Data Mining

Data Mining Overview

Definition: The process of looking for patterns in data.
Relation to Other Fields:
- Basic Data Manipulation and Analysis: Involves performing well-defined computations or queries.
- Machine Learning: Utilizing data to make inferences or predictions.
- Data Visualization: Graphical representation and interpretation of data.
- Data Collection and Preparation: Gathering and pre-processing data for analysis.

Types of Data Mining

Market-Basket Data: Focus of the discussion:
- Typically involves retail transactions (e.g., customers' grocery purchases).
- Patterns detected include frequent item-sets and association rules.
Other Types of Data: Includes networks/graphs, streams, and text mining.
Other Patterns: Includes similar items, structural patterns in networks/graphs, clusters, and anomalies.

Market-Basket Data

Definition: Data derived from retail transactions where each shopper has a basket of groceries.
Key Concepts:
- Domain of Items: Refers to the set of items available for transaction.
- Transaction: A record of one or more items occurring together.
- Dataset: A collection of transactions, usually extensive in size.
Examples:
- Groceries: Items bought together in a grocery cart.
- Online Goods: Items in a virtual shopping cart.
- University Courses: Courses a student includes in their transcript.
- Social Contexts: Patterns such as attendees at parties or symptoms of patients.
- Restaurants: Menu items selected by customers.

Data Mining Algorithms

Frequent Item-Sets: Definition and Importance:
- Sets of items that appear together frequently in transactions.
- Examples include groceries bought, courses taken by students, and movies watched together.
Association Rules:
- Concept: When certain items appear together, another item often appears with them (e.g., buying a phone leads to buying a charger).
Questions Addressed:
- How large is a “set”? Specifies a minimum and maximum size.
- What does “frequently” mean? Introduces the term support.

Support in Data Mining

Definition: The support for a set of items S in a dataset is defined as the fraction of transactions containing S:
$ext{Support}(S) = \frac{\text{Number of transactions containing } S}{\text{Total number of transactions}}$
Thresholds:
- A support-threshold specifies the minimum required support for item-sets.
- Frequent item-sets are returned only if their support exceeds this threshold.

Case Study: Frequent Item-Sets Example

Given Transactions:
- T1: milk, eggs, juice
- T2: milk, juice, cookies
- T3: eggs, chips
- T4: milk, eggs
- T5: milk, juice, cookies, chips
Criteria: min-set-size = 2; support-threshold = 0.3
Compute support for various item-sets based on these transactions.

Apriori Algorithm

Purpose: Used to compute frequent item-sets.
Key Property:
- If S is a frequent item-set with support threshold t, then every subset of S is also a frequent item-set with threshold t.
- Contrapositively, if S is not frequent, then no superset of S can be frequent.

Association Rules Definition

An association rule takes the form S → i (e.g., if S occurs, then i is likely to appear).
Factors to Consider:
- Size of S: The minimum and possibly maximum size of the set S.
- Occurrence: Defines conditions under which S and i occur together.
- Frequency: Defined by support and confidence parameters.

Support and Confidence for Association Rules

Support: For an association rule S → i, it is defined as the fraction of transactions containing S:
$ext{Support}(S → i) = \frac{\text{Number of transactions containing } S}{\text{Total number of transactions}}$
Confidence: This is the fraction of transactions that contain both S and i when S is present:
$ext{Confidence}(S → i) = \frac{\text{Number of transactions containing both } S \text{ and } i}{\text{Number of transactions containing } S}$
Thresholds: Specify support and confidence thresholds to filter rules returned by the mining process.

Case Study: Association Rules Example

Transactions used:
- T1: milk, eggs, juice
- T2: milk, juice, cookies
- T3: eggs, chips
- T4: milk, eggs
- T5: milk, juice, cookies, chips
Rules derived must satisfy min-set-size, support-threshold (0.5), and confidence threshold (0.5).

Computing Association Rules

Steps:
1. Use frequent item-sets to identify potential left-hand sides S that meet support threshold.
2. Extend to derive right-hand sides that meet confidence threshold S → i.
Properties: It is NOT guaranteed that if S → i is an association rule satisfying thresholds, then any subset S' of S also satisfies these properties.

Lift in Association Rules

Definition: Lift measures the strength of rule S → i, indicating if the confidence of the rule is due to item i’s frequency.
Lift calculation formula:
$ext{Lift}(S → i) = \frac{\text{Number of transactions containing both } S \text{ and } i}{\text{Number of transactions containing } S} \div \frac{\text{Number of transactions containing } i}{\text{Total number of transactions}}$
Interpretation of Lift Values:
- Lift = 1: No association between S and i.
- Lift > 1: Indicator of a positive association.
- Lift < 1: Indicates anti-association.
Real-world Example Computation:
- Using example transactions to calculate lift for specific item relations (juice → cookies and eggs → milk).

Conclusion

Upcoming Demo: Application of these concepts using SQL with MySQL database software.

Support is a measure of how often a particular itemset appears in a dataset. It is calculated using the formula:

$\text{Support}(S) = \frac{\text{Number of transactions containing } S}{\text{Total number of transactions}}$

This formula tells us what proportion of the total transactions includes the items in set S. For example, if there are 100 total transactions and the itemset S appears in 30 of those, the support for S would be 0.3 or 30%.

3-frequent itemsets refer to itemsets that appear together in a dataset at least 3 times. In the context of data mining, identifying 3-frequent itemsets is important because they can reveal interesting patterns, associations, or correlations between items that can be useful for market analysis, recommendation systems, and other applications.

To find the list of 3-frequent itemsets, you would examine all possible itemsets of size 3 and calculate their support. Those that meet or exceed a predetermined support threshold are considered frequent. In a specific case with a given T-value (threshold), if an itemset is identified as frequent, it means it meets the minimum occurrence criteria set by that T-value.

Comparing 3-frequent itemsets across different datasets can help analysts understand how item relationships vary under different conditions, such as market trends or customer behaviors.

Support, confidence, and lift are metrics used in data mining, especially in association rule learning, to evaluate how items relate to each other in transactions. Here's a simple explanation of each:

Support: This measures how often a particular itemset appears in a dataset. It is calculated as the fraction of transactions containing the itemset compared to the total number of transactions. For example, if you have 100 transactions and the itemset appears in 30 of them, the support is 0.3 (or 30%). This indicates the itemset's overall popularity in the dataset.
Confidence: This indicates how often the rule (S → i) holds true. In simpler terms, it's the likelihood that the item i is bought when itemset S is purchased. Confidence is calculated as the fraction of transactions containing both S and i, divided by the transactions containing S. For example, if S appears in 40 transactions and in 30 of those transactions i also appears, the confidence of the rule S → i is 0.75 (or 75%). This tells us how reliable the rule is.
Lift: This measures the strength of association between items S and i. It helps to determine whether the occurrence of one item influences the occurrence of another. Lift is calculated as the ratio of the observed support of the rule to the expected support if S and i were independent. A lift value greater than 1 indicates a positive association, meaning the presence of S increases the likelihood that i will also be present. A lift of less than 1 suggests a negative or negligible association.

In summary:

Support reflects item popularity,
Confidence reflects the reliability of rules,
Lift reflects the strength of the relationship between items.