Data Mining UNIT-III
Presidency College (Autonomous)
Institution Overview
Re-accredited by NAAC: Achieved an 'A+' grade, reflecting excellence in educational standards and institutional framework.
Affiliations: Affiliated to Bengaluru City University, is approved by AICTE, and recognised by the Government of Karnataka, ensuring adherence to national education quality requirements.
Mission Statement: Aiming to empower students to gain more knowledge and reach greater heights, emphasizing holistic development and lifelong learning.
Experience: The institution boasts over 40 years of academic wisdom, serving as a hub for education and research.
Association Rule Learning
Definition
Association Rule Learning: It is a type of unsupervised learning that identifies interesting dependencies between different items within a dataset, allowing businesses and researchers to uncover patterns and insights in large volumes of data.
Importance
Market Basket Analysis: Association rule learning plays a crucial role in identifying consumer purchasing patterns, helping retailers stock their products effectively.
Web Usage Mining: It is also instrumental in analyzing user behavior on websites, thereby enhancing user experience and website effectiveness.
Example
A common application is in supermarkets, where analysis reveals that customers who purchase bread also frequently buy butter, eggs, or milk, presenting opportunities for strategic product placements and promotional offers.
Types of Algorithms in Association Rule Learning
Apriori Algorithm: Utilizes a breadth-first search strategy to identify frequent itemsets and generate rules, making it suitable for large datasets but can be computationally intensive.
Eclat Algorithm: This depth-first growth technique for finding itemsets is considered efficient for high-dimensional data by using a vertical data format.
F-P Growth Algorithm: An enhancement of the Apriori algorithm which employs tree structures for frequent pattern mining, significantly reducing the computational overhead.
How Association Rule Learning Works
Basic Concept
Utilizes 'If A then B' statements, allowing for practical inference about the relationships between items.
Terminology
Antecedent: Represents the 'if' part of the rule, indicating a condition that must be met.
Consequent: Represents the 'then' part of the rule, suggesting an outcome that occurs if the antecedent is true.
Cardinality: Refers to the number of items in the set, providing insight into the complexity of the relationships.
Metrics Used
Support: Measures the frequency of an itemset in the dataset. For example, a support of 0.3 means the item appeared in 30% of all transactions.
Formula: Support(X) = Frequency(X) / Total Transactions.
Confidence: Indicates the likelihood of the consequent occurring given the antecedent, providing insights into the strength of the rule.
Formula: Confidence(X => Y) = Frequency(X, Y) / Frequency(X).
Lift: Measures the strength of a rule's relationship, comparing its observed support against the expected support if the two items were independent.
Formula: Lift = Support(X, Y) / (Support(X) * Support(Y)).
Interpretation of Lift Values
Lift = 1: Indicates no relationship; variables are independent of each other.
Lift > 1: Shows that the itemsets are positively correlated and that the presence of one influences the other positively.
Lift < 1: Suggests that one item may substitute for another, indicating a negative association.
Applications of Association Rule Learning
Market Basket Analysis: Core application for major retailers aiming to discover relationships among purchased products to enhance inventory management.
Medical Diagnosis: Helps in identifying diseases based on symptoms that frequently appear together in patient diagnosis data.
Protein Sequence Analysis: Utilized in bioinformatics to uncover relationships in amino acid sequences, aiding in artificial protein synthesis.
Additional applications span areas like catalog design, loss-leader analysis, and risk assessment.
Example of Association Rule Calculation
Rule: A => D.
Total Transactions (N) = 5, Frequency(A, D) = 3.
Support Calculation: Support(A) = 3 / 5 = 0.6.
Confidence Calculation: Confidence(A => D) = 3 / 4 = 0.75.
Candidate Set Generation
Step 1: Generate Candidate Set 1 for all individual items, keeping track of their counts. Remove infrequent items below a defined support threshold (for instance, min-support = 2).
Set 2 Generation: Create item pairs from frequent items while ensuring minimum support is met.
Step 3 & 4: Continue generating candidate sets by combining frequent itemsets until sets fail to meet the minimum support, leading to the establishment of frequent patterns.
Evaluation of Candidate Sets
Calculate confidence for association rules, determining which rules qualify as strong, and retain only those that exceed the defined thresholds.
Types of Association Rules in Data Mining
Multi-relational Association Rules: Deals with more than one relationship within datasets, advantageous in complex data environments.
Generalized Association Rules: Rules established at varying levels of abstraction, allowing flexibility in interpretation.
Quantitative Association Rules: Involve numeric attributes on either side of the rule, facilitating more detailed analysis forecasts.