1/34
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
Association Analysis
Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transactions
transaction width
number of items in a transaction
itemset
any collection of 0 or more items
k itemset
an itemset with k items
support count
number of transactions that contain the itemset
frequent itemset
an itemset whose support count is >= some minsup threshold
Association rule
an implication expression of the form X→Y, where X and Y are disjoint itemsets. (X is the antecedent and Y is the consequent). Correlation, not causation. Directionality matters
support
fraction of transactions that contain both X and Y
confidence
measures how often items in y appear in transactions that contain X
strong association rule
high s and high c (support >= minsup and confidence >= minconf threshold)
goal of association rule mining
to find all strong rules
Association Rule Mining steps
Frequent itemset generation: find all itemsets that satisfy minsup
strong rule generation: find all rules in the frequent itemsets that satisfy minconf
how many candidate itemsets are there given d items?
2d itemsets
Apriori principle
if an itemset is frequent, then all of its subsets must also be frequent. if an itemset is infrequent, then all of its supersets must be infrequent
Candidate Generation Fk-1 x F1
items in each frequent itemset must be sorted
Extend each frequent k-1 itemset with a frequent 1 itemset that comes after all of the items in the k-1 itemset
Candidate Generation Fk-1 x Fk-1 (Apriori gen)
items in each frequent itemset must be sorted
Merge pairs of k-1 itemsets if all but the last item is the same
Method for Apriori Pruning
if any k-1 subset of the candidate is not a frequent k-1 itemset, the candidate is pruned
Frequent itemsets
the final list of all frequent itemsets found at every k
Tyranny of Counting Pairs
the most memory is required for counting frequent pairs
rule generation brute force
all non-empty subsets of a frequent itemset are enumerated and tested against minconf
confidence based pruning
confidence is anti-monotone
as we move items to the right side of the rule, the confidence can only go down, it cannot go up (only rules generated from the same itemset)
look at denominator of confidence equation for explanation
Rule generation algorithm
start with rules that have only one item in the consequent and test against minconf
the high confidence rules that are found are then used to generate the next round of candidate rules by merging consequents
repeat on all frequent itemsets
rules that meet minconf are strong rules
how to set minsup
large set of items means small minsup (<1%)
maximal frequent itemset
no frequent supersets
closed frequent itemset
no superset with the same support count
maximal and closed frequent itemset algorithm
keep a list of maximal/closed frequent itemsets
each time a frequent itemset is generated, perform the subset check and superset check
subset check for maximal algorithm
is the frequent itemset just found a subset of anything in the maximal list? If so, it is not maximal, end. else, add it to the maximal list and do superset check
superset check for maximal algorithm
Is the frequent itemset just found a superset of anything in the maximal list? If so, remove items in the maximal list that are a subset of this itemset
subset check for closed algorithm
is the frequent itemset just found a subset of anything in the closed list? If so, is it’s support higher than its superset in the list? if no, it is not closed. If yes, add it to the closed list and do the superset check.
superset check for closed algorithm
Is the frequent itemset just found a superset of anything in the closed list? If so, does the subset in the list have the same or higher support? If subsets support is the same, remove the subset from the closed list. If subsets support is higher, it remains in the list.
cross support patterns
rules that relate low frequency items to high frequency items
support ratio is below a user specified threshold
limitations of support and confidence
there are times when both support and confidence are high, but the rule produced is not good
Lift
measures the ratio of the observed frequency of co-occurence to the expected frequency
symmetric metric (directionality doesnt matter)
lift value meanings
1: no correlation
<1: negative correlation
>1: positive correlation
contingency tables
X, not X, Y, not Y, fill in support counts