1/44
Flashcards summarising essential vocabulary from the lecture on Slim, an MDL-based algorithm for mining high-quality itemset patterns without pre-mining candidates.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Pattern Mining
A branch of data mining focused on discovering interesting local structures (patterns) such as frequent itemsets or association rules in data.
Itemset
A set of items that can occur together in a transaction; fundamental unit in many pattern-mining tasks.
Transaction Database
A collection (bag) of transactions, each transaction being a set of items drawn from a common item universe.
Support (of an Itemset)
The number of transactions in which an itemset occurs; a key frequency measure.
Frequent Itemset
An itemset whose support meets or exceeds a specified minimum threshold.
Pattern Explosion
The phenomenon where lowering support thresholds produces an enormous, often unmanageable, number of frequent itemsets.
Minimum Description Length (MDL) Principle
A model-selection principle stating that the best model minimizes the sum of the lengths (in bits) of the model itself and the data encoded with the model.
Two-Part MDL (Crude MDL)
MDL variant where model and data are encoded separately as L(M)+L(D|M).
Code Table (CT)
A two-column dictionary listing itemsets with corresponding code lengths, used to encode data under MDL.
Standard Code Table (ST)
The baseline code table containing only singleton items, used to encode the left column of any other code table.
Standard Cover Order
Ordering of itemsets in a code table: descending by cardinality, then descending by support, then lexicographically.
Cover (of a Transaction)
The set of code-table itemsets whose codes are concatenated to encode a transaction without overlaps.
Usage (of an Itemset)
The count of transactions whose cover contains that itemset.
Relative Usage
The probability that an itemset’s code is used when encoding an arbitrary transaction; equals usage divided by total usages.
Shannon Entropy Code Length
Optimal prefix-code length for an event: –log₂ Pr(event); used to assign code lengths in a code table.
Total Compressed Size L(CT,D)
Sum of encoded database size L(D|CT) and encoded model size L(CT|D); the MDL objective to minimize.
Minimal Coding Set Problem
The task of finding the smallest subset of candidate itemsets whose code table minimizes total compressed size.
Krimp Algorithm
A two-phase MDL approach: mine frequent itemsets, then traverse them once in a fixed order, adding those that improve compression.
Slim Algorithm
An any-time, parameter-free algorithm that greedily constructs code tables directly from data by iteratively combining current patterns for maximal estimated compression gain.
Any-Time Algorithm
An algorithm that continually improves its solution and can be stopped at any moment to yield a valid, if sub-optimal, result.
Gain Order
Ordering of candidate itemsets in Slim based on estimated reduction (gain) in total compressed size ∆L.
Post-Pruning
Step after accepting a candidate where Slim/Krimp reconsider existing itemsets, removing those no longer contributing to compression.
Candidate Explosion
The dramatic growth in the number of candidate itemsets when support thresholds are lowered in traditional mining.
Heuristic Gain Estimate (ΔL̂)
Slim’s fast approximation of compression gain for a candidate, based on usages of only the directly affected itemsets.
Branch-and-Bound (in Slim)
Search technique that prunes candidate exploration using upper bounds on possible gain, avoiding needless evaluation.
Cover Matrix (C)
A binary matrix representing which code-table itemsets cover which transactions; frequent sets in C suggest new candidates.
Tiling
Greedy set-cover method selecting itemsets that cover many uncovered 1s; inspiration but not directly optimal for MDL compression.
SetCover Problem
NP-hard problem of selecting the smallest set of subsets that cover all elements; conceptually related to code-table construction.
Sub-Modular Function
A set function exhibiting diminishing returns, for which greedy selection has provable bounds; not proven for code-table gain.
Local Optimisation (Greedy)
Strategy of repeatedly choosing the best single candidate at the current step; used in Kramp and Slam variants.
Kramp
A Krimp variant that recalculates exact compression gain for all current candidates each iteration, yielding better but slower results.
Slam
A Slim variant ordering candidates by exact (rather than estimated) gain each iteration, serving as a gold-standard reference.
Minsup Threshold
Minimum support parameter controlling the number of frequent itemsets generated; not needed by Slim.
Parameter-Free (in Slim)
Property that Slim requires no user-set thresholds (such as minsup) in theory or practice.
Dense Dataset
A database with a high ratio of 1s; often causes severe pattern explosion for frequent-itemset mining but suits Slim well.
Relative Compression (L%)
Encoded size relative to encoding with the singleton code table: L(CT,D)/L(ST,D) ×100%.
∆L (Exact Gain)
True reduction in total compressed size when a candidate is added to the code table.
Usage List (Tid-list)
Set of transaction identifiers where an itemset appears; intersected to estimate candidate support.
Any-Time Convergence
Behaviour where Slim rapidly achieves large compression gains early, then slowly refines for smaller improvements.
Classification via Code Tables
Technique where a separate code table per class is built and a new instance is assigned to the class yielding the shortest encoding.
Outlier Detection with Compression
Method marking transactions as anomalies if their encoded length under the data’s code table is higher than expected.
Candidate Pair (X ∪ Y)
Union of two existing code-table itemsets considered as a new pattern during Slim’s refinement.
Standard Candidate Order (Krimp)
Fixed ordering for Krimp’s candidate list: descending support, then length, then lexicographic.
Compression Ratio
Measure of how succinctly data is described; lower values signify better compression.
Greedy Approximation Quality
Observation that Slim’s heuristic achieves compression close to exact greedy (Slam/Kramp) but with far less computation.