1/29
Key vocabulary terms from the lecture on the KRIMP algorithm, MDL-based itemset selection, and related concepts.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Pattern Mining
The data-mining task of discovering regularities (patterns) in data, typically describing only part of the dataset rather than producing a full predictive model.
Itemset
A set of items that can occur together within a transaction in a transactional database.
Frequent Itemset
An itemset whose support (occurrence count) meets or exceeds a user-defined minimum support threshold.
Support
The number of transactions in a database that contain a particular itemset; denoted suppD(X).
Minimum Support (minsup)
The threshold that an itemset’s support must reach to be considered frequent.
Pattern Explosion
The rapid growth in the number of discovered patterns when constraints such as minsup are loosened, leading to huge, redundant result sets.
Minimum Description Length (MDL) Principle
A model-selection principle stating that the best model yields the shortest combined description of the model itself and the data encoded with it.
Two-Part MDL
MDL formulation where model cost L(H) and data-given-model cost L(D|H) are encoded separately and summed.
Code Table (CT)
A two-column table mapping itemsets to prefix codes, used to losslessly encode a database under MDL.
Prefix Code
A set of codes where no code word is the prefix of another, enabling unambiguous decoding.
Optimal Code Length
The length −log P(X|D) assigned to an itemset’s code when codes are chosen to minimise expected description length (Shannon optimal).
Cover Function
A procedure that, given a code table and a transaction, selects a non-overlapping set of itemsets whose union equals the transaction.
Usage (of an Itemset)
The number of times an itemset is employed in covering all transactions when encoding a database with a given code table.
Standard Code Table (ST)
The baseline code table containing only singleton itemsets, yielding an independence-assumption encoding of the data.
Standard Cover Order
The fixed ordering of itemsets in CT—sorted by decreasing length, then decreasing support, then lexicographically—for deterministic covering.
Standard Candidate Order
The order in which candidate itemsets are tested for inclusion—sorted by decreasing support, then decreasing length, then lexicographically.
Krimp Algorithm
A greedy MDL-based algorithm that builds a code table by iteratively adding candidate itemsets that improve compression and optionally pruning.
Post-Acceptance Pruning
Krimp’s step that, after accepting a new itemset, removes existing itemsets whose removal further shortens total description length.
Compression Ratio (L%)
The total encoded size with a code table divided by the size with the standard table, expressed as a percentage; lower is better.
Candidate Set (F)
The collection of itemsets considered for inclusion in the code table, often all (or closed) frequent itemsets mined at a chosen minsup.
Minimal Coding Set Problem
The task of finding the smallest subset of a candidate set that minimises the total encoded size of data and code table.
Singleton Itemset
An itemset containing exactly one item; all singletons must appear in any valid code table.
Total Compressed Size L(D,CT)
The sum of the encoded database size L(D|CT) and the encoded code-table size L(CT|D).
Swap Randomisation
A technique that randomises a binary dataset via value swaps to break structure while preserving row and column sums, used to test pattern significance.
Code Table–Based Classifier
A classifier that assigns an unseen transaction to the class whose code table yields the shortest encoding of that transaction.
Naïve Bayes Assumption (in CTs)
The simplification that itemsets used together in a cover are independent, linking code lengths to class-conditional probabilities.
Krimp Classifier
The simple classifier that builds one Krimp code table per class and predicts the class giving the minimal code length for a new transaction.
Closed Frequent Itemset
A frequent itemset that has no proper superset with the same support, providing a lossless yet smaller representation of frequent patterns.
Pattern Usage Histogram
A distribution showing how often selected itemsets are used in encoding; reveals balance between general and specific patterns.
Heuristic
A practical strategy that aims for good solutions quickly when exhaustive optimal search is computationally infeasible, as in Krimp’s greedy selection.