1/45
Vocabulary flashcards covering major concepts, techniques, and terminology related to the application of the MDL principle to pattern mining and compression-based data modeling.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Minimum Description Length (MDL) Principle
A formalization of ‘Induction by Compression’ that selects the model giving the shortest total description in bits of the model plus the data encoded with it.
Compression (in data mining)
Representing data more compactly; in MDL, the best set of patterns is the one that compresses the data most.
Pattern Mining
Discovering recurring structures or associations in data, such as itemsets, sequences, graphs, etc.
Pattern Explosion
The phenomenon that relaxing constraints yields an overwhelming number of patterns, often many more than data rows.
Interestingness Measure
A function used to assess the importance or usefulness of a pattern in traditional pattern mining.
Pattern Set Mining
Searching for a small, non-redundant collection of patterns evaluated with a global criterion rather than individually.
Model Class (𝓜)
The predefined family of models among which MDL will search for the description that compresses the data best.
Kolmogorov Complexity
Length of the shortest program for a universal Turing machine that outputs a given string and halts; an ideal but incomputable measure of information.
Universal Turing Machine
A theoretical computer capable of simulating any other Turing machine; used in defining Kolmogorov complexity.
Algorithmic Information Theory
Field studying information content via algorithmic descriptions, founded by Kolmogorov, Solomonoff and Chaitin.
Two-part MDL (Crude MDL)
MDL variant that encodes model and data separately: L(D,M)=L(M)+L(D|M).
Lossless Encoding
An encoding from which the original data can be perfectly reconstructed; required by MDL for fair model comparison.
Descriptive Model
A model intended to succinctly characterize observed data rather than to generate new data.
Code Table
A two-column dictionary mapping patterns to code words, used to encode data losslessly under MDL.
Pattern Language (𝓧)
The set of all possible patterns considered, e.g., all itemsets over a given item universe.
Cover Function
Procedure that selects a disjoint set of patterns from a code table to fully encode a tuple.
Usage Count
Number of tuples in which a particular pattern’s code appears when covering the database.
Optimal Prefix Code
Set of codes where no code is prefix of another and lengths equal −log p, yielding minimal expected length.
Shannon’s Source Coding Theorem
States that −log p(y) bits per symbol is the optimal code length given probability p(y).
Dataset-level Compression
Ability to compute total encoded length for an entire dataset, enabling tasks like distance between datasets.
Tuple-level Compression
Capability of computing encoded length for each individual tuple independently of others.
Pattern-level Inspection
Property that allows seeing which specific patterns were used to encode each tuple, supporting explanations.
Krimp
Early MDL-based algorithm that builds code tables for itemsets using single-pass candidate filtering and pruning.
Slim
Direct-mining algorithm that iteratively merges patterns that co-occur, building compact code tables faster than Krimp.
Frequent Itemset
An itemset appearing in the data at least a specified number of times; often used as candidate patterns.
Low-Entropy Set
Itemset whose occurrence distribution is highly skewed, used by Less to capture interactions including absences.
Tiling
Covering binary matrices with rectangles (tiles) that contain many 1’s; an alternative compression-based model.
Candidate Filtering
Two-stage strategy: mine a large candidate set first, then select a subset that improves compression.
Single-pass Filtering
Greedy candidate filtering variant that considers each candidate once in a fixed order, accepting if compression improves.
Iterative Candidate Selection
Filtering method that repeatedly re-evaluates all current candidates, adding the one that maximally improves compression per iteration.
Pruning (in MDL mining)
Removing previously accepted patterns that no longer contribute to compression after later additions.
Direct Mining
Approach that generates promising patterns on the fly from the current model, avoiding pre-mining huge candidate sets.
Optimal Model
Model that minimizes total description length within the chosen model class and encoding scheme.
L(D|M)
Encoded length, in bits, of the data when compressed with model M.
L(M)
Encoded length, in bits, needed to describe the model itself.
MDL vs. AIC/BIC
MDL uses full lossless codes and no explicit parameter count; AIC/BIC use (negative) log-likelihood plus parameter penalties.
Lossy Compression (in MDL context)
Encoding that discards some data detail; disallowed in strict MDL because comparisons between models become unfair.
Normalized Compression Distance (NCD)
Similarity measure based on general-purpose compressors, approximating Kolmogorov complexity for pairs of strings.
Code Table Classifier
Classification scheme using one code table per class; an unseen tuple gets the class whose table encodes it shortest.
Asymmetric Code Length Difference (ACLD)
Normalized measure of how much worse database D is compressed by another database’s model than by its own.
Dataset Dissimilarity Measure (DS)
Symmetric distance between two datasets defined as the maximum of two ACLDs, using their MDL-optimal models.
Component Identification
Task of partitioning a mixed database into homogeneous components by finding a set of models that minimizes total description length.
Sequential Pattern (Serial Episode)
Ordered list of events occurring in sequence data; mined using MDL in algorithms like SQS and GoKrimp.
Candidate Set
Subset of the pattern language provided to a filtering algorithm as potential patterns for the final model.
Beam Search (in pattern mining)
Search strategy maintaining the top-k best partial models after each iteration to explore more of the search space.
EM-like Approach (for components)
Iterative procedure alternating between inducing models for current partitions and re-assigning tuples to the best model to minimize description length.