1/39
Vocabulary flashcards covering fundamental terms and definitions from the lecture on applying the MDL principle and compression for pattern mining and related data-mining tasks.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Pattern Explosion
The overwhelming number of patterns returned when interestingness constraints are loosened, making results unmanageable and redundant.
Interestingness Measure
A function that evaluates how important or useful a pattern is with respect to the data and the user’s goal.
Minimum Description Length (MDL) Principle
A model-selection guideline stating that the best model is the one that minimizes the combined length of the model description and the data encoded with it.
Two-part MDL
MDL variant where data are encoded in two parts: the model length L(M) plus the data length given the model L(D|M).
Lossless Compression
An encoding scheme that allows exact reconstruction of the original data from the compressed representation.
Kolmogorov Complexity
The length of the shortest program on a universal Turing machine that outputs a given string and halts; the theoretical limit of compression.
Algorithmic Information Theory
Field linking information content to the computational resources needed to generate data, underpinning Kolmogorov complexity and MDL.
Code Table
A two-column model mapping patterns to code words; used to compress data by replacing pattern occurrences with their codes.
Pattern Language
The set of all pattern types allowed for a given data domain, e.g., all possible itemsets for transactional data.
Cover Function
A procedure that selects a non-overlapping set of patterns from a code table that together encode a specific tuple.
Usage Count
The number of tuples in which a pattern’s code appears in the cover; drives probability estimates for code lengths.
Optimal Prefix Code
A set of code words in which no code is a prefix of another and each length equals –log(probability), ensuring minimal expected size.
Shannon Source Coding Theorem
Result stating that the optimal code length for a symbol equals the negative logarithm of its probability.
Candidate Set Filtering
Family of algorithms that first mine a pool of candidate patterns and then select a subset that minimizes MDL.
Single-pass Filtering
Greedy selection strategy that scans the candidate list once, testing each pattern for compression gain and keeping beneficial ones.
Iterative Candidate Selection
Search approach that repeatedly re-evaluates all remaining candidates, adding the pattern that gives the largest MDL improvement each round.
Pruning (Pattern Mining)
Post-selection step that removes patterns whose deletion does not worsen compression, yielding smaller, non-redundant models.
Slim Algorithm
Direct mining method that iteratively merges highly co-occurring itemsets to build a compact, MDL-optimal code table.
Krimp Algorithm
Pioneer MDL method for transactional data selecting frequent itemsets that together compress the database best.
MDL-based Classification
Assigning a class label to an unseen tuple by encoding it with each class-specific code table and choosing the shortest code.
Aggregated Code Length Difference (ACLD)
Normalized measure of how much longer one model encodes another database compared to its own, used for dataset comparison.
Database Dissimilarity Measure (DS)
Symmetric distance between two datasets defined as the maximum of their cross-encoded ACLD values.
Component Identification
Partitioning a dataset into subsets, each with its own MDL-optimal model, to reveal underlying distributions.
Low-Entropy Set
An itemset whose presence shows a highly skewed distribution, indicating strong attribute interaction.
Serial Episode
A sequential pattern describing an ordered list of events that may include gaps between occurrences.
Tiling
Covering a binary matrix with rectangles (tiles) of 1s (or 0s) to summarize dense and sparse areas compactly.
Compression-based Model
Any descriptive model whose quality is evaluated by how well it losslessly compresses the data.
Tuple-level Compression
Property that allows computing the encoded length of each data record independently of others using the model.
Pattern-level Inspection
Ability to examine which specific patterns encode a tuple, enabling interpretability and explanations.
Refined MDL
MDL variant that encodes model and data together in one part; has stronger theory but is often infeasible computationally.
Bayesian Information Criterion (BIC)
Model-selection score approximating MDL asymptotically; penalizes model complexity using parameter counts.
Akaike Information Criterion (AIC)
Alternative to MDL/BIC selecting models via maximum likelihood plus a complexity penalty proportional to parameters.
Generative vs. Descriptive Models
Generative models aim to reproduce data distributions, while descriptive models (like code tables) focus on succinct representation.
Normalized Compression Distance (NCD)
General dissimilarity metric between objects based on their compressed sizes with a standard compressor.
Minimum Message Length (MML)
Information-theoretic principle closely related to MDL that includes explicit priors, often used in Bayesian contexts.
Data Imputation via Compression
Filling missing values by choosing replacements that yield the smallest total encoded size under an MDL model.
Outlier Detection with MDL
Identifying anomalies as tuples that receive unusually long codes when compressed by a model of the normal data.
Change Detection in Data Streams
Monitoring streaming data for rises in encoded length, signaling shifts in the underlying distribution.
Class-specific Compressor
A code table trained on data from one class, used both to describe that class and to classify new instances.
Maximum Entropy Principle
Guideline for selecting the least-biased distribution satisfying known constraints; can define priors for MDL/MML models.