REQ READING Mining and Using Sets of Patterns through Compression – Key Vocabulary

0.0(0)

Studied by 0 people

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/45

Earn XP

Description and Tags

Vocabulary flashcards covering major concepts, techniques, and terminology related to the application of the MDL principle to pattern mining and compression-based data modeling.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

46 Terms

New cards

Minimum Description Length (MDL) Principle

A formalization of ‘Induction by Compression’ that selects the model giving the shortest total description in bits of the model plus the data encoded with it.

New cards

Compression (in data mining)

Representing data more compactly; in MDL, the best set of patterns is the one that compresses the data most.

New cards

Pattern Mining

Discovering recurring structures or associations in data, such as itemsets, sequences, graphs, etc.

New cards

Pattern Explosion

The phenomenon that relaxing constraints yields an overwhelming number of patterns, often many more than data rows.

New cards

Interestingness Measure

A function used to assess the importance or usefulness of a pattern in traditional pattern mining.

New cards

Pattern Set Mining

Searching for a small, non-redundant collection of patterns evaluated with a global criterion rather than individually.

New cards

Model Class (𝓜)

The predefined family of models among which MDL will search for the description that compresses the data best.

New cards

Kolmogorov Complexity

Length of the shortest program for a universal Turing machine that outputs a given string and halts; an ideal but incomputable measure of information.

New cards

Universal Turing Machine

A theoretical computer capable of simulating any other Turing machine; used in defining Kolmogorov complexity.

New cards

Algorithmic Information Theory

Field studying information content via algorithmic descriptions, founded by Kolmogorov, Solomonoff and Chaitin.

New cards

Two-part MDL (Crude MDL)

MDL variant that encodes model and data separately: L(D,M)=L(M)+L(D|M).

New cards

Lossless Encoding

An encoding from which the original data can be perfectly reconstructed; required by MDL for fair model comparison.

New cards

Descriptive Model

A model intended to succinctly characterize observed data rather than to generate new data.

New cards

Code Table

A two-column dictionary mapping patterns to code words, used to encode data losslessly under MDL.

New cards

Pattern Language (𝓧)

The set of all possible patterns considered, e.g., all itemsets over a given item universe.

New cards

Cover Function

Procedure that selects a disjoint set of patterns from a code table to fully encode a tuple.

New cards

Usage Count

Number of tuples in which a particular pattern’s code appears when covering the database.

New cards

Optimal Prefix Code

Set of codes where no code is prefix of another and lengths equal −log p, yielding minimal expected length.

New cards

Shannon’s Source Coding Theorem

States that −log p(y) bits per symbol is the optimal code length given probability p(y).

New cards

Dataset-level Compression

Ability to compute total encoded length for an entire dataset, enabling tasks like distance between datasets.

New cards

Tuple-level Compression

Capability of computing encoded length for each individual tuple independently of others.

New cards

Pattern-level Inspection

Property that allows seeing which specific patterns were used to encode each tuple, supporting explanations.

New cards

Krimp

Early MDL-based algorithm that builds code tables for itemsets using single-pass candidate filtering and pruning.

New cards

Slim

Direct-mining algorithm that iteratively merges patterns that co-occur, building compact code tables faster than Krimp.

New cards

Frequent Itemset

An itemset appearing in the data at least a specified number of times; often used as candidate patterns.

New cards

Low-Entropy Set

Itemset whose occurrence distribution is highly skewed, used by Less to capture interactions including absences.

New cards

Tiling

Covering binary matrices with rectangles (tiles) that contain many 1’s; an alternative compression-based model.

New cards

Candidate Filtering

Two-stage strategy: mine a large candidate set first, then select a subset that improves compression.

New cards

Single-pass Filtering

Greedy candidate filtering variant that considers each candidate once in a fixed order, accepting if compression improves.

New cards

Iterative Candidate Selection

Filtering method that repeatedly re-evaluates all current candidates, adding the one that maximally improves compression per iteration.

New cards

Pruning (in MDL mining)

Removing previously accepted patterns that no longer contribute to compression after later additions.

New cards

Direct Mining

Approach that generates promising patterns on the fly from the current model, avoiding pre-mining huge candidate sets.

New cards

Optimal Model

Model that minimizes total description length within the chosen model class and encoding scheme.

New cards

L(D|M)

Encoded length, in bits, of the data when compressed with model M.

New cards

L(M)

Encoded length, in bits, needed to describe the model itself.

New cards

MDL vs. AIC/BIC

MDL uses full lossless codes and no explicit parameter count; AIC/BIC use (negative) log-likelihood plus parameter penalties.

New cards

Lossy Compression (in MDL context)

Encoding that discards some data detail; disallowed in strict MDL because comparisons between models become unfair.

New cards

Normalized Compression Distance (NCD)

Similarity measure based on general-purpose compressors, approximating Kolmogorov complexity for pairs of strings.

New cards

Code Table Classifier

Classification scheme using one code table per class; an unseen tuple gets the class whose table encodes it shortest.

New cards

Asymmetric Code Length Difference (ACLD)

Normalized measure of how much worse database D is compressed by another database’s model than by its own.

New cards

Dataset Dissimilarity Measure (DS)

Symmetric distance between two datasets defined as the maximum of two ACLDs, using their MDL-optimal models.

New cards

Component Identification

Task of partitioning a mixed database into homogeneous components by finding a set of models that minimizes total description length.

New cards

Sequential Pattern (Serial Episode)

Ordered list of events occurring in sequence data; mined using MDL in algorithms like SQS and GoKrimp.

New cards

Candidate Set

Subset of the pattern language provided to a filtering algorithm as potential patterns for the final model.

New cards

Beam Search (in pattern mining)

Search strategy maintaining the top-k best partial models after each iteration to explore more of the search space.

New cards

EM-like Approach (for components)

Iterative procedure alternating between inducing models for current partitions and re-assigning tuples to the best model to minimize description length.