01 Mining and Using Sets of Patterns through Compression – Key Vocabulary

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/39

flashcard set

Earn XP

Description and Tags

Vocabulary flashcards covering fundamental terms and definitions from the lecture on applying the MDL principle and compression for pattern mining and related data-mining tasks.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

40 Terms

1
New cards

Pattern Explosion

The overwhelming number of patterns returned when interestingness constraints are loosened, making results unmanageable and redundant.

2
New cards

Interestingness Measure

A function that evaluates how important or useful a pattern is with respect to the data and the user’s goal.

3
New cards

Minimum Description Length (MDL) Principle

A model-selection guideline stating that the best model is the one that minimizes the combined length of the model description and the data encoded with it.

4
New cards

Two-part MDL

MDL variant where data are encoded in two parts: the model length L(M) plus the data length given the model L(D|M).

5
New cards

Lossless Compression

An encoding scheme that allows exact reconstruction of the original data from the compressed representation.

6
New cards

Kolmogorov Complexity

The length of the shortest program on a universal Turing machine that outputs a given string and halts; the theoretical limit of compression.

7
New cards

Algorithmic Information Theory

Field linking information content to the computational resources needed to generate data, underpinning Kolmogorov complexity and MDL.

8
New cards

Code Table

A two-column model mapping patterns to code words; used to compress data by replacing pattern occurrences with their codes.

9
New cards

Pattern Language

The set of all pattern types allowed for a given data domain, e.g., all possible itemsets for transactional data.

10
New cards

Cover Function

A procedure that selects a non-overlapping set of patterns from a code table that together encode a specific tuple.

11
New cards

Usage Count

The number of tuples in which a pattern’s code appears in the cover; drives probability estimates for code lengths.

12
New cards

Optimal Prefix Code

A set of code words in which no code is a prefix of another and each length equals –log(probability), ensuring minimal expected size.

13
New cards

Shannon Source Coding Theorem

Result stating that the optimal code length for a symbol equals the negative logarithm of its probability.

14
New cards

Candidate Set Filtering

Family of algorithms that first mine a pool of candidate patterns and then select a subset that minimizes MDL.

15
New cards

Single-pass Filtering

Greedy selection strategy that scans the candidate list once, testing each pattern for compression gain and keeping beneficial ones.

16
New cards

Iterative Candidate Selection

Search approach that repeatedly re-evaluates all remaining candidates, adding the pattern that gives the largest MDL improvement each round.

17
New cards

Pruning (Pattern Mining)

Post-selection step that removes patterns whose deletion does not worsen compression, yielding smaller, non-redundant models.

18
New cards

Slim Algorithm

Direct mining method that iteratively merges highly co-occurring itemsets to build a compact, MDL-optimal code table.

19
New cards

Krimp Algorithm

Pioneer MDL method for transactional data selecting frequent itemsets that together compress the database best.

20
New cards

MDL-based Classification

Assigning a class label to an unseen tuple by encoding it with each class-specific code table and choosing the shortest code.

21
New cards

Aggregated Code Length Difference (ACLD)

Normalized measure of how much longer one model encodes another database compared to its own, used for dataset comparison.

22
New cards

Database Dissimilarity Measure (DS)

Symmetric distance between two datasets defined as the maximum of their cross-encoded ACLD values.

23
New cards

Component Identification

Partitioning a dataset into subsets, each with its own MDL-optimal model, to reveal underlying distributions.

24
New cards

Low-Entropy Set

An itemset whose presence shows a highly skewed distribution, indicating strong attribute interaction.

25
New cards

Serial Episode

A sequential pattern describing an ordered list of events that may include gaps between occurrences.

26
New cards

Tiling

Covering a binary matrix with rectangles (tiles) of 1s (or 0s) to summarize dense and sparse areas compactly.

27
New cards

Compression-based Model

Any descriptive model whose quality is evaluated by how well it losslessly compresses the data.

28
New cards

Tuple-level Compression

Property that allows computing the encoded length of each data record independently of others using the model.

29
New cards

Pattern-level Inspection

Ability to examine which specific patterns encode a tuple, enabling interpretability and explanations.

30
New cards

Refined MDL

MDL variant that encodes model and data together in one part; has stronger theory but is often infeasible computationally.

31
New cards

Bayesian Information Criterion (BIC)

Model-selection score approximating MDL asymptotically; penalizes model complexity using parameter counts.

32
New cards

Akaike Information Criterion (AIC)

Alternative to MDL/BIC selecting models via maximum likelihood plus a complexity penalty proportional to parameters.

33
New cards

Generative vs. Descriptive Models

Generative models aim to reproduce data distributions, while descriptive models (like code tables) focus on succinct representation.

34
New cards

Normalized Compression Distance (NCD)

General dissimilarity metric between objects based on their compressed sizes with a standard compressor.

35
New cards

Minimum Message Length (MML)

Information-theoretic principle closely related to MDL that includes explicit priors, often used in Bayesian contexts.

36
New cards

Data Imputation via Compression

Filling missing values by choosing replacements that yield the smallest total encoded size under an MDL model.

37
New cards

Outlier Detection with MDL

Identifying anomalies as tuples that receive unusually long codes when compressed by a model of the normal data.

38
New cards

Change Detection in Data Streams

Monitoring streaming data for rises in encoded length, signaling shifts in the underlying distribution.

39
New cards

Class-specific Compressor

A code table trained on data from one class, used both to describe that class and to classify new instances.

40
New cards

Maximum Entropy Principle

Guideline for selecting the least-biased distribution satisfying known constraints; can define priors for MDL/MML models.