OPT READING KRIMP: Mining Itemsets That Compress – Vocabulary Flashcards

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/29

flashcard set

Earn XP

Description and Tags

Key vocabulary terms from the lecture on the KRIMP algorithm, MDL-based itemset selection, and related concepts.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

30 Terms

1
New cards

Pattern Mining

The data-mining task of discovering regularities (patterns) in data, typically describing only part of the dataset rather than producing a full predictive model.

2
New cards

Itemset

A set of items that can occur together within a transaction in a transactional database.

3
New cards

Frequent Itemset

An itemset whose support (occurrence count) meets or exceeds a user-defined minimum support threshold.

4
New cards

Support

The number of transactions in a database that contain a particular itemset; denoted suppD(X).

5
New cards

Minimum Support (minsup)

The threshold that an itemset’s support must reach to be considered frequent.

6
New cards

Pattern Explosion

The rapid growth in the number of discovered patterns when constraints such as minsup are loosened, leading to huge, redundant result sets.

7
New cards

Minimum Description Length (MDL) Principle

A model-selection principle stating that the best model yields the shortest combined description of the model itself and the data encoded with it.

8
New cards

Two-Part MDL

MDL formulation where model cost L(H) and data-given-model cost L(D|H) are encoded separately and summed.

9
New cards

Code Table (CT)

A two-column table mapping itemsets to prefix codes, used to losslessly encode a database under MDL.

10
New cards

Prefix Code

A set of codes where no code word is the prefix of another, enabling unambiguous decoding.

11
New cards

Optimal Code Length

The length −log P(X|D) assigned to an itemset’s code when codes are chosen to minimise expected description length (Shannon optimal).

12
New cards

Cover Function

A procedure that, given a code table and a transaction, selects a non-overlapping set of itemsets whose union equals the transaction.

13
New cards

Usage (of an Itemset)

The number of times an itemset is employed in covering all transactions when encoding a database with a given code table.

14
New cards

Standard Code Table (ST)

The baseline code table containing only singleton itemsets, yielding an independence-assumption encoding of the data.

15
New cards

Standard Cover Order

The fixed ordering of itemsets in CT—sorted by decreasing length, then decreasing support, then lexicographically—for deterministic covering.

16
New cards

Standard Candidate Order

The order in which candidate itemsets are tested for inclusion—sorted by decreasing support, then decreasing length, then lexicographically.

17
New cards

Krimp Algorithm

A greedy MDL-based algorithm that builds a code table by iteratively adding candidate itemsets that improve compression and optionally pruning.

18
New cards

Post-Acceptance Pruning

Krimp’s step that, after accepting a new itemset, removes existing itemsets whose removal further shortens total description length.

19
New cards

Compression Ratio (L%)

The total encoded size with a code table divided by the size with the standard table, expressed as a percentage; lower is better.

20
New cards

Candidate Set (F)

The collection of itemsets considered for inclusion in the code table, often all (or closed) frequent itemsets mined at a chosen minsup.

21
New cards

Minimal Coding Set Problem

The task of finding the smallest subset of a candidate set that minimises the total encoded size of data and code table.

22
New cards

Singleton Itemset

An itemset containing exactly one item; all singletons must appear in any valid code table.

23
New cards

Total Compressed Size L(D,CT)

The sum of the encoded database size L(D|CT) and the encoded code-table size L(CT|D).

24
New cards

Swap Randomisation

A technique that randomises a binary dataset via value swaps to break structure while preserving row and column sums, used to test pattern significance.

25
New cards

Code Table–Based Classifier

A classifier that assigns an unseen transaction to the class whose code table yields the shortest encoding of that transaction.

26
New cards

Naïve Bayes Assumption (in CTs)

The simplification that itemsets used together in a cover are independent, linking code lengths to class-conditional probabilities.

27
New cards

Krimp Classifier

The simple classifier that builds one Krimp code table per class and predicts the class giving the minimal code length for a new transaction.

28
New cards

Closed Frequent Itemset

A frequent itemset that has no proper superset with the same support, providing a lossless yet smaller representation of frequent patterns.

29
New cards

Pattern Usage Histogram

A distribution showing how often selected itemsets are used in encoding; reveals balance between general and specific patterns.

30
New cards

Heuristic

A practical strategy that aims for good solutions quickly when exhaustive optimal search is computationally infeasible, as in Krimp’s greedy selection.