OPT READING Slim: Directly Mining Descriptive Patterns – Key Vocabulary

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/44

flashcard set

Earn XP

Description and Tags

Flashcards summarising essential vocabulary from the lecture on Slim, an MDL-based algorithm for mining high-quality itemset patterns without pre-mining candidates.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

45 Terms

1
New cards

Pattern Mining

A branch of data mining focused on discovering interesting local structures (patterns) such as frequent itemsets or association rules in data.

2
New cards

Itemset

A set of items that can occur together in a transaction; fundamental unit in many pattern-mining tasks.

3
New cards

Transaction Database

A collection (bag) of transactions, each transaction being a set of items drawn from a common item universe.

4
New cards

Support (of an Itemset)

The number of transactions in which an itemset occurs; a key frequency measure.

5
New cards

Frequent Itemset

An itemset whose support meets or exceeds a specified minimum threshold.

6
New cards

Pattern Explosion

The phenomenon where lowering support thresholds produces an enormous, often unmanageable, number of frequent itemsets.

7
New cards

Minimum Description Length (MDL) Principle

A model-selection principle stating that the best model minimizes the sum of the lengths (in bits) of the model itself and the data encoded with the model.

8
New cards

Two-Part MDL (Crude MDL)

MDL variant where model and data are encoded separately as L(M)+L(D|M).

9
New cards

Code Table (CT)

A two-column dictionary listing itemsets with corresponding code lengths, used to encode data under MDL.

10
New cards

Standard Code Table (ST)

The baseline code table containing only singleton items, used to encode the left column of any other code table.

11
New cards

Standard Cover Order

Ordering of itemsets in a code table: descending by cardinality, then descending by support, then lexicographically.

12
New cards

Cover (of a Transaction)

The set of code-table itemsets whose codes are concatenated to encode a transaction without overlaps.

13
New cards

Usage (of an Itemset)

The count of transactions whose cover contains that itemset.

14
New cards

Relative Usage

The probability that an itemset’s code is used when encoding an arbitrary transaction; equals usage divided by total usages.

15
New cards

Shannon Entropy Code Length

Optimal prefix-code length for an event: –log₂ Pr(event); used to assign code lengths in a code table.

16
New cards

Total Compressed Size L(CT,D)

Sum of encoded database size L(D|CT) and encoded model size L(CT|D); the MDL objective to minimize.

17
New cards

Minimal Coding Set Problem

The task of finding the smallest subset of candidate itemsets whose code table minimizes total compressed size.

18
New cards

Krimp Algorithm

A two-phase MDL approach: mine frequent itemsets, then traverse them once in a fixed order, adding those that improve compression.

19
New cards

Slim Algorithm

An any-time, parameter-free algorithm that greedily constructs code tables directly from data by iteratively combining current patterns for maximal estimated compression gain.

20
New cards

Any-Time Algorithm

An algorithm that continually improves its solution and can be stopped at any moment to yield a valid, if sub-optimal, result.

21
New cards

Gain Order

Ordering of candidate itemsets in Slim based on estimated reduction (gain) in total compressed size ∆L.

22
New cards

Post-Pruning

Step after accepting a candidate where Slim/Krimp reconsider existing itemsets, removing those no longer contributing to compression.

23
New cards

Candidate Explosion

The dramatic growth in the number of candidate itemsets when support thresholds are lowered in traditional mining.

24
New cards

Heuristic Gain Estimate (ΔL̂)

Slim’s fast approximation of compression gain for a candidate, based on usages of only the directly affected itemsets.

25
New cards

Branch-and-Bound (in Slim)

Search technique that prunes candidate exploration using upper bounds on possible gain, avoiding needless evaluation.

26
New cards

Cover Matrix (C)

A binary matrix representing which code-table itemsets cover which transactions; frequent sets in C suggest new candidates.

27
New cards

Tiling

Greedy set-cover method selecting itemsets that cover many uncovered 1s; inspiration but not directly optimal for MDL compression.

28
New cards

SetCover Problem

NP-hard problem of selecting the smallest set of subsets that cover all elements; conceptually related to code-table construction.

29
New cards

Sub-Modular Function

A set function exhibiting diminishing returns, for which greedy selection has provable bounds; not proven for code-table gain.

30
New cards

Local Optimisation (Greedy)

Strategy of repeatedly choosing the best single candidate at the current step; used in Kramp and Slam variants.

31
New cards

Kramp

A Krimp variant that recalculates exact compression gain for all current candidates each iteration, yielding better but slower results.

32
New cards

Slam

A Slim variant ordering candidates by exact (rather than estimated) gain each iteration, serving as a gold-standard reference.

33
New cards

Minsup Threshold

Minimum support parameter controlling the number of frequent itemsets generated; not needed by Slim.

34
New cards

Parameter-Free (in Slim)

Property that Slim requires no user-set thresholds (such as minsup) in theory or practice.

35
New cards

Dense Dataset

A database with a high ratio of 1s; often causes severe pattern explosion for frequent-itemset mining but suits Slim well.

36
New cards

Relative Compression (L%)

Encoded size relative to encoding with the singleton code table: L(CT,D)/L(ST,D) ×100%.

37
New cards

∆L (Exact Gain)

True reduction in total compressed size when a candidate is added to the code table.

38
New cards

Usage List (Tid-list)

Set of transaction identifiers where an itemset appears; intersected to estimate candidate support.

39
New cards

Any-Time Convergence

Behaviour where Slim rapidly achieves large compression gains early, then slowly refines for smaller improvements.

40
New cards

Classification via Code Tables

Technique where a separate code table per class is built and a new instance is assigned to the class yielding the shortest encoding.

41
New cards

Outlier Detection with Compression

Method marking transactions as anomalies if their encoded length under the data’s code table is higher than expected.

42
New cards

Candidate Pair (X ∪ Y)

Union of two existing code-table itemsets considered as a new pattern during Slim’s refinement.

43
New cards

Standard Candidate Order (Krimp)

Fixed ordering for Krimp’s candidate list: descending support, then length, then lexicographic.

44
New cards

Compression Ratio

Measure of how succinctly data is described; lower values signify better compression.

45
New cards

Greedy Approximation Quality

Observation that Slim’s heuristic achieves compression close to exact greedy (Slam/Kramp) but with far less computation.