1/41
Vocabulary flashcards summarising the main terms, algorithms and theoretical concepts introduced in the lecture on DIFFNORM and MDL-based pattern discovery.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Minimum Description Length (MDL) principle
An information-theoretic framework that selects the model giving the shortest combined encoding of the model itself and the data encoded with that model.
Two-part MDL (crude MDL)
Variant of MDL that encodes model and data separately, minimising L(M) + L(D | M).
Refined MDL
MDL version that encodes model and data together; harder to compute but theoretically stronger.
Induction by Compression
The idea that the best explanation of data is the one that compresses it most.
Minimum Message Length (MML)
A model-selection principle closely related to MDL that also minimises total code length.
Transaction
A set of items (e.g., products bought together) treated as one observation.
Itemset
A subset of items; treated as a potential pattern in transaction data.
Support
The number of transactions in which an itemset appears.
Relative support (frequency)
Support divided by the total number of transactions.
Pattern set
A collection of itemsets used together to describe one or more databases.
Model S
The full collection of pattern sets, one for each user-specified subset of databases.
Coding set (Ci)
The union of all pattern sets relevant to database Di plus all single items; used to encode Di.
Cover function
A procedure that selects non-overlapping patterns from the coding set whose union equals a transaction.
GREEDYCOVER
Heuristic cover algorithm that picks patterns in Standard Cover Order (large, frequent, lexicographic).
Standard Cover Order
Sorting rule: descending by pattern length, then by support, then lexicographically.
Usage (usg)
The number of times a pattern is used in the covers of all transactions.
Prequential coding
Universal coding scheme that updates probabilities after every symbol, avoiding the need to encode usages explicitly.
Shannon entropy code length
Optimal prefix code length –log₂ Pr(X) used for encoding symbols.
Universal code LN
A code for non-negative integers used without knowing an upper bound; satisfies the Kraft inequality.
DIFFNORM
Algorithm that jointly discovers non-redundant pattern sets describing both common and database-specific structure.
Candidate pattern (X ∪ Y)
The union of two patterns X and Y considered for addition to the model because their individual codes co-occur.
∆L (gain)
Reduction in total encoded length obtained by adding a candidate pattern to the model.
Estimated gain (∆Ĺ)
Fast approximation of ∆L used to rank candidates before exact evaluation.
WEIGHTEDGREEDYCOVER
Procedure that greedily assigns a candidate to the subset(s) of databases where it yields the largest estimated gain.
PRUNE step
Phase that removes patterns whose usage dropped so much they no longer help compression.
Redundancy penalty (in MDL)
Implicit discouragement of overlapping or unnecessary patterns because they increase description length.
KRIMP
Earlier MDL-based algorithm that mines a succinct code table for a single database.
SLIM algorithm
MDL-based method that refines a code table by iteratively merging patterns; operates on one database.
SLAM
Locally optimal variant of SLIM that evaluates all candidate merges but at higher cost.
Pattern explosion
Phenomenon where frequent-pattern mining returns an unmanageably large number of itemsets.
Frequent Pattern Mining
Task of finding all itemsets with support above a user-defined threshold.
Pattern set mining
Mining only a small, non-redundant collection of patterns that summarise the data.
Joint Subspace Boolean Matrix Factorization (JSBMF)
Method for separating common and dataset-specific binary patterns but requires user-specified pattern counts.
Bag of databases (D)
A collection of individual transaction databases treated jointly.
Index set (J)
The set {1,…,d} used to identify individual databases within D.
User interest set (U)
Chosen subsets of J for which separate pattern sets should be learned.
Gamma function (Γ)
Continuous extension of the factorial, used in the closed form of prequential code length.
Double factorial (!!)
Product of every second integer up to n; appears in prequential coding formulas.
Complexity of DIFFNORM
Worst-case O(|F|³·|D|) where |F| is the number of frequent patterns, but practical runtime is much lower due to pruning.
Parameter-free algorithm
An algorithm, such as DIFFNORM, that avoids user-defined settings like number of patterns or thresholds beyond minimal support.
Global pattern set (SΩ)
The pattern set associated with all databases, capturing norms common across them.
Local pattern set (Si)
Pattern set that is characteristic for a specific database Di or subset thereof.