OPT READING The Difference and the Norm – Key Vocabulary

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/41

Earn XP

Description and Tags

Vocabulary flashcards summarising the main terms, algorithms and theoretical concepts introduced in the lecture on DIFFNORM and MDL-based pattern discovery.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

42 Terms

New cards

Minimum Description Length (MDL) principle

An information-theoretic framework that selects the model giving the shortest combined encoding of the model itself and the data encoded with that model.

New cards

Two-part MDL (crude MDL)

Variant of MDL that encodes model and data separately, minimising L(M) + L(D | M).

New cards

Refined MDL

MDL version that encodes model and data together; harder to compute but theoretically stronger.

New cards

Induction by Compression

The idea that the best explanation of data is the one that compresses it most.

New cards

Minimum Message Length (MML)

A model-selection principle closely related to MDL that also minimises total code length.

New cards

Transaction

A set of items (e.g., products bought together) treated as one observation.

New cards

Itemset

A subset of items; treated as a potential pattern in transaction data.

New cards

Support

The number of transactions in which an itemset appears.

New cards

Relative support (frequency)

Support divided by the total number of transactions.

New cards

Pattern set

A collection of itemsets used together to describe one or more databases.

New cards

Model S

The full collection of pattern sets, one for each user-specified subset of databases.

New cards

Coding set (Ci)

The union of all pattern sets relevant to database Di plus all single items; used to encode Di.

New cards

Cover function

A procedure that selects non-overlapping patterns from the coding set whose union equals a transaction.

New cards

GREEDYCOVER

Heuristic cover algorithm that picks patterns in Standard Cover Order (large, frequent, lexicographic).

New cards

Standard Cover Order

Sorting rule: descending by pattern length, then by support, then lexicographically.

New cards

Usage (usg)

The number of times a pattern is used in the covers of all transactions.

New cards

Prequential coding

Universal coding scheme that updates probabilities after every symbol, avoiding the need to encode usages explicitly.

New cards

Shannon entropy code length

Optimal prefix code length –log₂ Pr(X) used for encoding symbols.

New cards

Universal code LN

A code for non-negative integers used without knowing an upper bound; satisfies the Kraft inequality.

New cards

DIFFNORM

Algorithm that jointly discovers non-redundant pattern sets describing both common and database-specific structure.

New cards

Candidate pattern (X ∪ Y)

The union of two patterns X and Y considered for addition to the model because their individual codes co-occur.

New cards

∆L (gain)

Reduction in total encoded length obtained by adding a candidate pattern to the model.

New cards

Estimated gain (∆Ĺ)

Fast approximation of ∆L used to rank candidates before exact evaluation.

New cards

WEIGHTEDGREEDYCOVER

Procedure that greedily assigns a candidate to the subset(s) of databases where it yields the largest estimated gain.

New cards

PRUNE step

Phase that removes patterns whose usage dropped so much they no longer help compression.

New cards

Redundancy penalty (in MDL)

Implicit discouragement of overlapping or unnecessary patterns because they increase description length.

New cards

KRIMP

Earlier MDL-based algorithm that mines a succinct code table for a single database.

New cards

SLIM algorithm

MDL-based method that refines a code table by iteratively merging patterns; operates on one database.

New cards

SLAM

Locally optimal variant of SLIM that evaluates all candidate merges but at higher cost.

New cards

Pattern explosion

Phenomenon where frequent-pattern mining returns an unmanageably large number of itemsets.

New cards

Frequent Pattern Mining

Task of finding all itemsets with support above a user-defined threshold.

New cards

Pattern set mining

Mining only a small, non-redundant collection of patterns that summarise the data.

New cards

Joint Subspace Boolean Matrix Factorization (JSBMF)

Method for separating common and dataset-specific binary patterns but requires user-specified pattern counts.

New cards

Bag of databases (D)

A collection of individual transaction databases treated jointly.

New cards

Index set (J)

The set {1,…,d} used to identify individual databases within D.

New cards

User interest set (U)

Chosen subsets of J for which separate pattern sets should be learned.

New cards

Gamma function (Γ)

Continuous extension of the factorial, used in the closed form of prequential code length.

New cards

Double factorial (!!)

Product of every second integer up to n; appears in prequential coding formulas.

New cards

Complexity of DIFFNORM

Worst-case O(|F|³·|D|) where |F| is the number of frequent patterns, but practical runtime is much lower due to pruning.

New cards

Parameter-free algorithm

An algorithm, such as DIFFNORM, that avoids user-defined settings like number of patterns or thresholds beyond minimal support.

New cards

Global pattern set (SΩ)

The pattern set associated with all databases, capturing norms common across them.

New cards

Local pattern set (Si)

Pattern set that is characteristic for a specific database Di or subset thereof.