1/39
Key vocabulary from the lecture on MDL-based mining of robust association rules and the Grab algorithm.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Association Rule
An implication of the form X → Y where X (antecedent) and Y (consequent) are disjoint itemsets, expressing that Y tends to occur when X occurs.
Antecedent (Head)
The itemset X on the left‐hand side of an association rule X → Y.
Consequent (Tail)
The itemset Y on the right‐hand side of an association rule X → Y.
Support
The number of transactions in which a given itemset or rule occurs; denoted |T_X| for itemset X.
Confidence
A traditional quality measure for rules, defined as support(XY)/support(X).
Pattern
A special rule whose antecedent is empty (∅ → Y), describing unconditional co-occurrence of Y.
Singleton Rule
A rule whose consequent contains exactly one item, e.g., ∅ → A.
Noise-Resistant Rule
A rule allowed to hold approximately, even if some items of its consequent are missing in a transaction.
Minimum Description Length (MDL) Principle
Information-theoretic framework that selects the model giving the shortest total code length for describing model plus data.
One-Part (Refined) MDL
MDL variant using a single code L(D | M) to describe data with the whole model class, offering strong optimality but often infeasible.
Two-Part (Crude) MDL
Practical MDL variant using L(M)+L(D | M), i.e., separate costs for model description and for encoding data given the model.
Codelength
The number of bits required to encode an object (model or data) under a chosen coding scheme.
Model Cost L(M)
The portion of the MDL score that encodes the rule set itself, including heads, tails and parameters.
Data Cost L(D | M)
Bits needed to transmit the database when the recipient already knows the model M.
Parametric Complexity (Lpc)
The regret term of the Normalised Maximum Likelihood (NML) code used to encode model parameters optimally.
Grab Algorithm
A greedy, any-time heuristic that starts from singleton rules and iteratively refines the rule set to minimise MDL codelength.
Candidate Generation (Grab)
Phase where Grab creates new rules by merging heads or tails of existing rules, ensuring the dependency graph stays acyclic.
Gain ΔL
Change in total description length when replacing the current model by a refined model; negative ΔL indicates improvement.
Gain Estimation (Δ̂₁, Δ̂₂)
Fast first-tier and tighter second-tier approximations used by Grab to avoid expensive exact ΔL evaluations.
Cover Procedure
Subroutine that determines, for a candidate rule set, which transactions each rule explains and how errors are split.
Split Point k*
Optimal threshold (number of tail items present) that minimises MDL cost when deciding whether a rule holds in a transaction.
Tid-List (Transaction ID List)
Set of transaction identifiers containing a given itemset; used for fast support and intersection operations.
Directed Acyclic Graph (Rule Graph)
Representation of rule dependencies where vertices are rules and edges link overlapping heads/tails; must be acyclic to decode data.
Submodularity (Not satisfied)
Property of set functions with diminishing returns; the MDL score for rule sets is proven NOT to be submodular.
Monotonicity (Not satisfied)
Property where adding elements never worsens the score; MDL score for rule sets is NOT monotonically decreasing.
Greedy Heuristic
Search strategy that builds a solution step by step using locally optimal choices; Grab employs this to approximate the MDL optimum.
Pattern Set Mining
Framework evaluating the quality of a whole set of patterns rather than individual ones; Grab is an example.
Hyper+
Noise-resistant pattern miner used as a baseline in the experiments; tends to output large numbers of patterns.
Kingfisher
State-of-the-art algorithm for statistically significant dependency rules (single-item consequent) based on Fisher’s exact test.
Pack
MDL-based method that builds decision trees per item; tree paths can be interpreted as rules.
Synthetic Data
Artificially generated datasets used to test whether algorithms can recover known planted rules.
Real-World Data
Empirical datasets (e.g., Mushrooms, Mammals) used to assess practical usefulness of the mined rule sets.
Error Matrix
Binary matrix indicating, for transactions where a rule applies, which consequent items are missing (destructive) or present (additive).
Transaction
A single row in a binary dataset, typically representing items bought, attributes present, etc.
Itemset
A subset of items (features) from the alphabet I.
Dependency Graph
Graph capturing overlaps among rules; its acyclicity guarantees lossless decoding under Grab’s model.
Evaluation Order
Sequence in which candidate rules are tested; Grab prioritises those with lower estimated ΔL.
Rule Set Size |R|
Number of non-singleton rules included in a model; MDL penalises large |R| to avoid overfitting.
Information Theory
Field providing the mathematical basis (entropy, coding) underpinning MDL and compression-based model selection.
Frequent Itemset Mining
Task of enumerating all itemsets with support above a threshold; classic precursor to rule mining but often produces huge outputs.