1/26
These flashcards review core vocabulary from the SQUISH lecture, covering MDL foundations, pattern types, encoding concepts, and key algorithms (FINDWIN, GREEDYCOVER, ESTIMATE) used to summarise event-sequence data efficiently.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
SQUISH
A greedy MDL-based algorithm that finds a small set of interleaving sequential patterns (including choicisodes) that best compress an event-sequence database.
Pattern Set Mining
Task of discovering a small collection of patterns whose combined occurrences describe the data well, rather than listing all frequent patterns.
Minimum Description Length (MDL) Principle
Model-selection rule stating that the best model minimises total code length L(M)+L(D|M); i.e., it compresses the data most.
Event Sequence Database
A dataset consisting of |D| sequences, each an ordered list of events drawn from an alphabet Ω.
Serial Episode
A sequential pattern X = x1x2…xk that must appear as an ordered subsequence (with possible gaps) in a sequence.
Choicisode
A ‘choice episode’ allowing multiple alternative events at specific positions, e.g., [a,b] c matches ‘a c’ or ‘b c’.
Interleaving (and Nesting)
Allowing occurrences of different patterns to overlap in time so that their events can be mixed within one another.
Cover (C)
A selection of windows (occurrences) of patterns that jointly explain every event in the database without overlap.
Window
A subsequence S[a,b] that contains one occurrence of a pattern (possibly with gaps).
Minimal Window
Shortest possible window covering a pattern; no proper sub-window also contains the pattern.
Code Table (CT)
Model encoding consisting of all singleton events and a set P of non-singleton patterns, each with associated codes.
Singleton Pattern
A code-table entry containing exactly one event; forms the baseline model (standard code table ST).
Pattern Code (code_p)
Prefix code placed in the pattern stream Cp that identifies which pattern is used at a position.
Gap Code (code_g)
Meta-stream symbol indicating an extra database event between two successive events of a pattern occurrence.
Fill Code (code_f)
Meta-stream symbol telling that the next event of a pattern should be read while decoding.
Pattern Stream (Cp)
Sequence of pattern codes emitted while encoding the database using a cover.
Meta Stream (Cm)
Parallel stream of gap and fill codes that marks how each active pattern occurrence proceeds.
Usage(X)
Number of times pattern X’s code occurs in Cp; determines its optimal code length −log(usage/sum usages).
Compression Gain (ΔL)
Savings L(D,ST) − L(D,CT); higher values mean the discovered patterns describe the data better.
Minimal Code Table Problem
Optimisation task of finding the set P minimising total encoded length L(CT,D).
FINDWIN
SQUISH subroutine that efficiently enumerates candidate windows of a pattern within a sequence using an inverted index and a gap budget.
GREEDYCOVER
Algorithm that incrementally builds an interleaving, non-overlapping cover by inserting candidate windows in a fixed ‘candidate order’.
Candidate Order
Tie-breaking priority for patterns: longer length ↓, higher support ↓, longer singleton encoding ↓, lexicographic ↑.
ESTIMATE
Heuristic that predicts which concatenations XY of existing patterns will likely reduce total code length.
PRUNE
Post-acceptance procedure that deletes patterns whose removal lowers or does not raise the overall description length.
SQS
Earlier MDL method for non-interleaving serial episodes; SQUISH extends it with interleaving and choicisodes and runs much faster.
Pattern Recall
Evaluation metric: proportion of events in ground-truth patterns that can be covered (without gaps) by the learned patterns.