07 SQUISH – Efficiently Summarising Event Sequences

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/26

flashcard set

Earn XP

Description and Tags

These flashcards review core vocabulary from the SQUISH lecture, covering MDL foundations, pattern types, encoding concepts, and key algorithms (FINDWIN, GREEDYCOVER, ESTIMATE) used to summarise event-sequence data efficiently.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

27 Terms

1
New cards

SQUISH

A greedy MDL-based algorithm that finds a small set of interleaving sequential patterns (including choicisodes) that best compress an event-sequence database.

2
New cards

Pattern Set Mining

Task of discovering a small collection of patterns whose combined occurrences describe the data well, rather than listing all frequent patterns.

3
New cards

Minimum Description Length (MDL) Principle

Model-selection rule stating that the best model minimises total code length L(M)+L(D|M); i.e., it compresses the data most.

4
New cards

Event Sequence Database

A dataset consisting of |D| sequences, each an ordered list of events drawn from an alphabet Ω.

5
New cards

Serial Episode

A sequential pattern X = x1x2…xk that must appear as an ordered subsequence (with possible gaps) in a sequence.

6
New cards

Choicisode

A ‘choice episode’ allowing multiple alternative events at specific positions, e.g., [a,b] c matches ‘a c’ or ‘b c’.

7
New cards

Interleaving (and Nesting)

Allowing occurrences of different patterns to overlap in time so that their events can be mixed within one another.

8
New cards

Cover (C)

A selection of windows (occurrences) of patterns that jointly explain every event in the database without overlap.

9
New cards

Window

A subsequence S[a,b] that contains one occurrence of a pattern (possibly with gaps).

10
New cards

Minimal Window

Shortest possible window covering a pattern; no proper sub-window also contains the pattern.

11
New cards

Code Table (CT)

Model encoding consisting of all singleton events and a set P of non-singleton patterns, each with associated codes.

12
New cards

Singleton Pattern

A code-table entry containing exactly one event; forms the baseline model (standard code table ST).

13
New cards

Pattern Code (code_p)

Prefix code placed in the pattern stream Cp that identifies which pattern is used at a position.

14
New cards

Gap Code (code_g)

Meta-stream symbol indicating an extra database event between two successive events of a pattern occurrence.

15
New cards

Fill Code (code_f)

Meta-stream symbol telling that the next event of a pattern should be read while decoding.

16
New cards

Pattern Stream (Cp)

Sequence of pattern codes emitted while encoding the database using a cover.

17
New cards

Meta Stream (Cm)

Parallel stream of gap and fill codes that marks how each active pattern occurrence proceeds.

18
New cards

Usage(X)

Number of times pattern X’s code occurs in Cp; determines its optimal code length −log(usage/sum usages).

19
New cards

Compression Gain (ΔL)

Savings L(D,ST) − L(D,CT); higher values mean the discovered patterns describe the data better.

20
New cards

Minimal Code Table Problem

Optimisation task of finding the set P minimising total encoded length L(CT,D).

21
New cards

FINDWIN

SQUISH subroutine that efficiently enumerates candidate windows of a pattern within a sequence using an inverted index and a gap budget.

22
New cards

GREEDYCOVER

Algorithm that incrementally builds an interleaving, non-overlapping cover by inserting candidate windows in a fixed ‘candidate order’.

23
New cards

Candidate Order

Tie-breaking priority for patterns: longer length ↓, higher support ↓, longer singleton encoding ↓, lexicographic ↑.

24
New cards

ESTIMATE

Heuristic that predicts which concatenations XY of existing patterns will likely reduce total code length.

25
New cards

PRUNE

Post-acceptance procedure that deletes patterns whose removal lowers or does not raise the overall description length.

26
New cards

SQS

Earlier MDL method for non-interleaving serial episodes; SQUISH extends it with interleaving and choicisodes and runs much faster.

27
New cards

Pattern Recall

Evaluation metric: proportion of events in ground-truth patterns that can be covered (without gaps) by the learned patterns.