Galil Rule
Galil Rule Introduction
Instructor: Dr. Mary Hudachek-Buswell
Developed by: Quill Healey
Contributions by: Ivan Leung, Rena Li, Mai Anh Lê Hoàng, Ruston Shome
Prerequisites for Galil Rule
Must have implemented the Boyer-Moore algorithm.
Must have implemented the Failure Table used in Knuth-Morris-Pratt (KMP) algorithm.
Reviewing these concepts before proceeding is highly recommended.
Historical Context
The Galil Rule was developed by Zvi Galil, who served as the former Dean of the College of Computing.
The foundational article is titled "On improving the worst case running time of the Boyer-Moore string matching algorithm" and was published by ACM in 1979.
The Galil Rule linearizes the worst-case runtime of the full Boyer-Moore algorithm.
Boyer-Moore Algorithm and Heuristics
The Boyer-Moore algorithm employs two heuristics:
Bad Character Heuristic
Good Suffix Heuristic
In this course, we focus on implementing the partial Boyer-Moore algorithm, which utilizes only the Bad Character heuristic.
The addition of the Galil Rule heuristic to the full Boyer-Moore algorithm linearizes its worst-case runtime, and similarly, it also linearizes the worst-case in certain scenarios when added to the partial version of Boyer-Moore.
Overview of the Galil Rule
The Galil Rule modifies the Boyer-Moore algorithm after a complete match has been found.
Traditional Boyer-Moore shifts the pattern over by only one character after a complete match; the Galil Rule enhances this by allowing for a more intelligent shift of the pattern.
This improvement allows the Boyer-Moore algorithm to approach linear time complexity in specific cases.
Understanding Repeated Prefixes
Definition: The period of a string $s1$ is defined as the length of the shortest prefix of $s1$ such that if a new string $s2$ is formed by repeating this prefix, then $s1$ would be a prefix of $s_2$.
Key Insight: When a full match is found, the Galil Rule leverages the period of the pattern to minimize unnecessary comparisons.
Example 1: Identifying the Period
Pattern: "abababa"
The prefix that fulfills the period condition is identified as "ab", which can be repeatedly concatenated to create the infinite string: "abababababababa…".
Therefore, the period of the pattern is the length of "ab", which is $2$.
Example 2: Identifying the Period
Pattern: "abcabcabc"
The shortest prefix fulfilling the period condition here is "abc", allowing the construction of the repeated string: "abcabcabcabcabcabc".
The period length is therefore $3$.
Example 3: Identifying the Period
Pattern: "abcdab"
No prefix shorter than "abcd" can be used to form a suitable string where the given pattern is a prefix.
Thus, the period is the length of "abcd", which is $4$.
Calculating the Period
To compute the period of a pattern efficiently, we must perform it in $O(m)$ time, where $m$ is the length of the pattern.
The failure table from the KMP algorithm can be utilized for this computation.
Period Calculation Formula: $k = m - ext{ft}[m - 1]$, where $ ext{ft}$ denotes the failure table.
Example 1: Computing k
Pattern: "abababa"
Failure Table:
i
0
1
2
3
4
5
6
ft[i]
0
0
1
2
3
4
5
Calculate:
$k = m - ext{ft}[m - 1]$
Where $m = 7$ and $ ext{ft}[6] = 5$, thus:
$k = 7 - 5 = 2$
This suggests we can repetitively use the prefix "ab".
Example 2: Computing k
Pattern: "abcabcabc"
Failure Table:
i
0
1
2
3
4
5
6
7
8
ft[i]
0
0
0
1
2
3
4
5
6
Calculate:
$k = m - ext{ft}[m - 1] = 9 - 6 = 3$
Example 3: Computing k
Pattern: "abcdab"
Failure Table:
i
0
1
2
3
4
5
ft[i]
0
0
0
0
1
2
Calculate:
$k = m - ext{ft}[m - 1] = 6 - 2 = 4$
Applying the Galil Rule
The Galil Rule can be applied following a complete match only.
Upon a mismatch, the Galil Rule does not apply, necessitating a return to the last occurrence table for shifting the pattern.
Example Demonstration of the Galil Rule
Text: "abcdeabcdabcdeabcdabcdeabcd", where $n=27$
Pattern: "abcdeabcd", where $m=9$.
Failure Table:
i
0
0
0
0
0
1
2
3
4
ft[i]
0
0
0
0
0
1
2
3
4
Compute k:
$k = m - ext{ft}[m - 1] = 9 - 4 = 5$.
The subsequent comparisons and realignments can be counted for the process of applying the Galil rule.
No Matches Scenario
The Galil Rule enhances the Boyer-Moore algorithm efficiency when a pattern appears at least once.
In instances where there are no full matches, reliance on the Galil Rule is not possible, reverting to a worst-case time complexity of $O(mn)$.
Algorithm for Galil Rule Application
Step 1: Set $n$ as the length of the text and $m$ as the pattern length.
Step 2: Initialize matches as an empty list.
Step 3: Calculate $k = m - ext{failureTable}[m - 1]$.
Step 4: Initialize $i = 0, w = 0$.
Step 5: While $i ≤ n - m$:
Step 6: Set $j = m - 1$.
Step 7: While $j ≥ w$ and pattern[j] = text[i + j]:
Step 8: Decrement $j$ by 1.
Step 9: End While.
Step 10: If $j < w$, then:
Step 11: Add $i$ to matches.
Step 12: Set $w = m - k$.
Step 13: Increment $i$ by $k$.
Step 14: Else:
Step 15: Set $w = 0$.
Step 16: Shift using the last occurrence table.
Step 17: End Else.
Step 18: End While.