COMP261 - Data Compression 2: Lempel-Ziv Coding

Introduction to Data/Text Compression

Data compression refers to the process of reducing the memory required to store information. This is crucial in improving storage efficiency and minimizing transfer times of data. One well-known method of data compression is Huffman coding, which focuses on minimizing the number of bits used for each symbol. However, to achieve even better compression results, it is advantageous to analyze sequences of symbols rather than individual symbols alone.
In essence, the fundamental principle behind compression is to convert original text, image, or sound into a compressed version, effectively reducing the size while maintaining the core information.

Historical Context: Shannon and Information Theory

Claude Shannon, a pivotal figure in the field of information theory, introduced foundational concepts that influence data compression techniques today. Notably, he described a machine called the "Ultimate Machine," which operates with a simple mechanism: flipping a switch opens a lid, revealing a mechanical hand that turns the switch off and retracts inside. This concept serves as a metaphor for the interplay between control and knowledge regarding past and future events. While knowledge of the past is accessible, it remains uncontrollable; conversely, one can control future outcomes but lacks certainty about them.

Predictable Structures Leading to Shorter Codes

The organization of information has implications for encoding efficiency. Shannon’s source coding theorem posits that the optimal code length for a symbol can be defined as:
$L = - ext{log}_2(P)$
Where $P$ is the probability of occurrence of the input symbol.
The average code length across a symbol set is referred to as entropy, denoted as $H$. In this context, if the probability distribution $P$ is uniform (i.e., each letter occurs with the same frequency), the entropy $H$ for the English alphabet is approximately equal to 4.75 bits. When analyzing real English text with its varied letter frequency, entropy slightly decreases to about 4.2 bits per character.
Moreover, understanding how humans process information further highlights efficiency opportunities. For example, when vowels are omitted from a sentence, people can often still guess the intended meaning due to the redundancy of information. This phenomenon is supported by research from the University of Cambridge, which asserts that the order of letters within words is not critical, provided the first and last letters are correctly placed. Human cognition tends to recognize words as whole structures rather than decoding them letter by letter.

Exploring Digrams, Bigrams, and Trigrams

To further improve coding efficiency, one can analyze digrams (two-letter sequences), trigrams (three-letter sequences), and n-grams in general. From the study of English language patterns, the most frequently occurring digrams and trigrams include:

Digrams:

Trigrams:

Using trigrams instead of single characters reduces the entropy to about 2.6 bits per character. As one increases the scale (n) of n-grams, the entropy can decrease between 0.6 to 1.3 bits per character, presenting an intriguing question: can we create a coding scheme that achieves coding efficiency close to this lower bound?

Run Length Encoding (RLE)

Another data compression method is Run Length Encoding (RLE), particularly effective when data contains many repeating symbols. This technique stores data as pairs of (count, symbol) rather than individual symbols.

Example 1:

Original string: aaabbaaaaaaapaa
Encoded format: 3a2b6a1p2a, requiring a few bytes for counting and representation.

Example 2:

In a binary representation of an image:

Original data: 111111110000001111111111111
Encoded format: 010001 001100 011011 used with a count of 5 bits for counts and 1 bit for representation of repeated data.
However, RLE struggles with data lacking repetitive patterns, such as 01011101110010101010101010101 or alternating sequences like abcabcabbcabcabaabcabcccababcabc…, where multiple character variations impede effective compression.

Lempel-Ziv Coding Basics

Lempel-Ziv coding provides a lossless data compression methodology. Among its variants, LZ77 stands out for its simplicity and effectiveness in utilizing repeated patterns—forming the foundation for many sophisticated compression algorithms. The core principle of Lempel-Ziv is replacing repeated sequences with references to earlier instances. For illustration:

A contrived text: a contrived text containing riveting contrast
The encoded form might appear as: a contrived text[15,5] ain[2,2] g [22,4] t[9,4][35,5] ast, where [offset, length] denotes the position and length of the previously encountered sequence. It is essential to note that sequences of length 1 could be encoded further along, as they are generally included in later processes.

Distinguishing Pointers from Ordinary Characters

To differentiate between pointers and standard characters in the compressed output, Lempel-Ziv utilizes a triple format to store sequences:

Structure: $[offset, length, symbol]$
For cases with no repetition, the encoded format defaults to: $[0, 0, ext{symbol}]$
Moreover, practical limits are implemented on the size of the offset and length through
Restricting the size of the reference window to the left of the current position during match searches, and
Limiting the advancement distance ahead in the input for identifying a match.

Lempel-Ziv Coding Example

For the example of the text a contrived text containing riveting contrasting…, the encoding is constructed as follows:

Each character in the text is examined one at a time, producing values like:
$[0,0,a][0,0, ][0,0,c][0,0,o][0,0,n][0,0,t][0,0,r][0,0,i][0,0,v][0,0,e][0,0,d]$
Continuing with reference matches might yield triples such as:
$[10,1,t]$ , $[4,1,x]$ , etc.
This results in a total of 69 bytes required to store 48 characters based on the encoding adopted, supposing each offset, length, and character is encoded with a single byte. The efficiency of this encoding increases with longer text segments due to the repeated patterns likely emerging.

Lempel-Ziv 77 Encoding Process

The LZ77 algorithm generates outputs as tuples of the format $[offset, length, nextCharacter]$ or $[0,0, ext{character}]$ . The process entails moving the cursor through the text sequentially, pointing to the next character to be encoded, while a sliding window operates behind the cursor. The algorithm employs a lookahead buffer that seeks to match as long a sequence as possible within that window before ceasing expansion on failure to find matches. Each match produces a tuple which encodes the offset, length, and the next character:

Pseudocode for LZ77 Encoding

```plaintext
cursor ← 0;
windowSize ← 100; // a suitable size
while cursor < text.length-1:
look for longest prefix of text[cursor .. text.length-1]
in text[max(cursor-windowSize, 0) .. cursor]
if found, add [offset,length,text[cursor+length]] to output
else add [0, 0, text[cursor]] to output
advance cursor by length + 1

In the search for the longest match, multiple strategies can be employed, such as brute force methods or advanced algorithms (e.g., Knuth-Morris-Pratt, Boyer-Moore, etc.) to streamline the matching process.

---

## Lempel-Ziv 77 - Coding Attempt Example  
An initial attempt to encode data can be structured as follows:

plaintext
cursor ← 0;
windowSize ← 100;
while cursor < text.size:
length ← 1;
prevMatch ← 0;
loop
match ← stringMatch(text[cursor.. cursor+length], text[((cursor<windowSize)?0:cursor-windowSize) .. cursor])
if match succeeded then:
prevMatch ← match
length ← length + 1
else:
output( [a value for prevMatch, length - 1, text[cursor+length - 1]]);
cursor ← cursor + length;
break

In this method, the algorithm searches for an occurrence of the substring in the preceding text. Care is taken to avoid unnecessary searches beyond points where no matches exist.

---

## Decompression Process  
The decompression of encoded data relies on the structured tuples mentioned. An example for decoding the earlier mentioned data is as follows:

plaintext
cursor ← 0;
for each tuple
if [0, 0, ch]:
output[cursor++] ← ch;
elif [offset, length, ch]:
for j = 0 to length-1:
output[cursor++] ← output[cursor-offset];
output[cursor++] ← ch;
```
Here, each tuple is decoded sequentially, and the output is constructed accordingly.

Lempel-Ziv Insights

It is essential to highlight that encoding is computationally expensive, while decoding remains relatively inexpensive. Consequently, various improvements and adaptations have been explored beyond the foundational Lempel-Ziv approach. Notable examples include utilizing dual output formats, comprising pairs for offsets and lengths for repeated sequences alongside individual characters for non-repetitive data.
Compatibility with other schemes, such as Huffman coding, can also enhance overall performance.

Need for String-Searching Algorithms

To sufficiently implement Lempel-Ziv coding, an effective string-searching algorithm is vital. Key strategies such as the Knuth-Morris-Pratt algorithm, along with numerous others like Boyer-Moore, serve as excellent resources for efficient searching.
Interactive visualizations and demos online can provide additional guidance and practical understanding of these algorithms.