Lossless Compression Algorithms
Introduction
- Compression reduces bits needed to represent information.
- Encoder: performs compression, creating codes/codewords.
- Decoder: performs decompression.
- Lossless: No information loss during compression/decompression.
- Lossy: Information loss occurs.
- Compression Ratio: B<em>0/B</em>1 where B<em>0 is bits before, B</em>1 after compression.
- Desirable compression ratio: Much larger than 1.0.
- Entropy (η) of information source S = {$s1, s2, …, s_n$}:
- η=H(S)=∑<em>i=1np</em>ilog<em>2p</em>i1=−∑<em>i=1np</em>ilog<em>2p</em>i
- p<em>i = probability of symbol s</em>i occurring in S.
- −log<em>2p</em>i1 = self-information, bits needed to encode si.
- Entropy is a measure of disorder; higher entropy means more disorder.
- For a gray-level image with uniform distribution, pi=2561, entropy is 8 (no compression possible).
Run-Length Coding (RLC)
- Simple compression for data with continuous symbol groups.
- Codes symbol and its repetition length.
- Example:
WWWWWBBBB becomes 5W4B.
Variable-Length Coding (VLC)
Shannon-Fano Algorithm
- Sort symbols by frequency.
- Recursively divide symbols into two parts with approximately equal counts.
- Assign bit 0 to left branches, 1 to right branches in a binary tree.
- Shannon-Fano is top-down.
Huffman Coding
- Initialization: Sort symbols by frequency.
- Repeat until one symbol remains:
- Pick two symbols with lowest frequency.
- Create a Huffman subtree with these as children, create a parent node with the sum of their frequencies.
- Assign the sum of the children’s frequency counts to the parent and insert it in to the list, such that the order is maintained.
- Delete the children from the list.
- Assign codeword based on path from root.
- Huffman coding is bottom-up.
Dictionary-Based Coding
Lempel-Ziv-Welch (LZW)
- Adaptive, dictionary-based technique.
- Places repeated entries in a dictionary.
- Emits code for element instead of the string itself, if the element is in the dictionary.
- Used in UNIX compress, GIF, WinZip.