Lossless Compression Algorithms

Multimedia Data Compression

Introduction

  • Compression reduces bits needed to represent information.
  • Encoder: performs compression, creating codes/codewords.
  • Decoder: performs decompression.
  • Lossless: No information loss during compression/decompression.
  • Lossy: Information loss occurs.
  • Compression Ratio: B<em>0/B</em>1B<em>0 / B</em>1 where B<em>0B<em>0 is bits before, B</em>1B</em>1 after compression.
  • Desirable compression ratio: Much larger than 1.0.

Basics of Information Theory

  • Entropy (η)(\eta) of information source S = {$s1, s2, …, s_n$}:
    • η=H(S)=<em>i=1np</em>ilog<em>21p</em>i=<em>i=1np</em>ilog<em>2p</em>iη = H(S) = \sum<em>{i=1}^{n} p</em>i log<em>2{\frac{1}{p</em>i}} = -\sum<em>{i=1}^{n} p</em>i log<em>2{p</em>i}
    • p<em>ip<em>i = probability of symbol s</em>is</em>i occurring in S.
    • log<em>21p</em>i-log<em>2{\frac{1}{p</em>i}} = self-information, bits needed to encode sis_i.
  • Entropy is a measure of disorder; higher entropy means more disorder.
  • For a gray-level image with uniform distribution, pi=1256p_i = \frac{1}{256}, entropy is 8 (no compression possible).

Run-Length Coding (RLC)

  • Simple compression for data with continuous symbol groups.
  • Codes symbol and its repetition length.
  • Example: WWWWWBBBB becomes 5W4B.

Variable-Length Coding (VLC)

  • Entropy coding method.
Shannon-Fano Algorithm
  1. Sort symbols by frequency.
  2. Recursively divide symbols into two parts with approximately equal counts.
  3. Assign bit 0 to left branches, 1 to right branches in a binary tree.
  • Shannon-Fano is top-down.
Huffman Coding
  1. Initialization: Sort symbols by frequency.
  2. Repeat until one symbol remains:
    • Pick two symbols with lowest frequency.
    • Create a Huffman subtree with these as children, create a parent node with the sum of their frequencies.
    • Assign the sum of the children’s frequency counts to the parent and insert it in to the list, such that the order is maintained.
    • Delete the children from the list.
  3. Assign codeword based on path from root.
  • Huffman coding is bottom-up.

Dictionary-Based Coding

Lempel-Ziv-Welch (LZW)
  • Adaptive, dictionary-based technique.
  • Places repeated entries in a dictionary.
  • Emits code for element instead of the string itself, if the element is in the dictionary.
  • Used in UNIX compress, GIF, WinZip.