Compression: Information Theory and Lossless Techniques

Introduction

  • TM355 focuses on efficient bit communication.

  • Block 1 covered physical channels, signal transmission, and modulation techniques.

  • Part 1 of Block 2 addressed channel coding for error handling and channel efficiency.

  • This section moves up a layer to consider the source of bits.

  • Block 1: Bottom layers (physical aspects).

  • Block 2, Part 1: Middle layer (channel coding).

  • Parts 2 and 3 of Block 2: Source coding, ignoring channel coding and modulation layers.

  • Source coding assumes safe data delivery and focuses on source-related issues.

Aspects of Source Coding

  • Digitizing: Turning the source into a digital signal.

  • Compression: Reducing the size of the digital signal.

  • Digitization often includes compression aspects for minimizing signal size.

Overview of Part 2

  • Focus on discrete source symbols (e.g., letters of the alphabet).

  • Discrete sources form the basis of all digital communications.

  • Theory of communications for a discrete source is foundational to communications theory.

  • Information theory quantifies the concept of ‘information’.

  • Sections 4 and 5 cover compression methods.

Compression

  • Reduces file sizes for easier storage, manipulation, and transmission.

  • Retains original source content by removing redundancy.

  • Example: ZIP files are compressed to reduce download time.

  • Unzipping restores the exact original file.

Compression Techniques

  • Lossless Compression: Exact reconstruction of original data.

    • Used for text files and computer programs.

    • Relies on exploiting dependencies or properties in the source data.

    • Examples: Run-length encoding, Lempel–Ziv–Welch (LZW) coding, and Huffman coding.

    • File formats: ZIP, TIFF, GIF, PNG, PDF.

    • Compression ratio varies based on source redundancy.

    • No redundancy can result in larger files after compression.

  • Lossy Compression: Irreversible loss of information with an approximate reconstruction.

    • Used for multimedia (audio, image, video).

    • Allows some level of imperceptible distortion.

    • Is not necessarily inferior; can reduce random noise.

    • Achieves higher bit-rate reduction (order of magnitude or more).

    • Examples: MP3, AAC, HE-AAC, JPEG, JPEG2000, MPEG.

    • Some standards use both lossy and lossless compression (e.g., JPEG uses run-length encoding and Huffman coding).

Information Theory

  • Claude Shannon's work at Bell Telephone Labs in the 1940s is the foundation.

  • Shannon's paper ‘A mathematical theory of communication’ created the field and solved its fundamental problems.

  • Shannon's model is an abstraction of reality with a constrained scenario for analysis.

Model Assumptions

  • Data comes from an information source generating a sequence of messages.

  • Each message consists of a source symbol from a predetermined set (source alphabet).

Symbols
  • Source symbols: Letters, musical notes.

  • Channel symbols: Voltage pulses, sinusoidal waveforms.

Examples
  • Text: Letters, punctuation, spaces.

  • Sports results: Win, lose, draw.

Discrete Source Model

  • Encoding of audio and video signals turns them into discrete sources.

  • Model is very general and suitable for complex sources.

Fixed-Length Codes

  • Represent each message by a unique sequence of bits (code word).

  • Simplest codes have all code words with the same number of bits.

Activity example (2.1)

  • Binary representation for 26 letters + space (27 symbols).

  • Requires 5 bits, giving 32 sequences (25=322^5 = 32), using 27.

Representing Digits

  • Representing digits 0-9 requires more bits.

  • 37 symbols require six bits, adding $32$ code words, which seems wasteful.

Standardized Representations

  • ASCII (American Standard Code for Information Interchange) is a widely used standard.

  • UTF-8 (Universal Coded Character Set + Transformation Format – 8-bit) is now used, being backwards compatible with ASCII.

ASCII Coding

  • Uses a rectangular grid to display symbols compactly.

  • Contains printable characters (letters, numbers, punctuation) and control characters.

  • Code word is column number in binary (b7 b6 b5) + row number in binary (b4 b3 b2 b1).

Efficiency
  • In ASCII coding each code word is seven digits long, yielding 27=1282^7 = 128 combinations.

  • There are 8Imes16=1288 Imes 16 = 128 symbols represented, so each possible combination of the seven digits is used.

Binary-Coded Decimal

  • Four bits are needed to code digits 0–9 (24=162^4 = 16 possible code words).

  • Converting the numerical value of each decimal digit to binary.

  • Not very efficient, as only 10 of 16 possible sequences are used.

Source Extension

  • Grouping source symbols to improve efficiency.

  • Coding digits in pairs (second extension) requires 7 bits for 100 symbols (27=1282^7 = 128).

  • Second extension uses 3.5 bits per digit, which is more efficient than binary-coded decimal (4 bits per digit).

  • Third extension (grouping in threes) uses 10 bits for 1000 symbols (210=10242^{10} = 1024), needing 103=3.33\frac{10}{3} = 3.33 bits per digit.

Limit to Improvement
  • The lowest achievable number of bits per symbol using fixed-length codes and source extensions for n different symbols is log2(n)log_2(n).

Variable-Length Codes

  • Use different numbers of bits for different source symbols.

  • Can decrease the average number of bits needed.

Sports Results Example

  • Three symbols (win, lose, draw) require log2(3)=1.58log_2(3) = 1.58 bits.

  • Fixed-length code needs 2 bits.

Coding

Result

Code word

Win

11

Lose

10

Draw

01

  • Variable-length code can use fewer bits if probabilities are not equal.

Alternative Variable -Length Coding

Result

Code word

Win

11

Lose

1

Draw

0

  • Code is ambiguous and not uniquely decodable.

Uniquely Decodable and Instantaneously Decodable Codes

  • Uniquely decodable: Sequence has only one possible interpretation.

  • Instantaneously decodable: Can be decoded immediately upon receipt.

  • Practical interest lies in codes that are both uniquely and instantaneously decodable.

Prefix

For a code to be instantaneously decodable as well as uniquely decodable, no valid code word must form the first part (called the prefix) of another, longer code word.

Code Trees

  • Instantaneous codes can be generated and decoded using a code tree or decision tree.

  • Each node below the initial state corresponds to a possible code word.

  • Instantaneous code words correspond to the ends of branches in the code tree.

Generation

Visibility

Code word

Very poor

00

Poor

01

Moderate

10

Good

11

Taking Account of Source Symbol Probabilities

  • Variable-length codes can reduce average code length.

  • Coding efficiency is affected by the probability of different messages.

Weather Station Example

  • Located in a sunny area, with good visibility most of the time.

Visibility

Code word

Very poor

111

Poor

110

Moderate

10

Good

0

Efficient Codes

  • Code word used most often (good visibility) is only one bit long.

Average Length

Average length of the code is calculated by multiplying the code lengths by the probabilities and taking the sum.

Calculation

Visibility

Code word

Code word length, l

Visibility probability, p

l × p

Very poor

111

3

0.1

0.3

Poor

110

3

0.1

0.3

Moderate

10

2

0.3

0.6

Good

0

1

0.5

0.5

Sum

1

Average code length = 1.7 bits.

Summation

If we know the probabilities, pp, with which each of the source symbols will be chosen, the average length of the code is given by the formula:<em>i=1nl</em>ip<em>i\sum<em>{i=1}^{n} l</em>i p<em>iwhere l</em>il</em>i is the length of the ithi^{th} source symbol and pip_i is the probability of the ithi^{th} source symbol.

Source Entropy and Information

  • Code extensions can make coding more efficient (fewer bits per symbol).

  • Fewest bits per symbol with fixed-length codes: log2(n)log_2(n).

Variable-Length Codes

  • Number of bits can be fewer than log2(n)log_2(n).

  • Most efficient coding approaches source entropy (H), given by:H=<em>i=1np</em>ilog<em>2(p</em>i)H=-\sum<em>{i=1}^{n} p</em>i log<em>2(p</em>i)where nn is the number of symbols, and pip_i is the probability of the ithi^{th} symbol.

Entropy
  • Entropy is a measure of the minimum number of bits per symbol theoretically achievable.

  • Coding scheme efficiency (E) can be quantified by: E=HlE = \frac{H}{\overline{l}}where l\overline{l} = average code length and HH = source entropy.

Huffman Coding

  • Algorithm discovered by David Huffman produces instantaneous code based on symbol probabilities.

Construction

  • Source symbols with largest probabilities allocated to shortest code words.
    original source first reduction second reduction third reduction fourth reduction
    0.6 0.15 0.15 0.25 0.4 0 0 0 0.13 0.13 0.15 0.1 0.12 0.02 A 0 B 11 C 100 D 1010 E 1011
    0.6 0.6 0.6 0
    1
    1
    1
    1.0

  • Conventionally shown in this orientation

Efficiency

Symbol

Code word

Code word length, l

Probability, p

l × p

A

0

1

0.6

0.6

B

11

2

0.15

0.3

C

100

3

0.13

0.39

D

1010

4

0.1

0.4

E

1011

4

0.02

0.08

Sum

1

H = 1.68 bits.

Quantify

E=1.681.77E = \frac{1.68}{1.77}

which is 95 %.

Fixed Length

which requires three bits per symbol because there are five symbols
E=1.683E=\frac{1.68}{3}

is 56 %.

(both to two significant figures).

Compressing Files with Huffman Codes

  • Used for compressing computer files.

  • Variable-length coding reduces bits required if characters do not occur with the same frequency.

Uncompress

  • In order to uncompress the file later, the Huffman code that was used to create it needs to be known.

  • Requires inclusion of a coding table alongside the compressed file – an extra overhead.

Binary Table overhead

Counting the bits in file, coding table file contains a total of 42 bits.

Limitations of Huffman Coding

  • Assumes symbols are distributed randomly.

  • Not suitable for data with long runs of identical symbols or relationships between symbols.

  • Implementation considerations can also matter.

  • Requires analysis of the whole file to construct the coding table, introducing time delay.

  • Based on coding discrete symbols (letters, notes), not arbitrary binary files.

RLE and LZW Coding

  • RLE (Run-Length Encoding) for repeated symbols.

  • LZW (Lempel–Ziv–Welch) coding for more complicated repeating patterns.

  • Overcome some limitations of Huffman coding.

Example:C 4D6A1E7A3D2.

E1B8A4G7C2B3F1G1

Strengths and Weaknesses of RLE

Coding.

RLE is a simple technique that can compress data by replacing runs of repeated symbols with a number that gives the number of repeats.

  • Strengths: Not necessary to scan the whole file before it begins.

  • Overhead: The single number associated with each run.

  • Efficiency depends on the amount of repetition within the file.

  • Limitation: Requires a lot of repetition.

Application that have repetition.

One of wich is the JPEG code:
value, run length, value, run length, value, run length, value … 0, 0.
15 0 10 0 12 0917 045233 14050 0.

So far from compressing the file, encoding would more than double the file size.

LZW Coding

  • Exploits repeated patterns in data.

  • Assigns short codes to patterns.

  • Used in ZIP compression.

The LZW Coding Table


  • Creates a coding table (or ‘dictionary’) with code words representing sequences.


  • Structured into predetermined part (single characters) and created part (longer sequences).


  • Predetermined part (0-255) represents single bytes.


  • Created part (256-4095) is for repeating patterns.


  • LZW coding at the start without the binary codes

    Code

    Data


    0

    NUL



    65

    A


    66

    B


    ..


    255

    (not used in ASCII)


    256

    unassigned



    4095

    unassigned

    The Main Principles to LZW coding:

    1 The coding table is constructed as the data in the file is read.
    2 Although the coding table is unique to any file and is needed at both the encoder and the decoder, the table itself is not added to the file.

    Code

    Data

    0

    NUL

    65

    A

    66

    B

    ..

    255

    (not used in ASCII)

    256

    BC

    257

    CB

    258

    BCB

    263

    CBCB

    4095

    unassigned

    Summary

    • Information theory provides insights into source coding.

    • Source coding generates digital representations, including compression.

    • Theoretical minimum bits per message is the source entropy.

    • Huffman is a method of achieving the best coding table for any given set of symbol probabilities.

    • If a data file is to be compressed, for example, Huffman coding requires that the whole file is first analysed in order to determine the best coding table for that particular code.

    • Run-length (RLE) replaces runs of repeated symbols with a number of repeats.

    • LZW compresses repeated patterns of symbols