Compression: Information Theory and Lossless Techniques

Introduction

TM355 focuses on efficient bit communication.
Block 1 covered physical channels, signal transmission, and modulation techniques.
Part 1 of Block 2 addressed channel coding for error handling and channel efficiency.
This section moves up a layer to consider the source of bits.
Block 1: Bottom layers (physical aspects).
Block 2, Part 1: Middle layer (channel coding).
Parts 2 and 3 of Block 2: Source coding, ignoring channel coding and modulation layers.
Source coding assumes safe data delivery and focuses on source-related issues.

Aspects of Source Coding

Digitizing: Turning the source into a digital signal.
Compression: Reducing the size of the digital signal.
Digitization often includes compression aspects for minimizing signal size.

Overview of Part 2

Focus on discrete source symbols (e.g., letters of the alphabet).
Discrete sources form the basis of all digital communications.
Theory of communications for a discrete source is foundational to communications theory.
Information theory quantifies the concept of ‘information’.
Sections 4 and 5 cover compression methods.

Compression

Reduces file sizes for easier storage, manipulation, and transmission.
Retains original source content by removing redundancy.
Example: ZIP files are compressed to reduce download time.
Unzipping restores the exact original file.

Compression Techniques

Lossless Compression: Exact reconstruction of original data.
- Used for text files and computer programs.
- Relies on exploiting dependencies or properties in the source data.
- Examples: Run-length encoding, Lempel–Ziv–Welch (LZW) coding, and Huffman coding.
- File formats: ZIP, TIFF, GIF, PNG, PDF.
- Compression ratio varies based on source redundancy.
- No redundancy can result in larger files after compression.
Lossy Compression: Irreversible loss of information with an approximate reconstruction.
- Used for multimedia (audio, image, video).
- Allows some level of imperceptible distortion.
- Is not necessarily inferior; can reduce random noise.
- Achieves higher bit-rate reduction (order of magnitude or more).
- Examples: MP3, AAC, HE-AAC, JPEG, JPEG2000, MPEG.
- Some standards use both lossy and lossless compression (e.g., JPEG uses run-length encoding and Huffman coding).

Information Theory

Claude Shannon's work at Bell Telephone Labs in the 1940s is the foundation.
Shannon's paper ‘A mathematical theory of communication’ created the field and solved its fundamental problems.
Shannon's model is an abstraction of reality with a constrained scenario for analysis.

Model Assumptions

Data comes from an information source generating a sequence of messages.
Each message consists of a source symbol from a predetermined set (source alphabet).

Symbols

Source symbols: Letters, musical notes.
Channel symbols: Voltage pulses, sinusoidal waveforms.

Examples

Text: Letters, punctuation, spaces.
Sports results: Win, lose, draw.

Discrete Source Model

Encoding of audio and video signals turns them into discrete sources.
Model is very general and suitable for complex sources.

Fixed-Length Codes

Represent each message by a unique sequence of bits (code word).
Simplest codes have all code words with the same number of bits.

Activity example (2.1)

Binary representation for 26 letters + space (27 symbols).
Requires 5 bits, giving 32 sequences ( $2^5 = 32$ ), using 27.

Representing Digits

Representing digits 0-9 requires more bits.
37 symbols require six bits, adding $32$ code words, which seems wasteful.

Standardized Representations

ASCII (American Standard Code for Information Interchange) is a widely used standard.
UTF-8 (Universal Coded Character Set + Transformation Format – 8-bit) is now used, being backwards compatible with ASCII.

ASCII Coding

Uses a rectangular grid to display symbols compactly.
Contains printable characters (letters, numbers, punctuation) and control characters.
Code word is column number in binary (b7 b6 b5) + row number in binary (b4 b3 b2 b1).

Efficiency

In ASCII coding each code word is seven digits long, yielding $2^7 = 128$ combinations.
There are $8 Imes 16 = 128$ symbols represented, so each possible combination of the seven digits is used.

Binary-Coded Decimal

Four bits are needed to code digits 0–9 ( $2^4 = 16$ possible code words).
Converting the numerical value of each decimal digit to binary.
Not very efficient, as only 10 of 16 possible sequences are used.

Source Extension

Grouping source symbols to improve efficiency.
Coding digits in pairs (second extension) requires 7 bits for 100 symbols ( $2^7 = 128$ ).
Second extension uses 3.5 bits per digit, which is more efficient than binary-coded decimal (4 bits per digit).
Third extension (grouping in threes) uses 10 bits for 1000 symbols ( $2^{10} = 1024$ ), needing $\frac{10}{3} = 3.33$ bits per digit.

Limit to Improvement

The lowest achievable number of bits per symbol using fixed-length codes and source extensions for n different symbols is $log_2(n)$ .

Variable-Length Codes

Use different numbers of bits for different source symbols.
Can decrease the average number of bits needed.

Sports Results Example

Three symbols (win, lose, draw) require $log_2(3) = 1.58$ bits.
Fixed-length code needs 2 bits.

Coding

Result	Code word
Win	11
Lose	10
Draw	01

Variable-length code can use fewer bits if probabilities are not equal.

Alternative Variable -Length Coding

Result	Code word
Win	11
Lose	1
Draw	0

Code is ambiguous and not uniquely decodable.

Uniquely Decodable and Instantaneously Decodable Codes

Uniquely decodable: Sequence has only one possible interpretation.
Instantaneously decodable: Can be decoded immediately upon receipt.
Practical interest lies in codes that are both uniquely and instantaneously decodable.

Prefix

For a code to be instantaneously decodable as well as uniquely decodable, no valid code word must form the first part (called the prefix) of another, longer code word.

Code Trees

Instantaneous codes can be generated and decoded using a code tree or decision tree.
Each node below the initial state corresponds to a possible code word.
Instantaneous code words correspond to the ends of branches in the code tree.

Generation

Visibility	Code word
Very poor	00
Poor	01
Moderate	10
Good	11

Taking Account of Source Symbol Probabilities

Variable-length codes can reduce average code length.
Coding efficiency is affected by the probability of different messages.

Weather Station Example

Located in a sunny area, with good visibility most of the time.

Visibility	Code word
Very poor	111
Poor	110
Moderate	10
Good	0

Efficient Codes

Code word used most often (good visibility) is only one bit long.

Average Length

Average length of the code is calculated by multiplying the code lengths by the probabilities and taking the sum.

Calculation

Visibility	Code word	Code word length, l	Visibility probability, p	l × p
Very poor	111	3	0.1	0.3
Poor	110	3	0.1	0.3
Moderate	10	2	0.3	0.6
Good	0	1	0.5	0.5
Sum			1
Average code length = 1.7 bits.

Summation

If we know the probabilities, $p$ , with which each of the source symbols will be chosen, the average length of the code is given by the formula: $\sum{i=1}^{n} li pi$ where $li$ is the length of the $i^{th}$ source symbol and $p_i$ is the probability of the $i^{th}$ source symbol.

Source Entropy and Information

Code extensions can make coding more efficient (fewer bits per symbol).
Fewest bits per symbol with fixed-length codes: $log_2(n)$ .

Variable-Length Codes

Number of bits can be fewer than $log_2(n)$ .
Most efficient coding approaches source entropy (H), given by: $H=-\sum{i=1}^{n} pi log2(pi)$ where $n$ is the number of symbols, and $p_i$ is the probability of the $i^{th}$ symbol.

Entropy

Entropy is a measure of the minimum number of bits per symbol theoretically achievable.
Coding scheme efficiency (E) can be quantified by: $E = \frac{H}{\overline{l}}$ where $\overline{l}$ = average code length and $H$ = source entropy.

Huffman Coding

Algorithm discovered by David Huffman produces instantaneous code based on symbol probabilities.

Construction

Source symbols with largest probabilities allocated to shortest code words.
original source first reduction second reduction third reduction fourth reduction
0.6 0.15 0.15 0.25 0.4 0 0 0 0.13 0.13 0.15 0.1 0.12 0.02 A 0 B 11 C 100 D 1010 E 1011
0.6 0.6 0.6 0
1
1
1
1.0
Conventionally shown in this orientation

Efficiency

Symbol	Code word	Code word length, l	Probability, p	l × p
A	0	1	0.6	0.6
B	11	2	0.15	0.3
C	100	3	0.13	0.39
D	1010	4	0.1	0.4
E	1011	4	0.02	0.08
Sum			1
H = 1.68 bits.

Quantify

$E = \frac{1.68}{1.77}$

which is 95 %.

Fixed Length

which requires three bits per symbol because there are five symbols
$E=\frac{1.68}{3}$

is 56 %.

(both to two significant figures).

Compressing Files with Huffman Codes

Used for compressing computer files.
Variable-length coding reduces bits required if characters do not occur with the same frequency.

Uncompress

In order to uncompress the file later, the Huffman code that was used to create it needs to be known.
Requires inclusion of a coding table alongside the compressed file – an extra overhead.

Binary Table overhead

Counting the bits in file, coding table file contains a total of 42 bits.

Limitations of Huffman Coding

Assumes symbols are distributed randomly.
Not suitable for data with long runs of identical symbols or relationships between symbols.
Implementation considerations can also matter.
Requires analysis of the whole file to construct the coding table, introducing time delay.
Based on coding discrete symbols (letters, notes), not arbitrary binary files.

RLE and LZW Coding

RLE (Run-Length Encoding) for repeated symbols.
LZW (Lempel–Ziv–Welch) coding for more complicated repeating patterns.
Overcome some limitations of Huffman coding.

Example:C 4D6A1E7A3D2.

E1B8A4G7C2B3F1G1

Strengths and Weaknesses of RLE

Coding.

RLE is a simple technique that can compress data by replacing runs of repeated symbols with a number that gives the number of repeats.

Strengths: Not necessary to scan the whole file before it begins.
Overhead: The single number associated with each run.
Efficiency depends on the amount of repetition within the file.
Limitation: Requires a lot of repetition.

Application that have repetition.

One of wich is the JPEG code:
value, run length, value, run length, value, run length, value … 0, 0.
15 0 10 0 12 0917 045233 14050 0.

So far from compressing the file, encoding would more than double the file size.

LZW Coding

Exploits repeated patterns in data.
Assigns short codes to patterns.
Used in ZIP compression.

The LZW Coding Table

Creates a coding table (or ‘dictionary’) with code words representing sequences.
Structured into predetermined part (single characters) and created part (longer sequences).
Predetermined part (0-255) represents single bytes.
Created part (256-4095) is for repeating patterns.
LZW coding at the start without the binary codes
Code
Data

0
NUL

…
…

65
A

66
B

..

255
(not used in ASCII)

256
unassigned

…

4095
unassigned
The Main Principles to LZW coding:
1 The coding table is constructed as the data in the file is read.
2 Although the coding table is unique to any file and is needed at both the encoder and the decoder, the table itself is not added to the file.
Code
Data
0
NUL
…
…
65
A
66
B
..
255
(not used in ASCII)
256
BC
257
CB
258
BCB
…
263
CBCB
4095
unassigned
Summary
- Information theory provides insights into source coding.
- Source coding generates digital representations, including compression.
- Theoretical minimum bits per message is the source entropy.
- Huffman is a method of achieving the best coding table for any given set of symbol probabilities.
- If a data file is to be compressed, for example, Huffman coding requires that the whole file is first analysed in order to determine the best coding table for that particular code.
- Run-length (RLE) replaces runs of repeated symbols with a number of repeats.
- LZW compresses repeated patterns of symbols

Code	Data
0	NUL
…	…
65	A
66	B
..
255	(not used in ASCII)
256	unassigned
…
4095	unassigned

The Main Principles to LZW coding:

Code	Data
0	NUL
…	…
65	A
66	B
..
255	(not used in ASCII)
256	BC
257	CB
258	BCB
…
263	CBCB
4095	unassigned

Compression: Information Theory and Lossless Techniques

Introduction

Aspects of Source Coding

Overview of Part 2

Compression

Compression Techniques

Information Theory

Model Assumptions

Symbols

Examples

Discrete Source Model

Fixed-Length Codes

Activity example (2.1)

Representing Digits

Standardized Representations

ASCII Coding

Efficiency

Binary-Coded Decimal

Source Extension

Limit to Improvement

Variable-Length Codes

Sports Results Example

Coding

Alternative Variable -Length Coding

Uniquely Decodable and Instantaneously Decodable Codes

Prefix

Code Trees

Generation

Taking Account of Source Symbol Probabilities

Weather Station Example

Efficient Codes

Average Length

Calculation

Summation

Source Entropy and Information

Variable-Length Codes

Entropy

Huffman Coding

Construction

Efficiency

Quantify

Fixed Length

Compressing Files with Huffman Codes

Uncompress

Limitations of Huffman Coding

RLE and LZW Coding

Strengths and Weaknesses of RLE

Coding.

Application that have repetition.

LZW Coding

The LZW Coding Table

The Main Principles to LZW coding:

Summary