CITS3004 - Coding and Compression Notes

The Session/Presentation Layers

Protocols
Encoding
Compression
- Huffman coding

Recap - OSI Model

Handles delivery of packets.
- End-to-end.
- Reliability.
- Transport.
- Etc.
Prepares the data!
- Formatting data.
- Content encoding/conversion.
- Service allocation/negotiation.

The Session/Presentation Layer

Major tasks:
- Data representation.
- Data compression.
- Security and privacy.
- Etc.
Different computers have different representations for characters.
Two computers must agree on the representation of data prior to exchange.
To efficiently transport data, it must be compressed where possible.

Data Types and Standards

Data comes in different types, and they are not self-describing.
Standards are used to identify the format of the data.
- Typically signaled in the protocol used.
- Data formats categorized into eight top-level formats.
- Each type has many sub-types.
- Each sub-type may have parameters.
Examples of data format categories: application, image, message model, multipart (attachments), text, audio, video.
Example: Text/plain; charset=iso-8859-1 (type, sub-type, parameters)

Protocols at the Session/Presentation Layers

Many different protocols exist at the S/P layers.
Notable ones include:
- SDP (Session Description Protocol).
- RPC (Remote Procedure Call).
- PPTP (Point-to-Point Tunneling Protocol).
- LPP (Location Positioning Protocol).
- Telnet.
- X.25.
- Etc.

Session Description Protocol (SDP)

Can be used in a range of network environments and different applications.
Belongs to the session layer.
Can use different transport protocols as appropriate.
These properties make SDP forward-compatible, supporting any upcoming media or traffic type.
Four example usages:
- Session initiation.
- Streaming media.
- Email and the WWW.
- Multicast session announcement.
A textual format to describe multimedia sessions.
- Uses media types to specify media type, subtype, and parameters.
Example:
- v=0
- o=csp 2890844526 2890842807 IN IP4 10.47.16.5
- s=-
- c=IN IP4 224.2.17.12/127
- t=2873397496 2873404696
- m=audio 49170 RTP/AVP 99
- a=rtpmap:99 vorbis/44100/2
- a=fmtp:99 configuration=AAAAAZ2f4g9NAh4aAXZvcmJpcwA...
- audio/vorbis; configuration=AAAAAZ2f4g9NAh4aAXZvcmJpcwA...

Content Negotiation

Multimedia sessions need to negotiate codecs.
- A wide variety of codecs exist, and new codecs are frequently introduced.
Two-stage offer/answer model for negotiation.
- Offer lists supported codecs in order of preference.
- The receiver picks the highest preference codec it also supports and includes this in its answer.
- Negotiates a common supported codec in one round-trip time.
Offer/Answer example:
- Offer:
  - v=0
  - o=alice 2890844526 2890844526 IN IP4 atlanta.example.com
  - s=
  - c=IN IP4 atlanta.example.com
  - t=0 0
  - m=audio 49170 RTP/AVP 0 8 97
  - a=rtpmap:0 PCMU/8000
  - a=rtpmap:8 PCMA/8000
  - a=rtpmap:97 iLBC/8000
  - m=video 51372 RTP/AVP 31 32
  - a=rtpmap:31 H261/90000
  - a=rtpmap:32 MPV/90000
- Answer:
  - v=0
  - o=bob 2808844564 2808844564 IN IP4 biloxi.example.com
  - s=
  - c=IN IP4 biloxi.example.com
  - t=0 0
  - m=audio 49174 RTP/AVP 0
  - a=rtpmap:0 PCMU/8000
  - m=video 49170 RTP/AVP 32
  - a=rtpmap:32 MPV/90000
- Receiver picks a subset of the media subtypes and their parameters to accept → includes them in the answer.
Similar negotiation frameworks exist in many other systems.
- An HTTP client can send an Accept: header listing media types it understands; the server will try to format the response to match.
- HTTP is an application layer protocol but has presentation layer aspects (e.g., character encoding).
Email is one of the few widely used applications without format negotiation.
In HTTP, content negotiation provides the mechanism to serve different representations of a resource at the same URI.
- The client requests the resource and returns the most appropriate representation of the resource.
- Example:
  - Request: GET /URL HTTP/1.1 Accept: text/* Accept-Language: en Accept-Encoding: br, gzip;q=0.8
  - Response: HTTP/1.1 200 OK Content-Location: /URLe Content-Type: text/html Content-Language: en Content-Encoding: br

Binary Data

Data may not be represented within the textual character set in use.
- If using 7-bit ASCII text, any data using all eight bits.
- Example: Very old versions of sendmail used the 8th bit to signal that quoted data was present, stripping it off data on input, since email was guaranteed to be 7 bit ASCII only.
- If using EBCDIC, any unassigned character.
- If using UTF-8, invalid multi-byte sequences.
Must be encoded to fit the character set in use.
Issues when designing a binary coding scheme:
- Must be backwards compatible with text-only systems.
- Some systems only support 7-bit ASCII.
- Some systems enforce a maximum line length.
- Must survive translation between character sets.
  - Legacy systems using ASCII, national extended ASCII variants, EBCDIC, etc.
- Must not use non-printing characters.
- Must not use escape characters (e.g., $ \ # ; & “ ”).
If only a small amount of binary data, can be easier to quote the non-textual values.
- Use a special escape character to signal the start of the quote.
- Signal value of un-representable data.
- Two common approaches:
  - MIME quoted printable.
  - URL encoding.

Internationalization

What character set to use?
- A national character set? ASCII, iso-8859-1, koi-8, etc.
  - Need to identify the character set and the language.
  - Complex to convert between character sets.
- Unicode?
  - A single character set that can represent (almost?) all characters, from (almost?) all languages.
  - 21 bits per character ( $0x000000$ – $0x10FFFF$ ).
  - Several representations (e.g., UTF-8, UTF-32).
  - Just represents characters – still need to identify the language.
Strong recommendation: Unicode in UTF-8 format.
- UTF-8 is a variable-length coding of Unicode characters.
- Backwards compatible with 7-bit ASCII characters.
  - Codes in the ASCII range coded identically; all non-ASCII values are coded with the high bit set.
- No zero octets occur within UTF-8, so it can be represented as a string in C.
- Widely used in Internet standard protocols.
- Unicode character bit pattern:
  - $00000000\ 00000000\ 0zzzzzzz$ -> $0zzzzzzz$
  - $00000000\ 00000yyy\ yyzzzzzz$ -> $110yyyyy\ 10zzzzzz$
  - $00000000\ xxxxyyyy\ yyzzzzzz$ -> $1110xxxx\ 10yyyyyy\ 10zzzzzz$
  - $000wwwxx\ xxxxyyyy\ yyzzzzzz$ -> $11110www\ 10xxxxxx\ 10yyyyyy\ 10zzzzzz$
Unicode just codes the characters, need to code the language separately.
- Things to remember:
  - Different languages have very different rules!
  - At the protocol level.
IETF maintains a standard for identifying languages.
- Surprisingly complex!
- RFC 4646 describes the syntax and semantics of language tags and rules for how to register new tags (59 pages).
- RFC 4647 explains how language tags can be compared (20 pages).
- The list of registered languages is separate.
- Examples:
  - en-GB: English as used in Great Britain.
  - zh-Hans-CN: Chinese written using the Simplified script as used in mainland China.
  - sl-IT-nedis: Slovenian as used in Italy, Nadiza dialect.
  - de-Latn-DE-1996: German, Latin script, orthography of 1996.

Compression

Why compress data?
Where do we compress data?
There is no use in decompressing and recompressing at every node in a multi-hop network.
The compression should be done once at the point of origin and uncompressed only upon reaching the destination.
Thus end-to-end compression is considered a presentation layer function.
This allows better response time and throughput and finally less cost (if charged per byte transmitted).
Two types of compression:
- Lossy compression throws away some information.
  - The data produced by decompression is similar but not an exact copy of the original data.
  - Good use for image, video, and audio compression.
- Lossless compression keeps all information.
  - The data produced by decompression is an exact copy of the original data.
  - Removes redundant data, which can be reconstructed later.
Example:
- 100,000 character data file.
- Contains only 6 characters, appearing with the following frequencies:
  - a: 45,000
  - b: 13,000
  - c: 12,000
  - d: 16,000
  - e: 9,000
  - f: 5,000
- A binary code encodes each character as a binary string or codeword.
- We would like to find a binary code that encodes the file using as few bits as possible (i.e., compresses it as much as possible).
In a fixed-length code, each codeword has the same length.
In a variable-length code, codewords may have different lengths.
Examples:
- Fixed-length code (must have at least 3 bits per codeword):
  - a: 000, b: 001, c: 010, d: 011, e: 100, f: 101
- Variable-length code:
  - a: 0, b: 101, c: 100, d: 111, e: 1101, f: 1100
Variable-length code example: Morse code
- Problem: The same stream of bits can be interpreted into many different words!
  - Example: –. –..– can be translated into CAT, KIM, and TETEETT!
Two coding methods:
- Frequency-dependent coding:
  - E.g., counting individual letters.
- Context-dependent coding:
  - E.g., counting the neighborhood of letters.
Frequency-Dependent (FD) coding (Statistical encoding):
- The letter “e” occurs approximately 100 times more frequently in English text than does the letter “q”.
- A character encoding scheme can be devised where more frequently used letters take up fewer bits than the less frequently used letters.
- Because the frequently used letters form a larger proportion of the text, a net compression is gained over using ASCII.
- The only problem is that computers, when receiving a string of bits, must be able to determine where the beginning and end of each character is because the compressed letters have different lengths.
- This is easy only if they are all the same size.
To handle the problem of identifying letters correctly when they are compressed to different lengths of bits, various mathematical techniques are proposed.
One of the most widely known techniques is called Huffman coding.
- It is used as part of many data compression standards.
  - E.g., MNP 5 used in 2400 bps modems.
- Huffman coding also has an adaptive version to handle changes in the frequency distribution.
Problem: Given an alphabet $A = a1, …, an$ , with frequency distribution $f(ai)$ , find a binary prefix code C for A that minimizes the number of bits needed to encode a message of $\sigma{i=1}^n f(ai)$ characters, where $c(ai)$ is the codeword for encoding $ai$ , and $L(c\ ai )$ is the length of the codeword $c(a_i)$ .
$B(C) = \sum{i=1}^n f(ai) L(c(a_i))$
The solution: Huffman developed a greedy algorithm to solve this problem that produces a minimum-cost prefix code called the Huffman code.
Prefix Code: A code is called a prefix (free) code if no codeword is a prefix of another one.
- E.g., {a = 0, b = 110, c = 10, d = 111} is a prefix code.
Every message encoded by a prefix-free code is uniquely decipherable.
Since no codeword is a prefix of any other, we can always find the first codeword in a message, peel it off, and continue decoding.
Huffman Coding Steps:
- Step 1: Pick two letters x, y from alphabet A with the smallest frequencies and create a subtree that has these two characters as leaves (greedy).
  - Label the root of this subtree as z.
- Step 2: Set frequency $f(z) = f(x) + f(y)$ . Remove x, y and add z from alphabet A.
  - The size of A is now reduced by 1.
- Step 3: Repeat the above two steps until the size of A is 1.
Example:
- $A = {a, b, c, d, e}$ and frequencies are: $f(a) = 20, f(b) = 15, f(c) = 5, f(d) = 15, f(e) = 45$ .
- We will merge c and d in step 1.
- Now we have $A = {a, b, cd, e}$ (i.e., $A = 4$ ).
- Next, we merge a and b.
  - Note, we can also merge b and cd instead.
- Now we have $A = {ab, cd, e}$ (i.e., $A = 3$ ).
- Next, we merge ab and cd.
  - Cannot merge anything with e yet.
- Now we have $A = {abcd, e}$ (i.e., $A = 2$ ).
- Finally, we merge abcd and e.
- Now we have $A = {abcde}$ (i.e., $A = 1$ ), so we stop.
- To assign codeword, we attach 0 and 1 to the branches.
  - Note, left is 0 and right is 1 (cannot mix).
- So, we have the Huffman code of $a = 000, b = 001, c = 010, d = 011, e = 1$ .
- This is the optimum prefix code for this distribution.
Context-dependent coding recognizes that some characters/bytes often occur adjacent to one another.
- E.g., “t” is often followed by “h”, and “q” is often followed by “u”.
- By recognizing this, we might be able to assign a code to “th” which could be shorter than the Huffman code.
The most simple context-dependent coding is called run length encoding (RLE).
- It is not uncommon to get long strings of bytes that are exactly the same.
 - e.g., a long string of $00_{16}$ bytes within an executable file.
- Alternatively, a fax machine often scans long lines of white paper resulting in a long string of 1 bits.
- Given a run of one hundred bytes of $00{16}$ in a row, run length encoding recognizes that it is much shorter to transmit the short integer 100 followed by a $00{16}$ (to indicate it is 100 zeros) than to transmit all 100 bytes.
Problem: Often, the number 100, when stored in a byte, is a valid character.
- How do we signal the destination that this is a run count rather than a piece of data?
- If we precede every byte by a count, how do we avoid a random file expanding by 2 in size?
MNP5 uses the following RLE coding.
- It enforces encoding rules to ensure there are no confusion when the byte values are read.
- Any string of the same character between 3 and 258 long is changed into the 3 characters followed by a one-byte length count.
  - Note if the string is length 3, it will get changed into 4 characters, the last one being $00_{16}$ .
Examples:
- A -> A (No compression)
- AA -> AA (No compression)
- AAA -> AAA0 (3:4 (un)compression)
- AAAA -> AAA1 (No compression)
- AAAAAAAAAA (10 A’s) -> AAA7 (10:4 compression)
- AAA…AAAA (258 A’s) -> AAA255 (258:4 compression)
This allows the decompression algorithm to detect length counts:
- They follow every same-character triple!
- Note that “3” is not a magic number: some files might compress better with 4.
  - ‘3’ was found to be a good value most of the time.
- Note: MNP5 uses adaptive Huffman and RLE.
- We now have MNP10.

Other Things to Consider

Such as security:
- How to keep messages secret between two physically distant people (e.g., between continents)?
- How to share a secret key among trusted users?
- How do you know it is a trusted user?
- How to check things aren’t modified in between?
  - E.g., by a malicious user in the group or from outside.

Summary

Viewed some important functions of the session and presentation layers.
- Representation and encoding data, international standardization, data compression.
Looked at some encoding schemes.
- E.g., base 64, URL.
High-level overview of compression.
- Plus quick overview of Huffman code.