CITS3004 - Coding and Compression Notes

The Session/Presentation Layers

  • Protocols
  • Encoding
  • Compression
    • Huffman coding

Recap - OSI Model

  • Handles delivery of packets.
    • End-to-end.
    • Reliability.
    • Transport.
    • Etc.
  • Prepares the data!
    • Formatting data.
    • Content encoding/conversion.
    • Service allocation/negotiation.

The Session/Presentation Layer

  • Major tasks:
    • Data representation.
    • Data compression.
    • Security and privacy.
    • Etc.
  • Different computers have different representations for characters.
  • Two computers must agree on the representation of data prior to exchange.
  • To efficiently transport data, it must be compressed where possible.

Data Types and Standards

  • Data comes in different types, and they are not self-describing.
  • Standards are used to identify the format of the data.
    • Typically signaled in the protocol used.
    • Data formats categorized into eight top-level formats.
    • Each type has many sub-types.
    • Each sub-type may have parameters.
  • Examples of data format categories: application, image, message model, multipart (attachments), text, audio, video.
  • Example: Text/plain; charset=iso-8859-1 (type, sub-type, parameters)

Protocols at the Session/Presentation Layers

  • Many different protocols exist at the S/P layers.
  • Notable ones include:
    • SDP (Session Description Protocol).
    • RPC (Remote Procedure Call).
    • PPTP (Point-to-Point Tunneling Protocol).
    • LPP (Location Positioning Protocol).
    • Telnet.
    • X.25.
    • Etc.

Session Description Protocol (SDP)

  • Can be used in a range of network environments and different applications.
  • Belongs to the session layer.
  • Can use different transport protocols as appropriate.
  • These properties make SDP forward-compatible, supporting any upcoming media or traffic type.
  • Four example usages:
    • Session initiation.
    • Streaming media.
    • Email and the WWW.
    • Multicast session announcement.
  • A textual format to describe multimedia sessions.
    • Uses media types to specify media type, subtype, and parameters.
  • Example:
    • v=0
    • o=csp 2890844526 2890842807 IN IP4 10.47.16.5
    • s=-
    • c=IN IP4 224.2.17.12/127
    • t=2873397496 2873404696
    • m=audio 49170 RTP/AVP 99
    • a=rtpmap:99 vorbis/44100/2
    • a=fmtp:99 configuration=AAAAAZ2f4g9NAh4aAXZvcmJpcwA...
    • audio/vorbis; configuration=AAAAAZ2f4g9NAh4aAXZvcmJpcwA...

Content Negotiation

  • Multimedia sessions need to negotiate codecs.
    • A wide variety of codecs exist, and new codecs are frequently introduced.
  • Two-stage offer/answer model for negotiation.
    • Offer lists supported codecs in order of preference.
    • The receiver picks the highest preference codec it also supports and includes this in its answer.
    • Negotiates a common supported codec in one round-trip time.
  • Offer/Answer example:
    • Offer:
      • v=0
      • o=alice 2890844526 2890844526 IN IP4 atlanta.example.com
      • s=
      • c=IN IP4 atlanta.example.com
      • t=0 0
      • m=audio 49170 RTP/AVP 0 8 97
      • a=rtpmap:0 PCMU/8000
      • a=rtpmap:8 PCMA/8000
      • a=rtpmap:97 iLBC/8000
      • m=video 51372 RTP/AVP 31 32
      • a=rtpmap:31 H261/90000
      • a=rtpmap:32 MPV/90000
    • Answer:
      • v=0
      • o=bob 2808844564 2808844564 IN IP4 biloxi.example.com
      • s=
      • c=IN IP4 biloxi.example.com
      • t=0 0
      • m=audio 49174 RTP/AVP 0
      • a=rtpmap:0 PCMU/8000
      • m=video 49170 RTP/AVP 32
      • a=rtpmap:32 MPV/90000
    • Receiver picks a subset of the media subtypes and their parameters to accept → includes them in the answer.
  • Similar negotiation frameworks exist in many other systems.
    • An HTTP client can send an Accept: header listing media types it understands; the server will try to format the response to match.
    • HTTP is an application layer protocol but has presentation layer aspects (e.g., character encoding).
  • Email is one of the few widely used applications without format negotiation.
  • In HTTP, content negotiation provides the mechanism to serve different representations of a resource at the same URI.
    • The client requests the resource and returns the most appropriate representation of the resource.
    • Example:
      • Request: GET /URL HTTP/1.1 Accept: text/* Accept-Language: en Accept-Encoding: br, gzip;q=0.8
      • Response: HTTP/1.1 200 OK Content-Location: /URLe Content-Type: text/html Content-Language: en Content-Encoding: br

Binary Data

  • Data may not be represented within the textual character set in use.
    • If using 7-bit ASCII text, any data using all eight bits.
    • Example: Very old versions of sendmail used the 8th bit to signal that quoted data was present, stripping it off data on input, since email was guaranteed to be 7 bit ASCII only.
    • If using EBCDIC, any unassigned character.
    • If using UTF-8, invalid multi-byte sequences.
  • Must be encoded to fit the character set in use.
  • Issues when designing a binary coding scheme:
    • Must be backwards compatible with text-only systems.
    • Some systems only support 7-bit ASCII.
    • Some systems enforce a maximum line length.
    • Must survive translation between character sets.
      • Legacy systems using ASCII, national extended ASCII variants, EBCDIC, etc.
    • Must not use non-printing characters.
    • Must not use escape characters (e.g., $ \ # ; & “ ”).
  • If only a small amount of binary data, can be easier to quote the non-textual values.
    • Use a special escape character to signal the start of the quote.
    • Signal value of un-representable data.
    • Two common approaches:
      • MIME quoted printable.
      • URL encoding.

Internationalization

  • What character set to use?
    • A national character set? ASCII, iso-8859-1, koi-8, etc.
      • Need to identify the character set and the language.
      • Complex to convert between character sets.
    • Unicode?
      • A single character set that can represent (almost?) all characters, from (almost?) all languages.
      • 21 bits per character (0x0000000x0000000x10FFFF0x10FFFF).
      • Several representations (e.g., UTF-8, UTF-32).
      • Just represents characters – still need to identify the language.
  • Strong recommendation: Unicode in UTF-8 format.
    • UTF-8 is a variable-length coding of Unicode characters.
    • Backwards compatible with 7-bit ASCII characters.
      • Codes in the ASCII range coded identically; all non-ASCII values are coded with the high bit set.
    • No zero octets occur within UTF-8, so it can be represented as a string in C.
    • Widely used in Internet standard protocols.
    • Unicode character bit pattern:
      • 00000000 00000000 0zzzzzzz00000000\ 00000000\ 0zzzzzzz -> 0zzzzzzz0zzzzzzz
      • 00000000 00000yyy yyzzzzzz00000000\ 00000yyy\ yyzzzzzz -> 110yyyyy 10zzzzzz110yyyyy\ 10zzzzzz
      • 00000000 xxxxyyyy yyzzzzzz00000000\ xxxxyyyy\ yyzzzzzz -> 1110xxxx 10yyyyyy 10zzzzzz1110xxxx\ 10yyyyyy\ 10zzzzzz
      • 000wwwxx xxxxyyyy yyzzzzzz000wwwxx\ xxxxyyyy\ yyzzzzzz -> 11110www 10xxxxxx 10yyyyyy 10zzzzzz11110www\ 10xxxxxx\ 10yyyyyy\ 10zzzzzz
  • Unicode just codes the characters, need to code the language separately.
    • Things to remember:
      • Different languages have very different rules!
      • At the protocol level.
  • IETF maintains a standard for identifying languages.
    • Surprisingly complex!
    • RFC 4646 describes the syntax and semantics of language tags and rules for how to register new tags (59 pages).
    • RFC 4647 explains how language tags can be compared (20 pages).
    • The list of registered languages is separate.
    • Examples:
      • en-GB: English as used in Great Britain.
      • zh-Hans-CN: Chinese written using the Simplified script as used in mainland China.
      • sl-IT-nedis: Slovenian as used in Italy, Nadiza dialect.
      • de-Latn-DE-1996: German, Latin script, orthography of 1996.

Compression

  • Why compress data?
  • Where do we compress data?
  • There is no use in decompressing and recompressing at every node in a multi-hop network.
  • The compression should be done once at the point of origin and uncompressed only upon reaching the destination.
  • Thus end-to-end compression is considered a presentation layer function.
  • This allows better response time and throughput and finally less cost (if charged per byte transmitted).
  • Two types of compression:
    • Lossy compression throws away some information.
      • The data produced by decompression is similar but not an exact copy of the original data.
      • Good use for image, video, and audio compression.
    • Lossless compression keeps all information.
      • The data produced by decompression is an exact copy of the original data.
      • Removes redundant data, which can be reconstructed later.
  • Example:
    • 100,000 character data file.
    • Contains only 6 characters, appearing with the following frequencies:
      • a: 45,000
      • b: 13,000
      • c: 12,000
      • d: 16,000
      • e: 9,000
      • f: 5,000
    • A binary code encodes each character as a binary string or codeword.
    • We would like to find a binary code that encodes the file using as few bits as possible (i.e., compresses it as much as possible).
  • In a fixed-length code, each codeword has the same length.
  • In a variable-length code, codewords may have different lengths.
  • Examples:
    • Fixed-length code (must have at least 3 bits per codeword):
      • a: 000, b: 001, c: 010, d: 011, e: 100, f: 101
    • Variable-length code:
      • a: 0, b: 101, c: 100, d: 111, e: 1101, f: 1100
  • Variable-length code example: Morse code
    • Problem: The same stream of bits can be interpreted into many different words!
      • Example: –. –..– can be translated into CAT, KIM, and TETEETT!
  • Two coding methods:
    • Frequency-dependent coding:
      • E.g., counting individual letters.
    • Context-dependent coding:
      • E.g., counting the neighborhood of letters.
  • Frequency-Dependent (FD) coding (Statistical encoding):
    • The letter “e” occurs approximately 100 times more frequently in English text than does the letter “q”.
    • A character encoding scheme can be devised where more frequently used letters take up fewer bits than the less frequently used letters.
    • Because the frequently used letters form a larger proportion of the text, a net compression is gained over using ASCII.
    • The only problem is that computers, when receiving a string of bits, must be able to determine where the beginning and end of each character is because the compressed letters have different lengths.
    • This is easy only if they are all the same size.
  • To handle the problem of identifying letters correctly when they are compressed to different lengths of bits, various mathematical techniques are proposed.
  • One of the most widely known techniques is called Huffman coding.
    • It is used as part of many data compression standards.
      • E.g., MNP 5 used in 2400 bps modems.
    • Huffman coding also has an adaptive version to handle changes in the frequency distribution.
  • Problem: Given an alphabet A=a<em>1,,a</em>nA = a<em>1, …, a</em>n, with frequency distribution f(a<em>i)f(a<em>i), find a binary prefix code C for A that minimizes the number of bits needed to encode a message of σ</em>i=1nf(a<em>i)\sigma</em>{i=1}^n f(a<em>i) characters, where c(a</em>i)c(a</em>i) is the codeword for encoding a<em>ia<em>i, and L(c a</em>i)L(c\ a</em>i ) is the length of the codeword c(ai)c(a_i).
  • B(C)=<em>i=1nf(a</em>i)L(c(ai))B(C) = \sum<em>{i=1}^n f(a</em>i) L(c(a_i))
  • The solution: Huffman developed a greedy algorithm to solve this problem that produces a minimum-cost prefix code called the Huffman code.
  • Prefix Code: A code is called a prefix (free) code if no codeword is a prefix of another one.
    • E.g., {a = 0, b = 110, c = 10, d = 111} is a prefix code.
  • Every message encoded by a prefix-free code is uniquely decipherable.
  • Since no codeword is a prefix of any other, we can always find the first codeword in a message, peel it off, and continue decoding.
  • Huffman Coding Steps:
    • Step 1: Pick two letters x, y from alphabet A with the smallest frequencies and create a subtree that has these two characters as leaves (greedy).
      • Label the root of this subtree as z.
    • Step 2: Set frequency f(z)=f(x)+f(y)f(z) = f(x) + f(y). Remove x, y and add z from alphabet A.
      • The size of A is now reduced by 1.
    • Step 3: Repeat the above two steps until the size of A is 1.
  • Example:
    • A=a,b,c,d,eA = {a, b, c, d, e} and frequencies are: f(a)=20,f(b)=15,f(c)=5,f(d)=15,f(e)=45f(a) = 20, f(b) = 15, f(c) = 5, f(d) = 15, f(e) = 45.
    • We will merge c and d in step 1.
    • Now we have A=a,b,cd,eA = {a, b, cd, e} (i.e., A=4A = 4).
    • Next, we merge a and b.
      • Note, we can also merge b and cd instead.
    • Now we have A=ab,cd,eA = {ab, cd, e} (i.e., A=3A = 3).
    • Next, we merge ab and cd.
      • Cannot merge anything with e yet.
    • Now we have A=abcd,eA = {abcd, e} (i.e., A=2A = 2).
    • Finally, we merge abcd and e.
    • Now we have A=abcdeA = {abcde} (i.e., A=1A = 1), so we stop.
    • To assign codeword, we attach 0 and 1 to the branches.
      • Note, left is 0 and right is 1 (cannot mix).
    • So, we have the Huffman code of a=000,b=001,c=010,d=011,e=1a = 000, b = 001, c = 010, d = 011, e = 1.
    • This is the optimum prefix code for this distribution.
  • Context-dependent coding recognizes that some characters/bytes often occur adjacent to one another.
    • E.g., “t” is often followed by “h”, and “q” is often followed by “u”.
    • By recognizing this, we might be able to assign a code to “th” which could be shorter than the Huffman code.
  • The most simple context-dependent coding is called run length encoding (RLE).
    • It is not uncommon to get long strings of bytes that are exactly the same.
      • e.g., a long string of 001600_{16} bytes within an executable file.
    • Alternatively, a fax machine often scans long lines of white paper resulting in a long string of 1 bits.
    • Given a run of one hundred bytes of 00<em>1600<em>{16} in a row, run length encoding recognizes that it is much shorter to transmit the short integer 100 followed by a 00</em>1600</em>{16} (to indicate it is 100 zeros) than to transmit all 100 bytes.
  • Problem: Often, the number 100, when stored in a byte, is a valid character.
    • How do we signal the destination that this is a run count rather than a piece of data?
    • If we precede every byte by a count, how do we avoid a random file expanding by 2 in size?
  • MNP5 uses the following RLE coding.
    • It enforces encoding rules to ensure there are no confusion when the byte values are read.
    • Any string of the same character between 3 and 258 long is changed into the 3 characters followed by a one-byte length count.
      • Note if the string is length 3, it will get changed into 4 characters, the last one being 001600_{16}.
  • Examples:
    • A -> A (No compression)
    • AA -> AA (No compression)
    • AAA -> AAA0 (3:4 (un)compression)
    • AAAA -> AAA1 (No compression)
    • AAAAAAAAAA (10 A’s) -> AAA7 (10:4 compression)
    • AAA…AAAA (258 A’s) -> AAA255 (258:4 compression)
  • This allows the decompression algorithm to detect length counts:
    • They follow every same-character triple!
    • Note that “3” is not a magic number: some files might compress better with 4.
      • ‘3’ was found to be a good value most of the time.
    • Note: MNP5 uses adaptive Huffman and RLE.
    • We now have MNP10.

Other Things to Consider

  • Such as security:
    • How to keep messages secret between two physically distant people (e.g., between continents)?
    • How to share a secret key among trusted users?
    • How do you know it is a trusted user?
    • How to check things aren’t modified in between?
      • E.g., by a malicious user in the group or from outside.

Summary

  • Viewed some important functions of the session and presentation layers.
    • Representation and encoding data, international standardization, data compression.
  • Looked at some encoding schemes.
    • E.g., base 64, URL.
  • High-level overview of compression.
    • Plus quick overview of Huffman code.