CITS3004 - Coding and Compression Notes
The Session/Presentation Layers
- Protocols
- Encoding
- Compression
Recap - OSI Model
- Handles delivery of packets.
- End-to-end.
- Reliability.
- Transport.
- Etc.
- Prepares the data!
- Formatting data.
- Content encoding/conversion.
- Service allocation/negotiation.
The Session/Presentation Layer
- Major tasks:
- Data representation.
- Data compression.
- Security and privacy.
- Etc.
- Different computers have different representations for characters.
- Two computers must agree on the representation of data prior to exchange.
- To efficiently transport data, it must be compressed where possible.
Data Types and Standards
- Data comes in different types, and they are not self-describing.
- Standards are used to identify the format of the data.
- Typically signaled in the protocol used.
- Data formats categorized into eight top-level formats.
- Each type has many sub-types.
- Each sub-type may have parameters.
- Examples of data format categories: application, image, message model, multipart (attachments), text, audio, video.
- Example:
Text/plain; charset=iso-8859-1 (type, sub-type, parameters)
Protocols at the Session/Presentation Layers
- Many different protocols exist at the S/P layers.
- Notable ones include:
- SDP (Session Description Protocol).
- RPC (Remote Procedure Call).
- PPTP (Point-to-Point Tunneling Protocol).
- LPP (Location Positioning Protocol).
- Telnet.
- X.25.
- Etc.
Session Description Protocol (SDP)
- Can be used in a range of network environments and different applications.
- Belongs to the session layer.
- Can use different transport protocols as appropriate.
- These properties make SDP forward-compatible, supporting any upcoming media or traffic type.
- Four example usages:
- Session initiation.
- Streaming media.
- Email and the WWW.
- Multicast session announcement.
- A textual format to describe multimedia sessions.
- Uses media types to specify media type, subtype, and parameters.
- Example:
v=0o=csp 2890844526 2890842807 IN IP4 10.47.16.5s=-c=IN IP4 224.2.17.12/127t=2873397496 2873404696m=audio 49170 RTP/AVP 99a=rtpmap:99 vorbis/44100/2a=fmtp:99 configuration=AAAAAZ2f4g9NAh4aAXZvcmJpcwA...audio/vorbis; configuration=AAAAAZ2f4g9NAh4aAXZvcmJpcwA...
Content Negotiation
- Multimedia sessions need to negotiate codecs.
- A wide variety of codecs exist, and new codecs are frequently introduced.
- Two-stage offer/answer model for negotiation.
- Offer lists supported codecs in order of preference.
- The receiver picks the highest preference codec it also supports and includes this in its answer.
- Negotiates a common supported codec in one round-trip time.
- Offer/Answer example:
- Offer:
v=0o=alice 2890844526 2890844526 IN IP4 atlanta.example.coms=c=IN IP4 atlanta.example.comt=0 0m=audio 49170 RTP/AVP 0 8 97a=rtpmap:0 PCMU/8000a=rtpmap:8 PCMA/8000a=rtpmap:97 iLBC/8000m=video 51372 RTP/AVP 31 32a=rtpmap:31 H261/90000a=rtpmap:32 MPV/90000
- Answer:
v=0o=bob 2808844564 2808844564 IN IP4 biloxi.example.coms=c=IN IP4 biloxi.example.comt=0 0m=audio 49174 RTP/AVP 0a=rtpmap:0 PCMU/8000m=video 49170 RTP/AVP 32a=rtpmap:32 MPV/90000
- Receiver picks a subset of the media subtypes and their parameters to accept → includes them in the answer.
- Similar negotiation frameworks exist in many other systems.
- An HTTP client can send an
Accept: header listing media types it understands; the server will try to format the response to match. - HTTP is an application layer protocol but has presentation layer aspects (e.g., character encoding).
- Email is one of the few widely used applications without format negotiation.
- In HTTP, content negotiation provides the mechanism to serve different representations of a resource at the same URI.
- The client requests the resource and returns the most appropriate representation of the resource.
- Example:
- Request:
GET /URL HTTP/1.1 Accept: text/* Accept-Language: en Accept-Encoding: br, gzip;q=0.8 - Response:
HTTP/1.1 200 OK Content-Location: /URLe Content-Type: text/html Content-Language: en Content-Encoding: br
Binary Data
- Data may not be represented within the textual character set in use.
- If using 7-bit ASCII text, any data using all eight bits.
- Example: Very old versions of sendmail used the 8th bit to signal that quoted data was present, stripping it off data on input, since email was guaranteed to be 7 bit ASCII only.
- If using EBCDIC, any unassigned character.
- If using UTF-8, invalid multi-byte sequences.
- Must be encoded to fit the character set in use.
- Issues when designing a binary coding scheme:
- Must be backwards compatible with text-only systems.
- Some systems only support 7-bit ASCII.
- Some systems enforce a maximum line length.
- Must survive translation between character sets.
- Legacy systems using ASCII, national extended ASCII variants, EBCDIC, etc.
- Must not use non-printing characters.
- Must not use escape characters (e.g., $ \ # ; & “ ”).
- If only a small amount of binary data, can be easier to quote the non-textual values.
- Use a special escape character to signal the start of the quote.
- Signal value of un-representable data.
- Two common approaches:
- MIME quoted printable.
- URL encoding.
Internationalization
- What character set to use?
- A national character set? ASCII, iso-8859-1, koi-8, etc.
- Need to identify the character set and the language.
- Complex to convert between character sets.
- Unicode?
- A single character set that can represent (almost?) all characters, from (almost?) all languages.
- 21 bits per character (0x000000 – 0x10FFFF).
- Several representations (e.g., UTF-8, UTF-32).
- Just represents characters – still need to identify the language.
- Strong recommendation: Unicode in UTF-8 format.
- UTF-8 is a variable-length coding of Unicode characters.
- Backwards compatible with 7-bit ASCII characters.
- Codes in the ASCII range coded identically; all non-ASCII values are coded with the high bit set.
- No zero octets occur within UTF-8, so it can be represented as a string in C.
- Widely used in Internet standard protocols.
- Unicode character bit pattern:
- 00000000 00000000 0zzzzzzz -> 0zzzzzzz
- 00000000 00000yyy yyzzzzzz -> 110yyyyy 10zzzzzz
- 00000000 xxxxyyyy yyzzzzzz -> 1110xxxx 10yyyyyy 10zzzzzz
- 000wwwxx xxxxyyyy yyzzzzzz -> 11110www 10xxxxxx 10yyyyyy 10zzzzzz
- Unicode just codes the characters, need to code the language separately.
- Things to remember:
- Different languages have very different rules!
- At the protocol level.
- IETF maintains a standard for identifying languages.
- Surprisingly complex!
- RFC 4646 describes the syntax and semantics of language tags and rules for how to register new tags (59 pages).
- RFC 4647 explains how language tags can be compared (20 pages).
- The list of registered languages is separate.
- Examples:
en-GB: English as used in Great Britain.zh-Hans-CN: Chinese written using the Simplified script as used in mainland China.sl-IT-nedis: Slovenian as used in Italy, Nadiza dialect.de-Latn-DE-1996: German, Latin script, orthography of 1996.
Compression
- Why compress data?
- Where do we compress data?
- There is no use in decompressing and recompressing at every node in a multi-hop network.
- The compression should be done once at the point of origin and uncompressed only upon reaching the destination.
- Thus end-to-end compression is considered a presentation layer function.
- This allows better response time and throughput and finally less cost (if charged per byte transmitted).
- Two types of compression:
- Lossy compression throws away some information.
- The data produced by decompression is similar but not an exact copy of the original data.
- Good use for image, video, and audio compression.
- Lossless compression keeps all information.
- The data produced by decompression is an exact copy of the original data.
- Removes redundant data, which can be reconstructed later.
- Example:
- 100,000 character data file.
- Contains only 6 characters, appearing with the following frequencies:
- a: 45,000
- b: 13,000
- c: 12,000
- d: 16,000
- e: 9,000
- f: 5,000
- A binary code encodes each character as a binary string or codeword.
- We would like to find a binary code that encodes the file using as few bits as possible (i.e., compresses it as much as possible).
- In a fixed-length code, each codeword has the same length.
- In a variable-length code, codewords may have different lengths.
- Examples:
- Fixed-length code (must have at least 3 bits per codeword):
- a: 000, b: 001, c: 010, d: 011, e: 100, f: 101
- Variable-length code:
- a: 0, b: 101, c: 100, d: 111, e: 1101, f: 1100
- Variable-length code example: Morse code
- Problem: The same stream of bits can be interpreted into many different words!
- Example: –. –..– can be translated into CAT, KIM, and TETEETT!
- Two coding methods:
- Frequency-dependent coding:
- E.g., counting individual letters.
- Context-dependent coding:
- E.g., counting the neighborhood of letters.
- Frequency-Dependent (FD) coding (Statistical encoding):
- The letter “e” occurs approximately 100 times more frequently in English text than does the letter “q”.
- A character encoding scheme can be devised where more frequently used letters take up fewer bits than the less frequently used letters.
- Because the frequently used letters form a larger proportion of the text, a net compression is gained over using ASCII.
- The only problem is that computers, when receiving a string of bits, must be able to determine where the beginning and end of each character is because the compressed letters have different lengths.
- This is easy only if they are all the same size.
- To handle the problem of identifying letters correctly when they are compressed to different lengths of bits, various mathematical techniques are proposed.
- One of the most widely known techniques is called Huffman coding.
- It is used as part of many data compression standards.
- E.g., MNP 5 used in 2400 bps modems.
- Huffman coding also has an adaptive version to handle changes in the frequency distribution.
- Problem: Given an alphabet A=a<em>1,…,a</em>n, with frequency distribution f(a<em>i), find a binary prefix code C for A that minimizes the number of bits needed to encode a message of σ</em>i=1nf(a<em>i) characters, where c(a</em>i) is the codeword for encoding a<em>i, and L(c a</em>i) is the length of the codeword c(ai).
- B(C)=∑<em>i=1nf(a</em>i)L(c(ai))
- The solution: Huffman developed a greedy algorithm to solve this problem that produces a minimum-cost prefix code called the Huffman code.
- Prefix Code: A code is called a prefix (free) code if no codeword is a prefix of another one.
- E.g., {a = 0, b = 110, c = 10, d = 111} is a prefix code.
- Every message encoded by a prefix-free code is uniquely decipherable.
- Since no codeword is a prefix of any other, we can always find the first codeword in a message, peel it off, and continue decoding.
- Huffman Coding Steps:
- Step 1: Pick two letters x, y from alphabet A with the smallest frequencies and create a subtree that has these two characters as leaves (greedy).
- Label the root of this subtree as z.
- Step 2: Set frequency f(z)=f(x)+f(y). Remove x, y and add z from alphabet A.
- The size of A is now reduced by 1.
- Step 3: Repeat the above two steps until the size of A is 1.
- Example:
- A=a,b,c,d,e and frequencies are: f(a)=20,f(b)=15,f(c)=5,f(d)=15,f(e)=45.
- We will merge c and d in step 1.
- Now we have A=a,b,cd,e (i.e., A=4).
- Next, we merge a and b.
- Note, we can also merge b and cd instead.
- Now we have A=ab,cd,e (i.e., A=3).
- Next, we merge ab and cd.
- Cannot merge anything with e yet.
- Now we have A=abcd,e (i.e., A=2).
- Finally, we merge abcd and e.
- Now we have A=abcde (i.e., A=1), so we stop.
- To assign codeword, we attach 0 and 1 to the branches.
- Note, left is 0 and right is 1 (cannot mix).
- So, we have the Huffman code of a=000,b=001,c=010,d=011,e=1.
- This is the optimum prefix code for this distribution.
- Context-dependent coding recognizes that some characters/bytes often occur adjacent to one another.
- E.g., “t” is often followed by “h”, and “q” is often followed by “u”.
- By recognizing this, we might be able to assign a code to “th” which could be shorter than the Huffman code.
- The most simple context-dependent coding is called run length encoding (RLE).
- It is not uncommon to get long strings of bytes that are exactly the same.
- e.g., a long string of 0016 bytes within an executable file.
- Alternatively, a fax machine often scans long lines of white paper resulting in a long string of 1 bits.
- Given a run of one hundred bytes of 00<em>16 in a row, run length encoding recognizes that it is much shorter to transmit the short integer 100 followed by a 00</em>16 (to indicate it is 100 zeros) than to transmit all 100 bytes.
- Problem: Often, the number 100, when stored in a byte, is a valid character.
- How do we signal the destination that this is a run count rather than a piece of data?
- If we precede every byte by a count, how do we avoid a random file expanding by 2 in size?
- MNP5 uses the following RLE coding.
- It enforces encoding rules to ensure there are no confusion when the byte values are read.
- Any string of the same character between 3 and 258 long is changed into the 3 characters followed by a one-byte length count.
- Note if the string is length 3, it will get changed into 4 characters, the last one being 0016.
- Examples:
- A -> A (No compression)
- AA -> AA (No compression)
- AAA -> AAA0 (3:4 (un)compression)
- AAAA -> AAA1 (No compression)
- AAAAAAAAAA (10 A’s) -> AAA7 (10:4 compression)
- AAA…AAAA (258 A’s) -> AAA255 (258:4 compression)
- This allows the decompression algorithm to detect length counts:
- They follow every same-character triple!
- Note that “3” is not a magic number: some files might compress better with 4.
- ‘3’ was found to be a good value most of the time.
- Note: MNP5 uses adaptive Huffman and RLE.
- We now have MNP10.
Other Things to Consider
- Such as security:
- How to keep messages secret between two physically distant people (e.g., between continents)?
- How to share a secret key among trusted users?
- How do you know it is a trusted user?
- How to check things aren’t modified in between?
- E.g., by a malicious user in the group or from outside.
Summary
- Viewed some important functions of the session and presentation layers.
- Representation and encoding data, international standardization, data compression.
- Looked at some encoding schemes.
- High-level overview of compression.
- Plus quick overview of Huffman code.