Chapter 7: Computer Networks
7
THE APPLICATION LAYER
Having finished all the preliminaries, we now come to the layer where all the applications are found. The layers below the application layer are there to provide transport services, but they do not do real work for users. In this chapter, we will study some real network applications.
Even at the application layer there is a need for support protocols, to allow many applications to function. Accordingly, we will look at an important one of these before starting with the applications themselves. The item in question is the DNS (Domain Name System), which maps Internet names to IP addresses. After that, we will examine three real applications: electronic mail, the World Wide Web (generally referred to simply as ‘‘the Web’’), and multimedia, including modern video streaming. We will finish the chapter by discussing content distribution, including peer-to-peer networks and content delivery networks.
7.1 THE DOMAIN NAME SYSTEM (DNS)
Although programs theoretically could refer to Web pages, mailboxes, and other resources by using the network (i.e., IP) addresses of the computers where they are stored, these addresses are difficult for people to remember. Also, brows ing a company’s Web pages from 128.111.24.41 is brittle: if the company moves the Web server to a different machine with a different IP address, everyone needs to be told the new IP address. Although moving a Web site from one IP address to
613
614 THE APPLICATION LAYER CHAP. 7
another might seem far-fetched, in practice this general notion occurs quite often, in the form of load balancing. Specifically, many modern Web sites host their con tent on multiple machines, often geographically distributed clusters. The organiza tion hosting the content may wish to ‘‘move’’ a client’s communication from one Web server to another. The DNS is typically the most convenient way to do this.
High-level, readable names decouple machine names from machine addresses. An organization’s Web server could thus be referred to as www.cs.uchicago.edu, regardless of its IP address. Because the devices along a network path forward traffic to its destination based on IP address, these human-readable domain names must be converted to IP addresses; the DNS (Domain Name System) is the mech- anism that does so. In the subsequent sections, we will study how DNS performs this mapping, as well as how it has evolved over the past decades. In particular, one of the most significant developments in the DNS in recent years is its implica tions for user privacy. We will explore these implications and various recent devel- opments in DNS encryption that are related to privacy.
7.1.1 History and Overview
Back in the ARPANET days, a file, hosts.txt, listed all the computer names and their IP addresses. Every night, all of the hosts would fetch it from the site at which it was maintained. For a network of a few hundred large timesharing machines, this approach worked reasonably well.
However, well before many millions of PCs were connected to the Internet, everyone involved with it realized that this approach could not continue to work forever. For one thing, the size of the file would become too large. Even more importantly, host name conflicts would occur constantly unless names were cent rally managed, something unthinkable in a huge international network due to the load and latency. The Domain Name System was invented in 1983 to address these problems, and it has been a key part of the Internet ever since.
DNS is a hierarchical naming scheme and a distributed database system that implements this naming scheme. It is primarily used for mapping host names to IP addresses, but it has several other purposes, which we will outline in more detail below. DNS is one of the most actively evolving protocols in the Internet. DNS is defined in RFC 1034, RFC 1035, RFC 2181, and further elaborated in many other RFCs.
7.1.2 The DNS Lookup Process
DNS operates as follows. To map a name onto an IP address, an application program calls a library procedure, (typically gethostbyname or the equivalent) pas- sing this function the name as a parameter. This process is sometimes referred to as the stub resolver. The stub resolver sends a query containing the name to a local DNS resolver, often called the local recursive resolver or simply the local
SEC. 7.1 THE DOMAIN NAME SYSTEM (DNS) 615
resolver, which subsequently performs a so-called recursive lookup for the name against a set of DNS resolvers. The local recursive resolver ultimately returns a response with the corresponding IP address to the stub resolver, which then passes
the result to the function that issued the query in the first place. The query and response messages are sent as UDP packets. Given knowledge of the IP address, the program can then communicate with the host corresponding to the DNS name that it had looked up. We will explore this process in more detail later in this chap ter.
Typically, the stub resolver issues a recursive lookup to the local resolver, meaning that it simply issues the query and waits for the response from the local resolver. The local resolver, on the other hand, issues a sequence of queries to the respective name servers for each part of the name hierarchy; the name server that is responsible for a particular part of the hierarchy is often called the authoritative name server for that domain. As we will see later, DNS uses caching, but caches can be out of date. The authoritative name server is, well, authoritative. It is by definition always correct. Before describing more detailed operation of DNS, we describe the DNS name server hierarchy and how names are allocated.
When a host’s stub resolver sends a query to the local resolver, the local resolver handles the resolution until it has the desired answer, or no answer. It does not return partial answers. On the other hand, the root name server (and each subsequent name server) does not recursively continue the query for the local name server. It just returns a partial answer and moves on to the next query. The local resolver is responsible for continuing the resolution by issuing further iterative queries.
The name resolution process typically involves both mechanisms. A recursive query may always seem preferable, but many name servers (especially the root) will not handle them. They are too busy. Iterative queries put the burden on the originator. The rationale for the local name server supporting a recursive query is that it is providing a service to hosts in its domain. Those hosts do not have to be configured to run a full name server, just to reach the local one. A 16-bit tran- saction identifier is included in each query and copied to the response so that a name server can match answers to the corresponding query, even if multiple queries are outstanding at the same time.
All of the answers, including all the partial answers returned, are cached. In this way, if a computer at cs.vu.nl queries for cs.uchicago.edu, the answer is cached. If shortly thereafter, another host at cs.vu.nl also queries cs.uchicago.edu, the answer will already be known. Even better, if a host queries for a different host in the same domain, say noise.cs.uchicago.edu, the query can be sent directly to the authoritative name server for cs.uchicago.edu. Similarly, queries for other domains in uchicago.edu can start directly from the uchicago.edu name server. Using cached answers greatly reduces the steps in a query and improves per formance. The original scenario we sketched is in fact the worst case that occurs when no useful information is available in the cache.
616 THE APPLICATION LAYER CHAP. 7
Cached answers are not authoritative, since changes made at cs.uchicago.edu will not be propagated to all the caches in the world that may know about it. For this reason, cache entries should not live too long. This is the reason that the Time to live field is included in each DNS resource record, a part of the DNS database we will discuss shortly. It tells remote name servers how long to cache records. If a certain machine has had the same IP address for years, it may be safe to cache that information for one day. For more volatile information, it might be safer to purge the records after a few seconds or a minute.
DNS queries have a simple format that includes various information, including the name being queried (QNAME), as well as other auxiliary information, such as a transaction identifier; the transaction identifier is often used to map queries to responses. Initially, the transaction ID was only 16 bits, and the queries and re- sponses were not secured; this design choice left DNS vulnerable to a variety of attacks including something called a cache poisoning attack, whose details we dis- cuss further in Chap. 8. When performing a series of iterative lookups, a recursive DNS resolver might send the entire QNAME to the sequence of authoritative name servers returning the responses. At some point, protocol designers pointed out that sending the entire QNAME to every authoritative name server in a sequence of it- erative resolvers constituted a privacy risk. As a result, many recursive resolvers now use a process called QNAME minimization, whereby the local resolver only sends the part of the query that the respective authoritative name server has the information to resolve. For example, with QNAME minimization, given a name to resolve such as www.cs.uchicago.edu, a local resolver would send only the string cs.uchicago.edu to the authoritative name server for uchicago.edu, as opposed to the fully qualified domain name (FQDN), to avoid revealing the entire FQDN to the authoritative name server. For more information on QNAME minimization, see RFC 7816.
Until very recently, DNS queries and responses relied on UDP as its transport protocol, based on the rationale that DNS queries and responses needed to be fast and lightweight, and could not handle the corresponding overhead of a TCP three- way handshake. However, various developments, including the resulting insecurity of the DNS protocol and the myriad subsequent attacks that DNS has been subject to, ranging from cache poisoning to distributed denial-of-service (DDoS) attacks, has resulted in an increasing trend towards the use of TCP as the transport protocol for DNS. Using TCP as the transport protocol for DNS has subsequently allowed DNS to leverage modern secure transport and application-layer protocols, resulting in DNS-over-TLS (DoT) and DNS-over-HTTPS (DoH). We discuss these develop- ments in more detail later in this chapter.
If the DNS stub resolver does not receive a response within some relatively short period of time (a timeout period), the DNS client repeats the query, trying another server for the domain after a small number of retries. This process is designed to handle the case of the server being down as well as the query or response packet getting lost.
SEC. 7.1 THE DOMAIN NAME SYSTEM (DNS) 617 7.1.3 The DNS Name Space and Hierarchy
Managing a large and constantly changing set of names is challenging. In the postal system, name management is done by requiring letters to specify (implicitly or explicitly) the country, state or province, city, street address, and name of the addressee. Using this kind of hierarchical addressing ensures that there is no con fusion between the Marvin Anderson on Main St. in White Plains, N.Y. and the Marvin Anderson on Main St. in Austin, Texas. DNS works the same way.
For the Internet, the top of the naming hierarchy is managed by an organization called ICANN (Internet Corporation for Assigned Names and Numbers). ICANN was created for this purpose in 1998, as part of the maturing of the Internet to a worldwide, economic concern. Conceptually, the Internet is divided into over 250 top-level domains, where each domain covers many hosts. Each domain is partitioned into subdomains, and these are further partitioned, and so on. All of these domains constitute a namespace hierarchy, which can be represented by a tree, as shown in Fig. 7-1. The leaves of the tree represent domains that have no subdomains (but do contain machines, of course). A leaf domain may contain a single host, or it may represent a company and contain thousands of hosts.
Generic Countries
. . .
aero com edu gov museum org net au jp uk us nl cisco acm ieee
. . .
uchicago
eng
cs
eng
jack jill
edu vu oce ac co
uwa keio
nec
cs law
cs
noise
csl
filts fluit
Figure 7-1. A portion of the Internet domain name space.
The top-level domains have several different types: gTLD (generic Top Level Domain), ccTLD (country code Top Level Doman), and others. Some of the original generic TLDs, listed in Fig. 7-2, include original domains from the 1980s, plus additional top-level domains introduced to ICANN. The country domains include one entry for every country, as defined in ISO 3166. Internationalized country domain names that use non-Latin alphabets were introduced in 2010. These domains let people name hosts in Arabic, Chinese, Cyrillic, Hebrew, or other languages.
In 2011, there were only 22 gTLDs, but in June 2011, ICANN voted to end restrictions on the creation of additional gTLDs, allowing companies and other
618 THE APPLICATION LAYER CHAP. 7
organizations to select essentially arbitrary top-level domains, including TLDs that include non-Latin characters (e.g., Cyrillic). ICANN began accepting applications for new TLDs at the beginning of 2012. The initial cost of applying for a new TLD was nearly 200,000 dollars. Some of the first new gTLDs became operational in 2013, and in July 2013, the first four new gTLDs were launched based on agree- ment that was signed in Durban, South Africa. All four were based on non-Latin characters: the Arabic word for ‘‘Web,’’ the Russian word for ‘‘online,’’ the Rus- sian word for ‘‘site,’’ and the Chinese word for ‘‘game.’’ Some tech giants have applied for many gTLDs: Google and Amazon, for example, have each applied for about 100 new gTLDs. Today, some of the most popular gTLDs include top, loan, xyz, and so forth.
Domain Intended use Start date Restricted?
com Commercial 1985 No
edu Educational institutions 1985 Yes
gov Government 1985 Yes
int International organizations 1988 Yes
mil Military 1985 Yes
net Network providers 1985 No
org Non-profit organizations 1985 No
aero Air transport 2001 Yes
biz Businesses 2001 No
coop Cooperatives 2001 Yes
info Informational 2002 No
museum Museums 2002 Yes
name People 2002 No
pro Professionals 2002 Yes
cat Catalan 2005 Yes
jobs Employment 2005 Yes
mobi Mobile devices 2005 Yes
tel Contact details 2005 Yes
travel Travel industry 2005 Yes
xxx Sex industry 2010 No
Figure 7-2. The original generic TLDs, as of 2010. As of 2020, there are more than 1,200 gTLDs.
Getting a second-level domain, such as name-of-company.com, is easy. The top-level domains are operated by companies called registries. They are appointed by ICANN. For example, the registry for com is Verisign. One level down, regis trars sell domain names directly to users. There are many of them and they com- pete on price and service. Common registrars include Domain.com, GoDaddy, and
SEC. 7.1 THE DOMAIN NAME SYSTEM (DNS) 619
NameCheap. Fig. 7-3 shows the relationship between registries and registrars as far as registering a domain name is concerned.
Register domains
Registrar
VERISIGN Registry
ICANN
Figure 7-3. The relationship between registries and registrars.
The domain name that a machine aims to look up is typically called a FQDN (Fully Qualified Domain Name) such as www.cs.uchicago.edu or cisco.com. The FQDN starts with the most specific part of the domain name, and each part of the hierarchy is separated by a ’’.’’ (Technically, all FQDNs end with a ‘‘.’’ as well, sig- nifying the root of the DNS hierarchy, although most operating systems complete that portion of the domain name automatically.)
Each domain is named by the path upward from it to the (unnamed) root. The components are separated by periods (pronounced ‘‘dot’’). Thus, the engineering department at Cisco might be eng.cisco.com., rather than a UNIX-style name such as /com/cisco/eng. Notice that this hierarchical naming means that eng.cisco.com. does not conflict with a potential use of eng in eng.uchicago.edu., which might be used by the English department at the University of Chicago.
Domain names can be either absolute or relative. An absolute domain name always ends with a period (e.g., eng.cisco.com.), whereas a relative one does not. Relative names have to be interpreted in some context to uniquely determine their true meaning. In both cases, a named domain refers to a specific node in the tree and all the nodes under it.
Domain names are case-insensitive, so edu, Edu, and EDU mean the same thing. Component names can be up to 63 characters long, and full path names must not exceed 255 characters. The fact that DNS in case insensitive has been used to defend against various DNS attacks, including DNS cache poisoning attacks, using a technique called 0x20 encoding (Dagon et al., 2008), which we will discuss in more detail later in this chapter.
In principle, domains can be inserted into the hierarchy in either the generic or the country domains. For example, the domain cc.gatech.edu could equally well be (and are often) listed under the us country domain as cc.gt.atl.ga.us. In prac tice, however, most organizations in the United States are under generic domains,
620 THE APPLICATION LAYER CHAP. 7
and most outside the United States are under the domain of their country. There is no rule against registering under multiple top-level domains. Large companies often do so (e.g., sony.com, sony.net, and sony.nl).
Each domain controls how it allocates the domains under it. For example, Japan has domains ac.jp and co.jp that mirror edu and com. The Netherlands does not make this distinction and puts all organizations directly under nl. Austraian universities are all in edu.au. Thus, all three of the following are university CS and EE departments:
1. cs.chicago.edu (University of Chicago, in the U.S.).
2. cs.vu.nl (Vrije Universiteit, in The Netherlands).
3. ee.uwa.edu.au (University of Western Australia).
To create a new domain, permission is required of the domain in which it will be included. For example, if a security research group at the University of Chicago wants to be known as security.cs.uchicago.edu, it has to get permission from who- ever manages cs.uchicago.edu. (Fortunately, that person is typically not far away, thanks to the federated management architecture of DNS) Similarly, if a new uni- versity is chartered, say, the University of Northern South Dakota, it must ask the manager of the edu domain to assign it unsd.edu (if that is still available). In this way, name conflicts are avoided and each domain can keep track of all its subdo- mains. Once a new domain has been created and registered, it can create subdo- mains, such as cs.unsd.edu, without getting permission from anybody higher up the tree.
Naming follows organizational boundaries, not physical networks. For exam- ple, if the computer science and electrical engineering departments are located in the same building and share the same LAN, they can nevertheless have distinct domains. Similarly, even if computer science is split over Babbage Hall and Tur ing Hall, the hosts in both buildings will normally belong to the same domain.
7.1.4 DNS Queries and Responses
We now turn to the structure, format, and purpose of DNS queries, and how the DNS servers answer those queries.
DNS Queries
As previously discussed, a DNS client typically issues a query to a local recur- sive resolver, which performs an iterative query to ultimately resolve the query. The most common query type is an A record query, which asks for a mapping from a domain name to an IP address for a corresponding Internet endpoint. DNS has a range of other resource records (with corresponding queries), as we discuss further in the next section on resource records (i.e., responses).
SEC. 7.1 THE DOMAIN NAME SYSTEM (DNS) 621
Although the primary mechanism for DNS has long been to map human read- able names to IP addresses, over the years, DNS queries have been used for a var iety of other purposes. Another common use for DNS queries is to look up do- mains in a DNSBL (DNS-based blacklist), which are lists that are commonly maintained to keep track of IP addresses associated with spammers and malware. To look up a domain name in a DNSBL, a client might send a DNS A-record query to a special DNS server, such as pbl.spamhaus.org (a ‘‘policy blacklist’’), which corresponds to a list of IP addresses that are not supposed to be making connec tions to mail servers. To look up a particular IP address, a client simply reverses the octets for the IP address and prepends the result to pbl.spamhaus.org.
For example, to look up 127.0.0.2, a client would simply issue a query for 2.0.0.127.pbl.spamhaus.org. If the corresponding IP address was in the list, the DNS query would return an IP address that typically encodes some additional in formation, such as the provenance of that entry in the list. If the IP address is not contained in the list, the DNS server would indicate that by responding with the corresponding NXDOMAIN response, corresponding to ‘‘no such domain.’’
Extensions and Enhancements to DNS Queries
DNS queries have become more sophisticated and complex over time, as the needs to serve clients with increasingly specific and relevant information over time has increased, and as security concerns have grown. Two significant extensions to DNS queries in recent years has been the use of the EDNS0 CS Extended DNS Client Subnet or simply EDNS Client Subnet option, whereby a client’s local recursive resolver passes the IP address subnet of the stub resolver to the authorita tive name server.
The EDNS0 CS mechanism allows the authoritative name server for a domain name to know the IP address of the client that initially performed the query. Know ing this information can typically allow an authoritative DNS server to perform a more effective mapping to a nearby copy of a replicated service. For example, if a client issues a query for google.com, the authoritative name server for Google would typically want to return a name that corresponds to a front-end server that is close to the client. The ability to do so of course depends on knowing where on the network (and, ideally, where in the world, geographically) the client is located. Ordinarily, an authoritative name server might only see the IP address of the local recursive resolver.
If the client that initiated the query happens to be located near its respective local resolver, then the authoritative server for that domain could determine an appropriate client mapping simply from the location of the DNS local recursive. Increasingly, however, clients have begun to use local recursive resolvers that may have IP addresses that make it difficult to locate the client. For example, Google and Cloudflare both operate public DNS resolvers (8.8.8.8 and 1.1.1.1, respec tively). If a client is configured to use one of these local recursive resolvers, then
622 THE APPLICATION LAYER CHAP. 7
the authoritative name server does not learn much useful information from the IP address of the recursive resolver. EDNS0 CS solves this problem by including the IP subnet in the query from the local recursive, so that the authoritative can see the IP subnet of the client that initiated the query.
As previously noted, the names in DNS queries are not case sensitive. This characteristic has allowed modern DNS resolvers to include additional bits of a transaction ID in the query by setting each character in a QNAME to an arbitrary case. A 16-bit transaction ID is vulnerable to various cache poisoning attacks, including the Kaminsky attack described in Chap. 8. This vulnerability partially arises because the DNS transaction ID is only 16 bits. Increasing the number of bits in the transaction ID would require changing the DNS protocol specification, which is a massive undertaking.
An alternative was developed, usually called 0x20 encoding, whereby a local recursive would toggle the case on each QNAME (e.g., uchicago.edu might become uCHicaGO.EDu or similar), allowing each letter in the domain name to encode an additional bit for the DNS transaction ID. The catch, of course, is that no other resolver should alter the case of the QNAME in subsequent iterative queries or responses. If the casing is preserved, then the corresponding reply con tains the QNAME with the original casing indicated by the local recursive resolver, effectively acting adding bits to the transaction identifier. The whole thing is an ugly hack, but such is the nature of trying to change widely deployed software while maintaining backward compatibility.
DNS Responses and Resource Records
Every domain, whether it is a single host or a top-level domain, can have a set of resource records associated with it. These records are the DNS database. For a single host, the most common resource record is just its IP address, but many other kinds of resource records also exist. When a resolver gives a domain name to DNS, what it gets back are the resource records associated with that name. Thus, the primary function of DNS is to map domain names onto resource records.
A resource record is a five-tuple. Although resource records are encoded in binary, in most expositions resource records are presented as ASCII text, with one line per resource record, as follows:
Domain name Time to live Class Type Value
The Domain name tells the domain to which this record applies. Normally, many records exist for each domain, and each copy of the database holds information about multiple domains. This field is thus the primary search key used to satisfy queries. The order of the records in the database is not significant.
The Time to live field gives an indication of how stable the record is. Infor- mation that is highly stable is assigned a large value, such as 86400 (the number of seconds in 1 day). Information that is volatile (like stock prices), or that operators
SEC. 7.1 THE DOMAIN NAME SYSTEM (DNS) 623
may want to change frequently (e.g., to enable load balancing a single name across multiple IP addresses) may be assigned a small value, such as 60 seconds (1 minute). We will return to this point later when we have discussed caching.
The third field of every resource record is the Class. For Internet information, it is always IN. For non-Internet information, other codes can be used, but in prac tice these are rarely seen.
The Type field tells what kind of record this is. There are many kinds of DNS records. The important types are listed in Fig. 7-4.
Type Meaning Value
SOA Start of authority Parameters for this zone
A IPv4 address of a host 32-Bit integer
AAAA IPv6 address of a host 128-Bit integer
MX Mail exchange Priority, domain willing to accept email NS Name server Name of a server for this domain CNAME Canonical name Domain name
PTR Pointer Alias for an IP address
SPF Sender policy framework Text encoding of mail sending policy SRV Service Host that provides it
TXT Text Descriptive ASCII text
Figure 7-4. The principal DNS resource record types.
An SOA record provides the name of the primary source of information about the name server’s zone (described below), the email address of its administrator, a unique serial number, and variousflags and timeouts.
Common Record Types
The most important record type is the A (Address) record. It holds a 32-bit IPv4 address of an interface for some host. The corresponding AAAA, or ‘‘quad A,’’ record holds a 128-bit IPv6 address. Every Internet host must have at least one IP address so that other machines can communicate with it. Some hosts have two or more network interfaces, so they will have two or more type A or AAAA re- source records. Additionally, a single service (e.g., google.com) may be hosted on many geographically distributed machines around the world (Calder et al., 2013). In these cases, a DNS resolver might return multiple IP addresses for a single domain name. In the case of a geographically distributed service, a resolver may return to its client one or more IP addresses of a server that is close to the client (geographically or topologically), to improve performance, and for load balancing.
An important record type is the NS record. It specifies a name server for the domain or subdomain. This is a host that has a copy of the database for a domain. It is used as part of the process to look up names, which we will describe shortly.
624 THE APPLICATION LAYER CHAP. 7
Another record type is the MX record. It specifies the name of the host prepared to accept email for the specified domain. It is used because not every machine is pre- pared to accept email. If someone wants to send email to, as an example, bill@microsoft.com, the sending host needs to find some mail server located at microsoft.com that is willing to accept email. The MX record can provide this information.
CNAME records allow aliases to be created. For example, a person familiar with Internet naming in general and wanting to send a message to user paul in the computer science department at the University of Chicago might guess that paul@cs.chicago.edu will work. Actually, this address will not work, because the domain for the computer science department is cs.uchicago.edu. As a service to people who do not know this, the University of Chicago could create a CNAME entry to point people and programs in the right direction. An entry like this one might do the job:
www.cs.uchicago.edu 120 IN CNAME hnd.cs.uchicago.edu
CNAMEs are commonly used for Web site aliases, because the common Web ser- ver addresses (which often start with www) tend to be hosted on machines that serve multiple purposes and whose primary name is not www.
The PTR record points to another name and is typically used to associate an IP address with a corresponding name. PTR lookups that associate a name with a corresponding IP address are typically called reverse lookups.
SRV is a newer type of record that allows a host to be identified for a given ser- vice in a domain. For example, the Web server for www.cs.uchicago.edu could be identified as hnd.cs.uchicago.edu. This record generalizes the MX record that per forms the same task but it is just for mail servers.
SPF lets a domain encode information about what machines in the domain will send mail to the rest of the Internet. This helps receiving machines check that mail is valid. If mail is being received from a machine that calls itself dodgy but the domain records say that mail will only be sent out of the domain by a machine called smtp, chances are that the mail is forged junk mail.
Last on the list, TXT records were originally provided to allow domains to identify themselves in arbitrary ways. Nowadays, they usually encode machine readable information, typically the SPF information.
Finally, we have the Value field. This field can be a number, a domain name, or an ASCII string. The semantics depend on the record type. A short description of the Value fields for each of the principal record types is given in Fig. 7-4.
DNSSEC Records
The original deployment of DNS did not consider the security of the protocol. In particular, DNS name servers or resolvers could manipulate the contents of any DNS record, thus causing the client to receive incorrect information. RFC 3833
SEC. 7.1 THE DOMAIN NAME SYSTEM (DNS) 625
highlights some of the various security threats to DNS and how DNSSEC addres- ses these threats. DNSSEC records allow responses from DNS name servers to carry digital signatures, which the local or stub resolver can subsequently verify to ensure that the DNS records were not modified or tampered with. Each DNS ser- ver computes a hash (a kind of long checksum) of the RRSET (Resource Record Set) for each set of resource records of the same type, with its private crypto- graphic keys. Corresponding public keys can be used to verify the signatures on the RRSETs. (For those not familiar with cryptography, Chap. 8 provides some technical background.)
Verifying the signature of an RRSET with the name server’s corresponding public key of course requires verifying the authenticity of that server’s public key. This verification can be accomplished if the public key of one authoritative name server’s public key is signed by the parent name server in the name hierarchy. For example, the .edu authoritative name server might sign the public key correspond ing to the chicago.edu authoritative name server, and so forth.
DNSSEC has two resource records relating to public keys: (1) the RRSIG record, which corresponds to a signature over the RRSET, signed with the corres- ponding authoritative name server’s private key, and (2) the DNSKEY record, which is the public key for the corresponding RRSET, which is signed by the par- ent’s private key. This hierarchical structure for signatures allows DNSSEC public keys for the name server hierarchy to be distributed in band. Only the root-level public keys must be distributed out-of-band, and those keys can be distributed in the same way that resolvers come to know about the IP addresses of the root name servers. Chap. 8 discusses DNSSEC in more detail.
DNS Zones
Fig. 7-5. shows an example of the type of information that might be available in a typical DNS resource record for a particular domain name. This figure depicts part of a (hypothetical) database for the cs.vu.nl domain shown in Fig. 7-1, which is often called a DNS zone file or sometimes simply DNS zone for short. This zone file contains seven types of resource records.
The first noncomment line of Fig. 7-5 gives some basic information about the domain, which will not concern us further. Then come two entries giving the first and second places to try to deliver email sent to person@cs.vu.nl. The zephyr (a specific machine) should be tried first. If that fails, the top should be tried as the next choice. The next line identifies the name server for the domain as star.
After the blank line (added for readability) come lines giving the IP addresses for the star, zephyr, and top. These are followed by an alias, www.cs.vu.nl, so that this address can be used without designating a specific machine. Creating this alias allows cs.vu.nl to change its World Wide Web server without invalidating the address people use to get to it. A similar argument holds for ftp.cs.vu.nl.
626 THE APPLICATION LAYER CHAP. 7
; Authoritative data for cs.vu.nl
cs.vu.nl. 86400 IN SOA star boss (9527,7200,7200,241920,86400) cs.vu.nl. 86400 IN MX 1 zephyr
cs.vu.nl. 86400 IN MX 2 top
cs.vu.nl. 86400 IN NS star
star 86400 IN A 130.37.56.205
zephyr 86400 IN A 130.37.20.10
top 86400 IN A 130.37.20.11
www 86400 IN CNAME star.cs.vu.nl
ftp 86400 IN CNAME zephyr.cs.vu.nl
flits 86400 IN A 130.37.16.112
flits 86400 IN A 192.31.231.165
flits 86400 IN MX 1 flits
flits 86400 IN MX 2 zephyr
flits 86400 IN MX 3 top
rowboat IN A 130.37.56.201
IN MX 1 rowboat
IN MX 2 zephyr
little-sister IN A 130.37.62.23
laserjet IN A 192.31.231.216
Figure 7-5. A portion of a possible DNS database (zone file) for cs.vu.nl.
The section for the machine flits lists two IP addresses and three choices are given for handling email sent to flits.cs.vu.nl. First choice is naturally the flits itself, but if it is down, the zephyr and top are the second and third choices.
The next three lines contain a typical entry for a computer, in this example, rowboat.cs.vu.nl. The information provided contains the IP address and the pri- mary and secondary mail drops. Then comes an entry for a computer that is not capable of receiving mail itself, followed by an entry that is likely for a printer (laserjet) that is connected to the Internet.
In theory at least, a single name server could contain the entire DNS database and respond to all queries about it. In practice, this server would be so overloaded as to be useless. Furthermore, if it ever went down, the entire Internet would be crippled.
To avoid the problems associated with having only a single source of infor- mation, the DNS name space is divided into nonoverlapping zones. One possible way to divide the name space of Fig. 7-1 is shown in Fig. 7-6. Each circled zone contains some part of the tree.
Where the zone boundaries are placed within a zone is up to that zone’s admin istrator. This decision is made in large part based on how many name servers are
SEC. 7.1 THE DOMAIN NAME SYSTEM (DNS) 627 Generic Countries
. . .
aero com edu gov museum org net au jp uk us nl cisco acm ieee
. . .
uchicago
eng
cs
eng
jack jill
edu vu oce ac co
uwa keio
nec
cs law
cs
noise
csl
flits fluit
Figure 7-6. Part of the DNS name space divided into zones (which are circled).
desired, and where. For example, in Fig. 7-6, the University of Chicago has a zone for chicago.edu that handles traffic to cs.uchicago.edu. However, it does not hand le eng.uchicago.edu. That is a separate zone with its own name servers. Such a decision might be made when a department such as English does not wish to run its own name server, but a department such as Computer Science does.
7.1.5 Name Resolution
Each zone is associated with one or more name servers. These are hosts that hold the database for the zone. Normally, a zone will have one primary name ser- ver, which gets its information from a file on its disk, and one or more secondary name servers, which get their information from the primary name server. To improve reliability, some of the name servers can be located outside the zone.
The process of looking up a name and finding an address is called name reso lution. When a resolver has a query about a domain name, it passes the query to a local name server. If the domain being sought falls under the jurisdiction of the name server, such as top.cs.vu.nl falling under cs.vu.nl, it returns the authoritative resource records. An authoritative record is one that comes from the authority that manages the record and is thus always correct. Authoritative records are in contrast to cached records, which may be out of date.
What happens when the domain is remote, such as when flits.cs.vu.nl wants to find the IP address of cs.uchicago.edu at the University of Chicago? In this case, and if there is no cached information about the domain available locally, the name server begins a remote query. This query follows the process shown in Fig. 7-7. Step 1 shows the query that is sent to the local name server. The query contains the domain name sought, the type (A), and the class(IN).
628 THE APPLICATION LAYER CHAP. 7
Root name server
(a.root-servers.net)
2: query
1: noise.cs.uchicago.edu
3: edu
4: query
5: uchicago.edu
Edu name server (a.edu-servers.net)
10: 128.135.24.19
filts.cs.vu.nl
Originator
Local
(cs.vu.nl) resolver
6: query
7: cs.uchicago.edu 9: 128.135.24.19
8: query
uchicago name server
uchicago cs name server
Figure 7-7. Example of a resolver looking up a remote name in 10 steps.
The next step is to start at the top of the name hierarchy by asking one of the root name servers. These name servers have information about each top-level domain. This is shown as step 2 in Fig. 7-7. To contact a root server, each name server must have information about one or more root name servers. This infor- mation is normally present in a system configuration file that is loaded into the DNS cache when the DNS server is started. It is simply a list of NS records for the root and the corresponding A records.
There are 13 root DNS servers, unimaginatively called a.root-servers.net through m.root-servers.net. Each root server could logically be a single computer. However, since the entire Internet depends on the root servers, they are powerful and heavily replicated computers. Most of the servers are present in multiple geo-
graphical locations and reached using anycast routing, in which a packet is deliv- ered to the nearest instance of a destination address; we described anycast in Chap. 5. The replication improves reliability and performance.
The root name server is very unlikely to know the address of a machine at uchicago.edu, and probably does not know the name server for uchicago.edu eith- er. But it must know the name server for the edu domain, in which cs.uchicago.edu is located. It returns the name and IP address for that part of the answer in step 3.
The local name server then continues its quest. It sends the entire query to the edu name server (a.edu-servers.net). That name server returns the name server for uchicago.edu. This is shown in steps 4 and 5. Closer now, the local name server sends the query to the uchicago.edu name server (step 6). If the domain name being sought was in the English department, the answer would be found, as the uchicago.edu zone includes the English department. The Computer Science depart- ment has chosen to run its own name server. The query returns the name and IP address of the uchicago.edu Computer Science name server (step 7).
SEC. 7.1 THE DOMAIN NAME SYSTEM (DNS) 629
Finally, the local name server queries the uchicago.edu Computer Science name server (step 8). This server is authoritative for the domain cs.uchicago.edu, so it must have the answer. It returns the final answer (step 9), which the local name server forwards as a response to flits.cs.vu.nl (step 10).
7.1.6 Hands on with DNS
You can explore this process using standard tools such as the dig program that is installed on most UNIX systems. For example, typing
dig ns @a.edu-servers.net cs.uchicago.edu
will send a query for cs.uchicago.edu to the a.edu-servers.net name server and print out the result for its name servers. This will show you the information obtain- ed in Step 4 in the example above, and you will learn the name and IP address of the uchicago.edu name servers. Most organizations will have multiple name ser- vers in case one is down. Half a dozen is not unusual. If you have access to a UNIX, Linux, or MacOS system, try experimenting with the dig program to see what it can do. You can learn a lot about DNS from using it. (The dig program is also available for Windows, but you may have to install it yourself.)
Even though its purpose is simple, it should be clear that DNS is a large and complex distributed system that is comprised of millions of name servers that work together. It forms a key link between human-readable domain names and the IP addresses of machines. It includes replication and caching for performance and reliability and is designed to be highly robust.
Some applications need to use names in more flexible ways, for example, by naming content and resolving to the IP address of a nearby host that has the con tent. This fits the model of searching for and downloading a movie. It is the movie that matters, not the computer that has a copy of it, so all that is wanted is the IP address of any nearby computer that has a copy of the movie. Content delivery networks are one way to accomplish this mapping. We will describe how they build on the DNS later in this chapter, in Sec. 7.5.
7.1.7 DNS Privacy
Historically, DNS queries and responses have not been encrypted. As a result, any other device or eavesdropper on the network (e.g., other devices, a system administrator, a coffee shop network) could conceivably observe a user’s DNS traf fic and determine information about that user. For example, a lookup to a site like uchicago.edu might indicate that a user was browsing the University of Chicago Web site. While such information might seem innocuous, DNS lookups to Web sites such as webmd.com might indicate that a user was performing medical research. Combinations of lookups combined with other information can often even reveal more specific information, possibly even the precise Web site that a user is visiting.
630 THE APPLICATION LAYER CHAP. 7
Privacy issues associated with DNS queries have become more contentious when considering emerging applications, such as the Internet of Things (IoT) and smart homes. For example, the DNS queries that a device issues can reveal infor- mation about the type of devices that users have in their smart homes and the extent to which they are interacting with those devices. For example, the DNS queries that an Internet-connected camera or sleep monitor issues can uniquely identify the device (Apthorpe et al., 2019). Given the increasingly sensitive activi ties that people perform on Internet-connected devices, from browsers to Inter- net-connected ‘‘smart’’ devices, there is an increasing desire to encrypt DNS queries and responses.
Several recent developments are poised to potentially reshape DNS entirely. The first is the movement toward encrypting DNS queries and responses. Various organizations, including Cloudflare, Google, and others are now offering users the opportunity to direct their DNS traffic to their own local recursive resolvers, and additionally offering support for encrypted transport (e.g., TLS, HTTPS) between the DNS stub resolver and their local resolver. In some cases, these organizations are partnering with Web browser manufacturers (e.g., Mozilla) to potentially direct all DNS traffic to these local resolvers by default.
If all DNS queries and responses are exchanged with cloud providers over encrypted transport by default, the implications for the future of the Internet archi tecture could be extremely significant. Specifically, Internet service providers will no longer have the ability to observe DNS queries from their subscribers’ home networks, which has, in the past, been one of the primary ways that ISPs monitor these networks for infections and malware (Antonakakis et al., 2010). Other func tions, such as parental controls and various other services that ISPs offer, also depend on seeing DNS traffic.
Ultimately, two somewhat orthogonal issues are at play. The first is the shift of DNS towards encrypted transport, which almost everyone would agree is a positive change (there were initial concerns about performance, which have mostly now been addressed). The second issue is thornier: it involves who gets to operate the local recursive resolvers. Previously, the local recursive resolver was generally operated by a user’s ISP; if DNS resolution moves to the browser, however, via DoH, then the browsers (the two most popular of which are at this point largely controlled by a single dominant provider, Google) can control who is in a position to observe DNS traffic. Ultimately, the operator of the local recursive resolver can see the DNS queries from the user and associate those with an IP address; whether the user wants their ISP or a large advertising company to see their DNS traffic should be their choice, but the default settings in the browser may ultimately deter- mine who ends up seeing the majority of this traffic. Presently, a wide range of organizations, from ISPs to content providers and advertising companies are trying to establish what are being called TRRs (Trusted Recursive Resolvers), which are local recursive resolvers that use DoT or DoH to resolve queries for clients. Time will tell how these developments ultimately reshape the DNS architecture.
SEC. 7.1 THE DOMAIN NAME SYSTEM (DNS) 631
Even DoT and DoH do not completely resolve all DNS-related privacy con- cerns, because the operator of the local resolver must still be trusted with sensitive information: namely, the DNS queries and the IP addresses of the clients that issued those queries. Other recent enhancements to DNS and DoH have been pro- posed, including oblivious DNS (Schmitt et al., 2019) and oblivious DoH (Kinn- ear et al., 2019), whereby the stub resolver encrypts the original query to the local recursive resolver, which in turn sends the encrypted query to an authoritative name serve that can decrypt and resolve the query, but does not know the identity or IP address of the stub resolver that initiated the query. Figure 7-8 shows this relationship.
Can decrypt query
Sees IP address of
stub, but not
query.
Client Stub resolver Recursive resolver
University of Chicago
Figure 7-8. Oblivious DNS.
but doesn t know stub resolve IP address.
ODNS
Authoritative server (Chicago)
Most of these implementations are still nascent, in the forms of early prototypes and draft standards being discussed in the DNS privacy working group at IETF.
7.1.8 Contention Over Names
As the Internet has become more commercial and more international, it has also become more contentious, especially in matters related to naming. This con troversy includes ICANN itself. For example, the creation of the xxx domain took several years and court cases to resolve. Is voluntarily placing adult content in its own domain a good or a bad thing? (Some people did not want adult content avail- able at all on the Internet while others wanted to put it all in one domain so nanny filters could easily find and block it from children.) Some of the domains self- organize, while others have restrictions on who can obtain a name, as noted in Fig. 7-8. But what restrictions are appropriate? Take the pro domain, for example. It is for qualified professionals. But who, exactly, is a professional? Doctors and lawyers clearly are professionals. But what about freelance photographers, piano teachers, magicians, plumbers, barbers, exterminators, tattoo artists, mercenaries, and prostitutes? Are these occupations eligible? According to whom?
632 THE APPLICATION LAYER CHAP. 7
There is also money in names. Tuvalu (a tiny island country midway between Hawaii and Australia) sold a lease on its tv domain for $50 million, all because the country code is well-suited to advertising television sites. Virtually every common (English) word has been taken in the com domain, along with the most common misspellings. Try household articles, animals, plants, body parts, etc. The practice of registering a domain only to turn around and sell it off to an interested party at a much higher price even has a name. It is called cybersquatting. Many companies that were slow off the mark when the Internet era began found their obvious domain names already taken when they tried to acquire them. In general, as long as no trademarks are being violated and no fraud is involved, it is first-come, first- served with names. Nevertheless, policies to resolve naming disputes are still being refined.
7.2 ELECTRONIC MAIL
Electronic mail, or more commonly email, has been around for over four dec- ades. Faster and cheaper than paper mail, email has been a popular application since the early days of the Internet. Before 1990, it was mostly used in academia. During the 1990s, it became known to the public at large and grew exponentially, to the point where the number of emails sent per day now is vastly more than the number of snail mail (i.e., paper) letters. Other forms of network communication, such as instant messaging and voice-over-IP calls have expanded greatly in use over the past decade, but email remains the workhorse of Internet communication. It is widely used within industry for intracompany communication, for example, to allow far-flung employees all over the world to cooperate on complex projects. Unfortunately, like paper mail, the majority of email—some 9 out of 10 mes- sages—is junk mail or spam. While mail systems can remove much of it now- adays, a lot still gets through and research into detecting it all is ongoing, for example, see Dan et al. (2019) and Zhang et al. (2019).
Email, like most other forms of communication, has developed its own conven tions and styles. It is very informal and has a low threshold of use. People who would never dream of calling up or even writing a letter to a Very Important Person do not hesitate for a second to send a sloppily written email to him or her. By eliminating most cues associated with rank, age, and gender, email debates often focus on content, not status. With email, a brilliant idea from a summer student can have more impact than a dumb one from an executive vice president.
Email is full of jargon such as BTW (By The Way), ROTFL (Rolling On The Floor Laughing), and IMHO (In My Humble Opinion). Many people also use little ASCII symbols called smileys, starting with the ubiquitous ‘‘:-)’’. This symbol and other emoticons help to convey the tone of the message. They have spread to other terse forms of communication, such as instant messaging, typically as graphi- cal emoji. Many smartphones have hundreds of emojis available.
SEC. 7.2 ELECTRONIC MAIL 633
The email protocols have evolved during the period of their use, too. The first email systems simply consisted of file transfer protocols, with the convention that the first line of each message (i.e., file) contained the recipient’s address. As time went on, email diverged from file transfer and many features were added, such as the ability to send one message to a list of recipients. Multimedia capabilities became important in the 1990s to send messages with images and other non-text material. Programs for reading email became much more sophisticated too, shift ing from text-based to graphical user interfaces and adding the ability for users to access their mail from their laptops wherever they happen to be. Finally, with the prevalence of spam, email systems now pay attention to finding and removing unwanted email.
In our description of email, we will focus on the way that mail messages are moved between users, rather than the look and feel of mail reader programs. Nevertheless, after describing the overall architecture, we will begin with the user facing part of the email system, as it is familiar to most readers.
7.2.1 Architecture and Services
In this section, we will provide an overview of how email systems are organized and what they can do. The architecture of the email system is shown in Fig. 7-9. It consists of two kinds of subsystems: the user agents, which allow peo- ple to read and send email, and the message transfer agents, which move the mes- sages from the source to the destination. We will also refer to message transfer agents informally as mail servers.
Mailbox
SMTP
Sender
User Agent
Message Transfer Agent
Message Transfer Agent
Receiver User Agent
1: Mail
submission
2: Message transfer
3: Final
delivery
Figure 7-9. Architecture of the email system.
The user agent is a program that provides a graphical interface, or sometimes a text- and command-based interface that lets users interact with the email system. It includes a means to compose messages and replies to messages, display incoming messages, and organize messages by filing, searching, and discarding them. The act of sending new messages into the mail system is called mail submission.
634 THE APPLICATION LAYER CHAP. 7
Some of the user agent processing may be done automatically, anticipating what the user wants. For example, incoming mail may be filtered to extract or deprioritize messages that are likely spam. Some user agents include advanced features, such as arranging for automatic email responses (‘‘I’m having a wonder ful vacation and it will be a while before I get back to you.’’). A user agent runs on the same computer on which a user reads her mail. It is just another program and may be run only some of the time.
The message transfer agents are typically system processes. They run in the background on mail server machines and are intended to be always available. Their job is to automatically move email through the system from the originator to the recipient with SMTP (Simple Mail Transfer Protocol), discussed in Sec. 7.2.4. This is the message transfer step.
SMTP was originally specified as RFC 821 and revised to become the current RFC 5321. It sends mail over connections and reports back the delivery status and any errors. Numerous applications exist in which confirmation of delivery is important and may even have legal significance (‘‘Well, Your Honor, my email sys tem is just not very reliable, so I guess the electronic subpoena just got lost some- where’’).
Message transfer agents also implement mailing lists, in which an identical copy of a message is delivered to everyone on a list of email addresses. Additional advanced features are carbon copies, blind carbon copies, high-priority email, secret (encrypted) email, alternative recipients if the primary one is not currently
available, and the ability for assistants to read and answer their bosses’ email. Linking user agents and message transfer agents are the concepts of mailboxes and a standard format for email messages. Mailboxes store the email that is received for a user. They are maintained by mail servers. User agents simply pres- ent users with a view of the contents of their mailboxes. To do this, the user agents send the mail servers commands to manipulate the mailboxes, inspecting their con tents, deleting messages, and so on. The retrieval of mail is the final delivery (step 3) in Fig. 7-9. With this architecture, one user may use different user agents on multiple computers to access one mailbox.
Mail is sent between message transfer agents in a standard format. The original format, RFC 822, has been revised to the current RFC 5322 and extended with support for multimedia content and international text. This scheme is called MIME. People still refer to Internet email as RFC 822, though.
A key idea in the message format is the clear distinction between the envelope and the contents of the envelope. The envelope encapsulates the message. Fur thermore, it contains all the information needed for transporting the message, such as the destination address, priority, and security level, all of which are distinct from the message itself. The message transport agents use the envelope for routing, just as the post office does.
The message inside the envelope consists of two separate parts: the header and the body. The header contains control information for the user agents. The body
SEC. 7.2 ELECTRONIC MAIL 635
is entirely for the human recipient. None of the agents care much about it. Envelopes and messages are illustrated in Fig. 7-10.
Name: Mr. Daniel Dumkopf
44¢
Mr. Daniel Dumkopf
18 Willow Lane
e p
o
l
e
v
Street: 18 Willow Lane City: White Plains State: NY
Zip code: 10604
Envelope
White Plains, NY 10604
United Gizmo
180 Main St
n
E
r e
Priority: Urgent
Encryption: None
From: United Gizmo Address: 180 Main St.
Location: Boston, MA 02120
Boston, MA 02120
d
a
Date: Feb. 14, 2020
Feb. 14, 2020
Subject: Invoice 1081
Dear Mr. Dumkopf,
Our computer records show that you still have not paid the above invoice of $0.00. Please send us a check for $0.00 promptly.
Yours truly
United Gizmo
e
H
y d
o
B
Subject: Invoice 1081
Dear Mr. Dumkopf,
Our computer records show that you still have not paid the above invoice of $0.00. Please send us a check for $0.00 promptly.
Yours truly
United Gizmo
Message
(a) (b)
Figure 7-10. Envelopes and messages. (a) Paper mail. (b) Electronic mail.
We will examine the pieces of this architecture in more detail by looking at the steps that are involved in sending email from one user to another. This journey starts with the user agent.
7.2.2 The User Agent
A user agent is a program (sometimes called an email reader) that accepts a variety of commands for composing, receiving, and replying to messages, as well as for manipulating mailboxes. There are many popular user agents, including Google Gmail, Microsoft Outlook, Mozilla Thunderbird, and Apple Mail. They can vary greatly in their appearance. Most user agents have a menu- or icon-driven graphical interface that requires a mouse, or a touch interface on smaller mobile devices. Older user agents, such as Elm, mh, and Pine, provide text-based inter faces and expect one-character commands from the keyboard. Functionally, these are the same, at least for text messages.
636 THE APPLICATION LAYER CHAP. 7
The typical elements of a user agent interface are shown in Fig. 7-11. Your mail reader is likely to be much flashier, but probably has equivalent functions. When a user agent is started, it will usually present a summary of the messages in the user’s mailbox. Often, the summary will have one line for each message in some sorted order. It highlights key fields of the message that are extracted from the message envelope or header.
Message folders
Mail Folders
From
Subject
Message summary Received
All items Inbox
Networks
trudy Andy djw
Not all Trudys are nasty Material on RFID privacy !
Have you seen this?
Today Today Mar 4
Travel
Junk Mail
Amy N. Wong guido
lazowska Olivia
Request for information Re: Paper acceptance More on that
I have an idea
Mar 3 Mar 3 Mar 2 Mar 2
. . .
. . . . . .
Search Graduate studies? Mar 1
A. Student
Dear Professor,
I recently completed my undergraduate studies with
Mailbox search
distinction at an excellent university. I will be visiting your . . . . . .
Message
Figure 7-11. Typical elements of the user agent interface.
Seven summary lines are shown in the example of Fig. 7-11. The lines use the From, Subject, and Received fields, in that order, to display who sent the message, what it is about, and when it was received. All the information is formatted in a user-friendly way rather than displaying the literal contents of the message fields, but it is based on the message fields. Thus, people who fail to include a Subject field often discover that responses to their emails tend not to get the highest prior ity.
Many other fields or indications are possible. The icons next to the message subjects in Fig. 7-11 might indicate, for example, unread mail (the envelope), attached material (the paperclip), and important mail, at least as judged by the sender (the exclamation point).
Many sorting orders are also possible. The most common is to order messages based on the time that they were received, most recent first, with some indication as to whether the message is new or has already been read by the user. The fields in the summary and the sort order can be customized by the user according to her preferences.
User agents must also be able to display incoming messages as needed so that people can read their email. Often a short preview of a message is provided, as in
SEC. 7.2 ELECTRONIC MAIL 637
Fig. 7-11, to help users decide when to read further and when to hit the SPAM but ton. Previews may use small icons or images to describe the contents of the mes- sage. Other presentation processing includes reformatting messages to fit the dis- play, and translating or converting contents to more convenient formats (e.g., digi tized speech to recognized text).
After a message has been read, the user can decide what to do with it. This is called message disposition. Options include deleting the message, sending a reply, forwarding the message to another user, and keeping the message for later reference. Most user agents can manage one mailbox for incoming mail with mul tiple folders for saved mail. The folders allow the user to save message according to sender, topic, or some other category.
Filing can be done automatically by the user agent as well, even before the user reads the messages. A common example is that the fields and contents of mes- sages are inspected and used, along with feedback from the user about previous messages, to determine if a message is likely to be spam. Many ISPs and com- panies run software that labels mail as important or spam so that the user agent can file it in the corresponding mailbox. The ISP and company have the advantage of seeing mail for many users and may have lists of known spammers. If hundreds of users have just received a similar message, it is probably spam, although it could be a message from the CEO to all employees. By presorting incoming mail as ‘‘probably legitimate’’ and ‘‘probably spam,’’ the user agent can save users a fair amount of work separating the good stuff from the junk.
And the most popular spam? It is generated by collections of compromised computers called botnets and its content depends on where you live. Fake diplo- mas are common in Asia, and cheap drugs and other dubious product offers are common in the U.S. Unclaimed Nigerian bank accounts still abound. Pills for enlarging various body parts are common everywhere.
Other filing rules can be constructed by users. Each rule specifies a condition and an action. For example, a rule could say that any message received from the boss goes to one folder for immediate reading and any message from a particular mailing list goes to another folder for later reading. Several folders are shown in Fig. 7-11. The most important folders are the Inbox, for incoming mail not filed elsewhere, and Junk Mail, for messages that are thought to be spam.
7.2.3 Message Formats
Now we turn from the user interface to the format of the email messages them- selves. Messages sent by the user agent must be placed in a standard format to be handled by the message transfer agents. First we will look at basic ASCII email using RFC 5322, which is the latest revision of the original Internet message for- mat as described in RFC 822 and its many updates. After that, we will look at multimedia extensions to the basic format.
638 THE APPLICATION LAYER CHAP. 7 RFC 5322—The Internet Message Format
Messages consist of a primitive envelope (described as part of SMTP in RFC 5321), some number of header fields, a blank line, and then the message body. Each header field (logically) consists of a single line of ASCII text containing the field name, a colon, and, for most fields, a value. The original RFC 822 was designed decades ago and did not clearly distinguish the envelope fields from the header fields. Although it has been revised to RFC 5322, completely redoing it was not possible due to its widespread usage. In normal usage, the user agent builds a message and passes it to the message transfer agent, which then uses some of the header fields to construct the actual envelope, a somewhat old-fashioned mixing of message and envelope.
The principal header fields related to message transport are listed in Fig. 7-12. The To: field gives the email address of the primary recipient. Having multiple recipients is also allowed. The Cc: field gives the addresses of any secondary recipients. In terms of delivery, there is no distinction between the primary and secondary recipients. It is entirely a psychological difference that may be impor tant to the people involved but is not important to the mail system. The term Cc: (Carbon copy) is a bit dated, since computers do not use carbon paper, but it is well established. The Bcc: (Blind carbon copy) field is like the Cc: field, except that this line is deleted from all the copies sent to the primary and secondary recipients. This feature allows people to send copies to third parties without the primary and secondary recipients knowing this.
Header Meaning
To: Email address(es) of primary recipient(s)
Cc: Email address(es) of secondary recipient(s)
Bcc: Email address(es) for blind carbon copies
From: Person or people who created the message
Sender: Email address of the actual sender
Received: Line added by each transfer agent along the route
Return-Path: Can be used to identify a path back to the sender
Figure 7-12. RFC 5322 header fields related to message transport.
The next two fields, From: and Sender:, tell who wrote and actually sent the message, respectively. These two fields need not be the same. For example, a bus iness executive may write a message, but her assistant may be the one who actually transmits it. In this case, the executive would be listed in the From: field and the assistant in the Sender: field. The From: field is required, but the Sender: field may be omitted if it is the same as the From: field. These fields are needed in case the message is undeliverable and must be returned to the sender.
SEC. 7.2 ELECTRONIC MAIL 639
A line containing Received: is added by each message transfer agent along the way. The line contains the agent’s identity, the date and time the message was received, and other information that can be used for debugging the routing system.
The Return-Path: field is added by the final message transfer agent and was intended to tell how to get back to the sender. In theory, this information can be gathered from all the Received: headers (except for the name of the sender’s mail- box), but it is rarely filled in as such and typically just contains the sender’s address.
In addition to the fields of Fig. 7-12, RFC 5322 messages may also contain a variety of header fields used by the user agents or human recipients. The most common ones are listed in Fig. 7-13. Most of these are self-explanatory, so we will not go into all of them in much detail.
Header Meaning
Date: The date and time the message was sent
Reply-To: Email address to which replies should be sent
Message-Id: Unique number for referencing this message later
In-Reply-To: Message-Id of the message to which this is a reply
References: Other relevant Message-Ids
Keywords: User-chosen keywords
Subject: Short summary of the message for the one-line display
Figure 7-13. Some fields used in the RFC 5322 message header.
The Reply-To: field is sometimes used when neither the person composing the message nor the person sending the message wants to see the reply. For example, a marketing manager may write an email message telling customers about a new product. The message is sent by an assistant, but the Reply-To: field lists the head of the sales department, who can answer questions and take orders. This field is also useful when the sender has two email accounts and wants the reply to go to the other one.
The Message-Id: is an automatically generated number that is used to link messages together (e.g., when used in the In-Reply-To: field) and to prevent dupli- cate delivery.
The RFC 5322 document explicitly says that users are allowed to invent optio- nal headers for their own private use. By convention since RFC 822, these headers start with the string X-. It is guaranteed that no future headers will use names start ing with X-, to avoid conflicts between official and private headers. Sometimes wiseguy undergraduates make up fields like X-Fruit-of-the-Day: or X-Disease-of the-Week:, which are legal, although not always illuminating.
After the headers comes the message body. Users can put whatever they want here. Some people terminate their messages with elaborate signatures, including quotations from greater and lesser authorities, political statements, and disclaimers
640 THE APPLICATION LAYER CHAP. 7
of all kinds (e.g., The XYZ Corporation is not responsible for my opinions; in fact, it cannot even comprehend them).
MIME—The Multipurpose Internet Mail Extensions
In the early days of the ARPANET, email consisted exclusively of text mes- sages written in English and expressed in ASCII. For this environment, the early RFC 822 format did the job completely: it specified the headers but left the content entirely up to the users. In the 1990s, the worldwide use of the Internet and de- mand to send richer content through the mail system meant that this approach was no longer adequate. The problems included sending and receiving messages in languages with diacritical marks (e.g., French and German), non-Latin alphabets (e.g., Hebrew and Russian), or no alphabets (e.g., Chinese and Japanese), as well as sending messages not containing text at all (e.g., audio, images, or binary docu- ments and programs).
The solution was the development of MIME (Multipurpose Internet Mail Extensions). It is widely used for mail messages that are sent across the Internet, as well as to describe content for other applications such as Web browsing. MIME is described in RFC 2045, and the ones following it as well as RFC 4288 and 4289.
The basic idea of MIME is to continue to use the RFC 822 format but to add structure to the message body and define encoding rules for the transfer of non- ASCII messages. Not deviating from RFC 822 allowed MIME messages to be sent using the existing mail transfer agents and protocols (based on RFC 821 then, and RFC 5321 now). All that had to be changed were the sending and receiving programs, which users could do for themselves.
MIME defines five new message headers, as shown in Fig. 7-14. The first of these simply tells the user agent receiving the message that it is dealing with a MIME message, and which version of MIME it uses. Any message not containing a MIME-Version: header is assumed to be an English plaintext message (or at least one using only ASCII characters) and is processed as such.
Header Meaning
MIME-Version: Identifies the MIME version
Content-Description: Human-readable string telling what is in the message Content-Id: Unique identifier
Content-Transfer-Encoding: How the body is wrapped for transmission Content-Type: Type and format of the content
Figure 7-14. Message headers added by MIME.
The Content-Description: header is an ASCII string telling what is in the mes- sage. This header is needed so the recipient will know whether it is worth decod ing and reading the message. If the string says ‘‘Photo of Aron’s hamster’’ and the
SEC. 7.2 ELECTRONIC MAIL 641
person getting the message is not a big hamster fan, the message will probably be discarded rather than decoded into a high-resolution color photograph. The Content-Id: header identifies the content. It uses the same format as the standard Message-Id: header.
The Content-Transfer-Encoding: tells how the body is wrapped for transmis- sion through the network. A key problem at the time MIME was developed was that the mail transfer (SMTP) protocols expected ASCII messages in which no line exceeded 1000 characters. ASCII characters use 7 bits out of each 8-bit byte. Bina ry data such as executable programs and images use all 8 bits of each byte, as do extended character sets. There was no guarantee this data would be transferred safely. Hence, some method of carrying binary data that made it look like a regular ASCII mail message was needed. Extensions to SMTP since the development of MIME do allow 8-bit binary data to be transferred, though even today binary data may not always go through the mail system correctly if unencoded.
MIME provides five transfer encoding schemes, plus an escape to new schemes—just in case. The simplest scheme is just ASCII text messages. ASCII characters use 7 bits and can be carried directly by the email protocol, provided that no line exceeds 1000 characters.
The next simplest scheme is the same thing, but using 8-bit characters, that is, all values from 0 up to and including 255 are allowed. Messages using the 8-bit encoding must still adhere to the standard maximum line length.
Then there are messages that use a true binary encoding. These are arbitrary binary files that not only use all 8 bits but also do not adhere to the 1000-character line limit. Executable programs fall into this category. Nowadays, mail servers can negotiate to send data in binary (or 8-bit) encoding, falling back to ASCII if both ends do not support the extension.
The ASCII encoding of binary data is called base64 encoding. In this scheme, groups of 24 bits are broken up into four 6-bit units, with each unit being sent as a legal ASCII character. The coding is ‘‘A’’ for 0, ‘‘B’’ for 1, and so on, followed by the 26 lowercase letters, the 10 digits, and finally + and / for 62 and 63, respec tively. The == and = sequences indicate that the last group contained only 8 or 16 bits, respectively. Carriage returns and line feeds are ignored, so they can be inserted at will in the encoded character stream to keep the lines short enough. Arbitrary binary text can be sent safely using this scheme, albeit inefficiently. This encoding was very popular before binary-capable mail servers were widely deploy- ed. It is still commonly seen.
The last header shown in Fig. 7-14 is really the most interesting one. It speci fies the nature of the message body and has had an impact well beyond email. For instance, content downloaded from the Web is labeled with MIME types so that the browser knows how to present it. So is content sent over streaming media and real-time transports such as voice over IP.
Initially, seven MIME types were defined in RFC 1521. Each type has one or more available subtypes. The type and subtype are separated by a slash, as in
642 THE APPLICATION LAYER CHAP. 7
‘‘Content-Type: video/mpeg’’. Since then, over 2700 subtypes have been added, along two new types (font and model). Additional entries are being added all the time as new types of content are developed. The list of assigned types and sub types is maintained online by IANA at www.iana.org/assignments/media-types. The types, along with several examples of commonly used subtypes, are given in Fig. 7-15.
Type Example subtypes Description
text plain, html, xml, css Text in various formats image gif, jpeg, tiff Pictures
audio basic, mpeg, mp4 Sounds
video mpeg, mp4, quicktime Movies
font otf, ttf Fonts for typesetting
model vrml 3D model
application octet-stream, pdf, javascript, zip Data produced by applications message http, RFC 822 Encapsulated message multipart mixed, alternative, parallel, digest Combination of multiple types
Figure 7-15. MIME content types and example subtypes.
The MIME types in Fig. 7-15 should be self-explanatory except perhaps the last one. It allows a message with multiple attachments, each with a different MIME type.
7.2.4 Message Transfer
Now that we have described user agents and mail messages, we are ready to look at how the message transfer agents relay messages from the originator to the recipient. The mail transfer is done with the SMTP protocol.
The simplest way to move messages is to establish a transport connection from the source machine to the destination machine and then just transfer the message. This is how SMTP originally worked. Over the years, however, two different uses of SMTP have been differentiated. The first use is mail submission, step 1 in the email architecture of Fig. 7-9. This is the means by which user agents send mes- sages into the mail system for delivery. The second use is to transfer messages between message transfer agents (step 2 in Fig. 7-9). This sequence delivers mail all the way from the sending to the receiving message transfer agent in one hop. Final delivery is accomplished with different protocols that we will describe in the next section.
In this section, we will describe the basics of the SMTP protocol and its exten- sion mechanism. Then we will discuss how it is used differently for mail submis- sion and message transfer.
SEC. 7.2 ELECTRONIC MAIL 643 SMTP (Simple Mail Transfer Protocol) and Extensions
Within the Internet, email is delivered by having the sending computer estab lish a TCP connection to port 25 of the receiving computer. Listening to this port is a mail server that speaks SMTP (Simple Mail Transfer Protocol). This server accepts incoming connections, subject to some security checks, and accepts mes-
sages for delivery. If a message cannot be delivered, an error report containing the first part of the undeliverable message is returned to the sender.
SMTP is a simple ASCII protocol. This is not a weakness but a feature. Using ASCII text makes protocols easy to develop, test, and debug. They can be tested by sending commands manually, and records of the messages are easy to read. Most application-level Internet protocols now work this way (e.g., HTTP).
We will walk through a simple message transfer between mail servers that delivers a message. After establishing the TCP connection to port 25, the sending machine, operating as the client, waits for the receiving machine, operating as the server, to talk first. The server starts by sending a line of text giving its identity and telling whether it is prepared to receive mail. If it is not, the client releases the connection and tries again later.
If the server is willing to accept email, the client announces whom the email is coming from and whom it is going to. If such a recipient exists at the destination, the server gives the client the go-ahead to send the message. Then the client sends the message and the server acknowledges it. No checksums are needed because TCP provides a reliable byte stream. If there is more email, that is now sent. When all the email has been exchanged in both directions, the connection is released. A sample dialog is shown in Fig. 7-16. The lines sent by the client (i.e., the sender) are marked C:. Those sent by the server (i.e., the receiver) are marked S:.
The first command from the client is indeed meant to be HELO. Of the vari- ous four-character abbreviations for HELLO, this one has numerous advantages over its biggest competitor. Why all the commands had to be four characters has been lost in the mists of time.
In Fig. 7-16, the message is sent to only one recipient, so only one RCPT com- mand is used. Such commands are allowed to send a single message to multiple receivers. Each one is individually acknowledged or rejected. Even if some recipi- ents are rejected (because they do not exist at the destination), the message can be sent to the other ones.
Finally, although the syntax of the four-character commands from the client is rigidly specified, the syntax of the replies is less rigid. Only the numerical code really counts. Each implementation can put whatever string it wants after the code.
The basic SMTP works well, but it is limited in several respects. It does not include authentication. This means that the FROM command in the example could give any sender address that it pleases. This is quite useful for sending spam. Another limitation is that SMTP transfers ASCII messages, not binary data. This is
644 THE APPLICATION LAYER CHAP. 7
S: 220 ee.uwa.edu.au SMTP service ready
C: HELO abcd.com
S: 250 cs.uchicago.edu says hello to ee.uwa.edu.au
C: MAIL FROM: <alice@cs.uchicago.edu>
S: 250 sender ok
C: RCPT TO: <bob@ee.uwa.edu.au>
S: 250 recipient ok
C: DATA
S: 354 Send mail; end with "." on a line by itself
C: From: alice@cs.uchicago.edu
C: To: bob@ee.uwa.edu.au
C: MIME-Version: 1.0
C: Message-Id: <0704760941.AA00747@ee.uwa.edu.au>
C: Content-Type: multipart/alternative; boundary=qwertyuiopasdfghjklzxcvbnm C: Subject: Earth orbits sun integral number of times
C:
C: This is the preamble. The user agent ignores it. Have a nice day.
C:
C: --qwertyuiopasdfghjklzxcvbnm
C: Content-Type: text/html
C:
C: <p>Happy birthday to you
C: Happy birthday to you
C: Happy birthday dear <bold> Bob </bold>
C: Happy birthday to you
C:
C: --qwertyuiopasdfghjklzxcvbnm
C: Content-Type: message/external-body;
C: access-type="anon-ftp";
C: site="bicycle.cs.uchicago.edu";
C: directory="pub";
C: name="birthday.snd"
C:
C: content-type: audio/basic
C: content-transfer-encoding: base64
C: --qwertyuiopasdfghjklzxcvbnm
C: .
S: 250 message accepted
C: QUIT
S: 221 ee.uwa.edu.au closing connection
Figure 7-16. A message from alice cs.uchicago.edu to bob ee.uwa.edu.au.
why the base64 MIME content transfer encoding was needed. However, with that encoding the mail transmission uses bandwidth inefficiently, which is an issue for large messages. A third limitation is that SMTP sends messages in the clear. It has no encryption to provide a measure of privacy against prying eyes.
To allow these and many other problems related to message processing to be addressed, SMTP was revised to have an extension mechanism. This mechanism
SEC. 7.2 ELECTRONIC MAIL 645
is a mandatory part of the RFC 5321 standard. The use of SMTP with extensions is called ESMTP (Extended SMTP).
Clients wanting to use an extension send an EHLO message instead of HELO initially. If this is rejected, the server is a regular SMTP server, and the client should proceed in the usual way. If the EHLO is accepted, the server replies with the extensions that it supports. The client may then use any of these extensions. Several common extensions are shown in Fig. 7-17. The figure gives the keyword as used in the extension mechanism, along with a description of the new func tionality. We will not go into extensions in further detail.
Keyword Description
AUTH Client authentication
BINARYMIME Server accepts binary messages
CHUNKING Server accepts large messages in chunks
SIZE Check message size before trying to send
STARTTLS Switch to secure transport (TLS; see Chap. 8)
UTF8SMTP Internationalized addresses
Figure 7-17. Some SMTP extensions.
To get a better feel for how SMTP and some of the other protocols described in this chapter work, try them out. In all cases, first go to a machine connected to the Internet. On a UNIX (or Linux) system, in a shell, type
telnet mail.isp.com 25
substituting the DNS name of your ISP’s mail server for mail.isp.com. On a Win- dows machine, you may have to first install the telnet program (or equivalent) and then start it yourself. This command will establish a telnet (i.e., TCP) connection to port 25 on that machine. Port 25 is the SMTP port; see Fig. 6-34 for the ports for other common protocols. You will probably get a response something like this:
Trying 192.30.200.66...
Connected to mail.isp.com
Escape character is ’ˆ]’.
220 mail.isp.com Smail #74 ready at Thu, 25 Sept 2019 13:26 +0200
The first three lines are from telnet, telling you what it is doing. The last line is from the SMTP server on the remote machine, announcing its willingness to talk to you and accept email. To find out what commands it accepts, type
HELP
From this point on, a command sequence such as the one in Fig. 7-16 is possible if the server is willing to accept mail from you. You may have to type quickly, though, since the connection may time out if it is inactive too long. Also, not every mail server will accept a telnet connection from an unknown machine.
646 THE APPLICATION LAYER CHAP. 7 Mail Submission
Originally, user agents ran on the same computer as the sending message trans fer agent. In this setting, all that is required to send a message is for the user agent to talk to the local mail server, using the dialog that we have just described. How- ever, this setting is no longer the usual case.
User agents often run on laptops, home PCs, and mobile phones. They are not always connected to the Internet. Mail transfer agents run on ISP and company servers. They are always connected to the Internet. This difference means that a user agent in Boston may need to contact its regular mail server in Seattle to send a mail message because the user is traveling.
By itself, this remote communication poses no problem. It is exactly what the TCP/IP protocols are designed to support. However, an ISP or company usually does not want any remote user to be able to submit messages to its mail server to be delivered elsewhere. The ISP or company is not running the server as a public service. In addition, this kind of open mail relay attracts spammers. This is because it provides a way to launder the original sender and thus make the message more difficult to identify as spam.
Given these considerations, SMTP is normally used for mail submission with the AUTH extension. This extension lets the server check the credentials (username and password) of the client to confirm that the server should be providing mail ser- vice.
There are several other differences in the way SMTP is used for mail submis- sion. For example, port 587 can be used in preference to port 25 and the SMTP ser- ver can check and correct the format of the messages sent by the user agent. For more information about the restricted use of SMTP for mail submission, please see
RFC 4409.
Physical Transfer
Once the sending mail transfer agent receives a message from the user agent, it will deliver it to the receiving mail transfer agent using SMTP. To do this, the sender uses the destination address. Consider the message in Fig. 7-16, addressed to bob@ee.uwa.edu.au. To what mail server should the message be delivered?
To determine the correct mail server to contact, DNS is consulted. In the previ- ous section, we described how DNS contains multiple types of records, including the MX, or mail exchanger, record. In this case, a DNS query is made for the MX records of the domain ee.uwa.edu.au. This query returns an ordered list of the names and IP addresses of one or more mail servers.
The sending mail transfer agent then makes a TCP connection on port 25 to the IP address of the mail server to reach the receiving mail transfer agent, and uses SMTP to relay the message. The receiving mail transfer agent will then place mail for the user bob in the correct mailbox for Bob to read it at a later time. This local
SEC. 7.2 ELECTRONIC MAIL 647
delivery step may involve moving the message among computers if there is a large mail infrastructure.
With this delivery process, mail travels from the initial to the final mail transfer agent in a single hop. There are no intermediate servers in the message transfer stage. It is possible, however, for this delivery process to occur multiple times. One example that we have described already is when a message transfer agent implements a mailing list. In this case, a message is received for the list. It is then expanded as a message to each member of the list that is sent to the individual member addresses.
As another example of relaying, Bob may have graduated from M.I.T. and also be reachable via the address bob@alum.mit.edu. Rather than reading mail on mul tiple accounts, Bob can arrange for mail sent to this address to be forwarded to bob@ee.uwa.edu. In this case, mail sent to bob@alum.mit.edu will undergo two deliveries. First, it will be sent to the mail server for alum.mit.edu. Then, it will be sent to the mail server for ee.uwa.edu.au. Each of these legs is a complete and sep- arate delivery as far as the mail transfer agents are concerned.
7.2.5 Final Delivery
Our mail message is almost delivered. It has arrived at Bob’s mailbox. All that remains is to transfer a copy of the message to Bob’s user agent for display. This is step 3 in the architecture of Fig. 7-9. This task was straightforward in the early Internet, when the user agent and mail transfer agent ran on the same machine as different processes. The mail transfer agent simply wrote new messages to the end of the mailbox file, and the user agent simply checked the mailbox file for new mail.
Nowadays, the user agent on a PC, laptop, or mobile, is likely to be on a dif ferent machine than the ISP or company mail server and certain to be on a different machine for a mail provider such as Gmail. Users want to be able to access their mail remotely, from wherever they are. They want to access email from work, from their home PCs, from their laptops when on business trips, and from cyber- cafes when on so-called vacation. They also want to be able to work offline, then reconnect to receive incoming mail and send outgoing mail. Moreover, each user may run several user agents depending on what computer it is convenient to use at the moment. Several user agents may even be running at the same time.
In this setting, the job of the user agent is to present a view of the contents of the mailbox, and to allow the mailbox to be remotely manipulated. Several dif ferent protocols can be used for this purpose, but SMTP is not one of them. SMTP is a push-based protocol. It takes a message and connects to a remote server to transfer the message. Final delivery cannot be achieved in this manner both because the mailbox must continue to be stored on the mail transfer agent and because the user agent may not be connected to the Internet at the moment that SMTP attempts to relay messages.
648 THE APPLICATION LAYER CHAP. 7 IMAP—The Internet Message Access Protocol
One of the main protocols that is used for final delivery is IMAP (Internet Message Access Protocol). Version 4 of the protocol is defined in RFC 3501 and in its many updates. To use IMAP, the mail server runs an IMAP server that listens to port 143. The user agent runs an IMAP client. The client connects to the server and begins to issue commands from those listed in Fig. 7-18.
Command Description
CAPABILITY List server capabilities
STARTTLS Start secure transport (TLS; see Chap. 8)
LOGIN Log on to server
AUTHENTICATE Log on with other method
SELECT Select a folder
EXAMINE Select a read-only folder
CREATE Create a folder
DELETE Delete a folder
RENAME Rename a folder
SUBSCRIBE Add folder to active set
UNSUBSCRIBE Remove folder from active set
LIST List the available folders
LSUB List the active folders
STATUS Get the status of a folder
APPEND Add a message to a folder
CHECK Get a checkpoint of a folder
FETCH Get messages from a folder
SEARCH Find messages in a folder
STORE Alter message flags
COPY Make a copy of a message in a folder
EXPUNGE Remove messages flagged for deletion
UID Issue commands using unique identifiers
NOOP Do nothing
CLOSE Remove flagged messages and close folder
LOGOUT Log out and close connection
Figure 7-18. IMAP (version 4) commands.
First, the client will start a secure transport if one is to be used (in order to keep the messages and commands confidential), and then log in or otherwise authenticate itself to the server. Once logged in, there are many commands to list folders and messages, fetch messages or even parts of messages, mark messages
SEC. 7.2 ELECTRONIC MAIL 649
with flags for later deletion, and organize messages into folders. To avoid confu- sion, please note that we use the term ‘‘folder’’ here to be consistent with the rest of the material in this section, in which a user has a single mailbox made up of
multiple folders. However, in the IMAP specification, the term mailbox is used instead. One user thus has many IMAP mailboxes, each of which is typically pres- ented to the user as a folder.
IMAP has many other features, too. It has the ability to address mail not by message number, but by using attributes (e.g., give me the first message from Alice). Searches can be performed on the server to find the messages that satisfy certain criteria so that only those messages are fetched by the client.
IMAP is an improvement over an earlier final delivery protocol, POP3 (Post Office Protocol, version 3), which is specified in RFC 1939. POP3 is a simpler protocol but supports fewer features and is less secure in typical usage. Mail is usually downloaded to the user agent computer, instead of remaining on the mail server. This makes life easier on the server, but harder on the user. It is not easy to read mail on multiple computers, plus if the user agent computer breaks, all email may be lost permanently. Nonetheless, you will still find POP3 in use.
Proprietary protocols can also be used because the protocol runs between a mail server and user agent that can be supplied by the same company. Microsoft Exchange is a mail system with a proprietary protocol.
Webmail
An increasingly popular alternative to IMAP and SMTP for providing email service is to use the Web as an interface for sending and receiving mail. Widely used Webmail systems include Google Gmail, Microsoft Hotmail and Yahoo! Mail. Webmail is one example of software (in this case, a mail user agent) that is provided as a service using the Web.
In this architecture, the provider runs mail servers as usual to accept messages for users with SMTP on port 25. However, the user agent is different. Instead of being a standalone program, it is a user interface that is provided via Web pages. This means that users can use any browser they like to access their mail and send new messages.
When the user goes to the email Web page of the provider, say, Gmail, a form is presented in which the user is asked for a login name and password. The login name and password are sent to the server, which then validates them. If the login is successful, the server finds the user’s mailbox and builds a Web page listing the contents of the mailbox on the fly. The Web page is then sent to the browser for display.
Many of the items on the page showing the mailbox are clickable, so messages can be read, deleted, and so on. To make the interface responsive, the Web pages will often include JavaScript programs. These programs are run locally on the cli- ent in response to local events (e.g., mouse clicks) and can also download and
650 THE APPLICATION LAYER CHAP. 7
upload messages in the background, to prepare the next message for display or a new message for submission. In this model, mail submission happens using the normal Web protocols by posting data to a URL. The Web server takes care of injecting messages into the traditional mail delivery system that we have described. For security, the standard Web protocols can be used as well. These protocols con- cern themselves with encrypting Web pages, not whether the content of the Web page is a mail message.
7.3 THE WORLD WIDE WEB
The Web, as the World Wide Web is popularly known, is an architectural framework for accessing linked content spread out over millions of machines all over the Internet. In 10 years it went from being a way to coordinate the design of high-energy physics experiments in Switzerland to the application that millions of people think of as being ‘‘The Internet.’’ Its enormous popularity stems from the fact that it is easy for beginners to use and provides access with a rich graphical interface to an enormous wealth of information on almost every conceivable sub ject, from aardvarks to Zulus.
The Web began in 1989 at CERN, the European Center for Nuclear Research. The initial idea was to help large teams, often with members in a dozen or more countries and time zones, collaborate using a constantly changing collection of reports, blueprints, drawings, photos, and other documents produced by experi- ments in particle physics. The proposal for a Web of linked documents came from CERN physicist Tim Berners-Lee. The first (text-based) prototype was operational 18 months later. A public demonstration given at the Hypertext ’91 conference caught the attention of other researchers, which led Marc Andreessen at the Uni- versity of Illinois to develop the first graphical browser. It was called Mosaic and released in February 1993.
The rest, as they say, is now history. Mosaic was so popular that a year later Andreessen left to form a company, Netscape Communications Corp., whose goal was to develop Web software. For the next three years, Netscape Navigator and Microsoft’s Internet Explorer engaged in a ‘‘browser war,’’ each one trying to cap ture a larger share of the new market by frantically adding more features (and thus more bugs) than the other one.
Through the 1990s and 2000s, Web sites and Web pages, as Web content is called, grew exponentially until there were millions of sites and billions of pages. A small number of these sites became tremendously popular. Those sites and the companies behind them largely define the Web as people experience it today. Examples include: a bookstore (Amazon, started in 1994), a flea market (eBay, 1995), search (Google, 1998), and social networking (Facebook, 2004). The period through 2000, when many Web companies became worth hundreds of mil lions of dollars overnight, only to go bust practically the next day when they turned
SEC. 7.3 THE WORLD WIDE WEB 651
out to be hype, even has a name. It is called the dot com era. New ideas are still striking it rich on the Web. Many of them come from students. For example, Mark Zuckerberg was a Harvard student when he started Facebook, and Sergey Brin and Larry Page were students at Stanford when they started Google. Perhaps you will come up with the next big thing.
In 1994, CERN and M.I.T. signed an agreement setting up the W3C (World Wide Web Consortium), an organization devoted to further developing the Web, standardizing protocols, and encouraging interoperability between sites. Berners- Lee became the director. Since then, several hundred universities and companies have joined the consortium. Although there are now more books about the Web than you can shake a stick at, the best place to get up-to-date information about the Web is (naturally) on the Web itself. The consortium’s home page is at www.w3.org. Interested readers are referred there for links to pages covering all of the consortium’s numerous documents and activities.
7.3.1 Architectural Overview
From the users’ point of view, the Web comprises a vast, worldwide collection of content in the form of Web pages. Each page typically contains links to hun- dreds of other objects, which may be hosted on any server on the Internet, any- where in the world. These objects may be other text and images, but nowadays also include a wide variety of objects, including advertisements and tracking scripts. A page may also link to other Web pages; users can follow a link by clicking on it, which then takes them to the page pointed to. This process can be repeated indefi- nitely. The idea of having one page point to another, now called hypertext, was invented by a visionary M.I.T. professor of electrical engineering, Vannevar Bush, in 1945 (Bush, 1945). This was long before the Internet was invented. In fact, it was before commercial computers existed although several universities had pro- duced crude prototypes that filled large rooms and had millions of times less com- puting power than a smart watch but consumed more electrical power than a small factory.
Pages are generally viewed with a program called a browser. Brave, Chrome, Edge, Firefox, Opera, and Safari are examples of popular browsers. The browser fetches the page requested, interprets the content, and displays the page, properly formatted, on the screen. The content itself may be a mix of text, images, and for- matting commands, in the manner of a traditional document, or other forms of con tent such as video or programs that produce a graphical interface for users.
Figure 7-19 shows an example of a Web page, which contains many objects. In this case, the page is for the U.S. Federal Communications Commission. This page shows text and graphical elements (which are mostly too small to read here). Many parts of the page include references and links to other pages. The index page, which the browser loads, typically contains instructions for the browser
652 THE APPLICATION LAYER CHAP. 7
concerning the locations of other objects to assemble, as well as how and where to render those objects on the page.
A piece of text, icon, graphic image, photograph, or other page element that can be associated with another page is called a hyperlink. To follow a link, a desktop or notebook computer user places the mouse cursor on the linked portion of the page area (which causes the cursor to change shape) and clicks. On a smart- phone or tablet, the user taps the link. Following a link is simply a way of telling the browser to fetch another page. In the early days of the Web, links were high lighted with underlining and colored text so that they would stand out. Now, the creators of Web pages can use style sheets to control the appearance of many aspects of the page, including hyperlinks, so links can effectively appear however the designer of the Web site wishes. The appearance of a link can even be dynam ic, for example, it might change its appearance when the mouse passes over it. It is up to the creators of the page to make the links visually distinct to provide a good user experience.
Document
Program
Database
Objects
(e.g., fonts.gstatic.com)
HTTPS Request
HTTPS Response
Web Page Web Browser
Web Server
Ads, Trackers, etc.
(e.g., google-analytics.com)
Figure 7-19. Fetching and rendering a Web page involves HTTP/HTTPS
requests to many servers.
Readers of this page might find a story of interest and click on the area indi- cated, at which point the browser fetches the new page and displays it. Dozens of other pages are linked off the first page besides this example. Every other page can consist of content on the same machine(s) as the first page, or on machines halfway around the globe. The user cannot tell. The browser typically fetches whatever objects the user indicates to the browser through a series of clicks. Thus, moving between machines while viewing content is seamless.
SEC. 7.3 THE WORLD WIDE WEB 653
The browser is displaying a Web page on the client machine. Each page is fetched by sending a request to one or more servers, which respond with the con tents of the page. The request-response protocol for fetching pages is a simple text-based protocol that runs over TCP, just as was the case for SMTP. It is called HTTP (HyperText Transfer Protocol). The secure version of this protocol, which is now the predominant mode of retrieving content on the Web today, is call- ed HTTPS (Secure HyperText Transfer Protocol). The content may simply be a document that is read off a disk, or the result of a database query and program execution. The page is a static page if it is a document that is the same every time it is displayed. In contrast, if it was generated on demand by a program or contains a program it is a dynamic page.
A dynamic page may present itself differently each time it is displayed. For example, the front page for an electronic store may be different for each visitor. If a bookstore customer has bought mystery novels in the past, upon visiting the store’s main page, the customer is likely to see new thrillers prominently displayed, whereas a more culinary-minded customer might be greeted with new cookbooks. How the Web site keeps track of who likes what is a story to be told shortly. But briefly, the answer involves cookies (even for culinarily challenged visitors).
In the browser contacts a number of servers to load the Web page. The content on the index page might be loaded directly from files hosted at fcc.gov. Auxiliary content, such as an embedded video, might be hosted at a separate server, still at fcc.gov, but perhaps on infrastructure that is dedicated to hosting the content. The index page may also contain references to other objects that the user may not even see, such as tracking scripts, or advertisements that are hosted on third-party ser- vers. The browser fetches all of these objects, scripts, and so forth and assembles them into a single page view for the user.
Display entails a range of processing that depends on the kind of content. Besides rendering text and graphics, it may involve playing a video or running a script that presents its own user interface as part of the page. In this case, the fcc.gov server supplies the main page, the fonts.gstatic.com server supplies addi tional objects (e.g., fonts), and the google-analytics.com server supplies nothing that the user can see but tracks visitors to the site. We will investigate trackers and Web privacy later in this chapter.
The Client Side
Let us now examine the Web browser side in Fig. 7-19 in more detail. In essence, a browser is a program that can display a Web page and capture a user’s request to ‘‘follow’’ other content on the page. When an item is selected, the brow- ser follows the hyperlink and retrieves the object that the user indicates (e.g., with a mouse click, or by tapping the link on the screen of a mobile device).
When the Web was first created, it was immediately apparent that having one page point to another Web page required mechanisms for naming and locating
654 THE APPLICATION LAYER CHAP. 7
pages. In particular, three questions had to be answered before a selected page could be displayed:
1. What is the page called?
2. Where is the page located?
3. How can the page be accessed?
If every page were somehow assigned a unique name, there would not be any ambiguity in identifying pages. Nevertheless, the problem would not be solved. Consider a parallel between people and pages. In the United States, almost every adult has a Social Security number, which is a unique identifier, as no two people are supposed to have the same one. Nevertheless, if you are armed only with a social security number, there is no way to find the owner’s address, and certainly no way to tell whether you should write to the person in English, Spanish, or Chi- nese. The Web has basically the same problems.
The solution chosen identifies pages in a way that solves all three problems at once. Each page is assigned a URL (Uniform Resource Locator) that effectively serves as the page’s worldwide name. URLs have three parts: the protocol (also
known as the scheme), the DNS name of the machine on which the page is locat- ed, and the path uniquely indicating the specific page (a file to read or program to run on the machine). In the general case, the path has a hierarchical name that models a file directory structure. However, the interpretation of the path is up to the server; it may or may not reflect the actual directory structure. As an example, the URL of the page shown in Fig. 7-19 is
https://fcc.gov/
This URL consists of three parts: the protocol (https), the DNS name of the host (fcc.gov), and the path name (/, which the Web server often treats as some default index object).
When a user selects a hyperlink, the browser carries out a series of steps in order to fetch the page pointed to. Let us trace the steps that occur when our exam- ple link is selected:
1. The browser determines the URL (by seeing what was selected). 2. The browser asks DNS for the IP address of the server fcc.gov. 3. DNS replies with 23.1.55.196.
4. The browser makes a TCP connection to that IP address; given that the protocol is HTTPS, the secure version of HTTP, the TCP con- nection would by default be on port 443 (the default port for HTTP, which is used far less often now, is port 80).
5. It sends an HTTPS request asking for the page //, which the Web ser- ver typically assumes is some index page (e.g., index.html, index.php, or similar, as configured by the Web server at fcc.gov).
SEC. 7.3 THE WORLD WIDE WEB 655
6. The server sends the page as an HTTPS response, for example, by sending the file /index.html, if that is determined to be the default index object.
7. If the page includes URLs that are needed for display, the browser fetches the other URLs using the same process. In this case, the URLs include multiple embedded images also fetched from that ser- ver, embedded objects from gstatic.com, and a script from google- analytics.com (as well as a number of other domains that are not shown).
8. The browser displays the page /index.html as it appears in Fig. 7-19.
9. The TCP connections are released if there are no other requests to the same servers for a short period.
Many browsers display which step they are currently executing in a status line at the bottom of the screen. In this way, when the performance is poor, the user can see if it is due to DNS not responding, a server not responding, or simply page transmission over a slow or congested network.
A more detailed way to explore and understand the performance of the Web page is through a so-called waterfall diagram, as shown in Fig. 7-20. The figure shows a list of all of the objects that the browser loads in the proc- ess of loading this page (in this case, 64, but many pages have hundreds of objects), as well as the timing dependencies associated with loading each request, and the operations associated with each page load (e.g., a DNS lookup, a TCP con- nection, the downloading of actual content, and so forth). These waterfall diagrams can tell us a lot about the behavior of a Web browser; for example, we can learn about the number of parallel connections that a browser makes to any given server, as well as whether connections are being reused. We can also learn about the rela tive time for DNS lookups versus actual object downloads, as well as other poten tial performance bottlenecks.
The URL design is open-ended in the sense that it is straightforward to have browsers use multiple protocols to retrieve different kinds of resources. In fact, URLs for various other protocols have been defined. Slightly simplified forms of the common ones are listed in Fig. 7-21.
Let us briefly go over the list. The http protocol is the Web’s native language, the one spoken by Web servers. HTTP stands for HyperText Transfer Protocol. We will examine it in more detail later in this section, with a particular focus on HTTPS, the secure version of this protocol, which is now the predominant protocol used to serve objects on the Web today.
The ftp protocol is used to access files by FTP, the Internet’s file transfer proto- col. FTP predates the Web and has been in use for more than four decades. The Web makes it easy to obtain files placed on numerous FTP servers throughout the world by providing a simple, clickable interface instead of the older command-line
656 THE APPLICATION LAYER CHAP. 7
Figure 7-20. Waterfall diagram for fcc.gov.
interface. This improved access to information is one reason for the spectacular growth of the Web.
It is possible to access a local file as a Web page by using the file protocol, or more simply, by just naming it. This approach does not require having a server. Of course, it works only for local files, not remote ones.
The mailto protocol does not really have the flavor of fetching Web pages, but is still useful anyway. It allows users to send email from a Web browser. Most
SEC. 7.3 THE WORLD WIDE WEB 657
Name Used for Example
http Hypertext (HTML) https://www.ee.uwa.edu/~rob/ (https://www.ee.uwa.edu/~rohttps Hypertext with security https://www.bank.com/accounts/ (https://www.bank.com/acftp FTP ftp://ftp.cs.vu.nl/pub/minix/README (ftp://ftp.cs.vu.nl/pub/mifile Local file file:///usr/nathan/prog.c
mailto Sending email mailto:JohnUser@acm.org
rtsp Streaming media rtsp://youtube.com/montypython.mpg sip Multimedia calls sip:eve@adversary.com
about Browser information about:plugins
Figure 7-21. Some common URL schemes.
browsers will respond when a mailto link is followed by starting the user’s mail agent to compose a message with the address field already filled in. The rtsp and sip protocols are for establishing streaming media sessions and audio and video calls.
Finally, the about protocol is a convention that provides information about the browser. For example, following the about:plugins link will cause most browsers to show a page that lists the MIME types that they handle with browser extensions called plug-ins. Many browsers have very interesting information in the about: sec tion; an interesting example in the Firefox browser is about:telemetry, which shows all of the performance and user activity information that the browser gathers about the user. about:preferences shows user preferences, and about:config shows many interesting aspects of the browser configuration, including whether the brow- ser is performing DNS-over-HTTPS lookups (and to which trusted recursive resolvers), as described in the previous section on DNS.
The URLs themselves have been designed not only to allow users to navigate the Web, but to run older protocols such as FTP and email as well as newer proto- cols for audio and video, and to provide convenient access to local files and brow- ser information. This approach makes all the specialized user interface programs for those other services unnecessary and integrates nearly all Internet access into a single program: the Web browser. If it were not for the fact that this idea was thought of by a British physicist working a multinational European research lab in Switzerland (CERN), it could easily pass for a plan dreamed up by some software company’s advertising department.
The Server Side
So much for the client side. Now let us take a look at the server side. As we saw above, when the user types in a URL or clicks on a line of hypertext, the brow- ser parses the URL and interprets the part between https:// and the next slash as a DNS name to look up. Armed with the IP address of the server, the browser can
658 THE APPLICATION LAYER CHAP. 7
establish a TCP connection to port 443 on that server. Then it sends over a com- mand containing the rest of the URL, which is the path to the page on that server. The server then returns the page for the browser to display.
To a first approximation, a simple Web server is similar to the server of Fig. 6-6. That server is given the name of a file to look up and return via the net- work. In both cases, the steps that the server performs in its main loop are:
1. Accept a TCP connection from a client (a browser).
2. Get the path to the page, which is the name of the file requested. 3. Get the file (from disk).
4. Send the contents of the file to the client.
5. Release the TCP connection.
Modern Web servers have more features, but in essence, this is what a Web server does for the simple case of content that is contained in a file. For dynamic content, the third step may be replaced by the execution of a program (determined from the path) that generates and returns the contents.
However, Web servers are implemented with a different design to serve hun- dreds or thousands of requests per second. One problem with the simple design is that accessing files is often the bottleneck. Disk reads are very slow compared to program execution, and the same files may be read repeatedly from disk using operating system calls. Another problem is that only one request is processed at a time. If the file is large, other requests will be blocked while it is transferred.
One obvious improvement (used by all Web servers) is to maintain a cache in memory of the n most recently read files or a certain number of gigabytes of con tent. Before going to disk to get a file, the server checks the cache. If the file is there, it can be served directly from memory, thus eliminating the disk access. Although effective caching requires a large amount of main memory and some extra processing time to check the cache and manage its contents, the savings in time are nearly always worth the overhead and expense.
To tackle the problem of serving more than a single request at a time, one strat- egy is to make the server multithreaded. In one design, the server consists of a front-end module that accepts all incoming requests and k processing modules, as shown in Fig. 7-22. The k + 1 threads all belong to the same process, so the proc- essing modules all have access to the cache within the process’ address space. When a request comes in, the front end accepts it and builds a short record describ ing it. It then hands the record to one of the processing modules.
The processing module first checks the cache to see if the requested object is present. If so, it updates the record to include a pointer to the file in the record. If it is not there, the processing module starts a disk operation to read it into the cache (possibly discarding some other cached file(s) to make room for it). When the file comes in from the disk, it is put in the cache and also sent back to the client.
SEC. 7.3 THE WORLD WIDE WEB 659
Processing
module
(thread)
Request
Response Client
Disk
Front end Cache Server
Figure 7-22. A multithreaded Web server with a front end and processing modules.
The advantage of this approach is that while one or more processing modules are blocked waiting for a disk or network operation to complete (and thus consum ing no CPU time), other modules can be actively working on other requests. With k processing modules, the throughput can be as much as k times higher than with a single-threaded server. Of course, when the disk or network is the limiting factor, it is necessary to have multiple disks or a faster network to get any real improvement over the single-threaded model.
Essentially all modern Web architectures are now designed as shown above, with a split between the front end and a back end. The front-end Web server is often called a reverse proxy, because it retrieves content from other (typically back-end) servers and serves those objects to the client. The proxy is called a ‘‘reverse’’ proxy because it is acting on behalf of the servers, as opposed to acting on behalf of clients.
When loading a Web page, a client will often first be directed (using DNS) to a reverse proxy (i.e., front end server), which will begin returning static objects to the client’s Web browser so that it can begin loading some of the page contents as quickly as possible. While those (typically static) objects are loading, the back end can perform complex operations (e.g., performing a Web search, doing a database lookup, or otherwise generating dynamic content), which it can serve back to the client via the reverse proxy as those results and content becomes available.
7.3.2 Static Web Objects
The basis of the Web is transferring Web pages from server to client. In the simplest form, Web objects are static. However, these days, almost any page that you view on the Web will have some dynamic content, but even on dynamic Web pages, a significant amount of the content (e.g., the logo, the style sheets, the head- er and footer) remains static. Static objects are just files sitting on some server that present themselves in the same way each time they are fetched and viewed. They
660 THE APPLICATION LAYER CHAP. 7
are generally amenable to caching, sometimes for a very long time, and are thus often placed on object caches that are close to the user. Just because they are static does not mean that the pages are inert at the browser, however. A video is a static object, for example.
As mentioned earlier, the lingua franca of the Web, in which most pages are written, is HTML. The home pages of university instructors are generally static objects; in some cases, companies may have dynamic Web pages, but the end result of the dynamic-generation process is a page in HTML. HTML (HyperText Markup Language) was introduced with the Web. It allows users to produce Web pages that include text, graphics, video, pointers to other Web pages, and more. HTML is a markup language, or language for describing how documents are to be formatted. The term ‘‘markup’’ comes from the old days when copyeditors actual ly marked up documents to tell the printer—in those days, a human being—which fonts to use, and so on. Markup languages thus contain explicit commands for for- matting. For example, in HTML, <b> means start boldface mode, and </b> means leave boldface mode. Also, <h1> means to start a level 1 heading here. LaTeX and TeX are other examples of markup languages that are well known to most academic authors. In contrast, Microsoft Word is not a markup language because the formatting commands are not embedded in the text.
The key advantage of a markup language over one with no explicit markup is that it separates content from how it should be presented. Most modern Webpages use style sheets to define the typefaces, colors, sizes, padding, and many other attributes of text, lists, tables, headings, ads, and other page elements. Style sheets are written in a language called CSS (Cascading Style Sheets).
Writing a browser is then straightforward: the browser simply has to under- stand the markup commands and style sheet and apply them to the content. Embedding all the markup commands within each HTML file and standardizing them makes it possible for any Web browser to read and reformat any Web page. That is crucial because a page may have been produced in a 3840 × 2160 window with 24-bit color on a high-end computer but may have to be displayed in a 640 × 320 window on a mobile phone. Just scaling it down linearly is a bad idea because then the letters would be so small that no one could read them.
While it is certainly possible to write documents like this with any plain text editor, and many people do, it is also possible to use word processors or special HTML editors that do most of the work (but correspondingly give the user less direct control over the details of the final result). There are also many programs available for designing Web pages, such as Adobe Dreamweaver.
7.3.3 Dynamic Web Pages and Web Applications
The static page model we have used so far treats pages as (multimedia) docu- ments that are conveniently linked together. It was a good model back in the early days of the Web, as vast amounts of information were put online. Nowadays,
SEC. 7.3 THE WORLD WIDE WEB 661
much of the excitement around the Web is using it for applications and services. Examples include buying products on e-commerce sites, searching library catalogs, exploring maps, reading and sending email, and collaborating on documents.
These new uses are like conventional application software (e.g., mail readers and word processors). The twist is that these applications run inside the browser, with user data stored on servers in Internet data centers. They use Web protocols to access information via the Internet, and the browser to display a user interface. The advantage of this approach is that users do not need to install separate applica tion programs, and user data can be accessed from different computers and backed up by the service operator. It is proving so successful that it is rivaling traditional application software. Of course, the fact that these applications are offered for free by large providers helps. This model is a prevalent form of cloud computing, where computing moves off individual desktop computers and into shared clusters of servers in the Internet.
To act as applications, Web pages can no longer be static. Dynamic content is needed. For example, a page of the library catalog should reflect which books are currently available and which books are checked out and are thus not available. Similarly, a useful stock market page would allow the user to interact with the page to see stock prices over different periods of time and compute profits and losses. As these examples suggest, dynamic content can be generated by programs run- ning on the server or in the browser (or in both places).
The general situation is as shown in Fig. 7-23. For example, consider a map service that lets the user enter a street address and presents a corresponding map of the location. Given a request for a location, the Web server must use a program to create a page that shows the map for the location from a database of streets and other geographic information. This action is shown as steps 1 through 3. The request (step 1) causes a program to run on the server. The program consults a database to generate the appropriate page (step 2) and returns it to the browser (step 3).
Web
1Program
page
Program 4
3 5 7
2
6
Program
DB
Web browser Web server
Figure 7-23. Dynamic pages.
There is more to dynamic content, however. The page that is returned may itself contain programs that run in the browser. In our map example, the program
662 THE APPLICATION LAYER CHAP. 7
would let the user find routes and explore nearby areas at different levels of detail. It would update the page, zooming in or out as directed by the user (step 4). To handle some interactions, the program may need more data from the server. In this case, the program will send a request to the server (step 5) that will retrieve more information from the database (step 6) and return a response (step 7). The program will then continue updating the page (step 4). The requests and responses happen in the background; the user may not even be aware of them because the page URL and title typically do not change. By including client-side programs, the page can present a more responsive interface than with server-side programs alone.
Server-Side Dynamic Web Page Generation
Let us look briefly at the case of server-side content generation. When the user clicks on a link in a form, for example in order to buy something, a request is sent to the server at the URL specified with the form along with the contents of the form as filled in by the user. These data must be given to a program or script to process. Thus, the URL identifies the program to run; the data are provided to the program as input. The page returned by this request will depend on what happens during the processing. It is not fixed like a static page. If the order succeeds, the page returned might give the expected shipping date. If it is unsuccessful, the re turned page might say that widgets requested are out of stock or the credit card was not valid for some reason.
Exactly how the server runs a program instead of retrieving a file depends on the design of the Web server. It is not specified by the Web protocols themselves. This is because the interface can be proprietary and the browser does not need to know the details. As far as the browser is concerned, it is simply making a request and fetching a page.
Nonetheless, standard APIs have been developed for Web servers to invoke programs. The existence of these interfaces makes it easier for developers to extend different servers with Web applications. We will briefly look at two APIs to give you a sense of what they entail.
The first API is a method for handling dynamic page requests that has been available since the beginning of the Web. It is called the CGI (Common Gateway Interface) and is defined in RFC 3875. CGI provides an interface to allow Web servers to talk to back-end programs and scripts that can accept input (e.g., from
forms) and generate HTML pages in response. These programs may be written in whatever language is convenient for the developer, usually a scripting language for ease of development. Pick Python, Ruby, Perl, or your favorite language.
By convention, programs invoked via CGI live in a directory called cgi-bin, which is visible in the URL. The server maps a request to this directory to a pro- gram name and executes that program as a separate process. It provides any data sent with the request as input to the program. The output of the program gives a Web page that is returned to the browser.
SEC. 7.3 THE WORLD WIDE WEB 663
The second API is quite different. The approach here is to embed little scripts inside HTML pages and have them be executed by the server itself to generate the page. A popular language for writing these scripts is PHP (PHP: Hypertext Pre- processor). To use it, the server has to understand PHP, just as a browser has to understand CSS to interpret Web pages with style sheets. Usually, servers identify Web pages containing PHP from the file extension php rather than html or htm. PHP is simpler to use than CGI and is widely used.
Although PHP is easy to use, it is actually a powerful programming language for interfacing the Web and a server database. It has variables, strings, arrays, and most of the control structures found in C, but much more powerful I/O than just printf. PHP is open source code, freely available, and widely used. It was designed specifically to work well with Apache, which is also open source and is the world’s most widely used Web server.
Client-Side Dynamic Web Page Generation
PHP and CGI scripts solve the problem of handling input and interactions with databases on the server. They can all accept incoming information from forms, look up information in one or more databases, and generate HTML pages with the results. What none of them can do is respond to mouse movements or interact with users directly. For this purpose, it is necessary to have scripts embedded in HTML pages that are executed on the client machine rather than the server machine. Starting with HTML 4.0, such scripts were permitted using the tag <script>. The current HTML standard is now generally referred to as HTML5. HTML5 includes many new syntactic features for incorporating multimedia and graphical content, including <video>, <audio>, and <canvas> tags. Notably, the canvas ele- ment facilitates dynamic rendering of two-dimensional shapes and bitmap images. Interestingly, the canvas element also has various privacy considerations, because the HTML canvas properties are often unique on different devices. The privacy concerns are significant, because the uniqueness of canvases on individual user devices allows Web site operators to track users, even if the users delete all track ing cookies and block tracking scripts.
The most popular scripting language for the client side is JavaScript, so we will now take a quick look at it. Many books have been written about it (e.g., Cod ing, 2019; and Atencio, 2020). Despite the similarity in names, JavaScript has al- most nothing to do with the Java programming language. Like other scripting lan- guages, it is a very high-level language. For example, in a single line of JavaScript it is possible to pop up a dialog box, wait for text input, and store the resulting string in a variable. High-level features like this make JavaScript ideal for design ing interactive Web pages. On the other hand, the fact that it is mutating faster than a fruit fly trapped in an X-ray machine makes it difficult to write JavaScript programs that work on all platforms, but maybe some day it will stabilize.
664 THE APPLICATION LAYER CHAP. 7
It is important to understand that while PHP and JavaScript look similar in that they both embed code in HTML files, they are processed totally differently. With PHP, after a user has clicked on the submit button, the browser collects the infor- mation into a long string and sends it off to the server as a request for a PHP page. The server loads the PHP file and executes the PHP script that is embedded in to produce a new HTML page. That page is sent back to the browser for display. The browser cannot even be sure that it was produced by a program. This processing is shown as steps 1 to 4 in Fig. 7-24(a).
Browser
User
Server
Browser
User
Server
1 4
2 3
1 2
PHP module (a)
JavaScript
(b)
Figure 7-24. (a) Server-side scripting with PHP. (b) Client-side scripting with JavaScript.
With JavaScript, when the submit button is clicked the browser interprets a JavaScript function contained on the page. All the work is done locally, inside the browser. There is no contact with the server. This processing is shown as steps 1 and 2 in Fig. 7-24(b). As a consequence, the result is displayed virtually instanta- neously, whereas with PHP there can be a delay of several seconds before the resulting HTML arrives at the client.
This difference does not mean that JavaScript is better than PHP. Their uses are completely different. PHP is used when interaction with a database on the ser- ver is needed. JavaScript (and other client-side languages) is used when the inter- action is with the user at the client computer. It is certainly possible to combine them, as we will see shortly.
7.3.4 HTTP and HTTPS
Now that we have an understanding of Web content and applications, it is time to look at the protocol that is used to transport all this information between Web servers and clients. It is HTTP (HyperText Transfer Protocol), as specified in RFC 2616. Before we get into too many details, it is worth noting some dis tinctions between HTTP and its secure counterpart, HTTPS (Secure HyperText Transfer Protocol). Both protocols essentially retrieve objects in the same way, and the HTTP standard to retrieve Web objects is evolving essentially indepen- dently from its secure counterpart, which effectively uses the HTTP protocol over a secure transport protocol called TLS (Transport Layer Security). In this chapter, we will focus on the protocol details of HTTP and how it has evolved from early
SEC. 7.3 THE WORLD WIDE WEB 665
versions, to the more modern versions of this protocol in what is now known as HTTP/3. Chapter 8 discusses TLS in more detail, which effectively is the transport protocol that transports HTTP, constituting what we think of as HTTPS. For the remainder of this section, we will talk about HTTP; you can think of HTTPS as simply HTTP that istransported over TLS.
Overview
HTTP is a simple request-response protocol; conventional versions of HTTP typically run over TCP, although the most modern version of HTTP, HTTP/3, now commonly runs over UDP as well. It specifies what messages clients may send to servers and what responses they get back in return. The request and response headers are given in ASCII, just like in SMTP. The contents are given in a MIME like format, also like in SMTP. This simple model was partly responsible for the early success of the Web because it made development and deployment straightfor- ward.
In this section, we will look at the more important properties of HTTP as it is used today. Before getting into the details we will note that the way it is used in the Internet is evolving. HTTP is an application layer protocol because it runs on top of TCP and is closely associated with the Web. That is why we are covering it in this chapter. In another sense, HTTP is becoming more like a transport protocol that provides a way for processes to communicate content across the boundaries of different networks. These processes do not have to be a Web browser and Web ser- ver. A media player could use HTTP to talk to a server and request album infor- mation. Antivirus software could use HTTP to download the latest updates. Developers could use HTTP to fetch project files. Consumer electronics products like digital photo frames often use an embedded HTTP server as an interface to the outside world. Machine-to-machine communication increasingly runs over HTTP. For example, an airline server might contact a car rental server and make a car reservation, all as part of a vacation package the airline was offering.
Methods
Although HTTP was designed for use in the Web, it was intentionally made more general than necessary with an eye to future object-oriented uses. For this reason, operations, called methods, other than just requesting a Web page are sup- ported.
Each request consists of one or more lines of ASCII text, with the first word on the first line being the name of the method requested. The built-in methods are listed in Fig. 7-25. The names are case sensitive, so GET is allowed but not get.
The GET method requests the server to send the page. (When we say ‘‘page’’ we mean ‘‘object’’ in the most general case, but thinking of a page as the contents of a file is sufficient to understand the concepts.) The page is suitably encoded in
666 THE APPLICATION LAYER CHAP. 7
Method Description
GET Read a Web page
HEAD Read a Web page’s header
POST Append to a Web page
PUT Store a Web page
DELETE Remove the Web page
TRACE Echo the incoming request
CONNECT Connect through a proxy
OPTIONS Query options for a page
Figure 7-25. The built-in HTTP request methods.
MIME. The vast majority of requests to Web servers are GETs and the syntax is simple. The usual form of GET is
GET filename HTTP/1.1
where filename names the page to be fetched and 1.1 is the protocol version. The HEAD method just asks for the message header, without the actual page. This method can be used to collect information for indexing purposes, or just to test a URL for validity.
The POST method is used when forms are submitted. Like GET, it bears a URL, but instead of simply retrieving a page it uploads data to the server (i.e., the contents of the form or parameters). The server then does something with the data that depends on the URL, conceptually appending the data to the object. The effect
might be to purchase an item, for example, or to call a procedure. Finally, the method returns a page indicating the result.
The remaining methods are not used much for browsing the Web. The PUT method is the reverse of GET: instead of reading the page, it writes the page. This method makes it possible to build a collection of Web pages on a remote server. The body of the request contains the page. It may be encoded using MIME, in which case the lines following the PUT might include authentication headers, to prove that the caller indeed has permission to perform the requested operation.
DELETE does what you might expect: it removes the page, or at least it indi- cates that the Web server has agreed to remove the page. As with PUT, authentica tion and permission play a major role here.
The TRACE method is for debugging. It instructs the server to send back the request. This method is useful when requests are not being processed correctly and the client wants to know what request the server actually got.
The CONNECT method lets a user make a connection to a Web server through an intermediate device, such as a Web cache.
The OPTIONS method provides a way for the client to query the server for a page and obtain the methods and headers that can be used with that page.
SEC. 7.3 THE WORLD WIDE WEB 667
Every request gets a response consisting of a status line, and possibly addi tional information (e.g., all or part of a Web page). The status line contains a three-digit status code telling whether the request was satisfied and, if not, why not. The first digit is used to divide the responses into five major groups, as shown in Fig. 7-26.
Code Meaning Examples
1xx Information 100 = server agrees to handle client’s request
2xx Success 200 = request succeeded; 204 = no content present 3xx Redirection 301 = page moved; 304 = cached page still valid 4xx Client error 403 = forbidden page; 404 = page not found
5xx Server error 500 = internal server error; 503 = try again later
Figure 7-26. The status code response groups.
The 1xx codes are rarely used in practice. The 2xx codes mean that the request was handled successfully and the content (if any) is being returned. The 3xx codes tell the client to look elsewhere, either using a different URL or in its own cache (discussed later). The 4xx codes mean the request failed due to a client error such an invalid request or a nonexistent page. Finally, the 5xx errors mean the server itself has an internal problem, either due to an error in its code or to a temporary overload.
Message Headers
The request line (e.g., the line with the GET method) may be followed by addi tional lines with more information. They are called request headers. This infor- mation can be compared to the parameters of a procedure call. Responses may also have response headers. Some headers can be used in either direction. A selection of the more important ones is given in Fig. 7-27. This list is not short, so as you might imagine there are often several headers on each request and response.
The User-Agent header allows the client to inform the server about its browser implementation (e.g., Mozilla/5.0 and Chrome/74.0.3729.169). This information is useful to let servers tailor their responses to the browser, since different browsers can have widely varying capabilities and behaviors.
The four Accept headers tell the server what the client is willing to accept in the event that it has a limited repertoire of what is acceptable to it. The first header specifies the MIME types that are welcome (e.g., text/html). The second gives the character set (e.g., ISO-8859-5 or Unicode-1-1). The third deals with compression methods (e.g., gzip). The fourth indicates a natural language (e.g., Spanish). If the server has a choice of pages, it can use this information to supply the one the client is looking for. If it is unable to satisfy the request, an error code is returned and the request fails.
668 THE APPLICATION LAYER CHAP. 7
Header Type Contents
User-Agent Request Information about the browser and its platform Accept Request The type of pages the client can handle Accept-Charset Request The character sets that are acceptable to the client Accept-Encoding Request The page encodings the client can handle Accept-Language Request The natural languages the client can handle If-Modified-Since Request Time and date to check freshness If-None-Match Request Previously sent tags to check freshness Host Request The server’s DNS name
Authorization Request A list of the client’s credentials
Referrer Request The previous URL from which the request came Cookie Request Previously set cookie sent back to the server Set-Cookie Response Cookie for the client to store
Server Response Information about the server
Content-Encoding Response How the content is encoded (e.g., gzip) Content-Language Response The natural language used in the page Content-Length Response The page’s length in bytes
Content-Type Response The page’s MIME type
Content-Range Response Identifies a portion of the page’s content Last-Modified Response Time and date the page was last changed Expires Response Time and date when the page stops being valid Location Response Tells the client where to send its request Accept-Ranges Response Indicates the server will accept byte range requests Date Both Date and time the message was sent Range Both Identifies a portion of a page
Cache-Control Both Directives for how to treat caches ETag Both Tag for the contents of the page
Upgrade Both The protocol the sender wants to switch to Figure 7-27. Some HTTP message headers.
The If-Modified-Since and If-None-Match headers are used with caching. They let the client ask for a page to be sent only if the cached copy is no longer valid. We will describe caching shortly.
The Host header names the server. It is taken from the URL. This header is mandatory. It is used because some IP addresses may serve multiple DNS names and the server needs some way to tell which host to hand the request to.
The Authorization header is needed for pages that are protected. In this case, the client may have to prove it has a right to see the page requested. This header is used for that case.
SEC. 7.3 THE WORLD WIDE WEB 669
The client uses the (misspelled) Referer [sic] header to give the URL that referred to the URL that is now requested. Most often this is the URL of the previ- ous page. This header is particularly useful for tracking Web browsing, as it tells servers how a client arrived at the page.
Cookies are small files that servers place on client computers to remember information for later. A typical example is an e-commerce Web site that uses a cli- ent-side cookie to keep track of what the client has ordered so far. Every time the client adds an item to her shopping cart, the cookie is updated to reflect the new item ordered. Although cookies are dealt with in RFC 2109 rather than RFC 2616, they also have headers. The Set-Cookie header is how servers send cookies to cli- ents. The client is expected to save the cookie and return it on subsequent requests to the server by using the Cookie header. (Note that there is a more recent specif ication for cookies with newer headers, RFC 2965, but this has largely been reject- ed by industry and is not widely implemented.)
Many other headers are used in responses. The Server header allows the server to identify its software build if it wishes. The next five headers, all starting with Content-, allow the server to describe properties of the page it is sending.
The Last-Modified header tells when the page was last modified, and the Expires header tells for how long the page will remain valid. Both of these headers play an important role in page caching.
The Location header is used by the server to inform the client that it should try a different URL. This can be used if the page has moved or to allow multiple URLs to refer to the same page (possibly on different servers). It is also used for companies that have a main Web page in the com domain but redirect clients to a national or regional page based on their IP addresses or preferred language.
If a page is large, a small client may not want it all at once. Some servers will accept requests for byte ranges, so the page can be fetched in multiple small units. The Accept-Ranges header announces the server’s willingness to handle this.
Now we come to headers that can be used either way. The Date header can be used in both directions and contains the time and date the message was sent, while the Range header tells the byte range of the page that is provided by the response.
The ETag header gives a short tag that serves as a name for the content of the page. It is used for caching. The Cache-Control header gives other explicit instruc tions about how to cache (or, more usually, how not to cache) pages.
Finally, the Upgrade header is used for switching to a new communication protocol, such as a future HTTP protocol or a secure transport. It allows the client to announce what it can support and the server to assert what it is using.
Caching
People often return to Web pages that they have viewed before, and related Web pages often have the same embedded resources. Some examples are the images that are used for navigation across the site, as well as common style sheets
670 THE APPLICATION LAYER CHAP. 7
and scripts. It would be very wasteful to fetch all of these resources for these pages each time they are displayed because the browser already has a copy. Squirreling away pages that are fetched for subsequent use is called caching. The advantage is that when a cached page can be reused, it is not necessary to repeat the transfer. HTTP has built-in support to help clients identify when they can safely reuse pages. This support improves performance by reducing both net- work traffic and latency. The trade-off is that the browser must now store pages, but this is nearly always a worthwhile trade-off because local storage is inexpen- sive. The pages are usually kept on disk so that they can be used when the browser is run at a later date.
The difficult issue with HTTP caching is how to determine that a previously cached copy of a page is the same as the page would be if it was fetched again. This determination cannot be made solely from the URL. For example, the URL may give a page that displays the latest news item. The contents of this page will be updated frequently even though the URL stays the same. Alternatively, the con tents of the page may be a list of the gods from Greek and Roman mythology. This page should change somewhat less rapidly.
HTTP uses two strategies to tackle this problem. They are shown in Fig. 7-28 as forms of processing between the request (step 1) and the response (step 5). The first strategy is page validation (step 2). The cache is consulted, and if it has a copy of a page for the requested URL that is known to be fresh (i.e., still valid), there is no need to fetch it anew from the server. Instead, the cached page can be returned directly. The Expires header returned when the cached page was originally fetched and the current date and time can be used to make this determination.
1: Request 2: Check expiry 3: Conditional GET
5: Response
Cache
4a: Not modified 4b: Response
Program
Web browser
Figure 7-28. HTTP caching.
Web server
However, not all pages come with a convenient Expires header that tells when the page must be fetched again. After all, making predictions is hard—especially about the future. In this case, the browser may use heuristics. For example, if the page has not been modified in the past year (as told by the Last-Modified header) it is a fairly safe bet that it will not change in the next hour. There is no guarantee, however, and this may be a bad bet. For example, the stock market might have closed for the day so that the page will not change for hours, but it will change rapidly once the next trading session starts. Thus, the cacheability of a page may
SEC. 7.3 THE WORLD WIDE WEB 671
vary wildly over time. For this reason, heuristics should be used with care, though they often work well in practice.
Finding pages that have not expired is the most beneficial use of caching because it means that the server does not need to be contacted at all. Unfortunately, it does not always work. Servers must use the Expires header conservatively, since they may be unsure when a page will be updated. Thus, the cached copies may still be fresh, but the client does not know.
The second strategy is used in this case. It is to ask the server if the cached copy is still valid. This request is a conditional GET, and it is shown in Fig. 7-28 as step 3. If the server knows that the cached copy is still valid, it can send a short reply to say so (step 4a). Otherwise, it must send the full response (step 4b).
More header fields are used to let the server check whether a cached copy is still valid. The client has the time a cached page was most recently updated from the Last-Modified header. It can send this time to the server using the If-Modi fied-Since header to ask for the page if and only if it has been changed in the mean time. There is much more to say about caching because it has such a big effect on performance, but this is not the place to say it. Not surprisingly, there are many tutorials on the Web that you can find easily by searching for ‘‘Web caching.’’
HTTP/1 and HTTP/1.1
The usual way for a browser to contact a server is to establish a TCP con- nection to port 443 for HTTPS (or port 80 for HTTP) on the server’s machine, although this procedure is not formally required. The value of using TCP is that neither browsers nor servers have to worry about how to handle long messages, reliability, or congestion control. All of these matters are handled by the TCP implementation.
Early in the Web, with HTTP/1.0, after the connection was established a single request was sent over and a single response was sent back. Then the TCP con- nection was released. In a world in which the typical Web page consisted entirely of HTML text, this method was adequate. Quickly, the average Web page grew to contain large numbers of embedded links for content such as icons and other eye candy. Establishing a separate TCP connection to transport each single icon became a very expensive way to operate.
This observation led to HTTP/1.1, which supports persistent connections. With them, it is possible to establish a TCP connection, send a request and get a response, and then send additional requests and get additional responses. This strategy is also called connection reuse. By amortizing the TCP setup, startup, and release costs over multiple requests, the relative overhead due to TCP is reduced per request. It is also possible to pipeline requests, that is, send request 2 before the response to request 1 has arrived.
The performance difference between these three cases is shown in Fig. 7-29. Part (a) shows three requests, one after the other and each in a separate connection.
672 THE APPLICATION LAYER CHAP. 7
Let us suppose that this represents a Web page with two embedded images on the same server. The URLs of the images are determined as the main page is fetched, so they are fetched after the main page. Nowadays, a typical page has around 40 other objects that must be fetched to present it, but that would make our figure far too big so we will use only two embedded objects.
HTTP
Connection setup
Request
HTTP
Connection setup Connection setup Pipelined
Response
Connection setup
Time
Connection setup
requests
(a) (b) (c)
Figure 7-29. HTTP with (a) multiple connections and sequential requests. (b) A persistent connection and sequential requests. (c) A persistent connection and pipelined requests.
In Fig. 7-29(b), the page is fetched with a persistent connection. That is, the TCP connection is opened at the beginning, then the same three requests are sent, one after the other as before, and only then is the connection closed. Observe that the fetch completes more quickly. There are two reasons for the speedup. First, time is not wasted setting up additional connections. Each TCP connection requires at least one round-trip time to establish. Second, the transfer of the same images proceeds more quickly. Why is this? It is because of TCP congestion con trol. At the start of a connection, TCP uses the slow-start procedure to increase the throughput until it learns the behavior of the network path. The consequence of this warmup period is that multiple short TCP connections take disproportionately longer to transfer information than one longer TCP connection.
Finally, in Fig. 7-29(c), there is one persistent connection and the requests are pipelined. Specifically, the second and third requests are sent in rapid succession as soon as enough of the main page has been retrieved to identify that the images must be fetched. The responses for these requests follow eventually. This method cuts down the time that the server is idle, so it further improves performance.
SEC. 7.3 THE WORLD WIDE WEB 673
Persistent connections do not come for free, however. A new issue that they raise is when to close the connection. A connection to a server should stay open while the page loads. What then? There is a good chance that the user will click on a link that requests another page from the server. If the connection remains open,
the next request can be sent immediately. However, there is no guarantee that the client will make another request of the server any time soon. In practice, clients and servers usually keep persistent connections open until they have been idle for a
short time (e.g., 60 seconds) or they have a large number of open connections and need to close some.
The observant reader may have noticed that there is one combination that we have left out so far. It is also possible to send one request per TCP connection, but run multiple TCP connections in parallel. This parallel connection method was widely used by browsers before persistent connections. It has the same disadvan tage as sequential connections—extra overhead—but much better performance. This is because setting up and ramping up the connections in parallel hides some of the latency. In our example, connections for both of the embedded images could be set up at the same time. However, running many TCP connections to the same server is discouraged. The reason is that TCP performs congestion control for each connection independently. As a consequence, the connections compete against each other, causing added packet loss, and in aggregate are more aggressive users of the network than an individual connection. Persistent connections are superior and used in preference to parallel connections because they avoid overhead and do not suffer from congestion problems.
HTTP/2
HTTP/1.0 was around from the start of the Web and HTTP/1.1 was written in 2007. By 2012 it was getting a bit long in tooth, so IETF set up a working group to create what later became HTTP/2. The starting point was a protocol Google had devised earlier, called SPDY. The final product was published as RFC 7540 in May 2015.
The working group had several goals it tried to achieve, including: 1. Allow clients and servers to choose which HTTP version to use. 2. Maintain compatibility with HTTP/1.1 as much as possible.
3. Improve performance with multiplexing, pipelining, compression, etc.
4. Support existing practices used in browsers, servers, proxies, delivery networks, and more.
A key idea was to maintain backward compatibility. Existing applications had to work with HTTP/2, but new ones could take advantage of the new features to improve performance. For this reason, the headers, URLs, and general semantics
674 THE APPLICATION LAYER CHAP. 7
did not change much. What changed was the way everything is encoded and the way the clients and servers interact. In HTTP/1.1, a client opens a TCP connection to a server, sends over a request as text, waits for a response, and in many cases then closes the connection. This is repeated as often as needed to fetch an entire Web page. In HTTP/2 A TCP connection is set up and many requests can be sent over, in binary, possibly prioritized, and the server can respond to them in any order it wants to. Only after all requests have been answered is the TCP connection torn down.
Through a mechanism called server push, HTTP/2 allows the server to push out files that it knows will be needed but which the client may not know initially. For example, if a client requests a Web page and the server sees that it uses a style sheet and a JavaScript file, the server can send over the style sheet and the JavaScript before they are even requested. This eliminates some delays. An exam- ple of getting the same information (a Web page, its style sheet, and two images) in HTTP/1.1 and HTTP/2 is shown in Fig. 7-30.
Server
t
e
e
t
h
t
e
e
e
e
1
2
s
e
1
2
l
h
e
e
g
1
2
e
y
2
1
s
h
e
e
t
e
e
a
g
s
e
g
e g
e
g
s
g
e
e
g
p
a
l
e
g
a
g
a
a
a
g
a
g
p
l
e
yt
a
a
p
+
a
a
t
y
m
m
t
m
m
s
i
i
e
h
t
m
i
m
i
i
i
m
i
m
i
se
t
s
t
e
t
s
s
se
g
t
t
s
s
u
t
s i
s
ht
se
i
se
i
u
a
p
se
se
i
i
q
e e
u
e u
e
q
u
u
e
e
e
r
s
r
r
e
e
r
r
R
e
u
q
i
qe
e
qe
e
R
ht
qe
qe
e
e
H
e
e
r
R
H
R
H
R
R
H
H
R
User
e
H
Time
s
i
e
r
e
H
Time
(a) (b)
Figure 7-30. (a) Getting a Web page in HTTP/1.1. (b) Getting the same page in HTTP/2.
Note that Fig. 7-30(a) is the best case for HTTP/1.1, where multiple requests can be sent consecutively over the same TCP connection, but the rules are that they must be processed in order and the results sent back in order. In HTTP/2 [Fig. 7-30(b)], the responses can come back in any order. If it turns out, for exam- ple, that image 1 is very large, the server could back image 2 first so the browser
SEC. 7.3 THE WORLD WIDE WEB 675
can start displaying the page with image 2 even before image 1 is available. That is not allowed in HTTP/1.1. Also note that in Fig. 7-30(b) the server sent the style sheet without the browser asking for it.
In addition to the pipelining and multiplexing of requests over the same TCP connection, HTTP/2 compresses the headers and sends them in binary to reduce bandwidth usage and latency. An HTTP/2 session consists of a series of frames, each with a separate identifier. Responses may come back in a different order than the requests, as in Fig. 7-30(b), but since each response carries the identifier of the request, the browser can determine which request each response corresponds to.
Encryption was a sore point during the development of HTTP/2. Some people wanted it badly, and others opposed it equally badly. The opposition was mostly related to Internet-of-Things applications, in which the ‘‘thing’’ does not have a lot of computing power. In the end, encryption was not required by the standard, but all browsers require encryption, so de facto it is there anyway, at least for Web browsing.
HTTP/3
HTTP/3 or simply H3 is the third major revision of HTTP, designed as a suc- cessor to HTTP/2. The major distinction for HTTP/3 is the transport protocol that it uses to support the HTTP messages: rather than relying on TCP, it relies on an augmented version of UDP called QUIC, which relies on user-space congestion control running on top of UDP. HTTP/3 started out simply as HTTP-over-QUIC and has become the latest proposed major revision to the protocol. Many open- source libraries that support client and server logic for QUIC and HTTP/3 are available, in languages that include C, C++, Python, Rust, and Go. Popular Web servers including nginx also now support HTTP/3 through patches.
The QUIC transport protocol supports stream multiplexing and per-stream flow control, similar to that offered in HTTP/2. Stream-level reliability and con- nection-wide congestion control can dramatically improve the performance of HTTP, since congestion information can be shared across sessions, and reliability can be amortized across multiple connections fetching objects in parallel. Once a connection exists to a server endpoint, HTTP/3 allows the client to reuse that same connection with multiple different URLs.
HTTP/3, running HTTP over QUIC, promises many possible performance enhancements over HTTP/2, primarily because of the benefits that QUIC offers for HTTP vs. TCP. In some ways, QUIC could be viewed as the next generation of TCP. It offers connection setup with no additional round trips between client and server; in the case when a previous connection has been established between client and server, a zero-round-trip connection re-establishment is possible, provided that a secret from the previous connection was established and cached. QUIC guaran tees reliable, in-order delivery of bytes within a single stream, but it does not
676 THE APPLICATION LAYER CHAP. 7
provide any guarantees with respect to bytes on other QUIC streams. QUIC does permit out-of-order delivery within a stream, but HTTP/3 does not make use of this feature. HTTP/3 over QUIC will be performed exclusively using HTTPS; requests to (the increasingly deprecated) HTTP URLs will not be upgraded to use HTTP/3.
For more details on HTTP/3, see https://http3.net.
7.3.5 Web Privacy
One of the most significant concerns in recent years has been the privacy con- cerns associated with Web browsing. Web sites, Web applications, and other third parties often use mechanisms in HTTP to track user behavior, both within the con text of a single Web site or application, or across the Internet. Additionally, attack- ers may exploit various information side channels in the browser or device to track users. This section describes some of the mechanisms that are used to track users and fingerprint individual users and devices.
Cookies
One conventional way to implement tracking is by placing a cookie (effec tively a small amount of data) on client devices, which the clients may then send back upon subsequent visits to various Web sites. When a user requests a Web object (e.g., a Web page), a Web server may place a piece of persistent state, called a cookie, on the user’s device, using the ‘‘set-cookie’’ directive in HTTP. The data passed to the client’s device using this directive is subsequently stored locally on the device. When the device visits that Web domain in the future, the HTTP request passes the cookie, in addition to the request itself.
‘‘First-party’’ HTTP cookies (i.e., those set by the domain of the Web site that the user intends to visit, such as a shopping or news Web site) are useful for improving user experience on many Web sites. For example, cookies are often used to preserve state across a Web ‘‘session.’’ They allow a Web site to track useful information about a user’s ongoing behavior on a Web site, such as whether they recently logged into the Web site, or what items they have placed in a shopping cart.
Cookies set by one domain are generally only visible to the same domain that set the cookie in the first place. For example, one advertising network may set a cookie on a user device, but no other third party can see the cookie that was set. This Web security policy, called the same-origin policy, prevents one party from reading a cookie that was set by another party and in some sense can limit how information about an individual user is shared.
Although first-party cookies are often used to improve the user experience, third parties, such as advertisers and tracking companies can also set cookies on client devices, which can allow those third parties to track the sites that users visit
SEC. 7.3 THE WORLD WIDE WEB 677
as they navigate different Web sites across the entire Internet. This tracking takes place as follows:
1. When a user visits a Web site, in addition to the content that the user requests directly, the device may load content from third-party sites, including from the domains of advertising networks. Loading an advertisement or script from a third party allows that party to set a unique cookie on the user’s device.
2. That user may subsequently visit different sites on the Internet that load Web objects from the same third party that set tracking infor- mation on a different site.
A common example of this practice might be two different Web sites that use the same advertising network to serve ads. In this case, the advertising network would see: (1) the user’s device return the cookie that it set on a different Web site; (2) the HTTP referer request header that accompanies the request to load the object from the advertiser, indicating the original site that the user’s device was visiting. This practice is commonly referred to as cross-site tracking.
Super cookies, and other locally stored tracking identifiers, that a user cannot control as they would regular cookies, can allow an intermediary to track a user a- cross Web sites over time. Unique identifiers can include things such as third-party tracking identifiers encoded in HTTP (specifically HSTS (HTTP Strict Trans- port Security) headers that are not cleared when a user clears their cookies and tags that an intermediate third party such as a mobile ISP can insert into unencryp ted Web traffic that traverses a network segment. This enables third parties, such as advertisers, to build up a profile of a user’s browsing across a set of Web sites, sim ilar to the Web tracking cookies used by ad networks and application providers.
Third-Party Trackers
Web cookies that originate from a third-party domain that are used across many sites can allow an advertising network or other third parties to track a user’s browsing habits on any site where that tracking software is deployed (i.e., any site that carries their advertisements, sharing buttons, or other embedded code). Adver tising networks and other third parties typically track a user’s browsing patterns a- cross the range of Web sites that the user browses, often using browser-based tracking software. In some cases, a third party may develop its own tracking soft- ware (e.g., Web analytics software). In other cases, they may use a different third- party service to collect and aggregate this behavior across sites.
Web sites may permit advertising networks and other third-party trackers to operate on their site, enabling them to collect analytics data, advertise on other Web sites (called re-targeting), or monetize the Web site’s available advertising space via placement of carefully targeted ads. The advertisers collect data about
678 THE APPLICATION LAYER CHAP. 7
users by using various tracking mechanisms, such as HTTP cookies, HTML5 objects, JavaScript, device fingerprinting, browser fingerprinting, and other com- mon Web technologies. When a user visits multiple Web sites that leverage the same advertising network, that advertising network recognizes the user’s device, enabling them to track user Web behavior over time.
Using such tracking software, a third party or advertising network can discover a user’s interactions, social network and contacts, likes, interests, purchases, and so on. This information can enable precise tracking of whether an advertisement resulted in a purchase, mapping of relationships between people, creation of detailed user tracking profiles, conduct of highly targeted advertising, and signifi- cantly more due to the breadth and scope of tracking.
Even in cases where someone is not a registered user of a particular service (e.g., social media site, search engine), has ceased using that service, or has logged out of that service, they often are still being uniquely tracked using third-party (and first-party) trackers. Third-party trackers are increasingly becoming concentrated with a few large providers.
In addition to third-party tracking with cookies, the same advertisers and third- party trackers can track user browsing behavior with techniques such as canvas fin- gerprinting (a type of browser fingerprinting), session replay (whereby a third party can see a playback of every user interaction with a particular Webpage), and even exploitation of a browser or password manager’s ‘‘auto-fill’’ feature to send back data from Web forms, often before a user even fills out the form. These more sophisticated technologies can provide detailed information about user behavior and data, including fine-grained details such as the user’s scrolls and mouse-clicks and even in some instances the user’s username and password for a given Web site (which can be either intentional on the part of the user or unintentional on the part of the Web site).
A recent study suggests that specific instances of third-party tracking software are pervasive. The same study also discovered that news sites have the largest num- ber of tracking parties on any given first-party site; other popular categories for tracking include arts, sports, and shopping Web sites. Cross-device tracking refers to the practice of linking activities of a single user across multiple devices (e.g., smartphones, tablets, desktop machines, other ‘‘smart devices’’); the practice aims to track a user’s behavior, even as they use different devices.
Certain aspects of cross-device tracking may improve user experience. For example, as with cookies on a single device or browser, cross-device tracking can allow a user to maintain a seamless experience when moving from one device to the next (e.g., continuing to read a book or watch a movie from the place where the user left off). Cross-device tracking can also be useful for preventing fraud; for example, a service provider may notice that a user has logged in from an unfamil iar device in a completely new location. When a user attempts a login from an unrecognized device, a service provider can take additional steps to authenticate the user (e.g., two-factor authentication).
SEC. 7.3 THE WORLD WIDE WEB 679
Cross-device tracking is most common by first-party services, such as email service providers, content providers (e.g., streaming video services), and com- merce sites, but third parties are also becoming increasingly adept at tracking users across devices.
1. Cross-device tracking may be deterministic, based on a persistent identifier such as a login that is tied to a specific user.
2. Cross-device tracking may also be probabilistic; the IP address is one example of a probabilistic identifier that can be used to implement cross-device tracking. For example, technologies such as network address translation can cause multiple devices on a network to have the same public IP address. Suppose that a user visits a Web site from a mobile device (e.g., a smartphone) and uses that device at both home and work. A third party can set IP address information in the device’s cookies. That user may then appear from two public IP addresses, one at work, and one at home, and those two IP addresses may be linked by the same third party cookie; if the user then visits that third party from different devices that share either of those two IP addresses, then those additional devices can be linked to the same user with high confidence.
Cross-device tracking often uses a combination of deterministic and proba- bilistic techniques; many of these techniques do not require the user to be logged into any site to enable this type of tracking. For example, some parties offer ‘‘ana lytics’’ services that, when embedded across many first-party Web sites, allow the third-party to track a user across Web sites and devices. Third parties often work together to track users across devices and services using a practice called cookie syncing, described in more detail later in this section.
Cross-device tracking enables more sophisticated inference of higher-level user activities, since data from different devices can be combined to build a more com- prehensive picture of an individual user’s activity. For example, data about a user’s location (as collected from a mobile device) can be combined with a user’s search history, social network activity (such as ‘‘likes’’) to determine for example whether a user has physically visited a store following an online search or online advertis ing exposure.
Device and Browser Fingerprinting
Even when users disable common tracking mechanisms such as third-party cookies, Web sites and third parties can still track users based on environmental, contextual, and device information that the device returns to the server. Based on a collection of this information, a third party may be able to uniquely identify, or ’’fingerprint,’’ a user across different sites and over time.
680 THE APPLICATION LAYER CHAP. 7
One well-known fingerprinting method is a technique called canvas finger- printing, whereby the HTML canvas is used to identify a device. The HTML can- vas allows a Web application to draw graphics in real time. Differences in font rendering, smoothing, dimensions, and some other features may cause each device to draw an image differently, and the resulting pixels can serve as a device finger- print. The technique was first discovered in 2012, but not brought to public atten tion until 2014. Although there was a backlash at that time, many trackers continue to use canvas fingerprinting and related techniques such as canvas font fingerprint ing, which identifies a device based on the browser’s font list; a recent study found that these techniques are still present on thousands of sites. Web sites can also use browser APIs to retrieve other information for tracking devices, including infor- mation such as the battery status, which can be used to track a user based on bat tery charge level and discharge time. Other reports describe how knowing the bat tery status of a device can be used to track a device and therefore associate a device with a user (Olejnik et al., 2015)
Cookie Syncing
When different third-party trackers share information with each other, these parties can track an individual user even as they visit Web sites that have different tracking mechanisms installed. Cookie syncing is difficult to detect and also facil itates merging of datasets about individual users between disparate third parties, creating significant privacy concerns. A recent study suggests that the practice of cookie syncing is widespread among third-party trackers.
7.4 STREAMING AUDIO AND VIDEO
Email and Web applications are not the only major uses of networks. For many people, audio and video are the holy grail of networking. When the word ‘‘multi- media’’ is mentioned, both the propellerheads and the suits begin salivating as if on cue. The former see immense technical challenges in providing good quality voice over IP and 8K video-on-demand to every computer. The latter see equally immense profits in it.
While the idea of sending audio and video over the Internet has been around since the 1970s at least, it is only since roughly 2000 that real-time audio and real-time video traffic has grown with a vengeance. Real-time traffic is different from Web traffic in that it must be played out at some predetermined rate to be use ful. After all, watching a video in slow motion with fits and starts is not most peo- ple’s idea of fun. In contrast, the Web can have short interruptions, and page loads can take more or less time, within limits, without it being a major problem.
Two things happened to enable this growth. First, computers have became much more powerful and are equipped with microphones and cameras so that they can input, process, and output audio and video data with ease. Second, a flood of
SEC. 7.4 STREAMING AUDIO AND VIDEO 681
Internet bandwidth has come to be available. Long-haul links in the core of the Internet run at many gigabits/sec, and broadband and 802.11ac wireless reaches users at the edge of the Internet. These developments allow ISPs to carry tremen- dous levels of traffic across their backbones and mean that ordinary users can con- nect to the Internet 100–1000 times faster than with a 56-kbps telephone modem.
The flood of bandwidth caused audio and video traffic to grow, but for dif ferent reasons. Telephone calls take up relatively little bandwidth (in principle 64 kbps but less when compressed) yet telephone service has traditionally been expen- sive. Companies saw an opportunity to carry voice traffic over the Internet using existing bandwidth to cut down on their telephone bills. Startups such as Skype saw a way to let customers make free telephone calls using their Internet con- nections. Upstart telephone companies saw a cheap way to carry traditional voice calls using IP networking equipment. The result was an explosion of voice data carried over the Internet and called Internet telephony and discussed in Sec. 7.4.4.
Unlike audio, video takes up a large amount of bandwidth. Reasonable quality Internet video is encoded with compression resulting in a stream of around 8 Mbps for 4K (which is 7 GB for a 2-hour movie) Before broadband Internet access, send ing movies over the network was prohibitive. Not so any more. With the spread of broadband, it became possible for the first time for users to watch decent, streamed video at home. People love to do it. Around a quarter of the Internet users on any given day are estimated to visit YouTube, the popular video sharing site. The movie rental business has shifted to online downloads. And the sheer size of videos has changed the overall makeup of Internet traffic. The majority of Internet traffic is already video, and it is estimated that 90% of Internet traffic will be video within a few years.
Given that there is enough bandwidth to carry audio and video, the key issue for designing streaming and conferencing applications is network delay. Audio and video need real-time presentation, meaning that they must be played out at a predetermined rate to be useful. Long delays mean that calls that should be inter- active no longer are. This problem is clear if you have ever talked on a satellite phone, where the delay of up to half a second is quite distracting. For playing music and movies over the network, the absolute delay does not matter, because it only affects when the media starts to play. But the variation in delay, called jitter, still matters. It must be masked by the player or the audio will sound unintelligible and the video will look jerky.
As an aside, the term multimedia is often used in the context of the Internet to mean video and audio. Literally, multimedia is just two or more media. That defi- nition makes this book a multimedia presentation, as it contains text and graphics (the figures). However, that is probably not what you had in mind, so we use the term ‘‘multimedia’’ to imply two or more continuous media, that is, media that have to be played during some well-defined time interval. The two media are nor- mally video with audio, that is, moving pictures with sound. Audio and smell may take a while. Many people also refer to pure audio, such as Internet telephony or
682 THE APPLICATION LAYER CHAP. 7
Internet radio, as multimedia as well, which it is clearly not. Actually, a better term for all these cases is streaming media. Nonetheless, we will follow the herd and consider real-time audio to be multimedia as well.
7.4.1 Digital Audio
An audio (sound) wave is a one-dimensional acoustic (pressure) wave. When an acoustic wave enters the ear, the eardrum vibrates, causing the tiny bones of the inner ear to vibrate along with it, sending nerve pulses to the brain. These pulses are perceived as sound by the listener. In a similar way, when an acoustic wave strikes a microphone, the microphone generates an electrical signal, representing the sound amplitude as a function of time.
The frequency range of the human ear runs from 20 Hz to 20,000 Hz. Some animals, notably dogs, can hear higher frequencies. The ear hears loudness loga rithmically, so the ratio of two sounds with power A and B is conventionally expressed in dB (decibels) as the quantity 10 log10(A/B). If we define the lower limit of audibility (a sound pressure of about 20 µPascals) for a 1-kHz sine wave as 0 dB, an ordinary conversation is about 50 dB and the pain threshold is about 120 dB. The dynamic range is a factor of more than 1 million.
The ear is surprisingly sensitive to sound variations lasting only a few millisec- onds. The eye, in contrast, does not notice changes in light level that last only a few milliseconds. The result of this observation is that jitter of only a few millisec- onds during the playout of multimedia affects the perceived sound quality much more than it affects the perceived image quality.
Digital audio is a digital representation of an audio wave that can be used to recreate it. Audio waves can be converted to digital form by an ADC (Analog-to- Digital Converter). An ADC takes an electrical voltage as input and generates a binary number as output. In Fig. 7-31(a) we see an example of a sine wave. To represent this signal digitally, we can sample it every 6T seconds, as shown by the bar heights in Fig. 7-31(b). If a sound wave is not a pure sine wave but a linear superposition of sine waves where the highest frequency component present is f , the Nyquist theorem (see Chap. 2) states that it is sufficient to make samples at a frequency 2 f . Sampling more often is of no value since the higher frequencies that such sampling could detect are not present.
The reverse process takes digital values and produces an analog electrical volt- age. It is done by a DAC (Digital-to-Analog Converter). A loudspeaker can then convert the analog voltage to acoustic waves so that people can hear sounds.
Audio Compression
Audio is often compressed to reduce bandwidth needs and transfer times, even though audio data rates are much lower than video data rates. All compression systems require two algorithms: one is used for compressing the data at the source,
SEC. 7.4 STREAMING AUDIO AND VIDEO 683
1.00
0.75
0.50
0.25
0
2 T12 T
–0.25 –0.50 –0.75
1
T T T
1
2 T
–1.00
(a) (b) (c)
Figure 7-31. (a) A sine wave. (b) Sampling the sine wave. (c) Quantizing the samples to 4 bits.
and another is used for decompressing it at the destination. In the literature, these algorithms are referred to as the encoding and decoding algorithms, respectively. We will use this terminology too.
Compression algorithms exhibit certain asymmetries that are important to understand. Even though we are considering audio first, these asymmetries hold for video as well. The first asymmetry applies to encoding the source material. For many applications, a multimedia document will only be encoded once (when it is stored on the multimedia server) but will be decoded thousands of times (when it is played back by customers). This asymmetry means that it is acceptable for the encoding algorithm to be slow and require expensive hardware provided that the decoding algorithm is fast and does not require expensive hardware.
The second asymmetry is that the encode/decode process need not be invert ible. That is, when compressing a data file, transmitting it, and then decompress ing it, the user expects to get the original back, accurate down to the last bit. With multimedia, this requirement does not exist. It is usually acceptable to have the audio (or video) signal after encoding and then decoding be slightly different from the original as long as it sounds (or looks) the same. When the decoded output is not exactly equal to the original input, the system is said to be lossy. If the input and output are identical, the system is lossless. Lossy systems are important because accepting a small amount of information loss normally means a huge pay- off in terms of the compression ratio possible.
Many audio compression algorithms have been developed. Probably the most popular formats are MP3 (MPEG audio layer 3) and AAC (Advanced Audio Coding) as carried in MP4 (MPEG-4) files. To avoid confusion, note that MPEG provides audio and video compression. MP3 refers to the audio compression por tion (part 3) of the MPEG-1 standard, not the third version of MPEG, which has been replaced by MPEG-4. AAC is the successor to MP3 and the default audio encoding used in MPEG-4. MPEG-2 allows both MP3 and AAC audio. Is that clear now? The nice thing about standards is that there are so many to choose from. And if you do not like any of them, just wait a year or two.
684 THE APPLICATION LAYER CHAP. 7
Audio compression can be done in two ways. In waveform coding, the signal is transformed mathematically by a Fourier transform into its frequency compo- nents. In Chap. 2, we showed an example function of time and its Fourier ampli tudes in Fig. 2-12(a). The amplitude of each component is then encoded in a mini- mal way. The goal is to reproduce the waveform fairly accurately at the other end in as few bits as possible.
The other way, perceptual coding, exploits certain flaws in the human audi tory system to encode a signal in such a way that it sounds the same to a human lis tener, even if it looks quite different on an oscilloscope. Perceptual coding is based on the science of psychoacoustics—how people perceive sound. Both MP3 and AAC are based on perceptual coding.
Perceptual encoding dominates modern multimedia systems, so let us take a look at it. A key property is that some sounds can mask other sounds. For exam- ple, imagine that you are broadcasting a live flute concert on warm summer day. Then all of a sudden, a crew of workmen show up with jackhammers and start tear ing up the street to replace it. No one can hear the flute any more, so you can just transmit the frequency of the jackhammers and the listeners will get the same musical experience as if you also had broadcast the flute as well, and you can save bandwidth to boot. This is called frequency masking.
When the jackhammers stop, you don’t have to start broadcasting the flute fre- quency for a small period of time because the ear turns down its gain when it picks up a loud sound and it takes a bit of time to reset it. Transmission of low-amplitude sounds during this recovery period are pointless and omitting them can save band- width. This is called temporal masking. Perceptual encoding relies heavily on not encoding or transmitting audio that the listeners are not going to perceive anyway.
7.4.2 Digital Video
Now that we know all about the ear, it is time to move on to the eye. (No, this section is not followed by one on the nose.) The human eye has the property that when an image appears on the retina, the image is retained for some number of milliseconds before decaying. If a sequence of images is drawn at 50 images/sec, the eye does not notice that it is looking at discrete images. All video systems since the Lumière brothers invented the movie projector in 1895 exploit this prin- ciple to produce moving pictures.
The simplest digital representation of video is a sequence of frames, each con- sisting of a rectangular grid of picture elements, or pixels. Common sizes for screens range from 1280 × 720 (called 720p), 1920 × 1080 (called 1080p or HD video), 3840 × 2160 (called 4K) and 7680 × 4320 (called 8K).
Most systems use 24 bits per pixel, with 8 bits each for the red, blue, and green (RGB) components. Red, blue, and green are the primary additive colors and every other color can be made from superimposing them in the appropriate intensity.
SEC. 7.4 STREAMING AUDIO AND VIDEO 685
Older frame rates vary from 24 frames/sec, which traditional film-based mov ies used, through 25.00 frames/sec (the PAL system used in most of the world), to 30 frames/sec (the American NTSC system). Actually, if you want to get picky, NTSC uses 29.97 frames/sec instead of 30 due to a hack the engineers introduced during the transition from black-and-white television to color. A bit of bandwidth was needed for part of the color management so they took it by reducing the frame rate by 0.03 frame/sec. PAL used color from its inception, so the rate really is exactly 25.00 frame/sec. In France, a slightly different system, called SECAM, was developed in part, to protect French companies from German television manu facturers. It also runs at exactly 25.00 frames/sec. During the 1950s, the Commu- nist countries of Eastern Europe adopted SECAM to prevent their people from watching West German (PAL) television and getting Bad Ideas.
To reduce the amount of bandwidth required to broadcast television signals over the air, television stations adopted a scheme in which frames were divided into two fields, one with the odd-numbered rows and one with the even-numbered rows, which were broadcast alternately. This meant that 25 frames/sec was actually 50 fields/sec. This scheme is called interlacing, and gives less flicker than broad- casting entire frames one after another. Modern video does not use interlacing and and just sends entire frames in sequence, usually at 50 frames/sec (PAL) or 59.94 frames/sec (NTSC). This is called progressive video.
Video Compression
It should be obvious from our discussion of digital video that compression is critical for sending video over the Internet. Even 720p PAL progressive video requires 553 Mbps of bandwidth and HD, 4K, and 8K require a lot more. To pro- duce a standard for compressing video that could be used over all platforms and by all manufacturers, the standards’ committees created a group called MPEG (Motion Picture Experts Group) to come up with a worldwide standard. Very briefly, the standards it came up with, known as MPEG-1, MPEG-2, and MPEG-4, work like this. Every few seconds a complete video frame is transmitted. The frame is compressed using something like the familiar JPEG algorithm that is used for digital still pictures. Then for the next few seconds, instead of sending out full frames, the transmitter sends out differences between the current frame and the base (full) frame it most recently sent out.
First let us briefly look at the JPEG (Joint Photographic Experts Group) algorithm for compressing a single still image. Instead of working with the RGB components, it converts the image into luminance (brightness) and chrominance (color) components because the eye is much more sensitive to luminance than chrominance, allowing fewer bits to be used to encode the chrominance without loss of perceived image quality. The image is then broken up into blocks of typi- cally 8 × 8 or 10 × 10 pixels, each of which is processed separately. Separately, the
686 THE APPLICATION LAYER CHAP. 7
luminance and chrominance are run through a kind of Fourier transform (techni- cally a discrete cosine transformation) to get the spectrum. High-frequency ampli tudes can then be discarded. The more amplitudes that are discarded, the fuzzier the image and the smaller the compressed image is. Then standard lossless com- press techniques like run-length encoding and Huffman encoding are applied to the remaining amplitudes. If this sounds complicated, it is, but computers are pretty good at carrying out complicated algorithms.
Now on to the MPEG part, described below in a simplified way. The frame following a full JPEG (base) frame is likely to be very similar to the JPEG frame, so instead of encoding the full frame, only the blocks that differ from the base frame are transmitted. A block containing, say, a piece of blue sky is likely to be the same as it was 20 msec earlier, so there is no need to transmit it again. Only the blocks that have changed need to be retransmitted.
As an example, consider the situation of a a camera mounted securely on a tri- pod with an actor walking toward a stationary tree and house. The first three frames are shown in Fig. 7-32. The encoding of the second frame just sends the blocks that have changed. Conceptually, the receiver starts out producing the sec- ond frame by copying the first frame into a buffer and then applying the changes. It then stores the second frame uncompressed for display. It also uses the second frame as the base for applying the changes that come describing the difference between the third frame and the second one.
Figure 7-32. Three consecutive frames.
It is slightly more complicated than this, though. If a block (say, the actor) is present in the second frame but has moved, MPEG allows the encoder to say, in effect, ‘‘block 29 from the previous frame is present in the new frame offset by a distance (6x, 6y) and furthermore the sixth pixel has changed to abc and the 24th pixel is now xyz.’’ This allows even more compression.
We mentioned symmetries between encoding and decoding before. Here we see one. The encoder can spend as much time as it wants searching for blocks that have moved and blocks that have changed somewhat to determine whether it is bet ter to send a list of updates to the previous frame or a complete new JPEG frame. Finding a moved block is a lot more work than simply copying a block from the previous image and pasting it into the new one at a known (6x, 6y) offset.
SEC. 7.4 STREAMING AUDIO AND VIDEO 687
To be a bit more complete, MPEG actually has three different kinds of frames, not just two:
1. I (Intracoded) frames that are self-contained compressed still images. 2. P (Predictive) frames that are difference with the previous frame. 3. B (Bidirectional) frames that code differences with the next I-frame.
The B-frames require the receiver to stop processing until the next I-frame arrives and then work backward from it. Sometimes this gives more compression, but having the encoder constantly check to see if differences with the previous frame or differences with any one of the next 30, 50, or 80 frames gives the small- est result is time consuming on the encoding side but not time consuming on the decoding side. This asymmetry is exploited to the maximum to give the smallest possible encoded file. The MPEG standards do not specify how to search, how far to search, or how good a match has to be in order to send differences or a complete new block. This is up to each implementation.
Audio and video are encoded separately as we have described. The final MPEG-encoded file consists of chunks containing some number of compressed images and the corresponding compressed audio to be played while the frames in
that chunk are displayed. In this way, the video and audio are kept synchronized. Note that this is a rather simplified description. In reality, even more tricks are used to get better compression, but the basic ideas given above are essentially cor rect. The most recent format is MPEG-4, also called MP4. It is formally defined in a standard known as H.264. It’s successor (defined for resolutions up to 8K) is H.265. H.264 is the format most consumer video cameras produce. Because the camera has to record the video on the SD card or other medium in real time, it has very little time to hunt for blocks that have moved a little. Consequently, the com- pression is not nearly as good as what a Hollywood studio can do when it dynam ically allocates 10,000 computers at a cloud server to encode its latest production. This is encoding/decoding asymmetry in action.
7.4.3 Streaming Stored Media
Let us now move on to network applications. Our first case is streaming a video that is already stored on a server somewhere, for example, watching a YouTube or Netflix video. The most common example of this is watching videos over the Internet. This is one form of VoD (Video on Demand). Other forms of video on demand use a provider network that is separate from the Internet to deliv- er the movies (e.g., the cable TV network).
The Internet is full of music and video sites that stream stored multimedia files. Actually, the easiest way to handle stored media is not to stream it. The straightforward way to make the video (or music track) available is just to treat the
688 THE APPLICATION LAYER CHAP. 7
pre-encoded video (or audio) file as a very big Web page and let the browser down load it. The sequence of four steps is shown in Fig. 7-33.
Client
Media
player
1: Media request (HTTP)
Browser
Server
Web server
4: Play
3: Save
2: Media response (HTTP)
mediaDisk Disk media
Figure 7-33. Playing media over the Web via simple downloads.
The browser goes into action when the user clicks on a movie. In step 1, it sends an HTTP request for the movie to the Web server to which the movie is link- ed. In step 2, the server fetches the movie (which is just a file in MP4 or some other format) and sends it back to the browser. Using the MIME type, the browser looks up how it is supposed to display the file. The browser then saves the entire movie to a scratch file on disk in step 3. It then starts the media player, passing it the name of the scratch file. Finally, in step 4 the media player starts reading the file and playing the movie. Conceptually, this is no different than fetching and dis- playing a static Web page, except that the downloaded file is ‘‘displayed’’ by using a media player instead of just writing pixels to a monitor.
In principle, this approach is completely correct. It will play the movie. There is no real-time network issue to address either because the download is simply a file download. The only trouble is that the entire video must be transmitted over the network before the movie starts. Most customers do not want to wait an hour for their ‘‘video on demand’’ to start, so something better is needed.
What is needed is a media player that is designed for streaming. It can either be part of the Web browser or an external program called by the browser when a video needs to be played. Modern browsers that support HTML5 usually have a built-in media player.
A media player has five major jobs to do:
1. Manage the user interface.
2. Handle transmission errors.
3. Decompress the content.
4. Eliminate jitter.
5. Decrypt the file.
Most media players nowadays have a glitzy user interface, sometimes simulating a stereo unit, with shiny buttons, knobs, sliders, and visual displays. Often there are
SEC. 7.4 STREAMING AUDIO AND VIDEO 689
interchangeable front panels, called skins, that the user can drop onto the player. The media player has to manage all this and interact with the user. The next three are related and depend on the network protocols. We will go through each one in turn, starting with handling transmission errors. Dealing with errors depends on whether a TCP-based transport like HTTP is used to transport the media, or a UDP-based transport like RTP (Real Time Protocol) is used. If a TCP-based transport is being used then there are no errors for the media player to correct because TCP already provides reliability by using retransmissions. This is an easy way to handle errors, at least for the media player, but it does complicate the removal of jitter in a later step because timing out and asking for retransmis- sions introduces uncertain and variable delays in the movie.
Alternatively, a UDP-based transport like RTP can be used to move the data. With these protocols, there are no retransmissions. Thus, packet loss due to con- gestion or transmission errors will mean that some of the media does not arrive. It is up to the media player to deal with this problem. One way is to ignore the prob lem and just have bits of video and audio be wrong. If errors are infrequent, this works fine and almost no one will notice. Another possibility is to use forward error correction, such as encoding the video file with some redundancy, such as a Hamming code or a Reed-Solomon code. Then the media player will have enough information to correct errors on its own, without having to ask for retransmissions or skip bits of damaged movies.
The downside here is that adding redundancy to the file makes it bigger. Another approach involves using selective retransmission of the parts of the video stream that are most important to play back the content. For example, in a com- pressed video sequence, a packet loss in an I-frame is much more consequential, since the decoding errors that result from the loss can propagate throughout the group of pictures. On the other hand, losses in derivative frames, including P frames and B-frames, are easier to recover from. Similarly, the value of a retrans- mission also depends on whether the retransmission of the content would arrive in time for playback. As a result, some retransmissions can be far more valuable than others, and selectively retransmitting certain packets (e.g., those within I-frames that would arrive before playback) is one possible strategy. Protocols have been built on top of RTP and QUIC to provide unequal loss protection when videos are streamed over UDP (Feamster et al., 2000; and Palmer et al., 2018).
The media player’s third job is decompressing the content. Although this task is computationally intensive, it is fairly straightforward. The thorny issue is how to decode media if the underlying network protocol does not correct transmission errors. In many compression schemes, later data cannot be decompressed until the earlier data has been decompressed, because the later data is encoded relative to the earlier data. Recall that a P-frame is based upon the most recent I-frame (and other I-frames following it). If the I-frame is damaged and cannot be decoded, all the subsequent P-frames are useless. The media player will then be forced to wait for the next I-frame and simply skip a few seconds of video.
690 THE APPLICATION LAYER CHAP. 7
This reality forces the encoder to make a decision. If I-frames are spaced closely, say, one per second, the gap when an error occurs will be fairly small, but the video will be bigger because I-frames are much bigger than P- or B-frames. If I-frames are, say, 5 seconds apart, the video file will be much smaller but there will be 5-second gap if an I-frame is damaged and a smaller gap if a P-frame is dam- aged. For this reason, when the underlying protocol is TCP, I-frames can be spaced much further apart than if RTP is used. Consequently, many video-streaming sites use TCP to allow a smaller encoded file with widely spaced I-frames and less bandwidth needed for smooth playback.
The fourth job is to eliminate jitter, the bane of all real-time systems. Using TCP makes this much worse, because it introduces random delays whenever retransmissions are needed. The general solution that all streaming systems use is a playout buffer. before starting to play the video, the system collects 5–30 sec- onds worth of media, as shown in Fig. 7-34. Playing drains media regularly from the buffer so that the audio is clear and the video is smooth. The startup delay gives the buffer a chance to fill to the low-water mark. The idea is that data should now arrive regularly enough that the buffer is never completely emptied. If that were to happen, the media playout would stall.
Client machine Server machine
Buffer
Media player
Low
High
Media server
water
water
mark
mark
Figure 7-34. The media player buffers input from the media server and plays from the buffer rather than directly from the network.
Buffering introduces a new complication. The media player needs to keep the buffer partly full, ideally between the low-water mark and the high-water mark. This means when the buffer passes the high-water mark, the player needs to tell the source to stop sending, lest it lose data for lack of a place to put it. The high-water mark has to be before the end of the buffer because data will continue to stream in until the Stop request gets to the media server. Once the server stops sending and the pipeline is empty, the buffer will start draining. When it hits the low-water mark, the player sends a Start command to the server to start streaming again.
By using a protocol in which the media player can command the server to stop and start, the media player can keep enough, but not too much, media in the buffer to ensure smooth playout. Since RAM is fairly cheap these days, a media player, even on a smartphone, could allocate enough buffer space to hold a minute or more of media, if need be.
SEC. 7.4 STREAMING AUDIO AND VIDEO 691
The start-stop mechanism has another nice feature. It decouples the server’s transmission rate from the playout rate. Suppose, for example, that the player has to play out the video at 8 Mbps. When the buffer drops to the low-water mark, the player will tell the server to deliver more data. If the server is capable of delivering it at 100 Mbps, that is not a problem. It just comes in and is stored in the buffer. When the high-water mark is reached, the player tells the server to stop. In this way, the server’s transmission rate and the playout rate are completely decoupled. What started out as a real-time system has become a simple nonreal-time file trans fer system. Getting rid of all the real-time transmission requirements is another reason YouTube, Netflix, Hulu, and other streaming servers use TCP. It makes the whole system design much simpler.
Determining the size of the buffer is a bit tricky. If lots of RAM is available, at first glance it sounds like it might make sense to have a large buffer and allow the server to keep it almost full, just in case the network suffers some congestion later on. However, users are sometimes finicky. If a user finds a scene boring and uses the buttons on the media player’s interface to skip forward, that might render most or all of the buffer useless. In any event, jumping forward (or backward) to a spe- cific point in time is unlikely to work unless that frame happens to be an I-frame. If not, the player has to search for a nearby I-frame. If the new play point is outside the buffer, the entire buffer has to be cleared and reloaded. In effect, users who skip around a lot (and there are many of them), waste network bandwidth by invali- dating precious data in their buffers. Systemwide, the existence of users who skip around a lot argues for limiting the buffer size, even if there is plenty of RAMavailable. Ideally, a media player could observe the user’s behavior and pick a buffer size to match the user’s viewing style.
All commercial videos are encrypted to prevent piracy, so media players have to be able to decrypt them as them come in. That is the fifth task in the list above.
DASH and HLS
The plethora of devices for viewing media introduces some complications we need to look at now. Someone who buys a bright, shiny, and very expensive 8K monitor will want movies delivered in 7680 × 4320 resolution at 100 or 120 frames/sec. But if halfway through an exciting movie she has to go to the doctor and wants to finish watching it in the waiting room on a 1280 × 720 smartphone that can handle at most 25 frames/sec, she has a problem. From the streaming site’s point of view, this raises the question of what at resolution and frame rate should movies be encoded.
The easy answer is to use every possible combination. At most it wastes disk space to encode every movie at seven screen resolutions (e.g., smartphone, NTSC, PAL, 720p, HD, 4K, and 8K) amd six frame rates (e.g., 25, 30, 50, 60, 100, and 120), for a total of 42 variants, but disk space is not very expensive. A bigger, but
692 THE APPLICATION LAYER CHAP. 7
related problem. is what happens when the viewer is stationary at home with her big, shiny monitor, but due to network congestion, the bandwidth between her and the server is changing wildly and cannot always support the full resolution.
Fortunately, several solutions have been already implemented. One solution is DASH (Dynamic Adaptive Streaming over HTTP). The basic idea is simple and it is compatible with HTTP (and HTTPS), so it can be streamed on a Web page. The streaming server first encodes its movies at multiple resolutions and frame rates and has them all stored in its disk farm. Each version is not stored as a single file, but as many files, each storing, say, 10 seconds of video and audio. This would mean that a 90-minute movie with seven screen resolutions and six frame rates (42 variants) would require 42 × 540 = 22,680 separate files, each with 10 seconds worth of content. In other words, each file holds a segment of the movie at one specific resolution and frame rate. Associated with the movie is a manifest, officially known as an MPD (Media Presentation Description), which lists the names of all these files and their properties, including resolution, frame rate, and frame number in the movie.
To make this approach work, both the player and server must both use the DASH protocol. The user side could either be the browser itself, a player shipped to the browser as a JavaScript program, or a custom application (e.g., for a mobile device, or a streaming set top box). The first thing it does when it is time to start viewing the movie is fetch the manifest for the movie, which is just a small file, so a normal GET HTTPS request is all that is needed.
The player then interrogates the device where it is running to discover its maxi- mum resolution and possibly other characteristics, such as what audio formats it can handle and how many speakers it has. Then it begins running some tests by sending test messages to the server to try to estimate how much bandwidth is avail- able. Once it has figured out what resolution the screen has and how much band- width is available, the player consults the manifest to find the first, say, 10 seconds of the movie that gives the best quality for the screen and available bandwidth.
But that’s not the end of the story. As the movie plays, the player continues to run bandwidth tests. Every time it needs more content, that is, when the amount of media in the buffer hits the low-water mark, it again consults the manifest and orders the appropriate file depending where it is in the movie and which resolution and frame rate it wants. If the bandwidth varies wildly during playback, the movie shown may change from 8K at 100 frames/sec to HD at 25 frames/sec and back several times a minute. In this way, the system adapts rapidly to changing network conditions and allows the best viewing experience consistent with the available resources. Companies such as Netflix have published information about how they adapt the bitrate of a video stream based on the playback buffer occupancy (Huang et al., 2014). An example is shown in Fig. 7-35.
In Fig. 7-35, as the bandwidth decreases, the player decides to ask for increas ingly low resolution versions. However, it could also have compromised in other ways. For example, sending out 300 frames for a 10-second playout requires less
7
THE APPLICATION LAYER
Having finished all the preliminaries, we now come to the layer where all the applications are found. The layers below the application layer are there to provide transport services, but they do not do real work for users. In this chapter, we will study some real network applications.
Even at the application layer there is a need for support protocols, to allow many applications to function. Accordingly, we will look at an important one of these before starting with the applications themselves. The item in question is the DNS (Domain Name System), which maps Internet names to IP addresses. After that, we will examine three real applications: electronic mail, the World Wide Web (generally referred to simply as ‘‘the Web’’), and multimedia, including modern video streaming. We will finish the chapter by discussing content distribution, including peer-to-peer networks and content delivery networks.
7.1 THE DOMAIN NAME SYSTEM (DNS)
Although programs theoretically could refer to Web pages, mailboxes, and other resources by using the network (i.e., IP) addresses of the computers where they are stored, these addresses are difficult for people to remember. Also, brows ing a company’s Web pages from 128.111.24.41 is brittle: if the company moves the Web server to a different machine with a different IP address, everyone needs to be told the new IP address. Although moving a Web site from one IP address to
613
614 THE APPLICATION LAYER CHAP. 7
another might seem far-fetched, in practice this general notion occurs quite often, in the form of load balancing. Specifically, many modern Web sites host their con tent on multiple machines, often geographically distributed clusters. The organiza tion hosting the content may wish to ‘‘move’’ a client’s communication from one Web server to another. The DNS is typically the most convenient way to do this.
High-level, readable names decouple machine names from machine addresses. An organization’s Web server could thus be referred to as www.cs.uchicago.edu, regardless of its IP address. Because the devices along a network path forward traffic to its destination based on IP address, these human-readable domain names must be converted to IP addresses; the DNS (Domain Name System) is the mech- anism that does so. In the subsequent sections, we will study how DNS performs this mapping, as well as how it has evolved over the past decades. In particular, one of the most significant developments in the DNS in recent years is its implica tions for user privacy. We will explore these implications and various recent devel- opments in DNS encryption that are related to privacy.
7.1.1 History and Overview
Back in the ARPANET days, a file, hosts.txt, listed all the computer names and their IP addresses. Every night, all of the hosts would fetch it from the site at which it was maintained. For a network of a few hundred large timesharing machines, this approach worked reasonably well.
However, well before many millions of PCs were connected to the Internet, everyone involved with it realized that this approach could not continue to work forever. For one thing, the size of the file would become too large. Even more importantly, host name conflicts would occur constantly unless names were cent rally managed, something unthinkable in a huge international network due to the load and latency. The Domain Name System was invented in 1983 to address these problems, and it has been a key part of the Internet ever since.
DNS is a hierarchical naming scheme and a distributed database system that implements this naming scheme. It is primarily used for mapping host names to IP addresses, but it has several other purposes, which we will outline in more detail below. DNS is one of the most actively evolving protocols in the Internet. DNS is defined in RFC 1034, RFC 1035, RFC 2181, and further elaborated in many other RFCs.
7.1.2 The DNS Lookup Process
DNS operates as follows. To map a name onto an IP address, an application program calls a library procedure, (typically gethostbyname or the equivalent) pas- sing this function the name as a parameter. This process is sometimes referred to as the stub resolver. The stub resolver sends a query containing the name to a local DNS resolver, often called the local recursive resolver or simply the local
SEC. 7.1 THE DOMAIN NAME SYSTEM (DNS) 615
resolver, which subsequently performs a so-called recursive lookup for the name against a set of DNS resolvers. The local recursive resolver ultimately returns a response with the corresponding IP address to the stub resolver, which then passes
the result to the function that issued the query in the first place. The query and response messages are sent as UDP packets. Given knowledge of the IP address, the program can then communicate with the host corresponding to the DNS name that it had looked up. We will explore this process in more detail later in this chap ter.
Typically, the stub resolver issues a recursive lookup to the local resolver, meaning that it simply issues the query and waits for the response from the local resolver. The local resolver, on the other hand, issues a sequence of queries to the respective name servers for each part of the name hierarchy; the name server that is responsible for a particular part of the hierarchy is often called the authoritative name server for that domain. As we will see later, DNS uses caching, but caches can be out of date. The authoritative name server is, well, authoritative. It is by definition always correct. Before describing more detailed operation of DNS, we describe the DNS name server hierarchy and how names are allocated.
When a host’s stub resolver sends a query to the local resolver, the local resolver handles the resolution until it has the desired answer, or no answer. It does not return partial answers. On the other hand, the root name server (and each subsequent name server) does not recursively continue the query for the local name server. It just returns a partial answer and moves on to the next query. The local resolver is responsible for continuing the resolution by issuing further iterative queries.
The name resolution process typically involves both mechanisms. A recursive query may always seem preferable, but many name servers (especially the root) will not handle them. They are too busy. Iterative queries put the burden on the originator. The rationale for the local name server supporting a recursive query is that it is providing a service to hosts in its domain. Those hosts do not have to be configured to run a full name server, just to reach the local one. A 16-bit tran- saction identifier is included in each query and copied to the response so that a name server can match answers to the corresponding query, even if multiple queries are outstanding at the same time.
All of the answers, including all the partial answers returned, are cached. In this way, if a computer at cs.vu.nl queries for cs.uchicago.edu, the answer is cached. If shortly thereafter, another host at cs.vu.nl also queries cs.uchicago.edu, the answer will already be known. Even better, if a host queries for a different host in the same domain, say noise.cs.uchicago.edu, the query can be sent directly to the authoritative name server for cs.uchicago.edu. Similarly, queries for other domains in uchicago.edu can start directly from the uchicago.edu name server. Using cached answers greatly reduces the steps in a query and improves per formance. The original scenario we sketched is in fact the worst case that occurs when no useful information is available in the cache.
616 THE APPLICATION LAYER CHAP. 7
Cached answers are not authoritative, since changes made at cs.uchicago.edu will not be propagated to all the caches in the world that may know about it. For this reason, cache entries should not live too long. This is the reason that the Time to live field is included in each DNS resource record, a part of the DNS database we will discuss shortly. It tells remote name servers how long to cache records. If a certain machine has had the same IP address for years, it may be safe to cache that information for one day. For more volatile information, it might be safer to purge the records after a few seconds or a minute.
DNS queries have a simple format that includes various information, including the name being queried (QNAME), as well as other auxiliary information, such as a transaction identifier; the transaction identifier is often used to map queries to responses. Initially, the transaction ID was only 16 bits, and the queries and re- sponses were not secured; this design choice left DNS vulnerable to a variety of attacks including something called a cache poisoning attack, whose details we dis- cuss further in Chap. 8. When performing a series of iterative lookups, a recursive DNS resolver might send the entire QNAME to the sequence of authoritative name servers returning the responses. At some point, protocol designers pointed out that sending the entire QNAME to every authoritative name server in a sequence of it- erative resolvers constituted a privacy risk. As a result, many recursive resolvers now use a process called QNAME minimization, whereby the local resolver only sends the part of the query that the respective authoritative name server has the information to resolve. For example, with QNAME minimization, given a name to resolve such as www.cs.uchicago.edu, a local resolver would send only the string cs.uchicago.edu to the authoritative name server for uchicago.edu, as opposed to the fully qualified domain name (FQDN), to avoid revealing the entire FQDN to the authoritative name server. For more information on QNAME minimization, see RFC 7816.
Until very recently, DNS queries and responses relied on UDP as its transport protocol, based on the rationale that DNS queries and responses needed to be fast and lightweight, and could not handle the corresponding overhead of a TCP three- way handshake. However, various developments, including the resulting insecurity of the DNS protocol and the myriad subsequent attacks that DNS has been subject to, ranging from cache poisoning to distributed denial-of-service (DDoS) attacks, has resulted in an increasing trend towards the use of TCP as the transport protocol for DNS. Using TCP as the transport protocol for DNS has subsequently allowed DNS to leverage modern secure transport and application-layer protocols, resulting in DNS-over-TLS (DoT) and DNS-over-HTTPS (DoH). We discuss these develop- ments in more detail later in this chapter.
If the DNS stub resolver does not receive a response within some relatively short period of time (a timeout period), the DNS client repeats the query, trying another server for the domain after a small number of retries. This process is designed to handle the case of the server being down as well as the query or response packet getting lost.
SEC. 7.1 THE DOMAIN NAME SYSTEM (DNS) 617 7.1.3 The DNS Name Space and Hierarchy
Managing a large and constantly changing set of names is challenging. In the postal system, name management is done by requiring letters to specify (implicitly or explicitly) the country, state or province, city, street address, and name of the addressee. Using this kind of hierarchical addressing ensures that there is no con fusion between the Marvin Anderson on Main St. in White Plains, N.Y. and the Marvin Anderson on Main St. in Austin, Texas. DNS works the same way.
For the Internet, the top of the naming hierarchy is managed by an organization called ICANN (Internet Corporation for Assigned Names and Numbers). ICANN was created for this purpose in 1998, as part of the maturing of the Internet to a worldwide, economic concern. Conceptually, the Internet is divided into over 250 top-level domains, where each domain covers many hosts. Each domain is partitioned into subdomains, and these are further partitioned, and so on. All of these domains constitute a namespace hierarchy, which can be represented by a tree, as shown in Fig. 7-1. The leaves of the tree represent domains that have no subdomains (but do contain machines, of course). A leaf domain may contain a single host, or it may represent a company and contain thousands of hosts.
Generic Countries
. . .
aero com edu gov museum org net au jp uk us nl cisco acm ieee
. . .
uchicago
eng
cs
eng
jack jill
edu vu oce ac co
uwa keio
nec
cs law
cs
noise
csl
filts fluit
Figure 7-1. A portion of the Internet domain name space.
The top-level domains have several different types: gTLD (generic Top Level Domain), ccTLD (country code Top Level Doman), and others. Some of the original generic TLDs, listed in Fig. 7-2, include original domains from the 1980s, plus additional top-level domains introduced to ICANN. The country domains include one entry for every country, as defined in ISO 3166. Internationalized country domain names that use non-Latin alphabets were introduced in 2010. These domains let people name hosts in Arabic, Chinese, Cyrillic, Hebrew, or other languages.
In 2011, there were only 22 gTLDs, but in June 2011, ICANN voted to end restrictions on the creation of additional gTLDs, allowing companies and other
618 THE APPLICATION LAYER CHAP. 7
organizations to select essentially arbitrary top-level domains, including TLDs that include non-Latin characters (e.g., Cyrillic). ICANN began accepting applications for new TLDs at the beginning of 2012. The initial cost of applying for a new TLD was nearly 200,000 dollars. Some of the first new gTLDs became operational in 2013, and in July 2013, the first four new gTLDs were launched based on agree- ment that was signed in Durban, South Africa. All four were based on non-Latin characters: the Arabic word for ‘‘Web,’’ the Russian word for ‘‘online,’’ the Rus- sian word for ‘‘site,’’ and the Chinese word for ‘‘game.’’ Some tech giants have applied for many gTLDs: Google and Amazon, for example, have each applied for about 100 new gTLDs. Today, some of the most popular gTLDs include top, loan, xyz, and so forth.
Domain Intended use Start date Restricted?
com Commercial 1985 No
edu Educational institutions 1985 Yes
gov Government 1985 Yes
int International organizations 1988 Yes
mil Military 1985 Yes
net Network providers 1985 No
org Non-profit organizations 1985 No
aero Air transport 2001 Yes
biz Businesses 2001 No
coop Cooperatives 2001 Yes
info Informational 2002 No
museum Museums 2002 Yes
name People 2002 No
pro Professionals 2002 Yes
cat Catalan 2005 Yes
jobs Employment 2005 Yes
mobi Mobile devices 2005 Yes
tel Contact details 2005 Yes
travel Travel industry 2005 Yes
xxx Sex industry 2010 No
Figure 7-2. The original generic TLDs, as of 2010. As of 2020, there are more than 1,200 gTLDs.
Getting a second-level domain, such as name-of-company.com, is easy. The top-level domains are operated by companies called registries. They are appointed by ICANN. For example, the registry for com is Verisign. One level down, regis trars sell domain names directly to users. There are many of them and they com- pete on price and service. Common registrars include Domain.com, GoDaddy, and
SEC. 7.1 THE DOMAIN NAME SYSTEM (DNS) 619
NameCheap. Fig. 7-3 shows the relationship between registries and registrars as far as registering a domain name is concerned.
Register domains
Registrar
VERISIGN Registry
ICANN
Figure 7-3. The relationship between registries and registrars.
The domain name that a machine aims to look up is typically called a FQDN (Fully Qualified Domain Name) such as www.cs.uchicago.edu or cisco.com. The FQDN starts with the most specific part of the domain name, and each part of the hierarchy is separated by a ’’.’’ (Technically, all FQDNs end with a ‘‘.’’ as well, sig- nifying the root of the DNS hierarchy, although most operating systems complete that portion of the domain name automatically.)
Each domain is named by the path upward from it to the (unnamed) root. The components are separated by periods (pronounced ‘‘dot’’). Thus, the engineering department at Cisco might be eng.cisco.com., rather than a UNIX-style name such as /com/cisco/eng. Notice that this hierarchical naming means that eng.cisco.com. does not conflict with a potential use of eng in eng.uchicago.edu., which might be used by the English department at the University of Chicago.
Domain names can be either absolute or relative. An absolute domain name always ends with a period (e.g., eng.cisco.com.), whereas a relative one does not. Relative names have to be interpreted in some context to uniquely determine their true meaning. In both cases, a named domain refers to a specific node in the tree and all the nodes under it.
Domain names are case-insensitive, so edu, Edu, and EDU mean the same thing. Component names can be up to 63 characters long, and full path names must not exceed 255 characters. The fact that DNS in case insensitive has been used to defend against various DNS attacks, including DNS cache poisoning attacks, using a technique called 0x20 encoding (Dagon et al., 2008), which we will discuss in more detail later in this chapter.
In principle, domains can be inserted into the hierarchy in either the generic or the country domains. For example, the domain cc.gatech.edu could equally well be (and are often) listed under the us country domain as cc.gt.atl.ga.us. In prac tice, however, most organizations in the United States are under generic domains,
620 THE APPLICATION LAYER CHAP. 7
and most outside the United States are under the domain of their country. There is no rule against registering under multiple top-level domains. Large companies often do so (e.g., sony.com, sony.net, and sony.nl).
Each domain controls how it allocates the domains under it. For example, Japan has domains ac.jp and co.jp that mirror edu and com. The Netherlands does not make this distinction and puts all organizations directly under nl. Austraian universities are all in edu.au. Thus, all three of the following are university CS and EE departments:
1. cs.chicago.edu (University of Chicago, in the U.S.).
2. cs.vu.nl (Vrije Universiteit, in The Netherlands).
3. ee.uwa.edu.au (University of Western Australia).
To create a new domain, permission is required of the domain in which it will be included. For example, if a security research group at the University of Chicago wants to be known as security.cs.uchicago.edu, it has to get permission from who- ever manages cs.uchicago.edu. (Fortunately, that person is typically not far away, thanks to the federated management architecture of DNS) Similarly, if a new uni- versity is chartered, say, the University of Northern South Dakota, it must ask the manager of the edu domain to assign it unsd.edu (if that is still available). In this way, name conflicts are avoided and each domain can keep track of all its subdo- mains. Once a new domain has been created and registered, it can create subdo- mains, such as cs.unsd.edu, without getting permission from anybody higher up the tree.
Naming follows organizational boundaries, not physical networks. For exam- ple, if the computer science and electrical engineering departments are located in the same building and share the same LAN, they can nevertheless have distinct domains. Similarly, even if computer science is split over Babbage Hall and Tur ing Hall, the hosts in both buildings will normally belong to the same domain.
7.1.4 DNS Queries and Responses
We now turn to the structure, format, and purpose of DNS queries, and how the DNS servers answer those queries.
DNS Queries
As previously discussed, a DNS client typically issues a query to a local recur- sive resolver, which performs an iterative query to ultimately resolve the query. The most common query type is an A record query, which asks for a mapping from a domain name to an IP address for a corresponding Internet endpoint. DNS has a range of other resource records (with corresponding queries), as we discuss further in the next section on resource records (i.e., responses).
SEC. 7.1 THE DOMAIN NAME SYSTEM (DNS) 621
Although the primary mechanism for DNS has long been to map human read- able names to IP addresses, over the years, DNS queries have been used for a var iety of other purposes. Another common use for DNS queries is to look up do- mains in a DNSBL (DNS-based blacklist), which are lists that are commonly maintained to keep track of IP addresses associated with spammers and malware. To look up a domain name in a DNSBL, a client might send a DNS A-record query to a special DNS server, such as pbl.spamhaus.org (a ‘‘policy blacklist’’), which corresponds to a list of IP addresses that are not supposed to be making connec tions to mail servers. To look up a particular IP address, a client simply reverses the octets for the IP address and prepends the result to pbl.spamhaus.org.
For example, to look up 127.0.0.2, a client would simply issue a query for 2.0.0.127.pbl.spamhaus.org. If the corresponding IP address was in the list, the DNS query would return an IP address that typically encodes some additional in formation, such as the provenance of that entry in the list. If the IP address is not contained in the list, the DNS server would indicate that by responding with the corresponding NXDOMAIN response, corresponding to ‘‘no such domain.’’
Extensions and Enhancements to DNS Queries
DNS queries have become more sophisticated and complex over time, as the needs to serve clients with increasingly specific and relevant information over time has increased, and as security concerns have grown. Two significant extensions to DNS queries in recent years has been the use of the EDNS0 CS Extended DNS Client Subnet or simply EDNS Client Subnet option, whereby a client’s local recursive resolver passes the IP address subnet of the stub resolver to the authorita tive name server.
The EDNS0 CS mechanism allows the authoritative name server for a domain name to know the IP address of the client that initially performed the query. Know ing this information can typically allow an authoritative DNS server to perform a more effective mapping to a nearby copy of a replicated service. For example, if a client issues a query for google.com, the authoritative name server for Google would typically want to return a name that corresponds to a front-end server that is close to the client. The ability to do so of course depends on knowing where on the network (and, ideally, where in the world, geographically) the client is located. Ordinarily, an authoritative name server might only see the IP address of the local recursive resolver.
If the client that initiated the query happens to be located near its respective local resolver, then the authoritative server for that domain could determine an appropriate client mapping simply from the location of the DNS local recursive. Increasingly, however, clients have begun to use local recursive resolvers that may have IP addresses that make it difficult to locate the client. For example, Google and Cloudflare both operate public DNS resolvers (8.8.8.8 and 1.1.1.1, respec tively). If a client is configured to use one of these local recursive resolvers, then
622 THE APPLICATION LAYER CHAP. 7
the authoritative name server does not learn much useful information from the IP address of the recursive resolver. EDNS0 CS solves this problem by including the IP subnet in the query from the local recursive, so that the authoritative can see the IP subnet of the client that initiated the query.
As previously noted, the names in DNS queries are not case sensitive. This characteristic has allowed modern DNS resolvers to include additional bits of a transaction ID in the query by setting each character in a QNAME to an arbitrary case. A 16-bit transaction ID is vulnerable to various cache poisoning attacks, including the Kaminsky attack described in Chap. 8. This vulnerability partially arises because the DNS transaction ID is only 16 bits. Increasing the number of bits in the transaction ID would require changing the DNS protocol specification, which is a massive undertaking.
An alternative was developed, usually called 0x20 encoding, whereby a local recursive would toggle the case on each QNAME (e.g., uchicago.edu might become uCHicaGO.EDu or similar), allowing each letter in the domain name to encode an additional bit for the DNS transaction ID. The catch, of course, is that no other resolver should alter the case of the QNAME in subsequent iterative queries or responses. If the casing is preserved, then the corresponding reply con tains the QNAME with the original casing indicated by the local recursive resolver, effectively acting adding bits to the transaction identifier. The whole thing is an ugly hack, but such is the nature of trying to change widely deployed software while maintaining backward compatibility.
DNS Responses and Resource Records
Every domain, whether it is a single host or a top-level domain, can have a set of resource records associated with it. These records are the DNS database. For a single host, the most common resource record is just its IP address, but many other kinds of resource records also exist. When a resolver gives a domain name to DNS, what it gets back are the resource records associated with that name. Thus, the primary function of DNS is to map domain names onto resource records.
A resource record is a five-tuple. Although resource records are encoded in binary, in most expositions resource records are presented as ASCII text, with one line per resource record, as follows:
Domain name Time to live Class Type Value
The Domain name tells the domain to which this record applies. Normally, many records exist for each domain, and each copy of the database holds information about multiple domains. This field is thus the primary search key used to satisfy queries. The order of the records in the database is not significant.
The Time to live field gives an indication of how stable the record is. Infor- mation that is highly stable is assigned a large value, such as 86400 (the number of seconds in 1 day). Information that is volatile (like stock prices), or that operators
SEC. 7.1 THE DOMAIN NAME SYSTEM (DNS) 623
may want to change frequently (e.g., to enable load balancing a single name across multiple IP addresses) may be assigned a small value, such as 60 seconds (1 minute). We will return to this point later when we have discussed caching.
The third field of every resource record is the Class. For Internet information, it is always IN. For non-Internet information, other codes can be used, but in prac tice these are rarely seen.
The Type field tells what kind of record this is. There are many kinds of DNS records. The important types are listed in Fig. 7-4.
Type Meaning Value
SOA Start of authority Parameters for this zone
A IPv4 address of a host 32-Bit integer
AAAA IPv6 address of a host 128-Bit integer
MX Mail exchange Priority, domain willing to accept email NS Name server Name of a server for this domain CNAME Canonical name Domain name
PTR Pointer Alias for an IP address
SPF Sender policy framework Text encoding of mail sending policy SRV Service Host that provides it
TXT Text Descriptive ASCII text
Figure 7-4. The principal DNS resource record types.
An SOA record provides the name of the primary source of information about the name server’s zone (described below), the email address of its administrator, a unique serial number, and variousflags and timeouts.
Common Record Types
The most important record type is the A (Address) record. It holds a 32-bit IPv4 address of an interface for some host. The corresponding AAAA, or ‘‘quad A,’’ record holds a 128-bit IPv6 address. Every Internet host must have at least one IP address so that other machines can communicate with it. Some hosts have two or more network interfaces, so they will have two or more type A or AAAA re- source records. Additionally, a single service (e.g., google.com) may be hosted on many geographically distributed machines around the world (Calder et al., 2013). In these cases, a DNS resolver might return multiple IP addresses for a single domain name. In the case of a geographically distributed service, a resolver may return to its client one or more IP addresses of a server that is close to the client (geographically or topologically), to improve performance, and for load balancing.
An important record type is the NS record. It specifies a name server for the domain or subdomain. This is a host that has a copy of the database for a domain. It is used as part of the process to look up names, which we will describe shortly.
624 THE APPLICATION LAYER CHAP. 7
Another record type is the MX record. It specifies the name of the host prepared to accept email for the specified domain. It is used because not every machine is pre- pared to accept email. If someone wants to send email to, as an example, bill@microsoft.com, the sending host needs to find some mail server located at microsoft.com that is willing to accept email. The MX record can provide this information.
CNAME records allow aliases to be created. For example, a person familiar with Internet naming in general and wanting to send a message to user paul in the computer science department at the University of Chicago might guess that paul@cs.chicago.edu will work. Actually, this address will not work, because the domain for the computer science department is cs.uchicago.edu. As a service to people who do not know this, the University of Chicago could create a CNAME entry to point people and programs in the right direction. An entry like this one might do the job:
www.cs.uchicago.edu 120 IN CNAME hnd.cs.uchicago.edu
CNAMEs are commonly used for Web site aliases, because the common Web ser- ver addresses (which often start with www) tend to be hosted on machines that serve multiple purposes and whose primary name is not www.
The PTR record points to another name and is typically used to associate an IP address with a corresponding name. PTR lookups that associate a name with a corresponding IP address are typically called reverse lookups.
SRV is a newer type of record that allows a host to be identified for a given ser- vice in a domain. For example, the Web server for www.cs.uchicago.edu could be identified as hnd.cs.uchicago.edu. This record generalizes the MX record that per forms the same task but it is just for mail servers.
SPF lets a domain encode information about what machines in the domain will send mail to the rest of the Internet. This helps receiving machines check that mail is valid. If mail is being received from a machine that calls itself dodgy but the domain records say that mail will only be sent out of the domain by a machine called smtp, chances are that the mail is forged junk mail.
Last on the list, TXT records were originally provided to allow domains to identify themselves in arbitrary ways. Nowadays, they usually encode machine readable information, typically the SPF information.
Finally, we have the Value field. This field can be a number, a domain name, or an ASCII string. The semantics depend on the record type. A short description of the Value fields for each of the principal record types is given in Fig. 7-4.
DNSSEC Records
The original deployment of DNS did not consider the security of the protocol. In particular, DNS name servers or resolvers could manipulate the contents of any DNS record, thus causing the client to receive incorrect information. RFC 3833
SEC. 7.1 THE DOMAIN NAME SYSTEM (DNS) 625
highlights some of the various security threats to DNS and how DNSSEC addres- ses these threats. DNSSEC records allow responses from DNS name servers to carry digital signatures, which the local or stub resolver can subsequently verify to ensure that the DNS records were not modified or tampered with. Each DNS ser- ver computes a hash (a kind of long checksum) of the RRSET (Resource Record Set) for each set of resource records of the same type, with its private crypto- graphic keys. Corresponding public keys can be used to verify the signatures on the RRSETs. (For those not familiar with cryptography, Chap. 8 provides some technical background.)
Verifying the signature of an RRSET with the name server’s corresponding public key of course requires verifying the authenticity of that server’s public key. This verification can be accomplished if the public key of one authoritative name server’s public key is signed by the parent name server in the name hierarchy. For example, the .edu authoritative name server might sign the public key correspond ing to the chicago.edu authoritative name server, and so forth.
DNSSEC has two resource records relating to public keys: (1) the RRSIG record, which corresponds to a signature over the RRSET, signed with the corres- ponding authoritative name server’s private key, and (2) the DNSKEY record, which is the public key for the corresponding RRSET, which is signed by the par- ent’s private key. This hierarchical structure for signatures allows DNSSEC public keys for the name server hierarchy to be distributed in band. Only the root-level public keys must be distributed out-of-band, and those keys can be distributed in the same way that resolvers come to know about the IP addresses of the root name servers. Chap. 8 discusses DNSSEC in more detail.
DNS Zones
Fig. 7-5. shows an example of the type of information that might be available in a typical DNS resource record for a particular domain name. This figure depicts part of a (hypothetical) database for the cs.vu.nl domain shown in Fig. 7-1, which is often called a DNS zone file or sometimes simply DNS zone for short. This zone file contains seven types of resource records.
The first noncomment line of Fig. 7-5 gives some basic information about the domain, which will not concern us further. Then come two entries giving the first and second places to try to deliver email sent to person@cs.vu.nl. The zephyr (a specific machine) should be tried first. If that fails, the top should be tried as the next choice. The next line identifies the name server for the domain as star.
After the blank line (added for readability) come lines giving the IP addresses for the star, zephyr, and top. These are followed by an alias, www.cs.vu.nl, so that this address can be used without designating a specific machine. Creating this alias allows cs.vu.nl to change its World Wide Web server without invalidating the address people use to get to it. A similar argument holds for ftp.cs.vu.nl.
626 THE APPLICATION LAYER CHAP. 7
; Authoritative data for cs.vu.nl
cs.vu.nl. 86400 IN SOA star boss (9527,7200,7200,241920,86400) cs.vu.nl. 86400 IN MX 1 zephyr
cs.vu.nl. 86400 IN MX 2 top
cs.vu.nl. 86400 IN NS star
star 86400 IN A 130.37.56.205
zephyr 86400 IN A 130.37.20.10
top 86400 IN A 130.37.20.11
www 86400 IN CNAME star.cs.vu.nl
ftp 86400 IN CNAME zephyr.cs.vu.nl
flits 86400 IN A 130.37.16.112
flits 86400 IN A 192.31.231.165
flits 86400 IN MX 1 flits
flits 86400 IN MX 2 zephyr
flits 86400 IN MX 3 top
rowboat IN A 130.37.56.201
IN MX 1 rowboat
IN MX 2 zephyr
little-sister IN A 130.37.62.23
laserjet IN A 192.31.231.216
Figure 7-5. A portion of a possible DNS database (zone file) for cs.vu.nl.
The section for the machine flits lists two IP addresses and three choices are given for handling email sent to flits.cs.vu.nl. First choice is naturally the flits itself, but if it is down, the zephyr and top are the second and third choices.
The next three lines contain a typical entry for a computer, in this example, rowboat.cs.vu.nl. The information provided contains the IP address and the pri- mary and secondary mail drops. Then comes an entry for a computer that is not capable of receiving mail itself, followed by an entry that is likely for a printer (laserjet) that is connected to the Internet.
In theory at least, a single name server could contain the entire DNS database and respond to all queries about it. In practice, this server would be so overloaded as to be useless. Furthermore, if it ever went down, the entire Internet would be crippled.
To avoid the problems associated with having only a single source of infor- mation, the DNS name space is divided into nonoverlapping zones. One possible way to divide the name space of Fig. 7-1 is shown in Fig. 7-6. Each circled zone contains some part of the tree.
Where the zone boundaries are placed within a zone is up to that zone’s admin istrator. This decision is made in large part based on how many name servers are
SEC. 7.1 THE DOMAIN NAME SYSTEM (DNS) 627 Generic Countries
. . .
aero com edu gov museum org net au jp uk us nl cisco acm ieee
. . .
uchicago
eng
cs
eng
jack jill
edu vu oce ac co
uwa keio
nec
cs law
cs
noise
csl
flits fluit
Figure 7-6. Part of the DNS name space divided into zones (which are circled).
desired, and where. For example, in Fig. 7-6, the University of Chicago has a zone for chicago.edu that handles traffic to cs.uchicago.edu. However, it does not hand le eng.uchicago.edu. That is a separate zone with its own name servers. Such a decision might be made when a department such as English does not wish to run its own name server, but a department such as Computer Science does.
7.1.5 Name Resolution
Each zone is associated with one or more name servers. These are hosts that hold the database for the zone. Normally, a zone will have one primary name ser- ver, which gets its information from a file on its disk, and one or more secondary name servers, which get their information from the primary name server. To improve reliability, some of the name servers can be located outside the zone.
The process of looking up a name and finding an address is called name reso lution. When a resolver has a query about a domain name, it passes the query to a local name server. If the domain being sought falls under the jurisdiction of the name server, such as top.cs.vu.nl falling under cs.vu.nl, it returns the authoritative resource records. An authoritative record is one that comes from the authority that manages the record and is thus always correct. Authoritative records are in contrast to cached records, which may be out of date.
What happens when the domain is remote, such as when flits.cs.vu.nl wants to find the IP address of cs.uchicago.edu at the University of Chicago? In this case, and if there is no cached information about the domain available locally, the name server begins a remote query. This query follows the process shown in Fig. 7-7. Step 1 shows the query that is sent to the local name server. The query contains the domain name sought, the type (A), and the class(IN).
628 THE APPLICATION LAYER CHAP. 7
Root name server
(a.root-servers.net)
2: query
1: noise.cs.uchicago.edu
3: edu
4: query
5: uchicago.edu
Edu name server (a.edu-servers.net)
10: 128.135.24.19
filts.cs.vu.nl
Originator
Local
(cs.vu.nl) resolver
6: query
7: cs.uchicago.edu 9: 128.135.24.19
8: query
uchicago name server
uchicago cs name server
Figure 7-7. Example of a resolver looking up a remote name in 10 steps.
The next step is to start at the top of the name hierarchy by asking one of the root name servers. These name servers have information about each top-level domain. This is shown as step 2 in Fig. 7-7. To contact a root server, each name server must have information about one or more root name servers. This infor- mation is normally present in a system configuration file that is loaded into the DNS cache when the DNS server is started. It is simply a list of NS records for the root and the corresponding A records.
There are 13 root DNS servers, unimaginatively called a.root-servers.net through m.root-servers.net. Each root server could logically be a single computer. However, since the entire Internet depends on the root servers, they are powerful and heavily replicated computers. Most of the servers are present in multiple geo-
graphical locations and reached using anycast routing, in which a packet is deliv- ered to the nearest instance of a destination address; we described anycast in Chap. 5. The replication improves reliability and performance.
The root name server is very unlikely to know the address of a machine at uchicago.edu, and probably does not know the name server for uchicago.edu eith- er. But it must know the name server for the edu domain, in which cs.uchicago.edu is located. It returns the name and IP address for that part of the answer in step 3.
The local name server then continues its quest. It sends the entire query to the edu name server (a.edu-servers.net). That name server returns the name server for uchicago.edu. This is shown in steps 4 and 5. Closer now, the local name server sends the query to the uchicago.edu name server (step 6). If the domain name being sought was in the English department, the answer would be found, as the uchicago.edu zone includes the English department. The Computer Science depart- ment has chosen to run its own name server. The query returns the name and IP address of the uchicago.edu Computer Science name server (step 7).
SEC. 7.1 THE DOMAIN NAME SYSTEM (DNS) 629
Finally, the local name server queries the uchicago.edu Computer Science name server (step 8). This server is authoritative for the domain cs.uchicago.edu, so it must have the answer. It returns the final answer (step 9), which the local name server forwards as a response to flits.cs.vu.nl (step 10).
7.1.6 Hands on with DNS
You can explore this process using standard tools such as the dig program that is installed on most UNIX systems. For example, typing
dig ns @a.edu-servers.net cs.uchicago.edu
will send a query for cs.uchicago.edu to the a.edu-servers.net name server and print out the result for its name servers. This will show you the information obtain- ed in Step 4 in the example above, and you will learn the name and IP address of the uchicago.edu name servers. Most organizations will have multiple name ser- vers in case one is down. Half a dozen is not unusual. If you have access to a UNIX, Linux, or MacOS system, try experimenting with the dig program to see what it can do. You can learn a lot about DNS from using it. (The dig program is also available for Windows, but you may have to install it yourself.)
Even though its purpose is simple, it should be clear that DNS is a large and complex distributed system that is comprised of millions of name servers that work together. It forms a key link between human-readable domain names and the IP addresses of machines. It includes replication and caching for performance and reliability and is designed to be highly robust.
Some applications need to use names in more flexible ways, for example, by naming content and resolving to the IP address of a nearby host that has the con tent. This fits the model of searching for and downloading a movie. It is the movie that matters, not the computer that has a copy of it, so all that is wanted is the IP address of any nearby computer that has a copy of the movie. Content delivery networks are one way to accomplish this mapping. We will describe how they build on the DNS later in this chapter, in Sec. 7.5.
7.1.7 DNS Privacy
Historically, DNS queries and responses have not been encrypted. As a result, any other device or eavesdropper on the network (e.g., other devices, a system administrator, a coffee shop network) could conceivably observe a user’s DNS traf fic and determine information about that user. For example, a lookup to a site like uchicago.edu might indicate that a user was browsing the University of Chicago Web site. While such information might seem innocuous, DNS lookups to Web sites such as webmd.com might indicate that a user was performing medical research. Combinations of lookups combined with other information can often even reveal more specific information, possibly even the precise Web site that a user is visiting.
630 THE APPLICATION LAYER CHAP. 7
Privacy issues associated with DNS queries have become more contentious when considering emerging applications, such as the Internet of Things (IoT) and smart homes. For example, the DNS queries that a device issues can reveal infor- mation about the type of devices that users have in their smart homes and the extent to which they are interacting with those devices. For example, the DNS queries that an Internet-connected camera or sleep monitor issues can uniquely identify the device (Apthorpe et al., 2019). Given the increasingly sensitive activi ties that people perform on Internet-connected devices, from browsers to Inter- net-connected ‘‘smart’’ devices, there is an increasing desire to encrypt DNS queries and responses.
Several recent developments are poised to potentially reshape DNS entirely. The first is the movement toward encrypting DNS queries and responses. Various organizations, including Cloudflare, Google, and others are now offering users the opportunity to direct their DNS traffic to their own local recursive resolvers, and additionally offering support for encrypted transport (e.g., TLS, HTTPS) between the DNS stub resolver and their local resolver. In some cases, these organizations are partnering with Web browser manufacturers (e.g., Mozilla) to potentially direct all DNS traffic to these local resolvers by default.
If all DNS queries and responses are exchanged with cloud providers over encrypted transport by default, the implications for the future of the Internet archi tecture could be extremely significant. Specifically, Internet service providers will no longer have the ability to observe DNS queries from their subscribers’ home networks, which has, in the past, been one of the primary ways that ISPs monitor these networks for infections and malware (Antonakakis et al., 2010). Other func tions, such as parental controls and various other services that ISPs offer, also depend on seeing DNS traffic.
Ultimately, two somewhat orthogonal issues are at play. The first is the shift of DNS towards encrypted transport, which almost everyone would agree is a positive change (there were initial concerns about performance, which have mostly now been addressed). The second issue is thornier: it involves who gets to operate the local recursive resolvers. Previously, the local recursive resolver was generally operated by a user’s ISP; if DNS resolution moves to the browser, however, via DoH, then the browsers (the two most popular of which are at this point largely controlled by a single dominant provider, Google) can control who is in a position to observe DNS traffic. Ultimately, the operator of the local recursive resolver can see the DNS queries from the user and associate those with an IP address; whether the user wants their ISP or a large advertising company to see their DNS traffic should be their choice, but the default settings in the browser may ultimately deter- mine who ends up seeing the majority of this traffic. Presently, a wide range of organizations, from ISPs to content providers and advertising companies are trying to establish what are being called TRRs (Trusted Recursive Resolvers), which are local recursive resolvers that use DoT or DoH to resolve queries for clients. Time will tell how these developments ultimately reshape the DNS architecture.
SEC. 7.1 THE DOMAIN NAME SYSTEM (DNS) 631
Even DoT and DoH do not completely resolve all DNS-related privacy con- cerns, because the operator of the local resolver must still be trusted with sensitive information: namely, the DNS queries and the IP addresses of the clients that issued those queries. Other recent enhancements to DNS and DoH have been pro- posed, including oblivious DNS (Schmitt et al., 2019) and oblivious DoH (Kinn- ear et al., 2019), whereby the stub resolver encrypts the original query to the local recursive resolver, which in turn sends the encrypted query to an authoritative name serve that can decrypt and resolve the query, but does not know the identity or IP address of the stub resolver that initiated the query. Figure 7-8 shows this relationship.
Can decrypt query
Sees IP address of
stub, but not
query.
Client Stub resolver Recursive resolver
University of Chicago
Figure 7-8. Oblivious DNS.
but doesn t know stub resolve IP address.
ODNS
Authoritative server (Chicago)
Most of these implementations are still nascent, in the forms of early prototypes and draft standards being discussed in the DNS privacy working group at IETF.
7.1.8 Contention Over Names
As the Internet has become more commercial and more international, it has also become more contentious, especially in matters related to naming. This con troversy includes ICANN itself. For example, the creation of the xxx domain took several years and court cases to resolve. Is voluntarily placing adult content in its own domain a good or a bad thing? (Some people did not want adult content avail- able at all on the Internet while others wanted to put it all in one domain so nanny filters could easily find and block it from children.) Some of the domains self- organize, while others have restrictions on who can obtain a name, as noted in Fig. 7-8. But what restrictions are appropriate? Take the pro domain, for example. It is for qualified professionals. But who, exactly, is a professional? Doctors and lawyers clearly are professionals. But what about freelance photographers, piano teachers, magicians, plumbers, barbers, exterminators, tattoo artists, mercenaries, and prostitutes? Are these occupations eligible? According to whom?
632 THE APPLICATION LAYER CHAP. 7
There is also money in names. Tuvalu (a tiny island country midway between Hawaii and Australia) sold a lease on its tv domain for $50 million, all because the country code is well-suited to advertising television sites. Virtually every common (English) word has been taken in the com domain, along with the most common misspellings. Try household articles, animals, plants, body parts, etc. The practice of registering a domain only to turn around and sell it off to an interested party at a much higher price even has a name. It is called cybersquatting. Many companies that were slow off the mark when the Internet era began found their obvious domain names already taken when they tried to acquire them. In general, as long as no trademarks are being violated and no fraud is involved, it is first-come, first- served with names. Nevertheless, policies to resolve naming disputes are still being refined.
7.2 ELECTRONIC MAIL
Electronic mail, or more commonly email, has been around for over four dec- ades. Faster and cheaper than paper mail, email has been a popular application since the early days of the Internet. Before 1990, it was mostly used in academia. During the 1990s, it became known to the public at large and grew exponentially, to the point where the number of emails sent per day now is vastly more than the number of snail mail (i.e., paper) letters. Other forms of network communication, such as instant messaging and voice-over-IP calls have expanded greatly in use over the past decade, but email remains the workhorse of Internet communication. It is widely used within industry for intracompany communication, for example, to allow far-flung employees all over the world to cooperate on complex projects. Unfortunately, like paper mail, the majority of email—some 9 out of 10 mes- sages—is junk mail or spam. While mail systems can remove much of it now- adays, a lot still gets through and research into detecting it all is ongoing, for example, see Dan et al. (2019) and Zhang et al. (2019).
Email, like most other forms of communication, has developed its own conven tions and styles. It is very informal and has a low threshold of use. People who would never dream of calling up or even writing a letter to a Very Important Person do not hesitate for a second to send a sloppily written email to him or her. By eliminating most cues associated with rank, age, and gender, email debates often focus on content, not status. With email, a brilliant idea from a summer student can have more impact than a dumb one from an executive vice president.
Email is full of jargon such as BTW (By The Way), ROTFL (Rolling On The Floor Laughing), and IMHO (In My Humble Opinion). Many people also use little ASCII symbols called smileys, starting with the ubiquitous ‘‘:-)’’. This symbol and other emoticons help to convey the tone of the message. They have spread to other terse forms of communication, such as instant messaging, typically as graphi- cal emoji. Many smartphones have hundreds of emojis available.
SEC. 7.2 ELECTRONIC MAIL 633
The email protocols have evolved during the period of their use, too. The first email systems simply consisted of file transfer protocols, with the convention that the first line of each message (i.e., file) contained the recipient’s address. As time went on, email diverged from file transfer and many features were added, such as the ability to send one message to a list of recipients. Multimedia capabilities became important in the 1990s to send messages with images and other non-text material. Programs for reading email became much more sophisticated too, shift ing from text-based to graphical user interfaces and adding the ability for users to access their mail from their laptops wherever they happen to be. Finally, with the prevalence of spam, email systems now pay attention to finding and removing unwanted email.
In our description of email, we will focus on the way that mail messages are moved between users, rather than the look and feel of mail reader programs. Nevertheless, after describing the overall architecture, we will begin with the user facing part of the email system, as it is familiar to most readers.
7.2.1 Architecture and Services
In this section, we will provide an overview of how email systems are organized and what they can do. The architecture of the email system is shown in Fig. 7-9. It consists of two kinds of subsystems: the user agents, which allow peo- ple to read and send email, and the message transfer agents, which move the mes- sages from the source to the destination. We will also refer to message transfer agents informally as mail servers.
Mailbox
SMTP
Sender
User Agent
Message Transfer Agent
Message Transfer Agent
Receiver User Agent
1: Mail
submission
2: Message transfer
3: Final
delivery
Figure 7-9. Architecture of the email system.
The user agent is a program that provides a graphical interface, or sometimes a text- and command-based interface that lets users interact with the email system. It includes a means to compose messages and replies to messages, display incoming messages, and organize messages by filing, searching, and discarding them. The act of sending new messages into the mail system is called mail submission.
634 THE APPLICATION LAYER CHAP. 7
Some of the user agent processing may be done automatically, anticipating what the user wants. For example, incoming mail may be filtered to extract or deprioritize messages that are likely spam. Some user agents include advanced features, such as arranging for automatic email responses (‘‘I’m having a wonder ful vacation and it will be a while before I get back to you.’’). A user agent runs on the same computer on which a user reads her mail. It is just another program and may be run only some of the time.
The message transfer agents are typically system processes. They run in the background on mail server machines and are intended to be always available. Their job is to automatically move email through the system from the originator to the recipient with SMTP (Simple Mail Transfer Protocol), discussed in Sec. 7.2.4. This is the message transfer step.
SMTP was originally specified as RFC 821 and revised to become the current RFC 5321. It sends mail over connections and reports back the delivery status and any errors. Numerous applications exist in which confirmation of delivery is important and may even have legal significance (‘‘Well, Your Honor, my email sys tem is just not very reliable, so I guess the electronic subpoena just got lost some- where’’).
Message transfer agents also implement mailing lists, in which an identical copy of a message is delivered to everyone on a list of email addresses. Additional advanced features are carbon copies, blind carbon copies, high-priority email, secret (encrypted) email, alternative recipients if the primary one is not currently
available, and the ability for assistants to read and answer their bosses’ email. Linking user agents and message transfer agents are the concepts of mailboxes and a standard format for email messages. Mailboxes store the email that is received for a user. They are maintained by mail servers. User agents simply pres- ent users with a view of the contents of their mailboxes. To do this, the user agents send the mail servers commands to manipulate the mailboxes, inspecting their con tents, deleting messages, and so on. The retrieval of mail is the final delivery (step 3) in Fig. 7-9. With this architecture, one user may use different user agents on multiple computers to access one mailbox.
Mail is sent between message transfer agents in a standard format. The original format, RFC 822, has been revised to the current RFC 5322 and extended with support for multimedia content and international text. This scheme is called MIME. People still refer to Internet email as RFC 822, though.
A key idea in the message format is the clear distinction between the envelope and the contents of the envelope. The envelope encapsulates the message. Fur thermore, it contains all the information needed for transporting the message, such as the destination address, priority, and security level, all of which are distinct from the message itself. The message transport agents use the envelope for routing, just as the post office does.
The message inside the envelope consists of two separate parts: the header and the body. The header contains control information for the user agents. The body
SEC. 7.2 ELECTRONIC MAIL 635
is entirely for the human recipient. None of the agents care much about it. Envelopes and messages are illustrated in Fig. 7-10.
Name: Mr. Daniel Dumkopf
44¢
Mr. Daniel Dumkopf
18 Willow Lane
e p
o
l
e
v
Street: 18 Willow Lane City: White Plains State: NY
Zip code: 10604
Envelope
White Plains, NY 10604
United Gizmo
180 Main St
n
E
r e
Priority: Urgent
Encryption: None
From: United Gizmo Address: 180 Main St.
Location: Boston, MA 02120
Boston, MA 02120
d
a
Date: Feb. 14, 2020
Feb. 14, 2020
Subject: Invoice 1081
Dear Mr. Dumkopf,
Our computer records show that you still have not paid the above invoice of $0.00. Please send us a check for $0.00 promptly.
Yours truly
United Gizmo
e
H
y d
o
B
Subject: Invoice 1081
Dear Mr. Dumkopf,
Our computer records show that you still have not paid the above invoice of $0.00. Please send us a check for $0.00 promptly.
Yours truly
United Gizmo
Message
(a) (b)
Figure 7-10. Envelopes and messages. (a) Paper mail. (b) Electronic mail.
We will examine the pieces of this architecture in more detail by looking at the steps that are involved in sending email from one user to another. This journey starts with the user agent.
7.2.2 The User Agent
A user agent is a program (sometimes called an email reader) that accepts a variety of commands for composing, receiving, and replying to messages, as well as for manipulating mailboxes. There are many popular user agents, including Google Gmail, Microsoft Outlook, Mozilla Thunderbird, and Apple Mail. They can vary greatly in their appearance. Most user agents have a menu- or icon-driven graphical interface that requires a mouse, or a touch interface on smaller mobile devices. Older user agents, such as Elm, mh, and Pine, provide text-based inter faces and expect one-character commands from the keyboard. Functionally, these are the same, at least for text messages.
636 THE APPLICATION LAYER CHAP. 7
The typical elements of a user agent interface are shown in Fig. 7-11. Your mail reader is likely to be much flashier, but probably has equivalent functions. When a user agent is started, it will usually present a summary of the messages in the user’s mailbox. Often, the summary will have one line for each message in some sorted order. It highlights key fields of the message that are extracted from the message envelope or header.
Message folders
Mail Folders
From
Subject
Message summary Received
All items Inbox
Networks
trudy Andy djw
Not all Trudys are nasty Material on RFID privacy !
Have you seen this?
Today Today Mar 4
Travel
Junk Mail
Amy N. Wong guido
lazowska Olivia
Request for information Re: Paper acceptance More on that
I have an idea
Mar 3 Mar 3 Mar 2 Mar 2
. . .
. . . . . .
Search Graduate studies? Mar 1
A. Student
Dear Professor,
I recently completed my undergraduate studies with
Mailbox search
distinction at an excellent university. I will be visiting your . . . . . .
Message
Figure 7-11. Typical elements of the user agent interface.
Seven summary lines are shown in the example of Fig. 7-11. The lines use the From, Subject, and Received fields, in that order, to display who sent the message, what it is about, and when it was received. All the information is formatted in a user-friendly way rather than displaying the literal contents of the message fields, but it is based on the message fields. Thus, people who fail to include a Subject field often discover that responses to their emails tend not to get the highest prior ity.
Many other fields or indications are possible. The icons next to the message subjects in Fig. 7-11 might indicate, for example, unread mail (the envelope), attached material (the paperclip), and important mail, at least as judged by the sender (the exclamation point).
Many sorting orders are also possible. The most common is to order messages based on the time that they were received, most recent first, with some indication as to whether the message is new or has already been read by the user. The fields in the summary and the sort order can be customized by the user according to her preferences.
User agents must also be able to display incoming messages as needed so that people can read their email. Often a short preview of a message is provided, as in
SEC. 7.2 ELECTRONIC MAIL 637
Fig. 7-11, to help users decide when to read further and when to hit the SPAM but ton. Previews may use small icons or images to describe the contents of the mes- sage. Other presentation processing includes reformatting messages to fit the dis- play, and translating or converting contents to more convenient formats (e.g., digi tized speech to recognized text).
After a message has been read, the user can decide what to do with it. This is called message disposition. Options include deleting the message, sending a reply, forwarding the message to another user, and keeping the message for later reference. Most user agents can manage one mailbox for incoming mail with mul tiple folders for saved mail. The folders allow the user to save message according to sender, topic, or some other category.
Filing can be done automatically by the user agent as well, even before the user reads the messages. A common example is that the fields and contents of mes- sages are inspected and used, along with feedback from the user about previous messages, to determine if a message is likely to be spam. Many ISPs and com- panies run software that labels mail as important or spam so that the user agent can file it in the corresponding mailbox. The ISP and company have the advantage of seeing mail for many users and may have lists of known spammers. If hundreds of users have just received a similar message, it is probably spam, although it could be a message from the CEO to all employees. By presorting incoming mail as ‘‘probably legitimate’’ and ‘‘probably spam,’’ the user agent can save users a fair amount of work separating the good stuff from the junk.
And the most popular spam? It is generated by collections of compromised computers called botnets and its content depends on where you live. Fake diplo- mas are common in Asia, and cheap drugs and other dubious product offers are common in the U.S. Unclaimed Nigerian bank accounts still abound. Pills for enlarging various body parts are common everywhere.
Other filing rules can be constructed by users. Each rule specifies a condition and an action. For example, a rule could say that any message received from the boss goes to one folder for immediate reading and any message from a particular mailing list goes to another folder for later reading. Several folders are shown in Fig. 7-11. The most important folders are the Inbox, for incoming mail not filed elsewhere, and Junk Mail, for messages that are thought to be spam.
7.2.3 Message Formats
Now we turn from the user interface to the format of the email messages them- selves. Messages sent by the user agent must be placed in a standard format to be handled by the message transfer agents. First we will look at basic ASCII email using RFC 5322, which is the latest revision of the original Internet message for- mat as described in RFC 822 and its many updates. After that, we will look at multimedia extensions to the basic format.
638 THE APPLICATION LAYER CHAP. 7 RFC 5322—The Internet Message Format
Messages consist of a primitive envelope (described as part of SMTP in RFC 5321), some number of header fields, a blank line, and then the message body. Each header field (logically) consists of a single line of ASCII text containing the field name, a colon, and, for most fields, a value. The original RFC 822 was designed decades ago and did not clearly distinguish the envelope fields from the header fields. Although it has been revised to RFC 5322, completely redoing it was not possible due to its widespread usage. In normal usage, the user agent builds a message and passes it to the message transfer agent, which then uses some of the header fields to construct the actual envelope, a somewhat old-fashioned mixing of message and envelope.
The principal header fields related to message transport are listed in Fig. 7-12. The To: field gives the email address of the primary recipient. Having multiple recipients is also allowed. The Cc: field gives the addresses of any secondary recipients. In terms of delivery, there is no distinction between the primary and secondary recipients. It is entirely a psychological difference that may be impor tant to the people involved but is not important to the mail system. The term Cc: (Carbon copy) is a bit dated, since computers do not use carbon paper, but it is well established. The Bcc: (Blind carbon copy) field is like the Cc: field, except that this line is deleted from all the copies sent to the primary and secondary recipients. This feature allows people to send copies to third parties without the primary and secondary recipients knowing this.
Header Meaning
To: Email address(es) of primary recipient(s)
Cc: Email address(es) of secondary recipient(s)
Bcc: Email address(es) for blind carbon copies
From: Person or people who created the message
Sender: Email address of the actual sender
Received: Line added by each transfer agent along the route
Return-Path: Can be used to identify a path back to the sender
Figure 7-12. RFC 5322 header fields related to message transport.
The next two fields, From: and Sender:, tell who wrote and actually sent the message, respectively. These two fields need not be the same. For example, a bus iness executive may write a message, but her assistant may be the one who actually transmits it. In this case, the executive would be listed in the From: field and the assistant in the Sender: field. The From: field is required, but the Sender: field may be omitted if it is the same as the From: field. These fields are needed in case the message is undeliverable and must be returned to the sender.
SEC. 7.2 ELECTRONIC MAIL 639
A line containing Received: is added by each message transfer agent along the way. The line contains the agent’s identity, the date and time the message was received, and other information that can be used for debugging the routing system.
The Return-Path: field is added by the final message transfer agent and was intended to tell how to get back to the sender. In theory, this information can be gathered from all the Received: headers (except for the name of the sender’s mail- box), but it is rarely filled in as such and typically just contains the sender’s address.
In addition to the fields of Fig. 7-12, RFC 5322 messages may also contain a variety of header fields used by the user agents or human recipients. The most common ones are listed in Fig. 7-13. Most of these are self-explanatory, so we will not go into all of them in much detail.
Header Meaning
Date: The date and time the message was sent
Reply-To: Email address to which replies should be sent
Message-Id: Unique number for referencing this message later
In-Reply-To: Message-Id of the message to which this is a reply
References: Other relevant Message-Ids
Keywords: User-chosen keywords
Subject: Short summary of the message for the one-line display
Figure 7-13. Some fields used in the RFC 5322 message header.
The Reply-To: field is sometimes used when neither the person composing the message nor the person sending the message wants to see the reply. For example, a marketing manager may write an email message telling customers about a new product. The message is sent by an assistant, but the Reply-To: field lists the head of the sales department, who can answer questions and take orders. This field is also useful when the sender has two email accounts and wants the reply to go to the other one.
The Message-Id: is an automatically generated number that is used to link messages together (e.g., when used in the In-Reply-To: field) and to prevent dupli- cate delivery.
The RFC 5322 document explicitly says that users are allowed to invent optio- nal headers for their own private use. By convention since RFC 822, these headers start with the string X-. It is guaranteed that no future headers will use names start ing with X-, to avoid conflicts between official and private headers. Sometimes wiseguy undergraduates make up fields like X-Fruit-of-the-Day: or X-Disease-of the-Week:, which are legal, although not always illuminating.
After the headers comes the message body. Users can put whatever they want here. Some people terminate their messages with elaborate signatures, including quotations from greater and lesser authorities, political statements, and disclaimers
640 THE APPLICATION LAYER CHAP. 7
of all kinds (e.g., The XYZ Corporation is not responsible for my opinions; in fact, it cannot even comprehend them).
MIME—The Multipurpose Internet Mail Extensions
In the early days of the ARPANET, email consisted exclusively of text mes- sages written in English and expressed in ASCII. For this environment, the early RFC 822 format did the job completely: it specified the headers but left the content entirely up to the users. In the 1990s, the worldwide use of the Internet and de- mand to send richer content through the mail system meant that this approach was no longer adequate. The problems included sending and receiving messages in languages with diacritical marks (e.g., French and German), non-Latin alphabets (e.g., Hebrew and Russian), or no alphabets (e.g., Chinese and Japanese), as well as sending messages not containing text at all (e.g., audio, images, or binary docu- ments and programs).
The solution was the development of MIME (Multipurpose Internet Mail Extensions). It is widely used for mail messages that are sent across the Internet, as well as to describe content for other applications such as Web browsing. MIME is described in RFC 2045, and the ones following it as well as RFC 4288 and 4289.
The basic idea of MIME is to continue to use the RFC 822 format but to add structure to the message body and define encoding rules for the transfer of non- ASCII messages. Not deviating from RFC 822 allowed MIME messages to be sent using the existing mail transfer agents and protocols (based on RFC 821 then, and RFC 5321 now). All that had to be changed were the sending and receiving programs, which users could do for themselves.
MIME defines five new message headers, as shown in Fig. 7-14. The first of these simply tells the user agent receiving the message that it is dealing with a MIME message, and which version of MIME it uses. Any message not containing a MIME-Version: header is assumed to be an English plaintext message (or at least one using only ASCII characters) and is processed as such.
Header Meaning
MIME-Version: Identifies the MIME version
Content-Description: Human-readable string telling what is in the message Content-Id: Unique identifier
Content-Transfer-Encoding: How the body is wrapped for transmission Content-Type: Type and format of the content
Figure 7-14. Message headers added by MIME.
The Content-Description: header is an ASCII string telling what is in the mes- sage. This header is needed so the recipient will know whether it is worth decod ing and reading the message. If the string says ‘‘Photo of Aron’s hamster’’ and the
SEC. 7.2 ELECTRONIC MAIL 641
person getting the message is not a big hamster fan, the message will probably be discarded rather than decoded into a high-resolution color photograph. The Content-Id: header identifies the content. It uses the same format as the standard Message-Id: header.
The Content-Transfer-Encoding: tells how the body is wrapped for transmis- sion through the network. A key problem at the time MIME was developed was that the mail transfer (SMTP) protocols expected ASCII messages in which no line exceeded 1000 characters. ASCII characters use 7 bits out of each 8-bit byte. Bina ry data such as executable programs and images use all 8 bits of each byte, as do extended character sets. There was no guarantee this data would be transferred safely. Hence, some method of carrying binary data that made it look like a regular ASCII mail message was needed. Extensions to SMTP since the development of MIME do allow 8-bit binary data to be transferred, though even today binary data may not always go through the mail system correctly if unencoded.
MIME provides five transfer encoding schemes, plus an escape to new schemes—just in case. The simplest scheme is just ASCII text messages. ASCII characters use 7 bits and can be carried directly by the email protocol, provided that no line exceeds 1000 characters.
The next simplest scheme is the same thing, but using 8-bit characters, that is, all values from 0 up to and including 255 are allowed. Messages using the 8-bit encoding must still adhere to the standard maximum line length.
Then there are messages that use a true binary encoding. These are arbitrary binary files that not only use all 8 bits but also do not adhere to the 1000-character line limit. Executable programs fall into this category. Nowadays, mail servers can negotiate to send data in binary (or 8-bit) encoding, falling back to ASCII if both ends do not support the extension.
The ASCII encoding of binary data is called base64 encoding. In this scheme, groups of 24 bits are broken up into four 6-bit units, with each unit being sent as a legal ASCII character. The coding is ‘‘A’’ for 0, ‘‘B’’ for 1, and so on, followed by the 26 lowercase letters, the 10 digits, and finally + and / for 62 and 63, respec tively. The == and = sequences indicate that the last group contained only 8 or 16 bits, respectively. Carriage returns and line feeds are ignored, so they can be inserted at will in the encoded character stream to keep the lines short enough. Arbitrary binary text can be sent safely using this scheme, albeit inefficiently. This encoding was very popular before binary-capable mail servers were widely deploy- ed. It is still commonly seen.
The last header shown in Fig. 7-14 is really the most interesting one. It speci fies the nature of the message body and has had an impact well beyond email. For instance, content downloaded from the Web is labeled with MIME types so that the browser knows how to present it. So is content sent over streaming media and real-time transports such as voice over IP.
Initially, seven MIME types were defined in RFC 1521. Each type has one or more available subtypes. The type and subtype are separated by a slash, as in
642 THE APPLICATION LAYER CHAP. 7
‘‘Content-Type: video/mpeg’’. Since then, over 2700 subtypes have been added, along two new types (font and model). Additional entries are being added all the time as new types of content are developed. The list of assigned types and sub types is maintained online by IANA at www.iana.org/assignments/media-types. The types, along with several examples of commonly used subtypes, are given in Fig. 7-15.
Type Example subtypes Description
text plain, html, xml, css Text in various formats image gif, jpeg, tiff Pictures
audio basic, mpeg, mp4 Sounds
video mpeg, mp4, quicktime Movies
font otf, ttf Fonts for typesetting
model vrml 3D model
application octet-stream, pdf, javascript, zip Data produced by applications message http, RFC 822 Encapsulated message multipart mixed, alternative, parallel, digest Combination of multiple types
Figure 7-15. MIME content types and example subtypes.
The MIME types in Fig. 7-15 should be self-explanatory except perhaps the last one. It allows a message with multiple attachments, each with a different MIME type.
7.2.4 Message Transfer
Now that we have described user agents and mail messages, we are ready to look at how the message transfer agents relay messages from the originator to the recipient. The mail transfer is done with the SMTP protocol.
The simplest way to move messages is to establish a transport connection from the source machine to the destination machine and then just transfer the message. This is how SMTP originally worked. Over the years, however, two different uses of SMTP have been differentiated. The first use is mail submission, step 1 in the email architecture of Fig. 7-9. This is the means by which user agents send mes- sages into the mail system for delivery. The second use is to transfer messages between message transfer agents (step 2 in Fig. 7-9). This sequence delivers mail all the way from the sending to the receiving message transfer agent in one hop. Final delivery is accomplished with different protocols that we will describe in the next section.
In this section, we will describe the basics of the SMTP protocol and its exten- sion mechanism. Then we will discuss how it is used differently for mail submis- sion and message transfer.
SEC. 7.2 ELECTRONIC MAIL 643 SMTP (Simple Mail Transfer Protocol) and Extensions
Within the Internet, email is delivered by having the sending computer estab lish a TCP connection to port 25 of the receiving computer. Listening to this port is a mail server that speaks SMTP (Simple Mail Transfer Protocol). This server accepts incoming connections, subject to some security checks, and accepts mes-
sages for delivery. If a message cannot be delivered, an error report containing the first part of the undeliverable message is returned to the sender.
SMTP is a simple ASCII protocol. This is not a weakness but a feature. Using ASCII text makes protocols easy to develop, test, and debug. They can be tested by sending commands manually, and records of the messages are easy to read. Most application-level Internet protocols now work this way (e.g., HTTP).
We will walk through a simple message transfer between mail servers that delivers a message. After establishing the TCP connection to port 25, the sending machine, operating as the client, waits for the receiving machine, operating as the server, to talk first. The server starts by sending a line of text giving its identity and telling whether it is prepared to receive mail. If it is not, the client releases the connection and tries again later.
If the server is willing to accept email, the client announces whom the email is coming from and whom it is going to. If such a recipient exists at the destination, the server gives the client the go-ahead to send the message. Then the client sends the message and the server acknowledges it. No checksums are needed because TCP provides a reliable byte stream. If there is more email, that is now sent. When all the email has been exchanged in both directions, the connection is released. A sample dialog is shown in Fig. 7-16. The lines sent by the client (i.e., the sender) are marked C:. Those sent by the server (i.e., the receiver) are marked S:.
The first command from the client is indeed meant to be HELO. Of the vari- ous four-character abbreviations for HELLO, this one has numerous advantages over its biggest competitor. Why all the commands had to be four characters has been lost in the mists of time.
In Fig. 7-16, the message is sent to only one recipient, so only one RCPT com- mand is used. Such commands are allowed to send a single message to multiple receivers. Each one is individually acknowledged or rejected. Even if some recipi- ents are rejected (because they do not exist at the destination), the message can be sent to the other ones.
Finally, although the syntax of the four-character commands from the client is rigidly specified, the syntax of the replies is less rigid. Only the numerical code really counts. Each implementation can put whatever string it wants after the code.
The basic SMTP works well, but it is limited in several respects. It does not include authentication. This means that the FROM command in the example could give any sender address that it pleases. This is quite useful for sending spam. Another limitation is that SMTP transfers ASCII messages, not binary data. This is
644 THE APPLICATION LAYER CHAP. 7
S: 220 ee.uwa.edu.au SMTP service ready
C: HELO abcd.com
S: 250 cs.uchicago.edu says hello to ee.uwa.edu.au
C: MAIL FROM: <alice@cs.uchicago.edu>
S: 250 sender ok
C: RCPT TO: <bob@ee.uwa.edu.au>
S: 250 recipient ok
C: DATA
S: 354 Send mail; end with "." on a line by itself
C: From: alice@cs.uchicago.edu
C: To: bob@ee.uwa.edu.au
C: MIME-Version: 1.0
C: Message-Id: <0704760941.AA00747@ee.uwa.edu.au>
C: Content-Type: multipart/alternative; boundary=qwertyuiopasdfghjklzxcvbnm C: Subject: Earth orbits sun integral number of times
C:
C: This is the preamble. The user agent ignores it. Have a nice day.
C:
C: --qwertyuiopasdfghjklzxcvbnm
C: Content-Type: text/html
C:
C: <p>Happy birthday to you
C: Happy birthday to you
C: Happy birthday dear <bold> Bob </bold>
C: Happy birthday to you
C:
C: --qwertyuiopasdfghjklzxcvbnm
C: Content-Type: message/external-body;
C: access-type="anon-ftp";
C: site="bicycle.cs.uchicago.edu";
C: directory="pub";
C: name="birthday.snd"
C:
C: content-type: audio/basic
C: content-transfer-encoding: base64
C: --qwertyuiopasdfghjklzxcvbnm
C: .
S: 250 message accepted
C: QUIT
S: 221 ee.uwa.edu.au closing connection
Figure 7-16. A message from alice cs.uchicago.edu to bob ee.uwa.edu.au.
why the base64 MIME content transfer encoding was needed. However, with that encoding the mail transmission uses bandwidth inefficiently, which is an issue for large messages. A third limitation is that SMTP sends messages in the clear. It has no encryption to provide a measure of privacy against prying eyes.
To allow these and many other problems related to message processing to be addressed, SMTP was revised to have an extension mechanism. This mechanism
SEC. 7.2 ELECTRONIC MAIL 645
is a mandatory part of the RFC 5321 standard. The use of SMTP with extensions is called ESMTP (Extended SMTP).
Clients wanting to use an extension send an EHLO message instead of HELO initially. If this is rejected, the server is a regular SMTP server, and the client should proceed in the usual way. If the EHLO is accepted, the server replies with the extensions that it supports. The client may then use any of these extensions. Several common extensions are shown in Fig. 7-17. The figure gives the keyword as used in the extension mechanism, along with a description of the new func tionality. We will not go into extensions in further detail.
Keyword Description
AUTH Client authentication
BINARYMIME Server accepts binary messages
CHUNKING Server accepts large messages in chunks
SIZE Check message size before trying to send
STARTTLS Switch to secure transport (TLS; see Chap. 8)
UTF8SMTP Internationalized addresses
Figure 7-17. Some SMTP extensions.
To get a better feel for how SMTP and some of the other protocols described in this chapter work, try them out. In all cases, first go to a machine connected to the Internet. On a UNIX (or Linux) system, in a shell, type
telnet mail.isp.com 25
substituting the DNS name of your ISP’s mail server for mail.isp.com. On a Win- dows machine, you may have to first install the telnet program (or equivalent) and then start it yourself. This command will establish a telnet (i.e., TCP) connection to port 25 on that machine. Port 25 is the SMTP port; see Fig. 6-34 for the ports for other common protocols. You will probably get a response something like this:
Trying 192.30.200.66...
Connected to mail.isp.com
Escape character is ’ˆ]’.
220 mail.isp.com Smail #74 ready at Thu, 25 Sept 2019 13:26 +0200
The first three lines are from telnet, telling you what it is doing. The last line is from the SMTP server on the remote machine, announcing its willingness to talk to you and accept email. To find out what commands it accepts, type
HELP
From this point on, a command sequence such as the one in Fig. 7-16 is possible if the server is willing to accept mail from you. You may have to type quickly, though, since the connection may time out if it is inactive too long. Also, not every mail server will accept a telnet connection from an unknown machine.
646 THE APPLICATION LAYER CHAP. 7 Mail Submission
Originally, user agents ran on the same computer as the sending message trans fer agent. In this setting, all that is required to send a message is for the user agent to talk to the local mail server, using the dialog that we have just described. How- ever, this setting is no longer the usual case.
User agents often run on laptops, home PCs, and mobile phones. They are not always connected to the Internet. Mail transfer agents run on ISP and company servers. They are always connected to the Internet. This difference means that a user agent in Boston may need to contact its regular mail server in Seattle to send a mail message because the user is traveling.
By itself, this remote communication poses no problem. It is exactly what the TCP/IP protocols are designed to support. However, an ISP or company usually does not want any remote user to be able to submit messages to its mail server to be delivered elsewhere. The ISP or company is not running the server as a public service. In addition, this kind of open mail relay attracts spammers. This is because it provides a way to launder the original sender and thus make the message more difficult to identify as spam.
Given these considerations, SMTP is normally used for mail submission with the AUTH extension. This extension lets the server check the credentials (username and password) of the client to confirm that the server should be providing mail ser- vice.
There are several other differences in the way SMTP is used for mail submis- sion. For example, port 587 can be used in preference to port 25 and the SMTP ser- ver can check and correct the format of the messages sent by the user agent. For more information about the restricted use of SMTP for mail submission, please see
RFC 4409.
Physical Transfer
Once the sending mail transfer agent receives a message from the user agent, it will deliver it to the receiving mail transfer agent using SMTP. To do this, the sender uses the destination address. Consider the message in Fig. 7-16, addressed to bob@ee.uwa.edu.au. To what mail server should the message be delivered?
To determine the correct mail server to contact, DNS is consulted. In the previ- ous section, we described how DNS contains multiple types of records, including the MX, or mail exchanger, record. In this case, a DNS query is made for the MX records of the domain ee.uwa.edu.au. This query returns an ordered list of the names and IP addresses of one or more mail servers.
The sending mail transfer agent then makes a TCP connection on port 25 to the IP address of the mail server to reach the receiving mail transfer agent, and uses SMTP to relay the message. The receiving mail transfer agent will then place mail for the user bob in the correct mailbox for Bob to read it at a later time. This local
SEC. 7.2 ELECTRONIC MAIL 647
delivery step may involve moving the message among computers if there is a large mail infrastructure.
With this delivery process, mail travels from the initial to the final mail transfer agent in a single hop. There are no intermediate servers in the message transfer stage. It is possible, however, for this delivery process to occur multiple times. One example that we have described already is when a message transfer agent implements a mailing list. In this case, a message is received for the list. It is then expanded as a message to each member of the list that is sent to the individual member addresses.
As another example of relaying, Bob may have graduated from M.I.T. and also be reachable via the address bob@alum.mit.edu. Rather than reading mail on mul tiple accounts, Bob can arrange for mail sent to this address to be forwarded to bob@ee.uwa.edu. In this case, mail sent to bob@alum.mit.edu will undergo two deliveries. First, it will be sent to the mail server for alum.mit.edu. Then, it will be sent to the mail server for ee.uwa.edu.au. Each of these legs is a complete and sep- arate delivery as far as the mail transfer agents are concerned.
7.2.5 Final Delivery
Our mail message is almost delivered. It has arrived at Bob’s mailbox. All that remains is to transfer a copy of the message to Bob’s user agent for display. This is step 3 in the architecture of Fig. 7-9. This task was straightforward in the early Internet, when the user agent and mail transfer agent ran on the same machine as different processes. The mail transfer agent simply wrote new messages to the end of the mailbox file, and the user agent simply checked the mailbox file for new mail.
Nowadays, the user agent on a PC, laptop, or mobile, is likely to be on a dif ferent machine than the ISP or company mail server and certain to be on a different machine for a mail provider such as Gmail. Users want to be able to access their mail remotely, from wherever they are. They want to access email from work, from their home PCs, from their laptops when on business trips, and from cyber- cafes when on so-called vacation. They also want to be able to work offline, then reconnect to receive incoming mail and send outgoing mail. Moreover, each user may run several user agents depending on what computer it is convenient to use at the moment. Several user agents may even be running at the same time.
In this setting, the job of the user agent is to present a view of the contents of the mailbox, and to allow the mailbox to be remotely manipulated. Several dif ferent protocols can be used for this purpose, but SMTP is not one of them. SMTP is a push-based protocol. It takes a message and connects to a remote server to transfer the message. Final delivery cannot be achieved in this manner both because the mailbox must continue to be stored on the mail transfer agent and because the user agent may not be connected to the Internet at the moment that SMTP attempts to relay messages.
648 THE APPLICATION LAYER CHAP. 7 IMAP—The Internet Message Access Protocol
One of the main protocols that is used for final delivery is IMAP (Internet Message Access Protocol). Version 4 of the protocol is defined in RFC 3501 and in its many updates. To use IMAP, the mail server runs an IMAP server that listens to port 143. The user agent runs an IMAP client. The client connects to the server and begins to issue commands from those listed in Fig. 7-18.
Command Description
CAPABILITY List server capabilities
STARTTLS Start secure transport (TLS; see Chap. 8)
LOGIN Log on to server
AUTHENTICATE Log on with other method
SELECT Select a folder
EXAMINE Select a read-only folder
CREATE Create a folder
DELETE Delete a folder
RENAME Rename a folder
SUBSCRIBE Add folder to active set
UNSUBSCRIBE Remove folder from active set
LIST List the available folders
LSUB List the active folders
STATUS Get the status of a folder
APPEND Add a message to a folder
CHECK Get a checkpoint of a folder
FETCH Get messages from a folder
SEARCH Find messages in a folder
STORE Alter message flags
COPY Make a copy of a message in a folder
EXPUNGE Remove messages flagged for deletion
UID Issue commands using unique identifiers
NOOP Do nothing
CLOSE Remove flagged messages and close folder
LOGOUT Log out and close connection
Figure 7-18. IMAP (version 4) commands.
First, the client will start a secure transport if one is to be used (in order to keep the messages and commands confidential), and then log in or otherwise authenticate itself to the server. Once logged in, there are many commands to list folders and messages, fetch messages or even parts of messages, mark messages
SEC. 7.2 ELECTRONIC MAIL 649
with flags for later deletion, and organize messages into folders. To avoid confu- sion, please note that we use the term ‘‘folder’’ here to be consistent with the rest of the material in this section, in which a user has a single mailbox made up of
multiple folders. However, in the IMAP specification, the term mailbox is used instead. One user thus has many IMAP mailboxes, each of which is typically pres- ented to the user as a folder.
IMAP has many other features, too. It has the ability to address mail not by message number, but by using attributes (e.g., give me the first message from Alice). Searches can be performed on the server to find the messages that satisfy certain criteria so that only those messages are fetched by the client.
IMAP is an improvement over an earlier final delivery protocol, POP3 (Post Office Protocol, version 3), which is specified in RFC 1939. POP3 is a simpler protocol but supports fewer features and is less secure in typical usage. Mail is usually downloaded to the user agent computer, instead of remaining on the mail server. This makes life easier on the server, but harder on the user. It is not easy to read mail on multiple computers, plus if the user agent computer breaks, all email may be lost permanently. Nonetheless, you will still find POP3 in use.
Proprietary protocols can also be used because the protocol runs between a mail server and user agent that can be supplied by the same company. Microsoft Exchange is a mail system with a proprietary protocol.
Webmail
An increasingly popular alternative to IMAP and SMTP for providing email service is to use the Web as an interface for sending and receiving mail. Widely used Webmail systems include Google Gmail, Microsoft Hotmail and Yahoo! Mail. Webmail is one example of software (in this case, a mail user agent) that is provided as a service using the Web.
In this architecture, the provider runs mail servers as usual to accept messages for users with SMTP on port 25. However, the user agent is different. Instead of being a standalone program, it is a user interface that is provided via Web pages. This means that users can use any browser they like to access their mail and send new messages.
When the user goes to the email Web page of the provider, say, Gmail, a form is presented in which the user is asked for a login name and password. The login name and password are sent to the server, which then validates them. If the login is successful, the server finds the user’s mailbox and builds a Web page listing the contents of the mailbox on the fly. The Web page is then sent to the browser for display.
Many of the items on the page showing the mailbox are clickable, so messages can be read, deleted, and so on. To make the interface responsive, the Web pages will often include JavaScript programs. These programs are run locally on the cli- ent in response to local events (e.g., mouse clicks) and can also download and
650 THE APPLICATION LAYER CHAP. 7
upload messages in the background, to prepare the next message for display or a new message for submission. In this model, mail submission happens using the normal Web protocols by posting data to a URL. The Web server takes care of injecting messages into the traditional mail delivery system that we have described. For security, the standard Web protocols can be used as well. These protocols con- cern themselves with encrypting Web pages, not whether the content of the Web page is a mail message.
7.3 THE WORLD WIDE WEB
The Web, as the World Wide Web is popularly known, is an architectural framework for accessing linked content spread out over millions of machines all over the Internet. In 10 years it went from being a way to coordinate the design of high-energy physics experiments in Switzerland to the application that millions of people think of as being ‘‘The Internet.’’ Its enormous popularity stems from the fact that it is easy for beginners to use and provides access with a rich graphical interface to an enormous wealth of information on almost every conceivable sub ject, from aardvarks to Zulus.
The Web began in 1989 at CERN, the European Center for Nuclear Research. The initial idea was to help large teams, often with members in a dozen or more countries and time zones, collaborate using a constantly changing collection of reports, blueprints, drawings, photos, and other documents produced by experi- ments in particle physics. The proposal for a Web of linked documents came from CERN physicist Tim Berners-Lee. The first (text-based) prototype was operational 18 months later. A public demonstration given at the Hypertext ’91 conference caught the attention of other researchers, which led Marc Andreessen at the Uni- versity of Illinois to develop the first graphical browser. It was called Mosaic and released in February 1993.
The rest, as they say, is now history. Mosaic was so popular that a year later Andreessen left to form a company, Netscape Communications Corp., whose goal was to develop Web software. For the next three years, Netscape Navigator and Microsoft’s Internet Explorer engaged in a ‘‘browser war,’’ each one trying to cap ture a larger share of the new market by frantically adding more features (and thus more bugs) than the other one.
Through the 1990s and 2000s, Web sites and Web pages, as Web content is called, grew exponentially until there were millions of sites and billions of pages. A small number of these sites became tremendously popular. Those sites and the companies behind them largely define the Web as people experience it today. Examples include: a bookstore (Amazon, started in 1994), a flea market (eBay, 1995), search (Google, 1998), and social networking (Facebook, 2004). The period through 2000, when many Web companies became worth hundreds of mil lions of dollars overnight, only to go bust practically the next day when they turned
SEC. 7.3 THE WORLD WIDE WEB 651
out to be hype, even has a name. It is called the dot com era. New ideas are still striking it rich on the Web. Many of them come from students. For example, Mark Zuckerberg was a Harvard student when he started Facebook, and Sergey Brin and Larry Page were students at Stanford when they started Google. Perhaps you will come up with the next big thing.
In 1994, CERN and M.I.T. signed an agreement setting up the W3C (World Wide Web Consortium), an organization devoted to further developing the Web, standardizing protocols, and encouraging interoperability between sites. Berners- Lee became the director. Since then, several hundred universities and companies have joined the consortium. Although there are now more books about the Web than you can shake a stick at, the best place to get up-to-date information about the Web is (naturally) on the Web itself. The consortium’s home page is at www.w3.org. Interested readers are referred there for links to pages covering all of the consortium’s numerous documents and activities.
7.3.1 Architectural Overview
From the users’ point of view, the Web comprises a vast, worldwide collection of content in the form of Web pages. Each page typically contains links to hun- dreds of other objects, which may be hosted on any server on the Internet, any- where in the world. These objects may be other text and images, but nowadays also include a wide variety of objects, including advertisements and tracking scripts. A page may also link to other Web pages; users can follow a link by clicking on it, which then takes them to the page pointed to. This process can be repeated indefi- nitely. The idea of having one page point to another, now called hypertext, was invented by a visionary M.I.T. professor of electrical engineering, Vannevar Bush, in 1945 (Bush, 1945). This was long before the Internet was invented. In fact, it was before commercial computers existed although several universities had pro- duced crude prototypes that filled large rooms and had millions of times less com- puting power than a smart watch but consumed more electrical power than a small factory.
Pages are generally viewed with a program called a browser. Brave, Chrome, Edge, Firefox, Opera, and Safari are examples of popular browsers. The browser fetches the page requested, interprets the content, and displays the page, properly formatted, on the screen. The content itself may be a mix of text, images, and for- matting commands, in the manner of a traditional document, or other forms of con tent such as video or programs that produce a graphical interface for users.
Figure 7-19 shows an example of a Web page, which contains many objects. In this case, the page is for the U.S. Federal Communications Commission. This page shows text and graphical elements (which are mostly too small to read here). Many parts of the page include references and links to other pages. The index page, which the browser loads, typically contains instructions for the browser
652 THE APPLICATION LAYER CHAP. 7
concerning the locations of other objects to assemble, as well as how and where to render those objects on the page.
A piece of text, icon, graphic image, photograph, or other page element that can be associated with another page is called a hyperlink. To follow a link, a desktop or notebook computer user places the mouse cursor on the linked portion of the page area (which causes the cursor to change shape) and clicks. On a smart- phone or tablet, the user taps the link. Following a link is simply a way of telling the browser to fetch another page. In the early days of the Web, links were high lighted with underlining and colored text so that they would stand out. Now, the creators of Web pages can use style sheets to control the appearance of many aspects of the page, including hyperlinks, so links can effectively appear however the designer of the Web site wishes. The appearance of a link can even be dynam ic, for example, it might change its appearance when the mouse passes over it. It is up to the creators of the page to make the links visually distinct to provide a good user experience.
Document
Program
Database
Objects
(e.g., fonts.gstatic.com)
HTTPS Request
HTTPS Response
Web Page Web Browser
Web Server
Ads, Trackers, etc.
(e.g., google-analytics.com)
Figure 7-19. Fetching and rendering a Web page involves HTTP/HTTPS
requests to many servers.
Readers of this page might find a story of interest and click on the area indi- cated, at which point the browser fetches the new page and displays it. Dozens of other pages are linked off the first page besides this example. Every other page can consist of content on the same machine(s) as the first page, or on machines halfway around the globe. The user cannot tell. The browser typically fetches whatever objects the user indicates to the browser through a series of clicks. Thus, moving between machines while viewing content is seamless.
SEC. 7.3 THE WORLD WIDE WEB 653
The browser is displaying a Web page on the client machine. Each page is fetched by sending a request to one or more servers, which respond with the con tents of the page. The request-response protocol for fetching pages is a simple text-based protocol that runs over TCP, just as was the case for SMTP. It is called HTTP (HyperText Transfer Protocol). The secure version of this protocol, which is now the predominant mode of retrieving content on the Web today, is call- ed HTTPS (Secure HyperText Transfer Protocol). The content may simply be a document that is read off a disk, or the result of a database query and program execution. The page is a static page if it is a document that is the same every time it is displayed. In contrast, if it was generated on demand by a program or contains a program it is a dynamic page.
A dynamic page may present itself differently each time it is displayed. For example, the front page for an electronic store may be different for each visitor. If a bookstore customer has bought mystery novels in the past, upon visiting the store’s main page, the customer is likely to see new thrillers prominently displayed, whereas a more culinary-minded customer might be greeted with new cookbooks. How the Web site keeps track of who likes what is a story to be told shortly. But briefly, the answer involves cookies (even for culinarily challenged visitors).
In the browser contacts a number of servers to load the Web page. The content on the index page might be loaded directly from files hosted at fcc.gov. Auxiliary content, such as an embedded video, might be hosted at a separate server, still at fcc.gov, but perhaps on infrastructure that is dedicated to hosting the content. The index page may also contain references to other objects that the user may not even see, such as tracking scripts, or advertisements that are hosted on third-party ser- vers. The browser fetches all of these objects, scripts, and so forth and assembles them into a single page view for the user.
Display entails a range of processing that depends on the kind of content. Besides rendering text and graphics, it may involve playing a video or running a script that presents its own user interface as part of the page. In this case, the fcc.gov server supplies the main page, the fonts.gstatic.com server supplies addi tional objects (e.g., fonts), and the google-analytics.com server supplies nothing that the user can see but tracks visitors to the site. We will investigate trackers and Web privacy later in this chapter.
The Client Side
Let us now examine the Web browser side in Fig. 7-19 in more detail. In essence, a browser is a program that can display a Web page and capture a user’s request to ‘‘follow’’ other content on the page. When an item is selected, the brow- ser follows the hyperlink and retrieves the object that the user indicates (e.g., with a mouse click, or by tapping the link on the screen of a mobile device).
When the Web was first created, it was immediately apparent that having one page point to another Web page required mechanisms for naming and locating
654 THE APPLICATION LAYER CHAP. 7
pages. In particular, three questions had to be answered before a selected page could be displayed:
1. What is the page called?
2. Where is the page located?
3. How can the page be accessed?
If every page were somehow assigned a unique name, there would not be any ambiguity in identifying pages. Nevertheless, the problem would not be solved. Consider a parallel between people and pages. In the United States, almost every adult has a Social Security number, which is a unique identifier, as no two people are supposed to have the same one. Nevertheless, if you are armed only with a social security number, there is no way to find the owner’s address, and certainly no way to tell whether you should write to the person in English, Spanish, or Chi- nese. The Web has basically the same problems.
The solution chosen identifies pages in a way that solves all three problems at once. Each page is assigned a URL (Uniform Resource Locator) that effectively serves as the page’s worldwide name. URLs have three parts: the protocol (also
known as the scheme), the DNS name of the machine on which the page is locat- ed, and the path uniquely indicating the specific page (a file to read or program to run on the machine). In the general case, the path has a hierarchical name that models a file directory structure. However, the interpretation of the path is up to the server; it may or may not reflect the actual directory structure. As an example, the URL of the page shown in Fig. 7-19 is
https://fcc.gov/
This URL consists of three parts: the protocol (https), the DNS name of the host (fcc.gov), and the path name (/, which the Web server often treats as some default index object).
When a user selects a hyperlink, the browser carries out a series of steps in order to fetch the page pointed to. Let us trace the steps that occur when our exam- ple link is selected:
1. The browser determines the URL (by seeing what was selected). 2. The browser asks DNS for the IP address of the server fcc.gov. 3. DNS replies with 23.1.55.196.
4. The browser makes a TCP connection to that IP address; given that the protocol is HTTPS, the secure version of HTTP, the TCP con- nection would by default be on port 443 (the default port for HTTP, which is used far less often now, is port 80).
5. It sends an HTTPS request asking for the page //, which the Web ser- ver typically assumes is some index page (e.g., index.html, index.php, or similar, as configured by the Web server at fcc.gov).
SEC. 7.3 THE WORLD WIDE WEB 655
6. The server sends the page as an HTTPS response, for example, by sending the file /index.html, if that is determined to be the default index object.
7. If the page includes URLs that are needed for display, the browser fetches the other URLs using the same process. In this case, the URLs include multiple embedded images also fetched from that ser- ver, embedded objects from gstatic.com, and a script from google- analytics.com (as well as a number of other domains that are not shown).
8. The browser displays the page /index.html as it appears in Fig. 7-19.
9. The TCP connections are released if there are no other requests to the same servers for a short period.
Many browsers display which step they are currently executing in a status line at the bottom of the screen. In this way, when the performance is poor, the user can see if it is due to DNS not responding, a server not responding, or simply page transmission over a slow or congested network.
A more detailed way to explore and understand the performance of the Web page is through a so-called waterfall diagram, as shown in Fig. 7-20. The figure shows a list of all of the objects that the browser loads in the proc- ess of loading this page (in this case, 64, but many pages have hundreds of objects), as well as the timing dependencies associated with loading each request, and the operations associated with each page load (e.g., a DNS lookup, a TCP con- nection, the downloading of actual content, and so forth). These waterfall diagrams can tell us a lot about the behavior of a Web browser; for example, we can learn about the number of parallel connections that a browser makes to any given server, as well as whether connections are being reused. We can also learn about the rela tive time for DNS lookups versus actual object downloads, as well as other poten tial performance bottlenecks.
The URL design is open-ended in the sense that it is straightforward to have browsers use multiple protocols to retrieve different kinds of resources. In fact, URLs for various other protocols have been defined. Slightly simplified forms of the common ones are listed in Fig. 7-21.
Let us briefly go over the list. The http protocol is the Web’s native language, the one spoken by Web servers. HTTP stands for HyperText Transfer Protocol. We will examine it in more detail later in this section, with a particular focus on HTTPS, the secure version of this protocol, which is now the predominant protocol used to serve objects on the Web today.
The ftp protocol is used to access files by FTP, the Internet’s file transfer proto- col. FTP predates the Web and has been in use for more than four decades. The Web makes it easy to obtain files placed on numerous FTP servers throughout the world by providing a simple, clickable interface instead of the older command-line
656 THE APPLICATION LAYER CHAP. 7
Figure 7-20. Waterfall diagram for fcc.gov.
interface. This improved access to information is one reason for the spectacular growth of the Web.
It is possible to access a local file as a Web page by using the file protocol, or more simply, by just naming it. This approach does not require having a server. Of course, it works only for local files, not remote ones.
The mailto protocol does not really have the flavor of fetching Web pages, but is still useful anyway. It allows users to send email from a Web browser. Most
SEC. 7.3 THE WORLD WIDE WEB 657
Name Used for Example
http Hypertext (HTML) https://www.ee.uwa.edu/~rob/ (https://www.ee.uwa.edu/~rohttps Hypertext with security https://www.bank.com/accounts/ (https://www.bank.com/acftp FTP ftp://ftp.cs.vu.nl/pub/minix/README (ftp://ftp.cs.vu.nl/pub/mifile Local file file:///usr/nathan/prog.c
mailto Sending email mailto:JohnUser@acm.org
rtsp Streaming media rtsp://youtube.com/montypython.mpg sip Multimedia calls sip:eve@adversary.com
about Browser information about:plugins
Figure 7-21. Some common URL schemes.
browsers will respond when a mailto link is followed by starting the user’s mail agent to compose a message with the address field already filled in. The rtsp and sip protocols are for establishing streaming media sessions and audio and video calls.
Finally, the about protocol is a convention that provides information about the browser. For example, following the about:plugins link will cause most browsers to show a page that lists the MIME types that they handle with browser extensions called plug-ins. Many browsers have very interesting information in the about: sec tion; an interesting example in the Firefox browser is about:telemetry, which shows all of the performance and user activity information that the browser gathers about the user. about:preferences shows user preferences, and about:config shows many interesting aspects of the browser configuration, including whether the brow- ser is performing DNS-over-HTTPS lookups (and to which trusted recursive resolvers), as described in the previous section on DNS.
The URLs themselves have been designed not only to allow users to navigate the Web, but to run older protocols such as FTP and email as well as newer proto- cols for audio and video, and to provide convenient access to local files and brow- ser information. This approach makes all the specialized user interface programs for those other services unnecessary and integrates nearly all Internet access into a single program: the Web browser. If it were not for the fact that this idea was thought of by a British physicist working a multinational European research lab in Switzerland (CERN), it could easily pass for a plan dreamed up by some software company’s advertising department.
The Server Side
So much for the client side. Now let us take a look at the server side. As we saw above, when the user types in a URL or clicks on a line of hypertext, the brow- ser parses the URL and interprets the part between https:// and the next slash as a DNS name to look up. Armed with the IP address of the server, the browser can
658 THE APPLICATION LAYER CHAP. 7
establish a TCP connection to port 443 on that server. Then it sends over a com- mand containing the rest of the URL, which is the path to the page on that server. The server then returns the page for the browser to display.
To a first approximation, a simple Web server is similar to the server of Fig. 6-6. That server is given the name of a file to look up and return via the net- work. In both cases, the steps that the server performs in its main loop are:
1. Accept a TCP connection from a client (a browser).
2. Get the path to the page, which is the name of the file requested. 3. Get the file (from disk).
4. Send the contents of the file to the client.
5. Release the TCP connection.
Modern Web servers have more features, but in essence, this is what a Web server does for the simple case of content that is contained in a file. For dynamic content, the third step may be replaced by the execution of a program (determined from the path) that generates and returns the contents.
However, Web servers are implemented with a different design to serve hun- dreds or thousands of requests per second. One problem with the simple design is that accessing files is often the bottleneck. Disk reads are very slow compared to program execution, and the same files may be read repeatedly from disk using operating system calls. Another problem is that only one request is processed at a time. If the file is large, other requests will be blocked while it is transferred.
One obvious improvement (used by all Web servers) is to maintain a cache in memory of the n most recently read files or a certain number of gigabytes of con tent. Before going to disk to get a file, the server checks the cache. If the file is there, it can be served directly from memory, thus eliminating the disk access. Although effective caching requires a large amount of main memory and some extra processing time to check the cache and manage its contents, the savings in time are nearly always worth the overhead and expense.
To tackle the problem of serving more than a single request at a time, one strat- egy is to make the server multithreaded. In one design, the server consists of a front-end module that accepts all incoming requests and k processing modules, as shown in Fig. 7-22. The k + 1 threads all belong to the same process, so the proc- essing modules all have access to the cache within the process’ address space. When a request comes in, the front end accepts it and builds a short record describ ing it. It then hands the record to one of the processing modules.
The processing module first checks the cache to see if the requested object is present. If so, it updates the record to include a pointer to the file in the record. If it is not there, the processing module starts a disk operation to read it into the cache (possibly discarding some other cached file(s) to make room for it). When the file comes in from the disk, it is put in the cache and also sent back to the client.
SEC. 7.3 THE WORLD WIDE WEB 659
Processing
module
(thread)
Request
Response Client
Disk
Front end Cache Server
Figure 7-22. A multithreaded Web server with a front end and processing modules.
The advantage of this approach is that while one or more processing modules are blocked waiting for a disk or network operation to complete (and thus consum ing no CPU time), other modules can be actively working on other requests. With k processing modules, the throughput can be as much as k times higher than with a single-threaded server. Of course, when the disk or network is the limiting factor, it is necessary to have multiple disks or a faster network to get any real improvement over the single-threaded model.
Essentially all modern Web architectures are now designed as shown above, with a split between the front end and a back end. The front-end Web server is often called a reverse proxy, because it retrieves content from other (typically back-end) servers and serves those objects to the client. The proxy is called a ‘‘reverse’’ proxy because it is acting on behalf of the servers, as opposed to acting on behalf of clients.
When loading a Web page, a client will often first be directed (using DNS) to a reverse proxy (i.e., front end server), which will begin returning static objects to the client’s Web browser so that it can begin loading some of the page contents as quickly as possible. While those (typically static) objects are loading, the back end can perform complex operations (e.g., performing a Web search, doing a database lookup, or otherwise generating dynamic content), which it can serve back to the client via the reverse proxy as those results and content becomes available.
7.3.2 Static Web Objects
The basis of the Web is transferring Web pages from server to client. In the simplest form, Web objects are static. However, these days, almost any page that you view on the Web will have some dynamic content, but even on dynamic Web pages, a significant amount of the content (e.g., the logo, the style sheets, the head- er and footer) remains static. Static objects are just files sitting on some server that present themselves in the same way each time they are fetched and viewed. They
660 THE APPLICATION LAYER CHAP. 7
are generally amenable to caching, sometimes for a very long time, and are thus often placed on object caches that are close to the user. Just because they are static does not mean that the pages are inert at the browser, however. A video is a static object, for example.
As mentioned earlier, the lingua franca of the Web, in which most pages are written, is HTML. The home pages of university instructors are generally static objects; in some cases, companies may have dynamic Web pages, but the end result of the dynamic-generation process is a page in HTML. HTML (HyperText Markup Language) was introduced with the Web. It allows users to produce Web pages that include text, graphics, video, pointers to other Web pages, and more. HTML is a markup language, or language for describing how documents are to be formatted. The term ‘‘markup’’ comes from the old days when copyeditors actual ly marked up documents to tell the printer—in those days, a human being—which fonts to use, and so on. Markup languages thus contain explicit commands for for- matting. For example, in HTML, <b> means start boldface mode, and </b> means leave boldface mode. Also, <h1> means to start a level 1 heading here. LaTeX and TeX are other examples of markup languages that are well known to most academic authors. In contrast, Microsoft Word is not a markup language because the formatting commands are not embedded in the text.
The key advantage of a markup language over one with no explicit markup is that it separates content from how it should be presented. Most modern Webpages use style sheets to define the typefaces, colors, sizes, padding, and many other attributes of text, lists, tables, headings, ads, and other page elements. Style sheets are written in a language called CSS (Cascading Style Sheets).
Writing a browser is then straightforward: the browser simply has to under- stand the markup commands and style sheet and apply them to the content. Embedding all the markup commands within each HTML file and standardizing them makes it possible for any Web browser to read and reformat any Web page. That is crucial because a page may have been produced in a 3840 × 2160 window with 24-bit color on a high-end computer but may have to be displayed in a 640 × 320 window on a mobile phone. Just scaling it down linearly is a bad idea because then the letters would be so small that no one could read them.
While it is certainly possible to write documents like this with any plain text editor, and many people do, it is also possible to use word processors or special HTML editors that do most of the work (but correspondingly give the user less direct control over the details of the final result). There are also many programs available for designing Web pages, such as Adobe Dreamweaver.
7.3.3 Dynamic Web Pages and Web Applications
The static page model we have used so far treats pages as (multimedia) docu- ments that are conveniently linked together. It was a good model back in the early days of the Web, as vast amounts of information were put online. Nowadays,
SEC. 7.3 THE WORLD WIDE WEB 661
much of the excitement around the Web is using it for applications and services. Examples include buying products on e-commerce sites, searching library catalogs, exploring maps, reading and sending email, and collaborating on documents.
These new uses are like conventional application software (e.g., mail readers and word processors). The twist is that these applications run inside the browser, with user data stored on servers in Internet data centers. They use Web protocols to access information via the Internet, and the browser to display a user interface. The advantage of this approach is that users do not need to install separate applica tion programs, and user data can be accessed from different computers and backed up by the service operator. It is proving so successful that it is rivaling traditional application software. Of course, the fact that these applications are offered for free by large providers helps. This model is a prevalent form of cloud computing, where computing moves off individual desktop computers and into shared clusters of servers in the Internet.
To act as applications, Web pages can no longer be static. Dynamic content is needed. For example, a page of the library catalog should reflect which books are currently available and which books are checked out and are thus not available. Similarly, a useful stock market page would allow the user to interact with the page to see stock prices over different periods of time and compute profits and losses. As these examples suggest, dynamic content can be generated by programs run- ning on the server or in the browser (or in both places).
The general situation is as shown in Fig. 7-23. For example, consider a map service that lets the user enter a street address and presents a corresponding map of the location. Given a request for a location, the Web server must use a program to create a page that shows the map for the location from a database of streets and other geographic information. This action is shown as steps 1 through 3. The request (step 1) causes a program to run on the server. The program consults a database to generate the appropriate page (step 2) and returns it to the browser (step 3).
Web
1Program
page
Program 4
3 5 7
2
6
Program
DB
Web browser Web server
Figure 7-23. Dynamic pages.
There is more to dynamic content, however. The page that is returned may itself contain programs that run in the browser. In our map example, the program
662 THE APPLICATION LAYER CHAP. 7
would let the user find routes and explore nearby areas at different levels of detail. It would update the page, zooming in or out as directed by the user (step 4). To handle some interactions, the program may need more data from the server. In this case, the program will send a request to the server (step 5) that will retrieve more information from the database (step 6) and return a response (step 7). The program will then continue updating the page (step 4). The requests and responses happen in the background; the user may not even be aware of them because the page URL and title typically do not change. By including client-side programs, the page can present a more responsive interface than with server-side programs alone.
Server-Side Dynamic Web Page Generation
Let us look briefly at the case of server-side content generation. When the user clicks on a link in a form, for example in order to buy something, a request is sent to the server at the URL specified with the form along with the contents of the form as filled in by the user. These data must be given to a program or script to process. Thus, the URL identifies the program to run; the data are provided to the program as input. The page returned by this request will depend on what happens during the processing. It is not fixed like a static page. If the order succeeds, the page returned might give the expected shipping date. If it is unsuccessful, the re turned page might say that widgets requested are out of stock or the credit card was not valid for some reason.
Exactly how the server runs a program instead of retrieving a file depends on the design of the Web server. It is not specified by the Web protocols themselves. This is because the interface can be proprietary and the browser does not need to know the details. As far as the browser is concerned, it is simply making a request and fetching a page.
Nonetheless, standard APIs have been developed for Web servers to invoke programs. The existence of these interfaces makes it easier for developers to extend different servers with Web applications. We will briefly look at two APIs to give you a sense of what they entail.
The first API is a method for handling dynamic page requests that has been available since the beginning of the Web. It is called the CGI (Common Gateway Interface) and is defined in RFC 3875. CGI provides an interface to allow Web servers to talk to back-end programs and scripts that can accept input (e.g., from
forms) and generate HTML pages in response. These programs may be written in whatever language is convenient for the developer, usually a scripting language for ease of development. Pick Python, Ruby, Perl, or your favorite language.
By convention, programs invoked via CGI live in a directory called cgi-bin, which is visible in the URL. The server maps a request to this directory to a pro- gram name and executes that program as a separate process. It provides any data sent with the request as input to the program. The output of the program gives a Web page that is returned to the browser.
SEC. 7.3 THE WORLD WIDE WEB 663
The second API is quite different. The approach here is to embed little scripts inside HTML pages and have them be executed by the server itself to generate the page. A popular language for writing these scripts is PHP (PHP: Hypertext Pre- processor). To use it, the server has to understand PHP, just as a browser has to understand CSS to interpret Web pages with style sheets. Usually, servers identify Web pages containing PHP from the file extension php rather than html or htm. PHP is simpler to use than CGI and is widely used.
Although PHP is easy to use, it is actually a powerful programming language for interfacing the Web and a server database. It has variables, strings, arrays, and most of the control structures found in C, but much more powerful I/O than just printf. PHP is open source code, freely available, and widely used. It was designed specifically to work well with Apache, which is also open source and is the world’s most widely used Web server.
Client-Side Dynamic Web Page Generation
PHP and CGI scripts solve the problem of handling input and interactions with databases on the server. They can all accept incoming information from forms, look up information in one or more databases, and generate HTML pages with the results. What none of them can do is respond to mouse movements or interact with users directly. For this purpose, it is necessary to have scripts embedded in HTML pages that are executed on the client machine rather than the server machine. Starting with HTML 4.0, such scripts were permitted using the tag <script>. The current HTML standard is now generally referred to as HTML5. HTML5 includes many new syntactic features for incorporating multimedia and graphical content, including <video>, <audio>, and <canvas> tags. Notably, the canvas ele- ment facilitates dynamic rendering of two-dimensional shapes and bitmap images. Interestingly, the canvas element also has various privacy considerations, because the HTML canvas properties are often unique on different devices. The privacy concerns are significant, because the uniqueness of canvases on individual user devices allows Web site operators to track users, even if the users delete all track ing cookies and block tracking scripts.
The most popular scripting language for the client side is JavaScript, so we will now take a quick look at it. Many books have been written about it (e.g., Cod ing, 2019; and Atencio, 2020). Despite the similarity in names, JavaScript has al- most nothing to do with the Java programming language. Like other scripting lan- guages, it is a very high-level language. For example, in a single line of JavaScript it is possible to pop up a dialog box, wait for text input, and store the resulting string in a variable. High-level features like this make JavaScript ideal for design ing interactive Web pages. On the other hand, the fact that it is mutating faster than a fruit fly trapped in an X-ray machine makes it difficult to write JavaScript programs that work on all platforms, but maybe some day it will stabilize.
664 THE APPLICATION LAYER CHAP. 7
It is important to understand that while PHP and JavaScript look similar in that they both embed code in HTML files, they are processed totally differently. With PHP, after a user has clicked on the submit button, the browser collects the infor- mation into a long string and sends it off to the server as a request for a PHP page. The server loads the PHP file and executes the PHP script that is embedded in to produce a new HTML page. That page is sent back to the browser for display. The browser cannot even be sure that it was produced by a program. This processing is shown as steps 1 to 4 in Fig. 7-24(a).
Browser
User
Server
Browser
User
Server
1 4
2 3
1 2
PHP module (a)
JavaScript
(b)
Figure 7-24. (a) Server-side scripting with PHP. (b) Client-side scripting with JavaScript.
With JavaScript, when the submit button is clicked the browser interprets a JavaScript function contained on the page. All the work is done locally, inside the browser. There is no contact with the server. This processing is shown as steps 1 and 2 in Fig. 7-24(b). As a consequence, the result is displayed virtually instanta- neously, whereas with PHP there can be a delay of several seconds before the resulting HTML arrives at the client.
This difference does not mean that JavaScript is better than PHP. Their uses are completely different. PHP is used when interaction with a database on the ser- ver is needed. JavaScript (and other client-side languages) is used when the inter- action is with the user at the client computer. It is certainly possible to combine them, as we will see shortly.
7.3.4 HTTP and HTTPS
Now that we have an understanding of Web content and applications, it is time to look at the protocol that is used to transport all this information between Web servers and clients. It is HTTP (HyperText Transfer Protocol), as specified in RFC 2616. Before we get into too many details, it is worth noting some dis tinctions between HTTP and its secure counterpart, HTTPS (Secure HyperText Transfer Protocol). Both protocols essentially retrieve objects in the same way, and the HTTP standard to retrieve Web objects is evolving essentially indepen- dently from its secure counterpart, which effectively uses the HTTP protocol over a secure transport protocol called TLS (Transport Layer Security). In this chapter, we will focus on the protocol details of HTTP and how it has evolved from early
SEC. 7.3 THE WORLD WIDE WEB 665
versions, to the more modern versions of this protocol in what is now known as HTTP/3. Chapter 8 discusses TLS in more detail, which effectively is the transport protocol that transports HTTP, constituting what we think of as HTTPS. For the remainder of this section, we will talk about HTTP; you can think of HTTPS as simply HTTP that istransported over TLS.
Overview
HTTP is a simple request-response protocol; conventional versions of HTTP typically run over TCP, although the most modern version of HTTP, HTTP/3, now commonly runs over UDP as well. It specifies what messages clients may send to servers and what responses they get back in return. The request and response headers are given in ASCII, just like in SMTP. The contents are given in a MIME like format, also like in SMTP. This simple model was partly responsible for the early success of the Web because it made development and deployment straightfor- ward.
In this section, we will look at the more important properties of HTTP as it is used today. Before getting into the details we will note that the way it is used in the Internet is evolving. HTTP is an application layer protocol because it runs on top of TCP and is closely associated with the Web. That is why we are covering it in this chapter. In another sense, HTTP is becoming more like a transport protocol that provides a way for processes to communicate content across the boundaries of different networks. These processes do not have to be a Web browser and Web ser- ver. A media player could use HTTP to talk to a server and request album infor- mation. Antivirus software could use HTTP to download the latest updates. Developers could use HTTP to fetch project files. Consumer electronics products like digital photo frames often use an embedded HTTP server as an interface to the outside world. Machine-to-machine communication increasingly runs over HTTP. For example, an airline server might contact a car rental server and make a car reservation, all as part of a vacation package the airline was offering.
Methods
Although HTTP was designed for use in the Web, it was intentionally made more general than necessary with an eye to future object-oriented uses. For this reason, operations, called methods, other than just requesting a Web page are sup- ported.
Each request consists of one or more lines of ASCII text, with the first word on the first line being the name of the method requested. The built-in methods are listed in Fig. 7-25. The names are case sensitive, so GET is allowed but not get.
The GET method requests the server to send the page. (When we say ‘‘page’’ we mean ‘‘object’’ in the most general case, but thinking of a page as the contents of a file is sufficient to understand the concepts.) The page is suitably encoded in
666 THE APPLICATION LAYER CHAP. 7
Method Description
GET Read a Web page
HEAD Read a Web page’s header
POST Append to a Web page
PUT Store a Web page
DELETE Remove the Web page
TRACE Echo the incoming request
CONNECT Connect through a proxy
OPTIONS Query options for a page
Figure 7-25. The built-in HTTP request methods.
MIME. The vast majority of requests to Web servers are GETs and the syntax is simple. The usual form of GET is
GET filename HTTP/1.1
where filename names the page to be fetched and 1.1 is the protocol version. The HEAD method just asks for the message header, without the actual page. This method can be used to collect information for indexing purposes, or just to test a URL for validity.
The POST method is used when forms are submitted. Like GET, it bears a URL, but instead of simply retrieving a page it uploads data to the server (i.e., the contents of the form or parameters). The server then does something with the data that depends on the URL, conceptually appending the data to the object. The effect
might be to purchase an item, for example, or to call a procedure. Finally, the method returns a page indicating the result.
The remaining methods are not used much for browsing the Web. The PUT method is the reverse of GET: instead of reading the page, it writes the page. This method makes it possible to build a collection of Web pages on a remote server. The body of the request contains the page. It may be encoded using MIME, in which case the lines following the PUT might include authentication headers, to prove that the caller indeed has permission to perform the requested operation.
DELETE does what you might expect: it removes the page, or at least it indi- cates that the Web server has agreed to remove the page. As with PUT, authentica tion and permission play a major role here.
The TRACE method is for debugging. It instructs the server to send back the request. This method is useful when requests are not being processed correctly and the client wants to know what request the server actually got.
The CONNECT method lets a user make a connection to a Web server through an intermediate device, such as a Web cache.
The OPTIONS method provides a way for the client to query the server for a page and obtain the methods and headers that can be used with that page.
SEC. 7.3 THE WORLD WIDE WEB 667
Every request gets a response consisting of a status line, and possibly addi tional information (e.g., all or part of a Web page). The status line contains a three-digit status code telling whether the request was satisfied and, if not, why not. The first digit is used to divide the responses into five major groups, as shown in Fig. 7-26.
Code Meaning Examples
1xx Information 100 = server agrees to handle client’s request
2xx Success 200 = request succeeded; 204 = no content present 3xx Redirection 301 = page moved; 304 = cached page still valid 4xx Client error 403 = forbidden page; 404 = page not found
5xx Server error 500 = internal server error; 503 = try again later
Figure 7-26. The status code response groups.
The 1xx codes are rarely used in practice. The 2xx codes mean that the request was handled successfully and the content (if any) is being returned. The 3xx codes tell the client to look elsewhere, either using a different URL or in its own cache (discussed later). The 4xx codes mean the request failed due to a client error such an invalid request or a nonexistent page. Finally, the 5xx errors mean the server itself has an internal problem, either due to an error in its code or to a temporary overload.
Message Headers
The request line (e.g., the line with the GET method) may be followed by addi tional lines with more information. They are called request headers. This infor- mation can be compared to the parameters of a procedure call. Responses may also have response headers. Some headers can be used in either direction. A selection of the more important ones is given in Fig. 7-27. This list is not short, so as you might imagine there are often several headers on each request and response.
The User-Agent header allows the client to inform the server about its browser implementation (e.g., Mozilla/5.0 and Chrome/74.0.3729.169). This information is useful to let servers tailor their responses to the browser, since different browsers can have widely varying capabilities and behaviors.
The four Accept headers tell the server what the client is willing to accept in the event that it has a limited repertoire of what is acceptable to it. The first header specifies the MIME types that are welcome (e.g., text/html). The second gives the character set (e.g., ISO-8859-5 or Unicode-1-1). The third deals with compression methods (e.g., gzip). The fourth indicates a natural language (e.g., Spanish). If the server has a choice of pages, it can use this information to supply the one the client is looking for. If it is unable to satisfy the request, an error code is returned and the request fails.
668 THE APPLICATION LAYER CHAP. 7
Header Type Contents
User-Agent Request Information about the browser and its platform Accept Request The type of pages the client can handle Accept-Charset Request The character sets that are acceptable to the client Accept-Encoding Request The page encodings the client can handle Accept-Language Request The natural languages the client can handle If-Modified-Since Request Time and date to check freshness If-None-Match Request Previously sent tags to check freshness Host Request The server’s DNS name
Authorization Request A list of the client’s credentials
Referrer Request The previous URL from which the request came Cookie Request Previously set cookie sent back to the server Set-Cookie Response Cookie for the client to store
Server Response Information about the server
Content-Encoding Response How the content is encoded (e.g., gzip) Content-Language Response The natural language used in the page Content-Length Response The page’s length in bytes
Content-Type Response The page’s MIME type
Content-Range Response Identifies a portion of the page’s content Last-Modified Response Time and date the page was last changed Expires Response Time and date when the page stops being valid Location Response Tells the client where to send its request Accept-Ranges Response Indicates the server will accept byte range requests Date Both Date and time the message was sent Range Both Identifies a portion of a page
Cache-Control Both Directives for how to treat caches ETag Both Tag for the contents of the page
Upgrade Both The protocol the sender wants to switch to Figure 7-27. Some HTTP message headers.
The If-Modified-Since and If-None-Match headers are used with caching. They let the client ask for a page to be sent only if the cached copy is no longer valid. We will describe caching shortly.
The Host header names the server. It is taken from the URL. This header is mandatory. It is used because some IP addresses may serve multiple DNS names and the server needs some way to tell which host to hand the request to.
The Authorization header is needed for pages that are protected. In this case, the client may have to prove it has a right to see the page requested. This header is used for that case.
SEC. 7.3 THE WORLD WIDE WEB 669
The client uses the (misspelled) Referer [sic] header to give the URL that referred to the URL that is now requested. Most often this is the URL of the previ- ous page. This header is particularly useful for tracking Web browsing, as it tells servers how a client arrived at the page.
Cookies are small files that servers place on client computers to remember information for later. A typical example is an e-commerce Web site that uses a cli- ent-side cookie to keep track of what the client has ordered so far. Every time the client adds an item to her shopping cart, the cookie is updated to reflect the new item ordered. Although cookies are dealt with in RFC 2109 rather than RFC 2616, they also have headers. The Set-Cookie header is how servers send cookies to cli- ents. The client is expected to save the cookie and return it on subsequent requests to the server by using the Cookie header. (Note that there is a more recent specif ication for cookies with newer headers, RFC 2965, but this has largely been reject- ed by industry and is not widely implemented.)
Many other headers are used in responses. The Server header allows the server to identify its software build if it wishes. The next five headers, all starting with Content-, allow the server to describe properties of the page it is sending.
The Last-Modified header tells when the page was last modified, and the Expires header tells for how long the page will remain valid. Both of these headers play an important role in page caching.
The Location header is used by the server to inform the client that it should try a different URL. This can be used if the page has moved or to allow multiple URLs to refer to the same page (possibly on different servers). It is also used for companies that have a main Web page in the com domain but redirect clients to a national or regional page based on their IP addresses or preferred language.
If a page is large, a small client may not want it all at once. Some servers will accept requests for byte ranges, so the page can be fetched in multiple small units. The Accept-Ranges header announces the server’s willingness to handle this.
Now we come to headers that can be used either way. The Date header can be used in both directions and contains the time and date the message was sent, while the Range header tells the byte range of the page that is provided by the response.
The ETag header gives a short tag that serves as a name for the content of the page. It is used for caching. The Cache-Control header gives other explicit instruc tions about how to cache (or, more usually, how not to cache) pages.
Finally, the Upgrade header is used for switching to a new communication protocol, such as a future HTTP protocol or a secure transport. It allows the client to announce what it can support and the server to assert what it is using.
Caching
People often return to Web pages that they have viewed before, and related Web pages often have the same embedded resources. Some examples are the images that are used for navigation across the site, as well as common style sheets
670 THE APPLICATION LAYER CHAP. 7
and scripts. It would be very wasteful to fetch all of these resources for these pages each time they are displayed because the browser already has a copy. Squirreling away pages that are fetched for subsequent use is called caching. The advantage is that when a cached page can be reused, it is not necessary to repeat the transfer. HTTP has built-in support to help clients identify when they can safely reuse pages. This support improves performance by reducing both net- work traffic and latency. The trade-off is that the browser must now store pages, but this is nearly always a worthwhile trade-off because local storage is inexpen- sive. The pages are usually kept on disk so that they can be used when the browser is run at a later date.
The difficult issue with HTTP caching is how to determine that a previously cached copy of a page is the same as the page would be if it was fetched again. This determination cannot be made solely from the URL. For example, the URL may give a page that displays the latest news item. The contents of this page will be updated frequently even though the URL stays the same. Alternatively, the con tents of the page may be a list of the gods from Greek and Roman mythology. This page should change somewhat less rapidly.
HTTP uses two strategies to tackle this problem. They are shown in Fig. 7-28 as forms of processing between the request (step 1) and the response (step 5). The first strategy is page validation (step 2). The cache is consulted, and if it has a copy of a page for the requested URL that is known to be fresh (i.e., still valid), there is no need to fetch it anew from the server. Instead, the cached page can be returned directly. The Expires header returned when the cached page was originally fetched and the current date and time can be used to make this determination.
1: Request 2: Check expiry 3: Conditional GET
5: Response
Cache
4a: Not modified 4b: Response
Program
Web browser
Figure 7-28. HTTP caching.
Web server
However, not all pages come with a convenient Expires header that tells when the page must be fetched again. After all, making predictions is hard—especially about the future. In this case, the browser may use heuristics. For example, if the page has not been modified in the past year (as told by the Last-Modified header) it is a fairly safe bet that it will not change in the next hour. There is no guarantee, however, and this may be a bad bet. For example, the stock market might have closed for the day so that the page will not change for hours, but it will change rapidly once the next trading session starts. Thus, the cacheability of a page may
SEC. 7.3 THE WORLD WIDE WEB 671
vary wildly over time. For this reason, heuristics should be used with care, though they often work well in practice.
Finding pages that have not expired is the most beneficial use of caching because it means that the server does not need to be contacted at all. Unfortunately, it does not always work. Servers must use the Expires header conservatively, since they may be unsure when a page will be updated. Thus, the cached copies may still be fresh, but the client does not know.
The second strategy is used in this case. It is to ask the server if the cached copy is still valid. This request is a conditional GET, and it is shown in Fig. 7-28 as step 3. If the server knows that the cached copy is still valid, it can send a short reply to say so (step 4a). Otherwise, it must send the full response (step 4b).
More header fields are used to let the server check whether a cached copy is still valid. The client has the time a cached page was most recently updated from the Last-Modified header. It can send this time to the server using the If-Modi fied-Since header to ask for the page if and only if it has been changed in the mean time. There is much more to say about caching because it has such a big effect on performance, but this is not the place to say it. Not surprisingly, there are many tutorials on the Web that you can find easily by searching for ‘‘Web caching.’’
HTTP/1 and HTTP/1.1
The usual way for a browser to contact a server is to establish a TCP con- nection to port 443 for HTTPS (or port 80 for HTTP) on the server’s machine, although this procedure is not formally required. The value of using TCP is that neither browsers nor servers have to worry about how to handle long messages, reliability, or congestion control. All of these matters are handled by the TCP implementation.
Early in the Web, with HTTP/1.0, after the connection was established a single request was sent over and a single response was sent back. Then the TCP con- nection was released. In a world in which the typical Web page consisted entirely of HTML text, this method was adequate. Quickly, the average Web page grew to contain large numbers of embedded links for content such as icons and other eye candy. Establishing a separate TCP connection to transport each single icon became a very expensive way to operate.
This observation led to HTTP/1.1, which supports persistent connections. With them, it is possible to establish a TCP connection, send a request and get a response, and then send additional requests and get additional responses. This strategy is also called connection reuse. By amortizing the TCP setup, startup, and release costs over multiple requests, the relative overhead due to TCP is reduced per request. It is also possible to pipeline requests, that is, send request 2 before the response to request 1 has arrived.
The performance difference between these three cases is shown in Fig. 7-29. Part (a) shows three requests, one after the other and each in a separate connection.
672 THE APPLICATION LAYER CHAP. 7
Let us suppose that this represents a Web page with two embedded images on the same server. The URLs of the images are determined as the main page is fetched, so they are fetched after the main page. Nowadays, a typical page has around 40 other objects that must be fetched to present it, but that would make our figure far too big so we will use only two embedded objects.
HTTP
Connection setup
Request
HTTP
Connection setup Connection setup Pipelined
Response
Connection setup
Time
Connection setup
requests
(a) (b) (c)
Figure 7-29. HTTP with (a) multiple connections and sequential requests. (b) A persistent connection and sequential requests. (c) A persistent connection and pipelined requests.
In Fig. 7-29(b), the page is fetched with a persistent connection. That is, the TCP connection is opened at the beginning, then the same three requests are sent, one after the other as before, and only then is the connection closed. Observe that the fetch completes more quickly. There are two reasons for the speedup. First, time is not wasted setting up additional connections. Each TCP connection requires at least one round-trip time to establish. Second, the transfer of the same images proceeds more quickly. Why is this? It is because of TCP congestion con trol. At the start of a connection, TCP uses the slow-start procedure to increase the throughput until it learns the behavior of the network path. The consequence of this warmup period is that multiple short TCP connections take disproportionately longer to transfer information than one longer TCP connection.
Finally, in Fig. 7-29(c), there is one persistent connection and the requests are pipelined. Specifically, the second and third requests are sent in rapid succession as soon as enough of the main page has been retrieved to identify that the images must be fetched. The responses for these requests follow eventually. This method cuts down the time that the server is idle, so it further improves performance.
SEC. 7.3 THE WORLD WIDE WEB 673
Persistent connections do not come for free, however. A new issue that they raise is when to close the connection. A connection to a server should stay open while the page loads. What then? There is a good chance that the user will click on a link that requests another page from the server. If the connection remains open,
the next request can be sent immediately. However, there is no guarantee that the client will make another request of the server any time soon. In practice, clients and servers usually keep persistent connections open until they have been idle for a
short time (e.g., 60 seconds) or they have a large number of open connections and need to close some.
The observant reader may have noticed that there is one combination that we have left out so far. It is also possible to send one request per TCP connection, but run multiple TCP connections in parallel. This parallel connection method was widely used by browsers before persistent connections. It has the same disadvan tage as sequential connections—extra overhead—but much better performance. This is because setting up and ramping up the connections in parallel hides some of the latency. In our example, connections for both of the embedded images could be set up at the same time. However, running many TCP connections to the same server is discouraged. The reason is that TCP performs congestion control for each connection independently. As a consequence, the connections compete against each other, causing added packet loss, and in aggregate are more aggressive users of the network than an individual connection. Persistent connections are superior and used in preference to parallel connections because they avoid overhead and do not suffer from congestion problems.
HTTP/2
HTTP/1.0 was around from the start of the Web and HTTP/1.1 was written in 2007. By 2012 it was getting a bit long in tooth, so IETF set up a working group to create what later became HTTP/2. The starting point was a protocol Google had devised earlier, called SPDY. The final product was published as RFC 7540 in May 2015.
The working group had several goals it tried to achieve, including: 1. Allow clients and servers to choose which HTTP version to use. 2. Maintain compatibility with HTTP/1.1 as much as possible.
3. Improve performance with multiplexing, pipelining, compression, etc.
4. Support existing practices used in browsers, servers, proxies, delivery networks, and more.
A key idea was to maintain backward compatibility. Existing applications had to work with HTTP/2, but new ones could take advantage of the new features to improve performance. For this reason, the headers, URLs, and general semantics
674 THE APPLICATION LAYER CHAP. 7
did not change much. What changed was the way everything is encoded and the way the clients and servers interact. In HTTP/1.1, a client opens a TCP connection to a server, sends over a request as text, waits for a response, and in many cases then closes the connection. This is repeated as often as needed to fetch an entire Web page. In HTTP/2 A TCP connection is set up and many requests can be sent over, in binary, possibly prioritized, and the server can respond to them in any order it wants to. Only after all requests have been answered is the TCP connection torn down.
Through a mechanism called server push, HTTP/2 allows the server to push out files that it knows will be needed but which the client may not know initially. For example, if a client requests a Web page and the server sees that it uses a style sheet and a JavaScript file, the server can send over the style sheet and the JavaScript before they are even requested. This eliminates some delays. An exam- ple of getting the same information (a Web page, its style sheet, and two images) in HTTP/1.1 and HTTP/2 is shown in Fig. 7-30.
Server
t
e
e
t
h
t
e
e
e
e
1
2
s
e
1
2
l
h
e
e
g
1
2
e
y
2
1
s
h
e
e
t
e
e
a
g
s
e
g
e g
e
g
s
g
e
e
g
p
a
l
e
g
a
g
a
a
a
g
a
g
p
l
e
yt
a
a
p
+
a
a
t
y
m
m
t
m
m
s
i
i
e
h
t
m
i
m
i
i
i
m
i
m
i
se
t
s
t
e
t
s
s
se
g
t
t
s
s
u
t
s i
s
ht
se
i
se
i
u
a
p
se
se
i
i
q
e e
u
e u
e
q
u
u
e
e
e
r
s
r
r
e
e
r
r
R
e
u
q
i
qe
e
qe
e
R
ht
qe
qe
e
e
H
e
e
r
R
H
R
H
R
R
H
H
R
User
e
H
Time
s
i
e
r
e
H
Time
(a) (b)
Figure 7-30. (a) Getting a Web page in HTTP/1.1. (b) Getting the same page in HTTP/2.
Note that Fig. 7-30(a) is the best case for HTTP/1.1, where multiple requests can be sent consecutively over the same TCP connection, but the rules are that they must be processed in order and the results sent back in order. In HTTP/2 [Fig. 7-30(b)], the responses can come back in any order. If it turns out, for exam- ple, that image 1 is very large, the server could back image 2 first so the browser
SEC. 7.3 THE WORLD WIDE WEB 675
can start displaying the page with image 2 even before image 1 is available. That is not allowed in HTTP/1.1. Also note that in Fig. 7-30(b) the server sent the style sheet without the browser asking for it.
In addition to the pipelining and multiplexing of requests over the same TCP connection, HTTP/2 compresses the headers and sends them in binary to reduce bandwidth usage and latency. An HTTP/2 session consists of a series of frames, each with a separate identifier. Responses may come back in a different order than the requests, as in Fig. 7-30(b), but since each response carries the identifier of the request, the browser can determine which request each response corresponds to.
Encryption was a sore point during the development of HTTP/2. Some people wanted it badly, and others opposed it equally badly. The opposition was mostly related to Internet-of-Things applications, in which the ‘‘thing’’ does not have a lot of computing power. In the end, encryption was not required by the standard, but all browsers require encryption, so de facto it is there anyway, at least for Web browsing.
HTTP/3
HTTP/3 or simply H3 is the third major revision of HTTP, designed as a suc- cessor to HTTP/2. The major distinction for HTTP/3 is the transport protocol that it uses to support the HTTP messages: rather than relying on TCP, it relies on an augmented version of UDP called QUIC, which relies on user-space congestion control running on top of UDP. HTTP/3 started out simply as HTTP-over-QUIC and has become the latest proposed major revision to the protocol. Many open- source libraries that support client and server logic for QUIC and HTTP/3 are available, in languages that include C, C++, Python, Rust, and Go. Popular Web servers including nginx also now support HTTP/3 through patches.
The QUIC transport protocol supports stream multiplexing and per-stream flow control, similar to that offered in HTTP/2. Stream-level reliability and con- nection-wide congestion control can dramatically improve the performance of HTTP, since congestion information can be shared across sessions, and reliability can be amortized across multiple connections fetching objects in parallel. Once a connection exists to a server endpoint, HTTP/3 allows the client to reuse that same connection with multiple different URLs.
HTTP/3, running HTTP over QUIC, promises many possible performance enhancements over HTTP/2, primarily because of the benefits that QUIC offers for HTTP vs. TCP. In some ways, QUIC could be viewed as the next generation of TCP. It offers connection setup with no additional round trips between client and server; in the case when a previous connection has been established between client and server, a zero-round-trip connection re-establishment is possible, provided that a secret from the previous connection was established and cached. QUIC guaran tees reliable, in-order delivery of bytes within a single stream, but it does not
676 THE APPLICATION LAYER CHAP. 7
provide any guarantees with respect to bytes on other QUIC streams. QUIC does permit out-of-order delivery within a stream, but HTTP/3 does not make use of this feature. HTTP/3 over QUIC will be performed exclusively using HTTPS; requests to (the increasingly deprecated) HTTP URLs will not be upgraded to use HTTP/3.
For more details on HTTP/3, see https://http3.net.
7.3.5 Web Privacy
One of the most significant concerns in recent years has been the privacy con- cerns associated with Web browsing. Web sites, Web applications, and other third parties often use mechanisms in HTTP to track user behavior, both within the con text of a single Web site or application, or across the Internet. Additionally, attack- ers may exploit various information side channels in the browser or device to track users. This section describes some of the mechanisms that are used to track users and fingerprint individual users and devices.
Cookies
One conventional way to implement tracking is by placing a cookie (effec tively a small amount of data) on client devices, which the clients may then send back upon subsequent visits to various Web sites. When a user requests a Web object (e.g., a Web page), a Web server may place a piece of persistent state, called a cookie, on the user’s device, using the ‘‘set-cookie’’ directive in HTTP. The data passed to the client’s device using this directive is subsequently stored locally on the device. When the device visits that Web domain in the future, the HTTP request passes the cookie, in addition to the request itself.
‘‘First-party’’ HTTP cookies (i.e., those set by the domain of the Web site that the user intends to visit, such as a shopping or news Web site) are useful for improving user experience on many Web sites. For example, cookies are often used to preserve state across a Web ‘‘session.’’ They allow a Web site to track useful information about a user’s ongoing behavior on a Web site, such as whether they recently logged into the Web site, or what items they have placed in a shopping cart.
Cookies set by one domain are generally only visible to the same domain that set the cookie in the first place. For example, one advertising network may set a cookie on a user device, but no other third party can see the cookie that was set. This Web security policy, called the same-origin policy, prevents one party from reading a cookie that was set by another party and in some sense can limit how information about an individual user is shared.
Although first-party cookies are often used to improve the user experience, third parties, such as advertisers and tracking companies can also set cookies on client devices, which can allow those third parties to track the sites that users visit
SEC. 7.3 THE WORLD WIDE WEB 677
as they navigate different Web sites across the entire Internet. This tracking takes place as follows:
1. When a user visits a Web site, in addition to the content that the user requests directly, the device may load content from third-party sites, including from the domains of advertising networks. Loading an advertisement or script from a third party allows that party to set a unique cookie on the user’s device.
2. That user may subsequently visit different sites on the Internet that load Web objects from the same third party that set tracking infor- mation on a different site.
A common example of this practice might be two different Web sites that use the same advertising network to serve ads. In this case, the advertising network would see: (1) the user’s device return the cookie that it set on a different Web site; (2) the HTTP referer request header that accompanies the request to load the object from the advertiser, indicating the original site that the user’s device was visiting. This practice is commonly referred to as cross-site tracking.
Super cookies, and other locally stored tracking identifiers, that a user cannot control as they would regular cookies, can allow an intermediary to track a user a- cross Web sites over time. Unique identifiers can include things such as third-party tracking identifiers encoded in HTTP (specifically HSTS (HTTP Strict Trans- port Security) headers that are not cleared when a user clears their cookies and tags that an intermediate third party such as a mobile ISP can insert into unencryp ted Web traffic that traverses a network segment. This enables third parties, such as advertisers, to build up a profile of a user’s browsing across a set of Web sites, sim ilar to the Web tracking cookies used by ad networks and application providers.
Third-Party Trackers
Web cookies that originate from a third-party domain that are used across many sites can allow an advertising network or other third parties to track a user’s browsing habits on any site where that tracking software is deployed (i.e., any site that carries their advertisements, sharing buttons, or other embedded code). Adver tising networks and other third parties typically track a user’s browsing patterns a- cross the range of Web sites that the user browses, often using browser-based tracking software. In some cases, a third party may develop its own tracking soft- ware (e.g., Web analytics software). In other cases, they may use a different third- party service to collect and aggregate this behavior across sites.
Web sites may permit advertising networks and other third-party trackers to operate on their site, enabling them to collect analytics data, advertise on other Web sites (called re-targeting), or monetize the Web site’s available advertising space via placement of carefully targeted ads. The advertisers collect data about
678 THE APPLICATION LAYER CHAP. 7
users by using various tracking mechanisms, such as HTTP cookies, HTML5 objects, JavaScript, device fingerprinting, browser fingerprinting, and other com- mon Web technologies. When a user visits multiple Web sites that leverage the same advertising network, that advertising network recognizes the user’s device, enabling them to track user Web behavior over time.
Using such tracking software, a third party or advertising network can discover a user’s interactions, social network and contacts, likes, interests, purchases, and so on. This information can enable precise tracking of whether an advertisement resulted in a purchase, mapping of relationships between people, creation of detailed user tracking profiles, conduct of highly targeted advertising, and signifi- cantly more due to the breadth and scope of tracking.
Even in cases where someone is not a registered user of a particular service (e.g., social media site, search engine), has ceased using that service, or has logged out of that service, they often are still being uniquely tracked using third-party (and first-party) trackers. Third-party trackers are increasingly becoming concentrated with a few large providers.
In addition to third-party tracking with cookies, the same advertisers and third- party trackers can track user browsing behavior with techniques such as canvas fin- gerprinting (a type of browser fingerprinting), session replay (whereby a third party can see a playback of every user interaction with a particular Webpage), and even exploitation of a browser or password manager’s ‘‘auto-fill’’ feature to send back data from Web forms, often before a user even fills out the form. These more sophisticated technologies can provide detailed information about user behavior and data, including fine-grained details such as the user’s scrolls and mouse-clicks and even in some instances the user’s username and password for a given Web site (which can be either intentional on the part of the user or unintentional on the part of the Web site).
A recent study suggests that specific instances of third-party tracking software are pervasive. The same study also discovered that news sites have the largest num- ber of tracking parties on any given first-party site; other popular categories for tracking include arts, sports, and shopping Web sites. Cross-device tracking refers to the practice of linking activities of a single user across multiple devices (e.g., smartphones, tablets, desktop machines, other ‘‘smart devices’’); the practice aims to track a user’s behavior, even as they use different devices.
Certain aspects of cross-device tracking may improve user experience. For example, as with cookies on a single device or browser, cross-device tracking can allow a user to maintain a seamless experience when moving from one device to the next (e.g., continuing to read a book or watch a movie from the place where the user left off). Cross-device tracking can also be useful for preventing fraud; for example, a service provider may notice that a user has logged in from an unfamil iar device in a completely new location. When a user attempts a login from an unrecognized device, a service provider can take additional steps to authenticate the user (e.g., two-factor authentication).
SEC. 7.3 THE WORLD WIDE WEB 679
Cross-device tracking is most common by first-party services, such as email service providers, content providers (e.g., streaming video services), and com- merce sites, but third parties are also becoming increasingly adept at tracking users across devices.
1. Cross-device tracking may be deterministic, based on a persistent identifier such as a login that is tied to a specific user.
2. Cross-device tracking may also be probabilistic; the IP address is one example of a probabilistic identifier that can be used to implement cross-device tracking. For example, technologies such as network address translation can cause multiple devices on a network to have the same public IP address. Suppose that a user visits a Web site from a mobile device (e.g., a smartphone) and uses that device at both home and work. A third party can set IP address information in the device’s cookies. That user may then appear from two public IP addresses, one at work, and one at home, and those two IP addresses may be linked by the same third party cookie; if the user then visits that third party from different devices that share either of those two IP addresses, then those additional devices can be linked to the same user with high confidence.
Cross-device tracking often uses a combination of deterministic and proba- bilistic techniques; many of these techniques do not require the user to be logged into any site to enable this type of tracking. For example, some parties offer ‘‘ana lytics’’ services that, when embedded across many first-party Web sites, allow the third-party to track a user across Web sites and devices. Third parties often work together to track users across devices and services using a practice called cookie syncing, described in more detail later in this section.
Cross-device tracking enables more sophisticated inference of higher-level user activities, since data from different devices can be combined to build a more com- prehensive picture of an individual user’s activity. For example, data about a user’s location (as collected from a mobile device) can be combined with a user’s search history, social network activity (such as ‘‘likes’’) to determine for example whether a user has physically visited a store following an online search or online advertis ing exposure.
Device and Browser Fingerprinting
Even when users disable common tracking mechanisms such as third-party cookies, Web sites and third parties can still track users based on environmental, contextual, and device information that the device returns to the server. Based on a collection of this information, a third party may be able to uniquely identify, or ’’fingerprint,’’ a user across different sites and over time.
680 THE APPLICATION LAYER CHAP. 7
One well-known fingerprinting method is a technique called canvas finger- printing, whereby the HTML canvas is used to identify a device. The HTML can- vas allows a Web application to draw graphics in real time. Differences in font rendering, smoothing, dimensions, and some other features may cause each device to draw an image differently, and the resulting pixels can serve as a device finger- print. The technique was first discovered in 2012, but not brought to public atten tion until 2014. Although there was a backlash at that time, many trackers continue to use canvas fingerprinting and related techniques such as canvas font fingerprint ing, which identifies a device based on the browser’s font list; a recent study found that these techniques are still present on thousands of sites. Web sites can also use browser APIs to retrieve other information for tracking devices, including infor- mation such as the battery status, which can be used to track a user based on bat tery charge level and discharge time. Other reports describe how knowing the bat tery status of a device can be used to track a device and therefore associate a device with a user (Olejnik et al., 2015)
Cookie Syncing
When different third-party trackers share information with each other, these parties can track an individual user even as they visit Web sites that have different tracking mechanisms installed. Cookie syncing is difficult to detect and also facil itates merging of datasets about individual users between disparate third parties, creating significant privacy concerns. A recent study suggests that the practice of cookie syncing is widespread among third-party trackers.
7.4 STREAMING AUDIO AND VIDEO
Email and Web applications are not the only major uses of networks. For many people, audio and video are the holy grail of networking. When the word ‘‘multi- media’’ is mentioned, both the propellerheads and the suits begin salivating as if on cue. The former see immense technical challenges in providing good quality voice over IP and 8K video-on-demand to every computer. The latter see equally immense profits in it.
While the idea of sending audio and video over the Internet has been around since the 1970s at least, it is only since roughly 2000 that real-time audio and real-time video traffic has grown with a vengeance. Real-time traffic is different from Web traffic in that it must be played out at some predetermined rate to be use ful. After all, watching a video in slow motion with fits and starts is not most peo- ple’s idea of fun. In contrast, the Web can have short interruptions, and page loads can take more or less time, within limits, without it being a major problem.
Two things happened to enable this growth. First, computers have became much more powerful and are equipped with microphones and cameras so that they can input, process, and output audio and video data with ease. Second, a flood of
SEC. 7.4 STREAMING AUDIO AND VIDEO 681
Internet bandwidth has come to be available. Long-haul links in the core of the Internet run at many gigabits/sec, and broadband and 802.11ac wireless reaches users at the edge of the Internet. These developments allow ISPs to carry tremen- dous levels of traffic across their backbones and mean that ordinary users can con- nect to the Internet 100–1000 times faster than with a 56-kbps telephone modem.
The flood of bandwidth caused audio and video traffic to grow, but for dif ferent reasons. Telephone calls take up relatively little bandwidth (in principle 64 kbps but less when compressed) yet telephone service has traditionally been expen- sive. Companies saw an opportunity to carry voice traffic over the Internet using existing bandwidth to cut down on their telephone bills. Startups such as Skype saw a way to let customers make free telephone calls using their Internet con- nections. Upstart telephone companies saw a cheap way to carry traditional voice calls using IP networking equipment. The result was an explosion of voice data carried over the Internet and called Internet telephony and discussed in Sec. 7.4.4.
Unlike audio, video takes up a large amount of bandwidth. Reasonable quality Internet video is encoded with compression resulting in a stream of around 8 Mbps for 4K (which is 7 GB for a 2-hour movie) Before broadband Internet access, send ing movies over the network was prohibitive. Not so any more. With the spread of broadband, it became possible for the first time for users to watch decent, streamed video at home. People love to do it. Around a quarter of the Internet users on any given day are estimated to visit YouTube, the popular video sharing site. The movie rental business has shifted to online downloads. And the sheer size of videos has changed the overall makeup of Internet traffic. The majority of Internet traffic is already video, and it is estimated that 90% of Internet traffic will be video within a few years.
Given that there is enough bandwidth to carry audio and video, the key issue for designing streaming and conferencing applications is network delay. Audio and video need real-time presentation, meaning that they must be played out at a predetermined rate to be useful. Long delays mean that calls that should be inter- active no longer are. This problem is clear if you have ever talked on a satellite phone, where the delay of up to half a second is quite distracting. For playing music and movies over the network, the absolute delay does not matter, because it only affects when the media starts to play. But the variation in delay, called jitter, still matters. It must be masked by the player or the audio will sound unintelligible and the video will look jerky.
As an aside, the term multimedia is often used in the context of the Internet to mean video and audio. Literally, multimedia is just two or more media. That defi- nition makes this book a multimedia presentation, as it contains text and graphics (the figures). However, that is probably not what you had in mind, so we use the term ‘‘multimedia’’ to imply two or more continuous media, that is, media that have to be played during some well-defined time interval. The two media are nor- mally video with audio, that is, moving pictures with sound. Audio and smell may take a while. Many people also refer to pure audio, such as Internet telephony or
682 THE APPLICATION LAYER CHAP. 7
Internet radio, as multimedia as well, which it is clearly not. Actually, a better term for all these cases is streaming media. Nonetheless, we will follow the herd and consider real-time audio to be multimedia as well.
7.4.1 Digital Audio
An audio (sound) wave is a one-dimensional acoustic (pressure) wave. When an acoustic wave enters the ear, the eardrum vibrates, causing the tiny bones of the inner ear to vibrate along with it, sending nerve pulses to the brain. These pulses are perceived as sound by the listener. In a similar way, when an acoustic wave strikes a microphone, the microphone generates an electrical signal, representing the sound amplitude as a function of time.
The frequency range of the human ear runs from 20 Hz to 20,000 Hz. Some animals, notably dogs, can hear higher frequencies. The ear hears loudness loga rithmically, so the ratio of two sounds with power A and B is conventionally expressed in dB (decibels) as the quantity 10 log10(A/B). If we define the lower limit of audibility (a sound pressure of about 20 µPascals) for a 1-kHz sine wave as 0 dB, an ordinary conversation is about 50 dB and the pain threshold is about 120 dB. The dynamic range is a factor of more than 1 million.
The ear is surprisingly sensitive to sound variations lasting only a few millisec- onds. The eye, in contrast, does not notice changes in light level that last only a few milliseconds. The result of this observation is that jitter of only a few millisec- onds during the playout of multimedia affects the perceived sound quality much more than it affects the perceived image quality.
Digital audio is a digital representation of an audio wave that can be used to recreate it. Audio waves can be converted to digital form by an ADC (Analog-to- Digital Converter). An ADC takes an electrical voltage as input and generates a binary number as output. In Fig. 7-31(a) we see an example of a sine wave. To represent this signal digitally, we can sample it every 6T seconds, as shown by the bar heights in Fig. 7-31(b). If a sound wave is not a pure sine wave but a linear superposition of sine waves where the highest frequency component present is f , the Nyquist theorem (see Chap. 2) states that it is sufficient to make samples at a frequency 2 f . Sampling more often is of no value since the higher frequencies that such sampling could detect are not present.
The reverse process takes digital values and produces an analog electrical volt- age. It is done by a DAC (Digital-to-Analog Converter). A loudspeaker can then convert the analog voltage to acoustic waves so that people can hear sounds.
Audio Compression
Audio is often compressed to reduce bandwidth needs and transfer times, even though audio data rates are much lower than video data rates. All compression systems require two algorithms: one is used for compressing the data at the source,
SEC. 7.4 STREAMING AUDIO AND VIDEO 683
1.00
0.75
0.50
0.25
0
2 T12 T
–0.25 –0.50 –0.75
1
T T T
1
2 T
–1.00
(a) (b) (c)
Figure 7-31. (a) A sine wave. (b) Sampling the sine wave. (c) Quantizing the samples to 4 bits.
and another is used for decompressing it at the destination. In the literature, these algorithms are referred to as the encoding and decoding algorithms, respectively. We will use this terminology too.
Compression algorithms exhibit certain asymmetries that are important to understand. Even though we are considering audio first, these asymmetries hold for video as well. The first asymmetry applies to encoding the source material. For many applications, a multimedia document will only be encoded once (when it is stored on the multimedia server) but will be decoded thousands of times (when it is played back by customers). This asymmetry means that it is acceptable for the encoding algorithm to be slow and require expensive hardware provided that the decoding algorithm is fast and does not require expensive hardware.
The second asymmetry is that the encode/decode process need not be invert ible. That is, when compressing a data file, transmitting it, and then decompress ing it, the user expects to get the original back, accurate down to the last bit. With multimedia, this requirement does not exist. It is usually acceptable to have the audio (or video) signal after encoding and then decoding be slightly different from the original as long as it sounds (or looks) the same. When the decoded output is not exactly equal to the original input, the system is said to be lossy. If the input and output are identical, the system is lossless. Lossy systems are important because accepting a small amount of information loss normally means a huge pay- off in terms of the compression ratio possible.
Many audio compression algorithms have been developed. Probably the most popular formats are MP3 (MPEG audio layer 3) and AAC (Advanced Audio Coding) as carried in MP4 (MPEG-4) files. To avoid confusion, note that MPEG provides audio and video compression. MP3 refers to the audio compression por tion (part 3) of the MPEG-1 standard, not the third version of MPEG, which has been replaced by MPEG-4. AAC is the successor to MP3 and the default audio encoding used in MPEG-4. MPEG-2 allows both MP3 and AAC audio. Is that clear now? The nice thing about standards is that there are so many to choose from. And if you do not like any of them, just wait a year or two.
684 THE APPLICATION LAYER CHAP. 7
Audio compression can be done in two ways. In waveform coding, the signal is transformed mathematically by a Fourier transform into its frequency compo- nents. In Chap. 2, we showed an example function of time and its Fourier ampli tudes in Fig. 2-12(a). The amplitude of each component is then encoded in a mini- mal way. The goal is to reproduce the waveform fairly accurately at the other end in as few bits as possible.
The other way, perceptual coding, exploits certain flaws in the human audi tory system to encode a signal in such a way that it sounds the same to a human lis tener, even if it looks quite different on an oscilloscope. Perceptual coding is based on the science of psychoacoustics—how people perceive sound. Both MP3 and AAC are based on perceptual coding.
Perceptual encoding dominates modern multimedia systems, so let us take a look at it. A key property is that some sounds can mask other sounds. For exam- ple, imagine that you are broadcasting a live flute concert on warm summer day. Then all of a sudden, a crew of workmen show up with jackhammers and start tear ing up the street to replace it. No one can hear the flute any more, so you can just transmit the frequency of the jackhammers and the listeners will get the same musical experience as if you also had broadcast the flute as well, and you can save bandwidth to boot. This is called frequency masking.
When the jackhammers stop, you don’t have to start broadcasting the flute fre- quency for a small period of time because the ear turns down its gain when it picks up a loud sound and it takes a bit of time to reset it. Transmission of low-amplitude sounds during this recovery period are pointless and omitting them can save band- width. This is called temporal masking. Perceptual encoding relies heavily on not encoding or transmitting audio that the listeners are not going to perceive anyway.
7.4.2 Digital Video
Now that we know all about the ear, it is time to move on to the eye. (No, this section is not followed by one on the nose.) The human eye has the property that when an image appears on the retina, the image is retained for some number of milliseconds before decaying. If a sequence of images is drawn at 50 images/sec, the eye does not notice that it is looking at discrete images. All video systems since the Lumière brothers invented the movie projector in 1895 exploit this prin- ciple to produce moving pictures.
The simplest digital representation of video is a sequence of frames, each con- sisting of a rectangular grid of picture elements, or pixels. Common sizes for screens range from 1280 × 720 (called 720p), 1920 × 1080 (called 1080p or HD video), 3840 × 2160 (called 4K) and 7680 × 4320 (called 8K).
Most systems use 24 bits per pixel, with 8 bits each for the red, blue, and green (RGB) components. Red, blue, and green are the primary additive colors and every other color can be made from superimposing them in the appropriate intensity.
SEC. 7.4 STREAMING AUDIO AND VIDEO 685
Older frame rates vary from 24 frames/sec, which traditional film-based mov ies used, through 25.00 frames/sec (the PAL system used in most of the world), to 30 frames/sec (the American NTSC system). Actually, if you want to get picky, NTSC uses 29.97 frames/sec instead of 30 due to a hack the engineers introduced during the transition from black-and-white television to color. A bit of bandwidth was needed for part of the color management so they took it by reducing the frame rate by 0.03 frame/sec. PAL used color from its inception, so the rate really is exactly 25.00 frame/sec. In France, a slightly different system, called SECAM, was developed in part, to protect French companies from German television manu facturers. It also runs at exactly 25.00 frames/sec. During the 1950s, the Commu- nist countries of Eastern Europe adopted SECAM to prevent their people from watching West German (PAL) television and getting Bad Ideas.
To reduce the amount of bandwidth required to broadcast television signals over the air, television stations adopted a scheme in which frames were divided into two fields, one with the odd-numbered rows and one with the even-numbered rows, which were broadcast alternately. This meant that 25 frames/sec was actually 50 fields/sec. This scheme is called interlacing, and gives less flicker than broad- casting entire frames one after another. Modern video does not use interlacing and and just sends entire frames in sequence, usually at 50 frames/sec (PAL) or 59.94 frames/sec (NTSC). This is called progressive video.
Video Compression
It should be obvious from our discussion of digital video that compression is critical for sending video over the Internet. Even 720p PAL progressive video requires 553 Mbps of bandwidth and HD, 4K, and 8K require a lot more. To pro- duce a standard for compressing video that could be used over all platforms and by all manufacturers, the standards’ committees created a group called MPEG (Motion Picture Experts Group) to come up with a worldwide standard. Very briefly, the standards it came up with, known as MPEG-1, MPEG-2, and MPEG-4, work like this. Every few seconds a complete video frame is transmitted. The frame is compressed using something like the familiar JPEG algorithm that is used for digital still pictures. Then for the next few seconds, instead of sending out full frames, the transmitter sends out differences between the current frame and the base (full) frame it most recently sent out.
First let us briefly look at the JPEG (Joint Photographic Experts Group) algorithm for compressing a single still image. Instead of working with the RGB components, it converts the image into luminance (brightness) and chrominance (color) components because the eye is much more sensitive to luminance than chrominance, allowing fewer bits to be used to encode the chrominance without loss of perceived image quality. The image is then broken up into blocks of typi- cally 8 × 8 or 10 × 10 pixels, each of which is processed separately. Separately, the
686 THE APPLICATION LAYER CHAP. 7
luminance and chrominance are run through a kind of Fourier transform (techni- cally a discrete cosine transformation) to get the spectrum. High-frequency ampli tudes can then be discarded. The more amplitudes that are discarded, the fuzzier the image and the smaller the compressed image is. Then standard lossless com- press techniques like run-length encoding and Huffman encoding are applied to the remaining amplitudes. If this sounds complicated, it is, but computers are pretty good at carrying out complicated algorithms.
Now on to the MPEG part, described below in a simplified way. The frame following a full JPEG (base) frame is likely to be very similar to the JPEG frame, so instead of encoding the full frame, only the blocks that differ from the base frame are transmitted. A block containing, say, a piece of blue sky is likely to be the same as it was 20 msec earlier, so there is no need to transmit it again. Only the blocks that have changed need to be retransmitted.
As an example, consider the situation of a a camera mounted securely on a tri- pod with an actor walking toward a stationary tree and house. The first three frames are shown in Fig. 7-32. The encoding of the second frame just sends the blocks that have changed. Conceptually, the receiver starts out producing the sec- ond frame by copying the first frame into a buffer and then applying the changes. It then stores the second frame uncompressed for display. It also uses the second frame as the base for applying the changes that come describing the difference between the third frame and the second one.
Figure 7-32. Three consecutive frames.
It is slightly more complicated than this, though. If a block (say, the actor) is present in the second frame but has moved, MPEG allows the encoder to say, in effect, ‘‘block 29 from the previous frame is present in the new frame offset by a distance (6x, 6y) and furthermore the sixth pixel has changed to abc and the 24th pixel is now xyz.’’ This allows even more compression.
We mentioned symmetries between encoding and decoding before. Here we see one. The encoder can spend as much time as it wants searching for blocks that have moved and blocks that have changed somewhat to determine whether it is bet ter to send a list of updates to the previous frame or a complete new JPEG frame. Finding a moved block is a lot more work than simply copying a block from the previous image and pasting it into the new one at a known (6x, 6y) offset.
SEC. 7.4 STREAMING AUDIO AND VIDEO 687
To be a bit more complete, MPEG actually has three different kinds of frames, not just two:
1. I (Intracoded) frames that are self-contained compressed still images. 2. P (Predictive) frames that are difference with the previous frame. 3. B (Bidirectional) frames that code differences with the next I-frame.
The B-frames require the receiver to stop processing until the next I-frame arrives and then work backward from it. Sometimes this gives more compression, but having the encoder constantly check to see if differences with the previous frame or differences with any one of the next 30, 50, or 80 frames gives the small- est result is time consuming on the encoding side but not time consuming on the decoding side. This asymmetry is exploited to the maximum to give the smallest possible encoded file. The MPEG standards do not specify how to search, how far to search, or how good a match has to be in order to send differences or a complete new block. This is up to each implementation.
Audio and video are encoded separately as we have described. The final MPEG-encoded file consists of chunks containing some number of compressed images and the corresponding compressed audio to be played while the frames in
that chunk are displayed. In this way, the video and audio are kept synchronized. Note that this is a rather simplified description. In reality, even more tricks are used to get better compression, but the basic ideas given above are essentially cor rect. The most recent format is MPEG-4, also called MP4. It is formally defined in a standard known as H.264. It’s successor (defined for resolutions up to 8K) is H.265. H.264 is the format most consumer video cameras produce. Because the camera has to record the video on the SD card or other medium in real time, it has very little time to hunt for blocks that have moved a little. Consequently, the com- pression is not nearly as good as what a Hollywood studio can do when it dynam ically allocates 10,000 computers at a cloud server to encode its latest production. This is encoding/decoding asymmetry in action.
7.4.3 Streaming Stored Media
Let us now move on to network applications. Our first case is streaming a video that is already stored on a server somewhere, for example, watching a YouTube or Netflix video. The most common example of this is watching videos over the Internet. This is one form of VoD (Video on Demand). Other forms of video on demand use a provider network that is separate from the Internet to deliv- er the movies (e.g., the cable TV network).
The Internet is full of music and video sites that stream stored multimedia files. Actually, the easiest way to handle stored media is not to stream it. The straightforward way to make the video (or music track) available is just to treat the
688 THE APPLICATION LAYER CHAP. 7
pre-encoded video (or audio) file as a very big Web page and let the browser down load it. The sequence of four steps is shown in Fig. 7-33.
Client
Media
player
1: Media request (HTTP)
Browser
Server
Web server
4: Play
3: Save
2: Media response (HTTP)
mediaDisk Disk media
Figure 7-33. Playing media over the Web via simple downloads.
The browser goes into action when the user clicks on a movie. In step 1, it sends an HTTP request for the movie to the Web server to which the movie is link- ed. In step 2, the server fetches the movie (which is just a file in MP4 or some other format) and sends it back to the browser. Using the MIME type, the browser looks up how it is supposed to display the file. The browser then saves the entire movie to a scratch file on disk in step 3. It then starts the media player, passing it the name of the scratch file. Finally, in step 4 the media player starts reading the file and playing the movie. Conceptually, this is no different than fetching and dis- playing a static Web page, except that the downloaded file is ‘‘displayed’’ by using a media player instead of just writing pixels to a monitor.
In principle, this approach is completely correct. It will play the movie. There is no real-time network issue to address either because the download is simply a file download. The only trouble is that the entire video must be transmitted over the network before the movie starts. Most customers do not want to wait an hour for their ‘‘video on demand’’ to start, so something better is needed.
What is needed is a media player that is designed for streaming. It can either be part of the Web browser or an external program called by the browser when a video needs to be played. Modern browsers that support HTML5 usually have a built-in media player.
A media player has five major jobs to do:
1. Manage the user interface.
2. Handle transmission errors.
3. Decompress the content.
4. Eliminate jitter.
5. Decrypt the file.
Most media players nowadays have a glitzy user interface, sometimes simulating a stereo unit, with shiny buttons, knobs, sliders, and visual displays. Often there are
SEC. 7.4 STREAMING AUDIO AND VIDEO 689
interchangeable front panels, called skins, that the user can drop onto the player. The media player has to manage all this and interact with the user. The next three are related and depend on the network protocols. We will go through each one in turn, starting with handling transmission errors. Dealing with errors depends on whether a TCP-based transport like HTTP is used to transport the media, or a UDP-based transport like RTP (Real Time Protocol) is used. If a TCP-based transport is being used then there are no errors for the media player to correct because TCP already provides reliability by using retransmissions. This is an easy way to handle errors, at least for the media player, but it does complicate the removal of jitter in a later step because timing out and asking for retransmis- sions introduces uncertain and variable delays in the movie.
Alternatively, a UDP-based transport like RTP can be used to move the data. With these protocols, there are no retransmissions. Thus, packet loss due to con- gestion or transmission errors will mean that some of the media does not arrive. It is up to the media player to deal with this problem. One way is to ignore the prob lem and just have bits of video and audio be wrong. If errors are infrequent, this works fine and almost no one will notice. Another possibility is to use forward error correction, such as encoding the video file with some redundancy, such as a Hamming code or a Reed-Solomon code. Then the media player will have enough information to correct errors on its own, without having to ask for retransmissions or skip bits of damaged movies.
The downside here is that adding redundancy to the file makes it bigger. Another approach involves using selective retransmission of the parts of the video stream that are most important to play back the content. For example, in a com- pressed video sequence, a packet loss in an I-frame is much more consequential, since the decoding errors that result from the loss can propagate throughout the group of pictures. On the other hand, losses in derivative frames, including P frames and B-frames, are easier to recover from. Similarly, the value of a retrans- mission also depends on whether the retransmission of the content would arrive in time for playback. As a result, some retransmissions can be far more valuable than others, and selectively retransmitting certain packets (e.g., those within I-frames that would arrive before playback) is one possible strategy. Protocols have been built on top of RTP and QUIC to provide unequal loss protection when videos are streamed over UDP (Feamster et al., 2000; and Palmer et al., 2018).
The media player’s third job is decompressing the content. Although this task is computationally intensive, it is fairly straightforward. The thorny issue is how to decode media if the underlying network protocol does not correct transmission errors. In many compression schemes, later data cannot be decompressed until the earlier data has been decompressed, because the later data is encoded relative to the earlier data. Recall that a P-frame is based upon the most recent I-frame (and other I-frames following it). If the I-frame is damaged and cannot be decoded, all the subsequent P-frames are useless. The media player will then be forced to wait for the next I-frame and simply skip a few seconds of video.
690 THE APPLICATION LAYER CHAP. 7
This reality forces the encoder to make a decision. If I-frames are spaced closely, say, one per second, the gap when an error occurs will be fairly small, but the video will be bigger because I-frames are much bigger than P- or B-frames. If I-frames are, say, 5 seconds apart, the video file will be much smaller but there will be 5-second gap if an I-frame is damaged and a smaller gap if a P-frame is dam- aged. For this reason, when the underlying protocol is TCP, I-frames can be spaced much further apart than if RTP is used. Consequently, many video-streaming sites use TCP to allow a smaller encoded file with widely spaced I-frames and less bandwidth needed for smooth playback.
The fourth job is to eliminate jitter, the bane of all real-time systems. Using TCP makes this much worse, because it introduces random delays whenever retransmissions are needed. The general solution that all streaming systems use is a playout buffer. before starting to play the video, the system collects 5–30 sec- onds worth of media, as shown in Fig. 7-34. Playing drains media regularly from the buffer so that the audio is clear and the video is smooth. The startup delay gives the buffer a chance to fill to the low-water mark. The idea is that data should now arrive regularly enough that the buffer is never completely emptied. If that were to happen, the media playout would stall.
Client machine Server machine
Buffer
Media player
Low
High
Media server
water
water
mark
mark
Figure 7-34. The media player buffers input from the media server and plays from the buffer rather than directly from the network.
Buffering introduces a new complication. The media player needs to keep the buffer partly full, ideally between the low-water mark and the high-water mark. This means when the buffer passes the high-water mark, the player needs to tell the source to stop sending, lest it lose data for lack of a place to put it. The high-water mark has to be before the end of the buffer because data will continue to stream in until the Stop request gets to the media server. Once the server stops sending and the pipeline is empty, the buffer will start draining. When it hits the low-water mark, the player sends a Start command to the server to start streaming again.
By using a protocol in which the media player can command the server to stop and start, the media player can keep enough, but not too much, media in the buffer to ensure smooth playout. Since RAM is fairly cheap these days, a media player, even on a smartphone, could allocate enough buffer space to hold a minute or more of media, if need be.
SEC. 7.4 STREAMING AUDIO AND VIDEO 691
The start-stop mechanism has another nice feature. It decouples the server’s transmission rate from the playout rate. Suppose, for example, that the player has to play out the video at 8 Mbps. When the buffer drops to the low-water mark, the player will tell the server to deliver more data. If the server is capable of delivering it at 100 Mbps, that is not a problem. It just comes in and is stored in the buffer. When the high-water mark is reached, the player tells the server to stop. In this way, the server’s transmission rate and the playout rate are completely decoupled. What started out as a real-time system has become a simple nonreal-time file trans fer system. Getting rid of all the real-time transmission requirements is another reason YouTube, Netflix, Hulu, and other streaming servers use TCP. It makes the whole system design much simpler.
Determining the size of the buffer is a bit tricky. If lots of RAM is available, at first glance it sounds like it might make sense to have a large buffer and allow the server to keep it almost full, just in case the network suffers some congestion later on. However, users are sometimes finicky. If a user finds a scene boring and uses the buttons on the media player’s interface to skip forward, that might render most or all of the buffer useless. In any event, jumping forward (or backward) to a spe- cific point in time is unlikely to work unless that frame happens to be an I-frame. If not, the player has to search for a nearby I-frame. If the new play point is outside the buffer, the entire buffer has to be cleared and reloaded. In effect, users who skip around a lot (and there are many of them), waste network bandwidth by invali- dating precious data in their buffers. Systemwide, the existence of users who skip around a lot argues for limiting the buffer size, even if there is plenty of RAMavailable. Ideally, a media player could observe the user’s behavior and pick a buffer size to match the user’s viewing style.
All commercial videos are encrypted to prevent piracy, so media players have to be able to decrypt them as them come in. That is the fifth task in the list above.
DASH and HLS
The plethora of devices for viewing media introduces some complications we need to look at now. Someone who buys a bright, shiny, and very expensive 8K monitor will want movies delivered in 7680 × 4320 resolution at 100 or 120 frames/sec. But if halfway through an exciting movie she has to go to the doctor and wants to finish watching it in the waiting room on a 1280 × 720 smartphone that can handle at most 25 frames/sec, she has a problem. From the streaming site’s point of view, this raises the question of what at resolution and frame rate should movies be encoded.
The easy answer is to use every possible combination. At most it wastes disk space to encode every movie at seven screen resolutions (e.g., smartphone, NTSC, PAL, 720p, HD, 4K, and 8K) amd six frame rates (e.g., 25, 30, 50, 60, 100, and 120), for a total of 42 variants, but disk space is not very expensive. A bigger, but
692 THE APPLICATION LAYER CHAP. 7
related problem. is what happens when the viewer is stationary at home with her big, shiny monitor, but due to network congestion, the bandwidth between her and the server is changing wildly and cannot always support the full resolution.
Fortunately, several solutions have been already implemented. One solution is DASH (Dynamic Adaptive Streaming over HTTP). The basic idea is simple and it is compatible with HTTP (and HTTPS), so it can be streamed on a Web page. The streaming server first encodes its movies at multiple resolutions and frame rates and has them all stored in its disk farm. Each version is not stored as a single file, but as many files, each storing, say, 10 seconds of video and audio. This would mean that a 90-minute movie with seven screen resolutions and six frame rates (42 variants) would require 42 × 540 = 22,680 separate files, each with 10 seconds worth of content. In other words, each file holds a segment of the movie at one specific resolution and frame rate. Associated with the movie is a manifest, officially known as an MPD (Media Presentation Description), which lists the names of all these files and their properties, including resolution, frame rate, and frame number in the movie.
To make this approach work, both the player and server must both use the DASH protocol. The user side could either be the browser itself, a player shipped to the browser as a JavaScript program, or a custom application (e.g., for a mobile device, or a streaming set top box). The first thing it does when it is time to start viewing the movie is fetch the manifest for the movie, which is just a small file, so a normal GET HTTPS request is all that is needed.
The player then interrogates the device where it is running to discover its maxi- mum resolution and possibly other characteristics, such as what audio formats it can handle and how many speakers it has. Then it begins running some tests by sending test messages to the server to try to estimate how much bandwidth is avail- able. Once it has figured out what resolution the screen has and how much band- width is available, the player consults the manifest to find the first, say, 10 seconds of the movie that gives the best quality for the screen and available bandwidth.
But that’s not the end of the story. As the movie plays, the player continues to run bandwidth tests. Every time it needs more content, that is, when the amount of media in the buffer hits the low-water mark, it again consults the manifest and orders the appropriate file depending where it is in the movie and which resolution and frame rate it wants. If the bandwidth varies wildly during playback, the movie shown may change from 8K at 100 frames/sec to HD at 25 frames/sec and back several times a minute. In this way, the system adapts rapidly to changing network conditions and allows the best viewing experience consistent with the available resources. Companies such as Netflix have published information about how they adapt the bitrate of a video stream based on the playback buffer occupancy (Huang et al., 2014). An example is shown in Fig. 7-35.
In Fig. 7-35, as the bandwidth decreases, the player decides to ask for increas ingly low resolution versions. However, it could also have compromised in other ways. For example, sending out 300 frames for a 10-second playout requires less