Comprehensive Notes on Transport Layer Protocols

Transport Layer

What We Know So Far

  • Theory Development:

    • ISO developed/adopted the theory in the late 70s and early 80s.

    • Vint Cerf and Robert Kahn were key figures in 1974, referred to as "Packet Pushers."

  • Evolution of Data Transmission:

    • Raw bit streams are transmitted over a physical transmission medium.

    • Data frames are reliably transmitted between two nodes connected by a physical layer.

    • Multi-node/network data transfer includes network addressing, routing, and traffic control.

    • Reliable end-to-end communication for services/applications incorporates flow control, multiplexing, and connection-oriented communication.

  • Network Stack Layers (Bottom to Top):

    • Bits → Frames → Packets → Segments → Messages

Sample Network Stack in the Internet Reality

  • Headers of a Typical Packet in the AT&T Backbone Network:

    • The diagram illustrates the headers of a typical packet in the AT&T backbone network.

    • Headers lower in the diagram are outermost in the actual packet.

    • Examples of headers include HTTP, TCP, IP, IPsec, GTP, UDP, MPLS, and Ethernet.

    • Layers include Application, Transport, Network, and Data link/physical layers.

  • Layering:

    • The Internet architecture is a composition of a wide variety of networks.

  • Reference:

    • Pamela Zave and Jennifer Rexford. 2019. The compositional architecture of the internet. Commun. ACM 62, 3 (February 2019), 78–87. DOI:https://doi.org/10.1145/3226588

Why Layering Approach to the Network Model?

  • Separation: Breaks down data communication into smaller tasks/functions.

  • Abstraction: Changes in one layer have minimal impact on other layers.

  • Design: Simplifies implementation of functions/protocols as long as interconnection between layers is maintained.

  • Complexity: Eases learning, troubleshooting, and standardization.

Layering Approach to Network Model

  • Network layers should have different scopes or provide different functionalities within the same scope.

  • Diagram shows layering with different scopes (A, B, C) across Host A and Host B.

  • Question posed: What would middleboxes be called in the OSI model?

Transport Layer

  • Position in Network Stacks:

    • L4 in the OSI model.

    • L3 in the TCP/IP model.

  • Components:

    • Port number (e.g., BSD sockets API).

    • Transport Protocol Data Unit.

    • IP address.

    • Protocols: TCP, UDP, QUIC, SCTP

Transport Layer: Addressing

  • Role:

    • Provides end-to-end communication between two applications (transporting data from app A to remote app B on top of the network layer).

  • Addressing:

    • TSAPs (Transport Service Access Points).

    • NSAPs (Network Service Access Points).

    • Transport connections.

Transport Layer: Connection Process

  • Server Role:

    • A server runs on a server machine.

    • Acts as a proxy.

    • Listens to CONNECT requests.

    • Spawns the requested server, allowing it to inherit the connection when a request arrives.

  • Connection Establishment Example:

    • How a user process in host 1 can establish a connection with a time-of-day server in host 2.

Transport Layer: Addressing continued

  • Service Selection Using Port Numbers:

    • Port numbers are used to choose a service (e.g., each representing an application) during connection establishment.

  • Well-Known Ports (0-1023):

    • Maintained by the Internet Assigned Numbers Authority (IANA).

    • Reserved for services and applications.

    • See: https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xhtml

  • Sample Well-Known Ports:

    • 20: TCP/UDP - FTP Protocol (data) for transferring FTP data.

    • 21: TCP/UDP - FTP Protocol (control) for FTP commands and flow control.

    • 22: TCP/UDP - SSH (Secure Shell) for secure logins, file transfers, and port forwarding.

    • 25: TCP/UDP - Simple Mail Transfer (SMTP).

    • 53: TCP/UDP - Domain Name Server (DNS).

    • 80: TCP/UDP - World Wide Web HTTP.

    • 123: TCP/UDP - Network Time Protocol (NTP).

    • 220: TCP/UDP - Interactive Mail Access Protocol v3 (IMAP3).

    • 443: TCP/UDP - HTTP protocol over TLS/SSL (HTTPS).

Transport Layer: Addressing - Port Types & Ranges

  • As per RFC 1700:

  • Well-known Ports (0-1023):

    • Used by servers (e.g., web, email, DNS).

  • Registered Ports (1024-49151):

    • Assigned by IANA to a requesting entity but not controlled.

    • Used by client applications (e.g., port 6073 for directplay8 (Microsoft) for DirectX gaming & multimedia API).

  • Private and/or Dynamic Ports (49152-65535):

    • Assigned dynamically by the client OS to identify an application/service end-point.

Transport Layer: Addressing - Communication Process

  • What is needed to distinguish a particular communication process (i.e., conversation)?

  • 5-Tuples:

    1. Source port: Selected dynamically (e.g., by OS), used as return address.

    2. Destination port: (e.g., port 80 for HTTP (web)).

    3. Source IP address: (e.g., 192.168.1.5).

    4. Destination IP address: (e.g., 192.168.1.1).

    5. Protocol: (e.g., TCP or UDP).

  • The Internet Transport Layer:

    • Services are mostly defined by two protocols:

      • UDP (connectionless): sends a “datagram”.

      • TCP (connection-oriented): transfers a reliable bytestream.

    • Addressing: port numbers.

      • Choosing a service during connection establishment.

    • Socket: One end-point to a two-way communication (e.g., 192.168.1.1:10).

    • Socket pairs: two ends of the communication (local and remote) – Berkeley sockets: TCP primitives.

Transport Layer Protocols

  • Services are offered to the application by transport layer API (e.g., Berkley sockets API).

  • Services are implemented by transport protocols.

  • TCP and UDP have so far been the most widely used transport protocols:

    • UDP: Connection/state-less, no reliability, no flow/congestion control, message-based, hence sends a datagram.

    • TCP: Connection-oriented (stateful), reliable and in-order delivery, with flow/congestion control, byte-stream divided into segments.

    • QUIC: An alternative to TCP, rapidly gaining traction since 2012.

  • OSI/Internet Terminology:

    • TPDU (OSI) = Segment/Datagram (Internet).

    • PDU (protocol data unit): Data sent to peer protocol layer at the receiver end.

    • SDU (service data unit): Data sent from one layer to a lower layer.

    • Flow control: Do not exceed receiver’s available capacity.

    • Congestion control: Do not exceed network’s available capacity.

User Datagram Protocol (UDP)

  • Offers only two features over IP:

    • Ports.

    • Checksum.

  • UDP specifications in RFC 768 (1980).

  • Less overhead and delay, but unreliable, no flow control, no congestion control.

  • Used for live media streaming, DNS, SNMP, DHCP, VoIP, online games, IPTV

  • UDP = IP + 2 features: UDP header format

Transmission Control Protocol (TCP) (#1)

  • Specification defined in RFC 793 (1981).

    • Complex (85 pages) compared to RFC 768 (UDP) (3 pages only).

  • Full reliability established using acknowledgements (ACKs) and retransmissions; sequence numbers for in-order delivery.

  • Implements flow control and congestion control.

  • Used for web (HTTP), email (SMTP), file transfer (FTP).

  • Originally developed for DARPA.

  • More overhead and slower than UDP.

  • TCP encapsulation in IP: IP header, TCP header, TCP data (optional), IP packet, TCP segment.

Transmission Control Protocol (TCP) (#2)

  • Sequence No.:

    • seqnoseqno of first byte in the segment.

    • If SYN bit is set, indicates the Initial Sequence Number (ISN) denoting the starting value of the byte-stream.

  • Acknowledgment No.:

    • If ACK bit is set, value of the next sequence number sender expects to receive.

  • Hlen:

    • Header length; indicates where data begins.

  • Window:

    • The number of bytes receiver is willing to receive (receiver advertised window).

  • Checksum:

    • For error checking of segment header and data.

  • TCP Flags:

    • SYN (establish), ACK (acknowledge), RST (reset), FIN (terminate).

    • URG and PUSH flags are rarely used!

TCP Connection Establishment

  • 3-way handshake method to establish a reliable connection:

    • (a) Client requests a connection by sending a SYN packet.

    • (b) Server acknowledges back with SYN/ACK.

    • (c) Client ACKs the server’s SYN/ACK.

  • Takes two Round-Trip Times (RTTs) at minimum to establish a connection.

TCP Connection Termination

  • 4-way handshake method to terminate a connection (in fact it’s two two-ways handshake):

    • (a) FIN from Host A.

    • (b) ACK from Host B.

    • (c) FIN from Host B.

    • (d) ACK from Host A.

  • Question: Why 4-ways (2x2) instead of 3-ways?

  • Answer: TCP provides bi-directional data transfer. One side might still have data to send.

TCP in-order Data Delivery

  • TCP data stream can only be pushed to the application layer buffer in-order.

  • TCP packets/segments can arrive out-of-order (e.g., if they take different routes in the network or due to parallelism in the routers).

  • Using sequence numbers allow re-assembly of data-stream at the receiver side even in presence of out-of-order segments.

  • Randomly chosen Initial Sequence Numbers (ISNs) are exchanged upon TCP connection establishment (SYN, SYN/ACK). They represent the starting value of byte-stream. Data begins at ISN+1.

  • Seqno. is incremented further as data is being transmitted by the sender.

TCP Error Control: ACKs

  • “Positive” acknowledgement (ACK) packets sent back from receiver to the sender.

  • ACKs are cumulative: ACK nn acknowledges everything up to n1n-1.

  • Duplicate ACK (DupACK): sent when receiver sees a gap between received segments; Sender retransmits the missing segments after 3 DupAcks.

  • ACKs should be delayed (except when sending DupACKs) - ACK every 2 segments or once every 500ms (RFC 1122) (or 200ms in Microsoft Windows).

  • ACK packets are unreliable (less costly to drop than data packets).

TCP Error Control: Timeout

  • Retransmission Timeout (RTO).

  • When timeout expires, missing packet is retransmitted and cwnd=1 (i.e., starts all over again).

  • Difficult to determine the right RTO value!

    • Too long: too slow to detect loss.

    • Too short: risk of false positives.

  • RTO is calculated based on RTT as laid out in RFC 6298 (1 sec < RTO).

TCP Flow Control

  • Receiver advertised window (rwnd) sizes are exchanged during connection establishment.

  • Flow control: Limited receiver capacity.

  • Congestion control: Limited network capacity.

  • Destination, Network congestion, Receiver over-flow, Packet loss, Sender (rate adjustment).

TCP Flow Control

  • Receiver advertised window (rwnd): Sliding window chosen based on the available TCP receiver buffer size.

  • Send window (congestion window): The number bytes TCP sender is allowed to inject into the network.

  • Congestion window (cwnd) is updated whenever an ACK is received and set based on the inferred congestion in the network but capped at rwnd: cwnd=min(cwnd,rwnd)cwnd = min(cwnd, rwnd).

TCP Congestion Control (Slow-Start)

  • Congestion control objective is to adapt to the available network capacity.

  • Congestion Window (cwnd): The number of bytes TCP sender can inject into the network before expecting to receive an ACK.

  • TCP starts with an Initial Window (IW) (initcwnd) of ~3 packets (since 2002), or 10 packets (since ~2013, also from Linux 2.6.39).

  • Then it probes for the available bandwidth in Slow-Start mode (exponential growth, i.e., binary search).

  • SS: for every ACK, cwnd=cwnd+1cwnd = cwnd + 1; Doubles every RTT (exponential).

TCP Congestion Control (Congestion Avoidance)

  • TCP leaves the slow-start (SS) mode to Congestion Avoidance (CA) mode after it reaches SSThresh value.

  • SSTresh: initially an arbitrary high value – e.g., largest possible advertised window.

  • CA follows the Additive-Increase Multiplicative-Decrease (AIMD) concept.

  • SSThresh is initially set to a large value to allow for probing for the full (unknown) bandwidth

  • If(RTO)SSThresh=SSThresh/2;cwnd=1;If (RTO) { SSThresh=SSThresh/2; cwnd=1; }

  • If (DupACK_no==3) { Retransmit the packet; #Fast Retransmit SSThresh=SSThresh/2; cwnd=SSThresh;} #Fast Recovery (i.e., skip slow start i.e., start from half cwnd)

  • CA: for every ACK, cwnd=cwnd+1/cwndcwnd = cwnd + 1/cwnd; ~1 extra packet per RTT (linear).

Internet Transport Protocols: beyond TCP/UDP

  • Are there any other transports? Yes, plenty! E.g. SCTP (RFC 4960), DCCP (RFC 4340), QUIC (draft-ietf-quic-transport), and also extensions to TCP/UDP (MPTCP, UDP-Lite, RUDP, µTP/LEDBAT).

  • Some offer services that TCP/UDP don’t – e.g. partial reliability and multihoming, and multistreaming by SCTP.

  • Mostly aren’t used on the public Internet due to lack of middlebox support. Since 70’s-80’s Internet has been mainly reliant on TCP and UDP (just until very recently!)

  • Timeline includes ARPANET, TCP, UDP, RTP, SCTP, DCCP, UDP-Lite, RUDP, MPTCP.

Stream Control Transmission Protocol (SCTP)

  • Message-oriented data transfer (header chunks).

  • Provides reliability and congestion control; connection-oriented.

  • 4-way handshake on association establishment (exchange of cookies).

  • Provides multi-streaming, multi-homing, unordered reliable delivery.

  • Provides partial reliability (optional) (RFC 3758).

  • Many other features…

  • SCTP association establishment 4-way handshake: INIT, INIT-ACK, Cookie-Echo, Cookie-ACK.

Datagram Congestion Control Protocol (DCCP)

  • Message-oriented and unreliable and unordered data transfer.

  • Provides congestion control and ECN using ACKs.

  • Reliable connection setup and teardown.

  • Full-duplex bi-directional communication.

  • Each endpoint can negotiate congestion control mechanism on connection setup.

  • Suitable for interactive multimedia and gaming

    • Prevents HOL-blocking and retransmission (of expired packets) of TCP

    • Prevents congestion induced by UDP

    • Session-based therefore trackable by middleboxes

  • DCCP = UDP + congestion control or DCCP = TCP – bytestream semantics – full reliability

Quick UDP Internet Connections (QUIC)

  • Google has recently developed (2012) and increasingly deployed QUIC (Quick UDP Internet Connections) protocol – breaking the deployment impossibility cycle.

  • QUIC runs encrypted (TLS 1.3), encapsulated over UDP in order to bypass the middleboxes (e.g., routers) that wouldn’t allow anything to pass except TCP/UDP.

  • Userland, 0-RTT handshake with cookies, multiplexed in-order reliable stream-based transport (solves TCP’s HOL-blocking).

  • QUIC now accounts for 7.8% of total Internet traffic [APNIC, 2018].

TCP vs. UDP vs. QUIC

  • TCP:

    • Connection-oriented.

    • Byte-stream based (segments).

    • In-order delivery.

    • Reliability.

    • Flow control.

    • Congestion control.

    • Single stream

  • UDP:

    • Connection/state-less.

    • Message-based (datagrams).

    • Order of arrival.

    • Unreliable.

    • No flow control (must be implemented in app).

    • No congestion control (must be implemented in app).

  • QUIC:

    • Connection-oriented.

    • Byte-stream based (segments).

    • In-order delivery.

    • Reliability.

    • Flow control.

    • Congestion control.

    • Multi-streaming/Multiplexing

Transport Layer over Wireless Medium (#1)

  • Wireless networks can be lossy – e.g., due to adverse channel conditions caused by:

    • Frame collisions due to contending hosts on the shared wireless medium (e.g., CSMA/CA-based 802.11 DCF).

    • Environmental noise leading to high bit-error rate (BER) and hence frame loss.

  • Most of the MAC frame losses are masked from transport layer by some form of MAC-level (L2) frame retransmission (retry) on the wireless segment of the end-to-end path

    • A fixed retry limit (4 (short) – 7 (long) times in 802.11) before discarding the frame.

  • Multi-rate retry chain: ([r0, c0$]$, [r1, c1$]$, [r2, c2$]$, [r3, c3])

    • A member of a per-frame transmission descriptor that is stored in a FIFO queue

    • r corresponds to the modulation and coding scheme (MCS) used in each frame retry.

Transport Layer over Wireless Medium (#2)

  • Losses on wireless medium can potentially be unrelated to congestion in the network buffers

    • From DupACKs or timeouts, transport protocol (e.g., TCP) has no way of telling apart loss due to wireless noise from loss due to full network buffer

    • If BER is high on wireless channel (e.g., due to low SNR) and there are many MAC layer transmission retries, particularly with low(est) MCS indexes (i.e. bit-rates), this can impact the transport layer

      • Increase in TCP’s perceived end-to-end RTT, slowing down the TCP’s cwnd growth rate and hence sub-optimal utilization of the end-to-end capacity

      • Triggering transport layer loss due to 3-DupACKs/timeout

  • Some links such as GEO SATCOM are intrinsically “high latency”; speed-of-light is in fact slow! ;-)

    • TCP does not perform well in and is not designed for SATCOM links (RTT_{avg}=~650ms)

Transport Layer over Wireless Medium (#3)

  • Unlike wired switched networks, wireless networks dynamically change:

    • Channel conditions e.g., noise, interference, contention (address before)

    • RTT (both as result of changing channel conditions and, also distance)

      • From a couple of tens of milliseconds to regional CDNs on wired networks to several hundred milliseconds or a couple of seconds on busy/poor Wi-Fi links

    • Dynamic topology

      • Movement, node density, physical obstacles, distance,…

  • Generally used to have lower bandwidth although this is changing on the access links!

    • 802.11 MAC layer bit-rate: 54 Mbps (802.11g)|~300 Mbps (802.11n)|~1.5 Gbps (802.11ac)|several Gbps (802.11ax)

    • Cellular peak data rate: 1 Gbps (4G, IMT-Advanced)|20 Gbps (5G, IMT-2020)

Transport Layer over Wireless Medium (#4)

  • Limitation of power consumption (e.g., cellphone battery life for 4G/WiFi)

  • Losses/retransmissions are more costly on wireless devices

  • Some wireless devices are resource-constrained -- e.g., IoT devices using LoRaWAN, Bluetooth Low Energy (BLE) or Zigbee (802.15.4).

  • Low-power resource-constraint devices normally can’t run full-stack transport protocols:

    • Instead run a lightweight version of transport/network protocols e.g., uTCP/uIP

    • run their own wire protocol(s) without TCP/IP and use a gateway for TCP/IP comm.

    • Send very few, sporadic and small data packets -- e.g. cwnd=1

Mitigating Transport Performance over Wireless Links

  • Improving transport’s performance over wireless links is challenging due to the complex and dynamic nature of wireless access links

  • Transport (e.g. TCP) sender is not aware of the presence of wireless link on the path

  • Several possibilities:

    • Use of Explicit Congestion Notification (ECN) (RFC 3168) with Active Queue Management (AQM) on the wireless APs (and in the network)

    • Use of delay-based, model-based or hybrid (instead of loss-based) congestion control mechanisms

    • Use of “better” transport protocols (e.g., message based instead of byte-stream based)

      • However not possible to pick transport protocol based on a link segment on the end-to-end path

      • Lots of apps may break as they are tied to (and developed for) specific underlying transport protocol semantics

      • But the world is moving towards QUIC so perhaps this is a good news!

Explicit Congestion Notification (ECN)

  • ECN: network can explicitly signal congestion to the sender (via receiver)

    1. Negotiate ECN on connection establishment between sender/receivers

    2. If congested, routers will CE-mark a packet belonging to ECN-enabled connection

    3. Receiver will echo this CE-mark, using ECN-echo (ECE) until it receives a CWR bit

    4. Sender will reduce its sending rate (cwnd) and sets CWR in the outgoing packet

    5. Upon seeing CWR, receiver stops echoing ECN (i.e. setting ECE on ACKs).

  • ECN bits in IP header: Type-of-Service (TOS) field (6 DCSP codepoint, 2 ECN bits): 00 (not ECN-capable); 10 (ECT(0), i.e. ECN-capable), 11 (CE, i.e. congestion experienced), 01 (ECT(1), ECN-capable)

  • ECN bits in TCP header: 2 last bits in “reserved bits” field (CWR and ECE)

Active Queue Management (AQM)

  • AQM: dropping/marking packets randomly at the bottleneck link on the onset of congestion (i.e. before buffer is full) to signal to the sender to reduce its rate!

  • Many AQM mechanisms are proposed over the years starting with Random Early Discard (RED) by Sally Floyd; however RED was too complex and required fine-tuning for different network conditions => no AQM deployment after all!

  • Back in 2012, AQM was revitalized with new algorithms aiming to reduce the excessive latency on the Internet access links (a.k.a bufferbloat)

  • ECN-enabled access link routers (e.g. WiFi APs) should use an ECN-supporting AQM for ECN-marking (e.g., FQ_CoDel)

  • (FQ_)CoDel (2012)

  • PIE (2013)

  • Adaptive RED (ARED) (2001)

Congestion Control and Wireless Networks (#1)

  • Loss-based TCP: CC’s performance is suboptimal over wireless links due to packet loss caused by noise, interference or contention

    • E.g. unnecessary reduction in TCP’s cwnd

    • Multiplicative decrease factor (beta): beta{std}=0.5(RFC6582);(RFC 6582);beta{cubic}=0.7$$ (RFC 8312) (default in Linux)

  • Packet loss is harder to recover from on high-latency paths (e.g., wireless)

  • Delay-based TCP CCs: observes the RTT (or OWD) trend

    • As old as Jain’s CARD in 1989

    • TCP Vegas, FAST TCP, TCP-Africa, CTCP, CAIA Delay Gradient (CDG), etc.

    • They differ in their measurement method, setting the thresholds, and cwnd adjustment

  • Model-based TCP CC: actively measures the available end-to-end capacity and stays around that value – e.g. Google’s BBR

    • Need to predict the base-RTT on the path

    • Unfairness when coexisting with loss-based TCP

Congestion Control and Wireless Networks (#2)

  • TCP over SATCOM: TCP’s feedback loop is too long on the SATCOM links (~650ms)

  • This slows down the cwnd growth and makes packet losses (i.e. retransmissions) to be too costly!

  • HTTP request process: DNS lookup + 3-Way handshake TCP connection establishment + TLS 1.2 handshake + HTTP request = 4RTT+DNS => over SATCOM: ~2.6 sec < t

  • TCP-splitting: a technique traditionally used by SATCOM providers to speed up the connection setup and cwnd growth using a Performance Enhancing Proxy (PEP)

Congestion Control and Wireless Networks (#3)

  • TCP splitting in PEP is done based on TCP packet header data

  • QUIC over SATCOM: QUIC’s encrypted headers make it hard for PEPs to split the connection i.e. PEPs don’t understand QUIC!

  • As QUIC becomes more prevalent over the public Internet, this is going to be a challenge for SATCOM Internet links

    • Some proposed solutions involve using a QUIC proxy in the future

    • Exposing some of QUIC header publicly (WiP at the IETF)