Comprehensive Notes on Transport Layer Protocols
Transport Layer
What We Know So Far
Theory Development:
ISO developed/adopted the theory in the late 70s and early 80s.
Vint Cerf and Robert Kahn were key figures in 1974, referred to as "Packet Pushers."
Evolution of Data Transmission:
Raw bit streams are transmitted over a physical transmission medium.
Data frames are reliably transmitted between two nodes connected by a physical layer.
Multi-node/network data transfer includes network addressing, routing, and traffic control.
Reliable end-to-end communication for services/applications incorporates flow control, multiplexing, and connection-oriented communication.
Network Stack Layers (Bottom to Top):
Bits → Frames → Packets → Segments → Messages
Sample Network Stack in the Internet Reality
Headers of a Typical Packet in the AT&T Backbone Network:
The diagram illustrates the headers of a typical packet in the AT&T backbone network.
Headers lower in the diagram are outermost in the actual packet.
Examples of headers include HTTP, TCP, IP, IPsec, GTP, UDP, MPLS, and Ethernet.
Layers include Application, Transport, Network, and Data link/physical layers.
Layering:
The Internet architecture is a composition of a wide variety of networks.
Reference:
Pamela Zave and Jennifer Rexford. 2019. The compositional architecture of the internet. Commun. ACM 62, 3 (February 2019), 78–87. DOI:https://doi.org/10.1145/3226588
Why Layering Approach to the Network Model?
Separation: Breaks down data communication into smaller tasks/functions.
Abstraction: Changes in one layer have minimal impact on other layers.
Design: Simplifies implementation of functions/protocols as long as interconnection between layers is maintained.
Complexity: Eases learning, troubleshooting, and standardization.
Layering Approach to Network Model
Network layers should have different scopes or provide different functionalities within the same scope.
Diagram shows layering with different scopes (A, B, C) across Host A and Host B.
Question posed: What would middleboxes be called in the OSI model?
Transport Layer
Position in Network Stacks:
L4 in the OSI model.
L3 in the TCP/IP model.
Components:
Port number (e.g., BSD sockets API).
Transport Protocol Data Unit.
IP address.
Protocols: TCP, UDP, QUIC, SCTP
Transport Layer: Addressing
Role:
Provides end-to-end communication between two applications (transporting data from app A to remote app B on top of the network layer).
Addressing:
TSAPs (Transport Service Access Points).
NSAPs (Network Service Access Points).
Transport connections.
Transport Layer: Connection Process
Server Role:
A server runs on a server machine.
Acts as a proxy.
Listens to CONNECT requests.
Spawns the requested server, allowing it to inherit the connection when a request arrives.
Connection Establishment Example:
How a user process in host 1 can establish a connection with a time-of-day server in host 2.
Transport Layer: Addressing continued
Service Selection Using Port Numbers:
Port numbers are used to choose a service (e.g., each representing an application) during connection establishment.
Well-Known Ports (0-1023):
Maintained by the Internet Assigned Numbers Authority (IANA).
Reserved for services and applications.
See: https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xhtml
Sample Well-Known Ports:
20: TCP/UDP - FTP Protocol (data) for transferring FTP data.
21: TCP/UDP - FTP Protocol (control) for FTP commands and flow control.
22: TCP/UDP - SSH (Secure Shell) for secure logins, file transfers, and port forwarding.
25: TCP/UDP - Simple Mail Transfer (SMTP).
53: TCP/UDP - Domain Name Server (DNS).
80: TCP/UDP - World Wide Web HTTP.
123: TCP/UDP - Network Time Protocol (NTP).
220: TCP/UDP - Interactive Mail Access Protocol v3 (IMAP3).
443: TCP/UDP - HTTP protocol over TLS/SSL (HTTPS).
Transport Layer: Addressing - Port Types & Ranges
As per RFC 1700:
Well-known Ports (0-1023):
Used by servers (e.g., web, email, DNS).
Registered Ports (1024-49151):
Assigned by IANA to a requesting entity but not controlled.
Used by client applications (e.g., port 6073 for directplay8 (Microsoft) for DirectX gaming & multimedia API).
Private and/or Dynamic Ports (49152-65535):
Assigned dynamically by the client OS to identify an application/service end-point.
Transport Layer: Addressing - Communication Process
What is needed to distinguish a particular communication process (i.e., conversation)?
5-Tuples:
Source port: Selected dynamically (e.g., by OS), used as return address.
Destination port: (e.g., port 80 for HTTP (web)).
Source IP address: (e.g., 192.168.1.5).
Destination IP address: (e.g., 192.168.1.1).
Protocol: (e.g., TCP or UDP).
The Internet Transport Layer:
Services are mostly defined by two protocols:
UDP (connectionless): sends a “datagram”.
TCP (connection-oriented): transfers a reliable bytestream.
Addressing: port numbers.
Choosing a service during connection establishment.
Socket: One end-point to a two-way communication (e.g., 192.168.1.1:10).
Socket pairs: two ends of the communication (local and remote) – Berkeley sockets: TCP primitives.
Transport Layer Protocols
Services are offered to the application by transport layer API (e.g., Berkley sockets API).
Services are implemented by transport protocols.
TCP and UDP have so far been the most widely used transport protocols:
UDP: Connection/state-less, no reliability, no flow/congestion control, message-based, hence sends a datagram.
TCP: Connection-oriented (stateful), reliable and in-order delivery, with flow/congestion control, byte-stream divided into segments.
QUIC: An alternative to TCP, rapidly gaining traction since 2012.
OSI/Internet Terminology:
TPDU (OSI) = Segment/Datagram (Internet).
PDU (protocol data unit): Data sent to peer protocol layer at the receiver end.
SDU (service data unit): Data sent from one layer to a lower layer.
Flow control: Do not exceed receiver’s available capacity.
Congestion control: Do not exceed network’s available capacity.
User Datagram Protocol (UDP)
Offers only two features over IP:
Ports.
Checksum.
UDP specifications in RFC 768 (1980).
Less overhead and delay, but unreliable, no flow control, no congestion control.
Used for live media streaming, DNS, SNMP, DHCP, VoIP, online games, IPTV
UDP = IP + 2 features: UDP header format
Transmission Control Protocol (TCP) (#1)
Specification defined in RFC 793 (1981).
Complex (85 pages) compared to RFC 768 (UDP) (3 pages only).
Full reliability established using acknowledgements (ACKs) and retransmissions; sequence numbers for in-order delivery.
Implements flow control and congestion control.
Used for web (HTTP), email (SMTP), file transfer (FTP).
Originally developed for DARPA.
More overhead and slower than UDP.
TCP encapsulation in IP: IP header, TCP header, TCP data (optional), IP packet, TCP segment.
Transmission Control Protocol (TCP) (#2)
Sequence No.:
of first byte in the segment.
If SYN bit is set, indicates the Initial Sequence Number (ISN) denoting the starting value of the byte-stream.
Acknowledgment No.:
If ACK bit is set, value of the next sequence number sender expects to receive.
Hlen:
Header length; indicates where data begins.
Window:
The number of bytes receiver is willing to receive (receiver advertised window).
Checksum:
For error checking of segment header and data.
TCP Flags:
SYN (establish), ACK (acknowledge), RST (reset), FIN (terminate).
URG and PUSH flags are rarely used!
TCP Connection Establishment
3-way handshake method to establish a reliable connection:
(a) Client requests a connection by sending a SYN packet.
(b) Server acknowledges back with SYN/ACK.
(c) Client ACKs the server’s SYN/ACK.
Takes two Round-Trip Times (RTTs) at minimum to establish a connection.
TCP Connection Termination
4-way handshake method to terminate a connection (in fact it’s two two-ways handshake):
(a) FIN from Host A.
(b) ACK from Host B.
(c) FIN from Host B.
(d) ACK from Host A.
Question: Why 4-ways (2x2) instead of 3-ways?
Answer: TCP provides bi-directional data transfer. One side might still have data to send.
TCP in-order Data Delivery
TCP data stream can only be pushed to the application layer buffer in-order.
TCP packets/segments can arrive out-of-order (e.g., if they take different routes in the network or due to parallelism in the routers).
Using sequence numbers allow re-assembly of data-stream at the receiver side even in presence of out-of-order segments.
Randomly chosen Initial Sequence Numbers (ISNs) are exchanged upon TCP connection establishment (SYN, SYN/ACK). They represent the starting value of byte-stream. Data begins at ISN+1.
Seqno. is incremented further as data is being transmitted by the sender.
TCP Error Control: ACKs
“Positive” acknowledgement (ACK) packets sent back from receiver to the sender.
ACKs are cumulative: ACK acknowledges everything up to .
Duplicate ACK (DupACK): sent when receiver sees a gap between received segments; Sender retransmits the missing segments after 3 DupAcks.
ACKs should be delayed (except when sending DupACKs) - ACK every 2 segments or once every 500ms (RFC 1122) (or 200ms in Microsoft Windows).
ACK packets are unreliable (less costly to drop than data packets).
TCP Error Control: Timeout
Retransmission Timeout (RTO).
When timeout expires, missing packet is retransmitted and cwnd=1 (i.e., starts all over again).
Difficult to determine the right RTO value!
Too long: too slow to detect loss.
Too short: risk of false positives.
RTO is calculated based on RTT as laid out in RFC 6298 (1 sec < RTO).
TCP Flow Control
Receiver advertised window (rwnd) sizes are exchanged during connection establishment.
Flow control: Limited receiver capacity.
Congestion control: Limited network capacity.
Destination, Network congestion, Receiver over-flow, Packet loss, Sender (rate adjustment).
TCP Flow Control
Receiver advertised window (rwnd): Sliding window chosen based on the available TCP receiver buffer size.
Send window (congestion window): The number bytes TCP sender is allowed to inject into the network.
Congestion window (cwnd) is updated whenever an ACK is received and set based on the inferred congestion in the network but capped at rwnd: .
TCP Congestion Control (Slow-Start)
Congestion control objective is to adapt to the available network capacity.
Congestion Window (cwnd): The number of bytes TCP sender can inject into the network before expecting to receive an ACK.
TCP starts with an Initial Window (IW) (initcwnd) of ~3 packets (since 2002), or 10 packets (since ~2013, also from Linux 2.6.39).
Then it probes for the available bandwidth in Slow-Start mode (exponential growth, i.e., binary search).
SS: for every ACK, ; Doubles every RTT (exponential).
TCP Congestion Control (Congestion Avoidance)
TCP leaves the slow-start (SS) mode to Congestion Avoidance (CA) mode after it reaches SSThresh value.
SSTresh: initially an arbitrary high value – e.g., largest possible advertised window.
CA follows the Additive-Increase Multiplicative-Decrease (AIMD) concept.
SSThresh is initially set to a large value to allow for probing for the full (unknown) bandwidth
If (DupACK_no==3) { Retransmit the packet; #Fast Retransmit SSThresh=SSThresh/2; cwnd=SSThresh;} #Fast Recovery (i.e., skip slow start i.e., start from half cwnd)
CA: for every ACK, ; ~1 extra packet per RTT (linear).
Internet Transport Protocols: beyond TCP/UDP
Are there any other transports? Yes, plenty! E.g. SCTP (RFC 4960), DCCP (RFC 4340), QUIC (draft-ietf-quic-transport), and also extensions to TCP/UDP (MPTCP, UDP-Lite, RUDP, µTP/LEDBAT).
Some offer services that TCP/UDP don’t – e.g. partial reliability and multihoming, and multistreaming by SCTP.
Mostly aren’t used on the public Internet due to lack of middlebox support. Since 70’s-80’s Internet has been mainly reliant on TCP and UDP (just until very recently!)
Timeline includes ARPANET, TCP, UDP, RTP, SCTP, DCCP, UDP-Lite, RUDP, MPTCP.
Stream Control Transmission Protocol (SCTP)
Message-oriented data transfer (header chunks).
Provides reliability and congestion control; connection-oriented.
4-way handshake on association establishment (exchange of cookies).
Provides multi-streaming, multi-homing, unordered reliable delivery.
Provides partial reliability (optional) (RFC 3758).
Many other features…
SCTP association establishment 4-way handshake: INIT, INIT-ACK, Cookie-Echo, Cookie-ACK.
Datagram Congestion Control Protocol (DCCP)
Message-oriented and unreliable and unordered data transfer.
Provides congestion control and ECN using ACKs.
Reliable connection setup and teardown.
Full-duplex bi-directional communication.
Each endpoint can negotiate congestion control mechanism on connection setup.
Suitable for interactive multimedia and gaming
Prevents HOL-blocking and retransmission (of expired packets) of TCP
Prevents congestion induced by UDP
Session-based therefore trackable by middleboxes
DCCP = UDP + congestion control or DCCP = TCP – bytestream semantics – full reliability
Quick UDP Internet Connections (QUIC)
Google has recently developed (2012) and increasingly deployed QUIC (Quick UDP Internet Connections) protocol – breaking the deployment impossibility cycle.
QUIC runs encrypted (TLS 1.3), encapsulated over UDP in order to bypass the middleboxes (e.g., routers) that wouldn’t allow anything to pass except TCP/UDP.
Userland, 0-RTT handshake with cookies, multiplexed in-order reliable stream-based transport (solves TCP’s HOL-blocking).
QUIC now accounts for 7.8% of total Internet traffic [APNIC, 2018].
TCP vs. UDP vs. QUIC
TCP:
Connection-oriented.
Byte-stream based (segments).
In-order delivery.
Reliability.
Flow control.
Congestion control.
Single stream
UDP:
Connection/state-less.
Message-based (datagrams).
Order of arrival.
Unreliable.
No flow control (must be implemented in app).
No congestion control (must be implemented in app).
QUIC:
Connection-oriented.
Byte-stream based (segments).
In-order delivery.
Reliability.
Flow control.
Congestion control.
Multi-streaming/Multiplexing
Transport Layer over Wireless Medium (#1)
Wireless networks can be lossy – e.g., due to adverse channel conditions caused by:
Frame collisions due to contending hosts on the shared wireless medium (e.g., CSMA/CA-based 802.11 DCF).
Environmental noise leading to high bit-error rate (BER) and hence frame loss.
Most of the MAC frame losses are masked from transport layer by some form of MAC-level (L2) frame retransmission (retry) on the wireless segment of the end-to-end path
A fixed retry limit (4 (short) – 7 (long) times in 802.11) before discarding the frame.
Multi-rate retry chain: ([r0, c0$]$, [r1, c1$]$, [r2, c2$]$, [r3, c3])
A member of a per-frame transmission descriptor that is stored in a FIFO queue
r corresponds to the modulation and coding scheme (MCS) used in each frame retry.
Transport Layer over Wireless Medium (#2)
Losses on wireless medium can potentially be unrelated to congestion in the network buffers
From DupACKs or timeouts, transport protocol (e.g., TCP) has no way of telling apart loss due to wireless noise from loss due to full network buffer
If BER is high on wireless channel (e.g., due to low SNR) and there are many MAC layer transmission retries, particularly with low(est) MCS indexes (i.e. bit-rates), this can impact the transport layer
Increase in TCP’s perceived end-to-end RTT, slowing down the TCP’s cwnd growth rate and hence sub-optimal utilization of the end-to-end capacity
Triggering transport layer loss due to 3-DupACKs/timeout
Some links such as GEO SATCOM are intrinsically “high latency”; speed-of-light is in fact slow! ;-)
TCP does not perform well in and is not designed for SATCOM links (RTT_{avg}=~650ms)
Transport Layer over Wireless Medium (#3)
Unlike wired switched networks, wireless networks dynamically change:
Channel conditions e.g., noise, interference, contention (address before)
RTT (both as result of changing channel conditions and, also distance)
From a couple of tens of milliseconds to regional CDNs on wired networks to several hundred milliseconds or a couple of seconds on busy/poor Wi-Fi links
Dynamic topology
Movement, node density, physical obstacles, distance,…
Generally used to have lower bandwidth although this is changing on the access links!
802.11 MAC layer bit-rate: 54 Mbps (802.11g)|~300 Mbps (802.11n)|~1.5 Gbps (802.11ac)|several Gbps (802.11ax)
Cellular peak data rate: 1 Gbps (4G, IMT-Advanced)|20 Gbps (5G, IMT-2020)
Transport Layer over Wireless Medium (#4)
Limitation of power consumption (e.g., cellphone battery life for 4G/WiFi)
Losses/retransmissions are more costly on wireless devices
Some wireless devices are resource-constrained -- e.g., IoT devices using LoRaWAN, Bluetooth Low Energy (BLE) or Zigbee (802.15.4).
Low-power resource-constraint devices normally can’t run full-stack transport protocols:
Instead run a lightweight version of transport/network protocols e.g., uTCP/uIP
run their own wire protocol(s) without TCP/IP and use a gateway for TCP/IP comm.
Send very few, sporadic and small data packets -- e.g. cwnd=1
Mitigating Transport Performance over Wireless Links
Improving transport’s performance over wireless links is challenging due to the complex and dynamic nature of wireless access links
Transport (e.g. TCP) sender is not aware of the presence of wireless link on the path
Several possibilities:
Use of Explicit Congestion Notification (ECN) (RFC 3168) with Active Queue Management (AQM) on the wireless APs (and in the network)
Use of delay-based, model-based or hybrid (instead of loss-based) congestion control mechanisms
Use of “better” transport protocols (e.g., message based instead of byte-stream based)
However not possible to pick transport protocol based on a link segment on the end-to-end path
Lots of apps may break as they are tied to (and developed for) specific underlying transport protocol semantics
But the world is moving towards QUIC so perhaps this is a good news!
Explicit Congestion Notification (ECN)
ECN: network can explicitly signal congestion to the sender (via receiver)
Negotiate ECN on connection establishment between sender/receivers
If congested, routers will CE-mark a packet belonging to ECN-enabled connection
Receiver will echo this CE-mark, using ECN-echo (ECE) until it receives a CWR bit
Sender will reduce its sending rate (cwnd) and sets CWR in the outgoing packet
Upon seeing CWR, receiver stops echoing ECN (i.e. setting ECE on ACKs).
ECN bits in IP header: Type-of-Service (TOS) field (6 DCSP codepoint, 2 ECN bits): 00 (not ECN-capable); 10 (ECT(0), i.e. ECN-capable), 11 (CE, i.e. congestion experienced), 01 (ECT(1), ECN-capable)
ECN bits in TCP header: 2 last bits in “reserved bits” field (CWR and ECE)
Active Queue Management (AQM)
AQM: dropping/marking packets randomly at the bottleneck link on the onset of congestion (i.e. before buffer is full) to signal to the sender to reduce its rate!
Many AQM mechanisms are proposed over the years starting with Random Early Discard (RED) by Sally Floyd; however RED was too complex and required fine-tuning for different network conditions => no AQM deployment after all!
Back in 2012, AQM was revitalized with new algorithms aiming to reduce the excessive latency on the Internet access links (a.k.a bufferbloat)
ECN-enabled access link routers (e.g. WiFi APs) should use an ECN-supporting AQM for ECN-marking (e.g., FQ_CoDel)
(FQ_)CoDel (2012)
PIE (2013)
Adaptive RED (ARED) (2001)
Congestion Control and Wireless Networks (#1)
Loss-based TCP: CC’s performance is suboptimal over wireless links due to packet loss caused by noise, interference or contention
E.g. unnecessary reduction in TCP’s cwnd
Multiplicative decrease factor (beta): beta{std}=0.5beta{cubic}=0.7$$ (RFC 8312) (default in Linux)
Packet loss is harder to recover from on high-latency paths (e.g., wireless)
Delay-based TCP CCs: observes the RTT (or OWD) trend
As old as Jain’s CARD in 1989
TCP Vegas, FAST TCP, TCP-Africa, CTCP, CAIA Delay Gradient (CDG), etc.
They differ in their measurement method, setting the thresholds, and cwnd adjustment
Model-based TCP CC: actively measures the available end-to-end capacity and stays around that value – e.g. Google’s BBR
Need to predict the base-RTT on the path
Unfairness when coexisting with loss-based TCP
Congestion Control and Wireless Networks (#2)
TCP over SATCOM: TCP’s feedback loop is too long on the SATCOM links (~650ms)
This slows down the cwnd growth and makes packet losses (i.e. retransmissions) to be too costly!
HTTP request process: DNS lookup + 3-Way handshake TCP connection establishment + TLS 1.2 handshake + HTTP request = 4RTT+DNS => over SATCOM: ~2.6 sec < t
TCP-splitting: a technique traditionally used by SATCOM providers to speed up the connection setup and cwnd growth using a Performance Enhancing Proxy (PEP)
Congestion Control and Wireless Networks (#3)
TCP splitting in PEP is done based on TCP packet header data
QUIC over SATCOM: QUIC’s encrypted headers make it hard for PEPs to split the connection i.e. PEPs don’t understand QUIC!
As QUIC becomes more prevalent over the public Internet, this is going to be a challenge for SATCOM Internet links
Some proposed solutions involve using a QUIC proxy in the future
Exposing some of QUIC header publicly (WiP at the IETF)