How TCP Works: Connections, Flow Control, and Congestion
The Transmission Control Protocol (TCP) is the reliable transport layer that underpins most of the internet. Every time you load a webpage, send an email, transfer a file, or query a database, TCP is the protocol ensuring that your data arrives completely, in order, and without corruption. While BGP decides which path packets take across the internet and IP addresses identify where they go, TCP handles how the data is reliably delivered end-to-end.
TCP is defined in RFC 793 (1981), with decades of refinements since. It operates at Layer 4 of the network stack, sitting on top of IP and below application protocols like HTTP, SMTP, and SSH. Understanding TCP is essential for anyone working with networks, because its behavior directly determines the throughput, latency, and reliability of nearly every internet application.
Connection Establishment: The Three-Way Handshake
TCP is a connection-oriented protocol. Before any data can be exchanged, the two endpoints must establish a connection through a process called the three-way handshake. This handshake serves three purposes: it verifies that both sides are reachable, it synchronizes initial sequence numbers, and it negotiates connection parameters via TCP options.
The handshake proceeds in three steps:
- SYN -- The client sends a segment with the SYN (synchronize) flag set. It picks a random Initial Sequence Number (ISN), say
x, and includes TCP options advertising its capabilities (Maximum Segment Size, window scale, SACK support, timestamps). - SYN-ACK -- The server responds with both SYN and ACK flags set. It picks its own random ISN
y, acknowledges the client's sequence number by settingack = x + 1, and sends its own TCP options. - ACK -- The client sends a final ACK with
ack = y + 1, confirming receipt of the server's sequence number. The connection is now established and data can flow in both directions.
The three-way handshake adds one full round-trip time (RTT) of latency before any application data can be sent. For a connection from New York to London (roughly 70ms RTT), that is 70ms of pure overhead. This is one reason why newer protocols like QUIC combine the transport and cryptographic handshakes to reduce connection setup time.
SYN Cookies and SYN Flood Protection
During the handshake, the server must allocate state (a Transmission Control Block) to track the half-open connection. This creates a vulnerability: an attacker can flood a server with SYN packets from spoofed source addresses, exhausting the server's memory for half-open connections. This is a SYN flood attack.
Modern operating systems defend against this with SYN cookies: instead of storing state for each SYN, the server encodes the connection parameters into the ISN itself. When the client completes the handshake with the final ACK, the server can reconstruct the connection state from the sequence number in the ACK. No state is stored for connections that never complete.
Sequence Numbers and Acknowledgments
TCP provides a reliable, ordered byte stream. To achieve this, it assigns a sequence number to every byte of data transmitted. The sequence number in each segment's header indicates the position of the first data byte in that segment within the overall stream.
When the receiver gets data, it sends back an acknowledgment (ACK) containing the sequence number of the next byte it expects. This is called a cumulative acknowledgment: ACK number N means "I have received all bytes up to N-1, send me byte N next."
For example, if the sender transmits 1000 bytes starting at sequence number 5000, the segment header says seq=5000 and carries bytes 5000-5999. The receiver ACKs with ack=6000, meaning "I got everything up to 5999, send me 6000 next."
Initial Sequence Numbers are chosen randomly (or pseudo-randomly) rather than starting at zero. This prevents segments from old, defunct connections from being confused with segments from a new connection between the same endpoints, and it makes sequence number prediction attacks harder.
The Sliding Window
TCP does not wait for an acknowledgment after every segment. That would be catastrophically slow -- on a 70ms RTT link, you could only send one segment per round trip, yielding a few hundred kilobits per second regardless of link capacity. Instead, TCP uses a sliding window that allows multiple segments to be in flight simultaneously.
The sender maintains a window of bytes it is allowed to send without waiting for acknowledgment. As ACKs arrive, the window "slides" forward, allowing new data to be sent. The size of this window is the minimum of two values:
- Receiver window (rwnd) -- how much buffer space the receiver has available (flow control)
- Congestion window (cwnd) -- how much the sender estimates the network can handle (congestion control)
The effective window at any time is min(rwnd, cwnd). This dual constraint ensures that TCP neither overwhelms the receiver nor congests the network.
Flow Control: The Receiver Window
Flow control prevents a fast sender from overwhelming a slow receiver. Each ACK includes a window size field that tells the sender how many bytes the receiver is willing to accept beyond the acknowledged sequence number. This is the receiver window (rwnd).
If the receiver's application is reading data slowly and the receive buffer fills up, the receiver advertises a smaller window. If the buffer fills completely, the receiver advertises a window of zero, and the sender must stop transmitting data. This is called a zero window condition. The sender periodically sends tiny window probe segments to check whether the receiver has freed up buffer space.
Window Scaling (RFC 7323)
The original TCP header allocates only 16 bits for the window size, limiting it to 65,535 bytes. On modern high-bandwidth, high-latency networks (imagine a 10 Gbps link with 100ms RTT), this window is far too small. The bandwidth-delay product of such a link is 125 MB -- you need a window that large to keep the pipe full.
The Window Scale TCP option, negotiated during the three-way handshake, provides a scaling factor (a shift count from 0 to 14) that multiplies the 16-bit window field. With a scale factor of 14, the maximum window size becomes 65,535 x 2^14 = over 1 GB. This option is universally supported by modern operating systems and is critical for high-performance networking.
Congestion Control
While flow control prevents overwhelming the receiver, congestion control prevents overwhelming the network. If every TCP sender transmitted as fast as the receiver allowed, routers in the middle would overflow their buffers, drop packets, and the network would collapse. TCP's congestion control algorithms dynamically estimate how much data the network can carry and adjust the sending rate accordingly.
The sender maintains a congestion window (cwnd) that starts small and grows as the sender gains confidence that the network can handle more data. The core algorithms are:
Slow Start
When a connection begins (or after a timeout), the congestion window starts at a small value, typically a few segments (modern stacks use an initial window of 10 segments, or about 14 KB, per RFC 6928). For each ACK received during slow start, cwnd increases by one Maximum Segment Size (MSS). Since each RTT roughly doubles the window (one ACK per segment, each ACK adds one MSS), slow start produces exponential growth.
Slow start continues until one of three things happens: the window reaches the slow start threshold (ssthresh), a packet loss is detected, or the receiver window is reached. When cwnd reaches ssthresh, TCP transitions to congestion avoidance.
Congestion Avoidance
In congestion avoidance, the window grows much more conservatively: cwnd increases by roughly one MSS per entire round trip (not per ACK). This produces linear growth, probing for available bandwidth cautiously. The combination of exponential slow start followed by linear congestion avoidance is the classic TCP sawtooth pattern: the window grows linearly until a loss is detected, then drops sharply, and the cycle repeats.
Loss Detection and Response
TCP interprets packet loss as a signal of network congestion. There are two ways loss is detected, and the response differs:
- Timeout (RTO expiration) -- If an ACK does not arrive within the Retransmission Timeout, TCP assumes severe congestion. It resets cwnd to 1 MSS, sets ssthresh to half the current window, and re-enters slow start. This is the most drastic response.
- Triple duplicate ACKs (fast retransmit) -- If the sender receives three duplicate ACKs for the same sequence number, it infers that a single segment was lost but subsequent segments are arriving. This suggests mild congestion, not a collapse. TCP retransmits the lost segment immediately (without waiting for the timeout) and reduces cwnd to half, entering congestion avoidance from there. This is called fast recovery.
Congestion Control Algorithms: Reno, CUBIC, and BBR
The basic framework described above is TCP Reno (RFC 5681). Over the decades, several improved congestion control algorithms have been developed to handle modern network conditions better.
TCP Reno
Reno implements the classic AIMD (Additive Increase, Multiplicative Decrease) behavior: increase cwnd by 1 MSS per RTT during congestion avoidance, halve it on loss. Reno's fast retransmit and fast recovery handle single segment losses well, but it struggles with multiple losses within the same window -- each loss triggers another halving, and recovery is slow.
TCP NewReno
NewReno (RFC 6582) improves Reno's handling of multiple losses. Instead of exiting fast recovery after the first retransmitted segment is acknowledged, NewReno stays in fast recovery until all segments that were outstanding at the time of the loss are acknowledged. This allows it to retransmit multiple lost segments within a single recovery episode.
TCP CUBIC
CUBIC (RFC 9438) is the default congestion control algorithm on Linux and most modern operating systems. Instead of the linear growth of Reno, CUBIC uses a cubic function to determine the window size. The window grows rapidly when far from the last congestion point, slows down as it approaches, and then grows rapidly again once it passes the previous maximum. This S-shaped growth curve means CUBIC is more aggressive in probing for bandwidth on high-capacity links while being cautious near the point where congestion was last observed.
The key advantage of CUBIC is that its window growth is independent of RTT. In Reno, a connection with a longer RTT grows its window more slowly (since growth happens once per RTT), penalizing long-distance connections. CUBIC's cubic function is based on elapsed time, not round trips, giving fairness across connections with different RTTs.
TCP BBR (Bottleneck Bandwidth and RTT)
BBR, developed by Google (RFC 9613), takes a fundamentally different approach. Rather than using packet loss as the congestion signal, BBR builds an explicit model of the network path by continuously estimating two parameters:
- Bottleneck bandwidth (BtlBw) -- the maximum delivery rate observed over a recent window of time
- Round-trip propagation delay (RTprop) -- the minimum RTT observed, representing the pure propagation delay without queuing
BBR sets its sending rate to match the estimated bottleneck bandwidth and its in-flight data limit to the bandwidth-delay product (BtlBw x RTprop). This approach avoids filling router buffers, resulting in lower latency under congestion compared to loss-based algorithms. BBR periodically probes for more bandwidth (by temporarily increasing the sending rate) and lower RTT (by temporarily reducing it).
BBR is widely deployed on Google's servers and has shown significant throughput improvements on long-distance, high-bandwidth paths where loss-based algorithms underperform. However, it has also been shown to be somewhat unfair to competing loss-based flows in certain conditions, which BBRv2 and BBRv3 aim to address.
Fast Retransmit and Selective Acknowledgments (SACK)
With standard cumulative ACKs, the sender knows that everything up to the ACK number has been received, but it has no information about which segments beyond that point arrived. If segments 1, 2, 4, and 5 arrive but segment 3 is lost, the receiver can only ACK up to segment 2. The sender sees duplicate ACKs for segment 2 and retransmits segment 3, but it does not know whether it also needs to retransmit segments 4 and 5.
Selective Acknowledgments (SACK) -- RFC 2018
SACK solves this problem. When SACK is enabled (negotiated during the three-way handshake via TCP options), the receiver can report non-contiguous blocks of data it has received. In the example above, the receiver sends a duplicate ACK for segment 2 with a SACK block indicating "I also have segments 4-5." The sender now knows it only needs to retransmit segment 3, avoiding unnecessary retransmission of segments that already arrived.
SACK is critical for performance on lossy links or links with large bandwidth-delay products. Without SACK, recovering from multiple losses in a single window requires multiple round trips (one loss retransmitted per RTT). With SACK, the sender can retransmit exactly the missing segments in a single round trip. SACK support is nearly universal in modern TCP stacks.
Duplicate SACK (D-SACK) -- RFC 2883
D-SACK extends SACK by allowing the receiver to report segments that arrived more than once. This helps the sender distinguish between actual packet loss and reordering: if the sender retransmits a segment and then receives a D-SACK indicating the original segment did arrive, it knows the loss detection was a false positive caused by reordering, and can adjust its behavior accordingly.
TCP Options
TCP options are variable-length fields negotiated during the three-way handshake (and sometimes carried on subsequent segments). They extend TCP's capabilities beyond what the fixed header provides. The most important options are:
- Maximum Segment Size (MSS) -- Each side advertises the largest segment it can receive. Typically 1460 bytes for IPv4 (1500 byte Ethernet MTU minus 20 bytes IP header minus 20 bytes TCP header) or 1440 for IPv6. MSS prevents IP fragmentation, which is costly.
- Window Scale -- As described above, a shift count that extends the 16-bit window field to support windows up to 1 GB.
- SACK Permitted / SACK -- Enables selective acknowledgments. The "SACK Permitted" option is sent in the SYN; actual SACK blocks are sent in subsequent ACKs.
- Timestamps (TSopt) -- Each segment carries a timestamp from the sender and an echo of the most recent timestamp received from the peer. Timestamps serve two purposes: they enable precise RTT measurement (used to compute retransmission timeouts), and they provide protection against wrapped sequence numbers (PAWS) on very fast connections where 32-bit sequence numbers could wrap within the MSL.
- TCP Fast Open (TFO) -- Allows data to be sent in the SYN packet of subsequent connections to the same server, eliminating the RTT overhead of the handshake for repeat connections. The server issues a cookie on the first connection that the client presents in future SYNs.
Connection Teardown and TIME_WAIT
TCP connections are terminated through a four-way handshake (or a three-way variant). Either side can initiate the close:
- The initiator sends a FIN (finish) segment, indicating it has no more data to send.
- The other side acknowledges the FIN with an ACK. At this point, the connection is half-closed: the initiator can no longer send data, but the other side still can.
- When the other side finishes sending data, it sends its own FIN.
- The initiator ACKs this FIN, and the connection is fully closed.
In practice, the second and third steps are often combined: the receiver sends a FIN+ACK simultaneously, making it a three-way close.
An abrupt close can also occur if either side sends a RST (reset) segment, which immediately terminates the connection without the graceful FIN exchange. RSTs are sent when a connection is refused, when data arrives for a connection that no longer exists, or when an application wants to abandon a connection immediately.
The TIME_WAIT State
After sending the final ACK, the initiator enters the TIME_WAIT state and remains there for twice the Maximum Segment Lifetime (2MSL, typically 60 seconds on Linux, 120 seconds in the original spec). During TIME_WAIT, the socket is held open but cannot be reused for a new connection to the same remote address and port.
TIME_WAIT exists for two reasons:
- Reliable termination -- If the final ACK is lost, the peer will retransmit its FIN. The initiator must be around to re-send the ACK, or the peer will be stuck waiting.
- Duplicate segment protection -- Old segments from the just-closed connection might still be in transit. TIME_WAIT ensures they expire before a new connection reuses the same port pair. Without it, stale data from the old connection could corrupt the new one.
On busy servers that handle many short-lived connections, TIME_WAIT sockets can accumulate. Thousands of sockets sitting in TIME_WAIT is normal and usually not a problem, as they consume minimal resources (no file descriptors, no buffers, just a small kernel data structure). Linux provides tcp_tw_reuse to allow reuse of TIME_WAIT sockets for outgoing connections when it is safe (timestamps ensure old segments are rejected).
Retransmission Timeout (RTO) Calculation
TCP must decide how long to wait before assuming a segment was lost and retransmitting it. Wait too long and throughput suffers; retransmit too soon and the network is flooded with unnecessary duplicates.
TCP continuously measures round-trip time by observing how long it takes for an ACK to arrive after sending a segment. It maintains two state variables:
- SRTT (Smoothed RTT) -- An exponentially weighted moving average of observed RTTs
- RTTVAR (RTT Variance) -- A measure of RTT variability
The RTO is calculated as: RTO = SRTT + max(G, 4 * RTTVAR), where G is the clock granularity. This formula (RFC 6298) ensures the timeout adapts to the actual network conditions. On a low-latency LAN, the RTO might be a few milliseconds; on an intercontinental link, it might be several hundred milliseconds.
When timestamps are available (which they almost always are on modern stacks), RTT can be measured on every ACK rather than once per window, giving much more accurate estimates.
Nagle's Algorithm and Delayed ACKs
Nagle's Algorithm (RFC 896)
Without Nagle's algorithm, an application that writes data one byte at a time would generate a separate TCP segment for each byte. With the 40+ bytes of TCP/IP headers, this is enormously wasteful -- a syndrome called silly window syndrome from the sender's side.
Nagle's algorithm addresses this: if there is unacknowledged data in flight, TCP buffers small writes and combines them into a single segment. Specifically, a small segment can only be sent if all previously sent data has been acknowledged, or if the buffer accumulates a full MSS worth of data. This dramatically reduces the number of tiny segments on the network.
However, Nagle's algorithm interacts poorly with delayed ACKs (where the receiver waits up to 40ms before sending an ACK, hoping to piggyback it on a response). The combination can add up to 40ms of latency on interactive applications. For this reason, latency-sensitive applications (like SSH, real-time games, and some HTTP implementations) disable Nagle's algorithm with the TCP_NODELAY socket option.
Delayed ACKs (RFC 1122)
Rather than ACKing every segment immediately, TCP receivers can delay their ACK for up to a protocol-defined timeout (typically 40ms on Linux, 200ms in the spec). This allows the ACK to be combined with outgoing data (piggybacking) and reduces the number of pure-ACK segments on the network. The receiver must ACK at least every other full-sized segment without delay.
TCP and the Network Path
TCP's performance is heavily influenced by the network path between endpoints. The AS path that BGP selects determines the physical route packets traverse, which directly affects RTT, available bandwidth, and loss rate -- all critical inputs to TCP's algorithms.
A connection routed through a short AS path with low latency will complete its slow start quickly and reach high throughput sooner. The same data transfer across a longer path with higher RTT and more potential for loss will perform worse, even if the raw link bandwidth is identical. You can use traceroute to measure the actual hop-by-hop latency of a path, and the looking glass to see the BGP-level route.
The interaction between TCP and the network path is why content delivery networks (CDNs) place servers at the edge of the network, close to users. By reducing the RTT, they dramatically improve TCP throughput. A page served from a CDN edge node 5ms away reaches full throughput in slow start within a handful of round trips, while the same page from a server 200ms away takes far longer.
TCP vs UDP and Modern Alternatives
TCP's reliability comes at a cost: the handshake latency, head-of-line blocking (where one lost segment stalls all subsequent data), and the complexity of congestion control. For applications that can tolerate loss or need lower latency, UDP offers a simpler, connectionless alternative with no built-in reliability.
Modern protocols increasingly build on UDP to get the best of both worlds:
- QUIC -- Developed by Google and standardized in RFC 9000, QUIC provides TCP-like reliability and congestion control over UDP, with integrated TLS 1.3 encryption and zero-RTT connection resumption. QUIC eliminates head-of-line blocking by multiplexing independent streams within a single connection. HTTP/3 uses QUIC as its transport.
- DTLS -- TLS over UDP, used by WebRTC and some VPN protocols when TCP's ordering guarantees are undesirable.
Despite these alternatives, TCP remains the dominant transport protocol on the internet. BGP itself runs over TCP (port 179), relying on TCP's reliable delivery for the routing updates that hold the internet together.
Putting It All Together
Every TCP connection you make -- whether browsing a website, querying an API, or downloading a file -- goes through the same lifecycle:
- Handshake -- Three-way SYN/SYN-ACK/ACK with option negotiation
- Slow start -- Exponential window growth to quickly discover available bandwidth
- Congestion avoidance -- Linear window growth to cautiously probe for more capacity
- Steady state -- Data flows with the window oscillating as congestion signals are received and the algorithms respond
- Teardown -- FIN/ACK exchange followed by TIME_WAIT
The interaction between TCP's algorithms and the underlying network path -- determined by BGP routing decisions, peering relationships, and physical geography -- ultimately determines the performance that applications and users experience.
Explore the Network
Use the BGP looking glass to explore the routing paths that TCP connections traverse. Look up any IP address to see the BGP route, origin AS, and AS path. Understanding the network path helps explain why some TCP connections are fast and others are slow -- it all comes down to the route.