How TCP Congestion Control Works: From Reno to BBR
TCP congestion control is the set of algorithms that govern how fast a TCP sender transmits data, preventing it from overwhelming the network while maximizing throughput. Without congestion control, every TCP connection would blast packets at the maximum rate the sender and receiver could handle, routers would overflow their buffers, and the internet would collapse into congestion collapse -- a state where the network carries mostly retransmitted packets and useful throughput drops to near zero. This actually happened on the early internet in October 1986, when throughput on the NSFnet backbone dropped from 32 Kbps to 40 bps -- a factor of 800. Van Jacobson's subsequent work on congestion control algorithms, published in 1988, saved the internet and established the foundations that every TCP implementation still builds on today.
Congestion control operates in the sender's TCP stack, adjusting the congestion window (cwnd) -- the number of bytes the sender is allowed to have in flight (sent but not yet acknowledged) at any time. The actual sending window is min(cwnd, rwnd), where rwnd is the receiver's advertised window. While flow control prevents overwhelming the receiver, congestion control prevents overwhelming the routers and links between sender and receiver. Understanding congestion control is essential for diagnosing network performance -- a connection traversing a path you can trace in the BGP looking glass may have ample link capacity but still suffer poor throughput if the congestion control algorithm is fighting against bufferbloat, loss, or competing flows.
The Core State Variables: cwnd and ssthresh
Every TCP connection maintains two key congestion control variables:
- cwnd (congestion window) -- The sender's estimate of how many bytes the network can handle in flight. This starts small and changes constantly based on congestion signals. The initial value (IW) is typically 10 segments (~14,600 bytes) on modern stacks per RFC 6928, though the original specification started at just 1 segment.
- ssthresh (slow start threshold) -- The boundary between the slow start phase (exponential growth) and the congestion avoidance phase (linear growth). When a loss is detected, ssthresh is set to half the current cwnd, recording the sender's estimate of the network's capacity. On a fresh connection, ssthresh starts at an effectively infinite value, so slow start runs until either loss occurs or the receiver window is reached.
The interplay between these two variables defines the classic TCP sawtooth pattern: cwnd grows (exponentially then linearly), hits a loss, drops sharply, ssthresh is updated, and the cycle repeats. Different congestion control algorithms modify how cwnd grows, what constitutes a congestion signal, and how much cwnd drops in response -- but this fundamental framework persists across nearly all variants.
Slow Start
Slow start is TCP's mechanism for quickly discovering the available bandwidth on a new connection. Despite the name, it is actually quite fast -- the congestion window grows exponentially.
The algorithm works as follows: for each ACK received that acknowledges new data during slow start, cwnd increases by one MSS (Maximum Segment Size). Since a full window of data generates roughly one ACK per segment, cwnd approximately doubles every round-trip time:
- RTT 0: cwnd = 10 MSS, send 10 segments
- RTT 1: 10 ACKs arrive, cwnd = 20 MSS, send 20 segments
- RTT 2: 20 ACKs arrive, cwnd = 40 MSS, send 40 segments
- RTT 3: 40 ACKs arrive, cwnd = 80 MSS, send 80 segments
With an initial window of 10 segments and a typical MSS of 1,460 bytes, slow start reaches 1 MB of in-flight data within 7 RTTs. On a 20ms RTT connection, that is 140ms. On a 200ms transatlantic path, that is 1.4 seconds -- a meaningful delay for short transfers like web page loads, which is why CDNs place servers close to users.
Slow start exits when any of three conditions is met:
- cwnd reaches ssthresh -- TCP transitions to congestion avoidance (linear growth).
- Packet loss is detected (timeout or triple duplicate ACKs) -- TCP responds according to its loss recovery algorithm.
- cwnd reaches rwnd -- The receiver's buffer is the bottleneck, not the network.
Congestion Avoidance
Once cwnd reaches ssthresh, TCP enters congestion avoidance. Here, growth becomes linear: cwnd increases by approximately one MSS per round-trip time, regardless of how many ACKs are received. The standard implementation increases cwnd by MSS * MSS / cwnd for each ACK, which sums to roughly one MSS per RTT when the full window is acknowledged.
This cautious probing is the "additive increase" half of TCP's AIMD (Additive Increase, Multiplicative Decrease) strategy. The window grows linearly until congestion is detected, at which point it is multiplicatively decreased (typically halved). AIMD has been mathematically proven to converge to a fair share of bandwidth when multiple flows compete on a shared bottleneck -- this convergence property is why TCP works at all on a shared network.
Loss Detection: Timeouts and Duplicate ACKs
TCP treats packet loss as a congestion signal. There are two mechanisms for detecting loss, and the response to each is different because they indicate different severities of congestion:
Retransmission Timeout (RTO)
If an ACK does not arrive within the RTO (computed from smoothed RTT and RTT variance per RFC 6298), TCP assumes the segment was lost and the network may be severely congested -- perhaps a route has changed (which you could observe in the looking glass as a BGP path change) or a link has failed entirely. The response is drastic:
- Retransmit the lost segment
- Set ssthresh = cwnd / 2
- Reset cwnd to 1 MSS (or the initial window)
- Re-enter slow start
This is the harshest response because a timeout suggests the network is so congested that not even duplicate ACKs are getting through.
Fast Retransmit (Triple Duplicate ACKs)
When the receiver gets an out-of-order segment, it immediately sends a duplicate ACK for the last in-order byte. If the sender receives three duplicate ACKs for the same sequence number, it infers that a specific segment was lost but subsequent segments are still arriving -- the network is not completely overwhelmed, just experiencing mild congestion. The sender retransmits the missing segment immediately, without waiting for the expensive RTO timer. This is fast retransmit, and what happens next depends on the specific congestion control variant.
The Evolution of TCP Congestion Control
TCP Tahoe (1988)
TCP Tahoe, named after the 4.3BSD Tahoe release, was Van Jacobson's original implementation of congestion control. It introduced slow start, congestion avoidance, and fast retransmit. Tahoe's loss response is simple and aggressive: on any loss detection -- whether by timeout or by triple duplicate ACKs -- Tahoe sets ssthresh to half the current cwnd, resets cwnd to 1 MSS, and re-enters slow start.
This behavior is correct but conservative. Even when triple duplicate ACKs indicate mild congestion (subsequent segments are still getting through), Tahoe restarts from scratch. On a high-bandwidth, high-latency path, re-entering slow start means the connection takes many RTTs to recover its previous throughput. For a 100ms RTT link where cwnd had reached 100 segments, Tahoe takes approximately 7 RTTs (700ms) just to re-enter congestion avoidance, and then many more RTTs to climb back to the previous rate.
TCP Reno (1990)
TCP Reno, from the 4.3BSD Reno release, added fast recovery to Tahoe's algorithm. The key insight: if the sender receives triple duplicate ACKs, the network is not completely collapsed -- those duplicate ACKs prove that segments are still being delivered. So instead of resetting to slow start, Reno halves the congestion window and enters congestion avoidance directly:
- Set ssthresh = cwnd / 2
- Retransmit the lost segment (fast retransmit)
- Set cwnd = ssthresh + 3 MSS (the "3" accounts for the three duplicate ACKs, indicating three segments beyond the lost one have left the network)
- For each additional duplicate ACK received, inflate cwnd by 1 MSS (each duplicate ACK indicates another segment has left the network, so one more can be injected)
- When the retransmitted segment is acknowledged, deflate cwnd back to ssthresh and resume normal congestion avoidance
This is the AIMD (Additive Increase, Multiplicative Decrease) pattern: linear growth during congestion avoidance (additive increase), halving on loss (multiplicative decrease). AIMD has a critical mathematical property -- it converges to fairness. If two Reno flows compete on a shared bottleneck, they will eventually reach equal bandwidth shares regardless of their starting points.
Reno's weakness is handling multiple losses within the same window. When Reno detects a loss via triple duplicate ACKs, it retransmits one segment and waits. If a second segment in the same window was also lost, the sender only discovers this when the ACK for the first retransmission arrives -- it then has to perform another fast retransmit (or wait for an RTO), halving cwnd again. With N losses in one window, cwnd is halved N times, which is catastrophically aggressive. On lossy links or large windows, this multi-loss scenario is common.
TCP NewReno (RFC 6582)
NewReno specifically addresses Reno's multi-loss problem by modifying fast recovery. The key change: NewReno tracks which segments were outstanding when fast recovery began (using the "recovery point" -- the highest sequence number sent at the time). When an ACK arrives during fast recovery:
- If the ACK covers the recovery point (a full ACK), fast recovery is complete. Deflate cwnd to ssthresh and resume normal congestion avoidance.
- If the ACK does not cover the recovery point (a partial ACK), another segment in the original window was also lost. Retransmit the next unacknowledged segment immediately, without halving cwnd again. Stay in fast recovery.
This allows NewReno to recover from multiple losses in a single recovery episode, losing cwnd only once per loss event (window of losses) rather than once per lost packet. NewReno was the standard TCP algorithm for many years and remains well-understood in academic and production settings.
SACK-Based Recovery
Selective Acknowledgments (SACK, RFC 2018) provide even better multi-loss recovery. With SACK, the receiver explicitly reports which non-contiguous blocks of data it has received. The sender knows exactly which segments are missing and can retransmit all of them within a single RTT, without waiting for partial ACK feedback. SACK-based loss recovery (RFC 3517) is more efficient than NewReno for bursty losses and is supported by virtually all modern TCP stacks. On paths where loss is common -- a scenario you can investigate by examining route stability in the looking glass -- SACK can dramatically improve recovery time.
TCP CUBIC: The Modern Default
CUBIC (RFC 9438) has been the default congestion control algorithm on Linux since kernel 2.6.19 (2006) and is now used by most operating systems including macOS and Windows. It was designed to address a fundamental limitation of Reno-style algorithms on modern networks: RTT unfairness.
In Reno and NewReno, cwnd grows by 1 MSS per RTT during congestion avoidance. This means a flow with a 10ms RTT grows its window 10 times faster than a flow with a 100ms RTT, even if both share the same bottleneck link. On today's networks, where high-bandwidth links span continents with RTTs of 100-300ms, Reno-style algorithms are painfully slow to utilize available capacity.
CUBIC replaces the linear growth function with a cubic function of elapsed time since the last loss event:
W(t) = C * (t - K)^3 + W_max
Where:
W_maxis the window size at which loss was last detectedK = (W_max * beta / C)^(1/3)is the time period to reach W_max againCis a scaling constant (0.4)betais the multiplicative decrease factor (0.7 -- CUBIC reduces cwnd to 70% on loss, not 50% like Reno)tis elapsed time since the last loss event
The cubic function produces a distinctive S-shaped growth curve. After a loss, cwnd drops to 70% of its previous value. It then grows rapidly (concave region), slows down as it approaches the previous maximum W_max (plateau region where it probes cautiously), and then accelerates again once it passes W_max (convex region where it probes for new capacity). This behavior is intuitively sensible: be cautious near the point where congestion was last observed, but probe aggressively in unexplored territory.
The critical property of CUBIC is that window growth is a function of real time, not RTTs. A CUBIC flow with a 200ms RTT grows at the same rate as one with a 20ms RTT. This makes CUBIC dramatically fairer across flows with different path latencies -- particularly important on the internet, where connections span wildly different distances. Two flows from different continents to the same server will converge to similar throughput, unlike Reno where the closer flow would dominate.
CUBIC also maintains a "TCP-friendly" mode: it computes what standard Reno's window would be and uses the maximum of the CUBIC window and the Reno window. This ensures CUBIC never gets less throughput than Reno, which matters on short-RTT, low-bandwidth paths where CUBIC's time-based growth might be slower than Reno's RTT-based growth.
TCP BBR: A Fundamentally Different Approach
BBR (Bottleneck Bandwidth and Round-trip propagation time), developed by Google and standardized in RFC 9613, represents the most significant rethinking of TCP congestion control since Jacobson's original work. While Tahoe, Reno, and CUBIC are all loss-based -- they interpret packet loss as the congestion signal -- BBR is model-based. It builds an explicit model of the network path and sets its sending rate to match the path's capacity, aiming to operate at the optimal point where throughput is maximized and delay is minimized.
The Optimal Operating Point
Consider a network path with a bottleneck link of bandwidth BtlBw and a base round-trip propagation delay of RTprop (the minimum RTT with no queuing). The bandwidth-delay product (BDP) is BtlBw * RTprop -- the amount of data that can be "in flight" (in the pipes and propagating) at any instant.
If the sender has exactly BDP bytes in flight, the bottleneck link is fully utilized (maximum throughput) and the router buffers are empty (minimum delay). This is Kleinrock's optimal operating point. If less data is in flight, the link is underutilized. If more data is in flight, the excess fills router buffers, adding queuing delay without increasing throughput -- the connection gets higher latency for no throughput benefit.
Loss-based algorithms like CUBIC operate far from this point. They keep increasing cwnd until packets are dropped, which means they consistently fill router buffers before backing off. The steady-state operating point of a loss-based algorithm is at the right edge of the buffer -- maximum queuing delay, with periodic loss. BBR aims to operate at the left edge -- full utilization with minimal queuing.
How BBR Works
BBR continuously estimates two quantities:
- BtlBw (bottleneck bandwidth) -- estimated as the maximum delivery rate observed over a sliding window of recent round trips. The delivery rate for each ACK is computed as the amount of data acknowledged divided by the time between when the data was sent and when the ACK arrived.
- RTprop (round-trip propagation time) -- estimated as the minimum RTT observed over a longer window (typically 10 seconds). Since any RTT measurement includes both propagation delay and queuing delay, the minimum RTT approximates the pure propagation delay.
BBR then sets its target sending rate to pacing_rate = pacing_gain * BtlBw and its target in-flight data to cwnd = cwnd_gain * BtlBw * RTprop. During steady state (the "ProbeBW" phase), BBR cycles through pacing gains of 1.25, 0.75, and 1.0 to periodically probe for more bandwidth and drain any queue it created. During "ProbeRTT" phases (triggered when RTprop has not been updated for 10 seconds), BBR temporarily reduces its cwnd to 4 segments to allow queues to drain, getting a clean RTprop measurement.
BBR's Advantages
- Lower latency -- By targeting BDP rather than buffer capacity, BBR maintains lower queuing delay than loss-based algorithms. On paths with deep buffers, this can mean hundreds of milliseconds less latency.
- Better performance on lossy links -- BBR does not treat random packet loss as a congestion signal. On paths with non-congestion loss (wireless links, satellite links), BBR maintains throughput while CUBIC repeatedly halves its window. Google reported 2-25x throughput improvements when deploying BBRv1 on YouTube.
- Faster startup -- BBR's startup phase estimates BtlBw by doubling the sending rate each RTT (similar to slow start) but exits when the delivery rate plateaus (indicating the bottleneck is reached), rather than waiting for loss. This finds the bottleneck capacity without overshoot.
BBR's Challenges and Evolution
BBRv1 had significant problems in practice:
- Unfairness to loss-based flows -- BBRv1 could consume a disproportionate share of bandwidth when competing with CUBIC flows, because BBR does not back off on loss while CUBIC does. In some measurements, a single BBR flow took 40% of bandwidth while 15 CUBIC flows shared the remaining 60%.
- High retransmission rates -- BBRv1's bandwidth probing (sending at 1.25x estimated BtlBw) caused consistent packet loss in the probing phase, leading to retransmission rates of 1-5% even on uncongested paths.
- RTprop estimation challenges -- Accurately measuring base RTT requires emptying queues, which reduces throughput temporarily and may not produce accurate measurements when many flows share a path.
BBRv2 addressed fairness by incorporating loss as a secondary signal (reducing the sending rate when loss exceeds a threshold), using ECN marks when available, and being less aggressive during bandwidth probing. BBRv3 (the version described in RFC 9613) further refined these behaviors. The algorithm remains under active development and is deployed on Google's infrastructure, where it handles a significant fraction of all internet traffic.
Delay-Based Congestion Control: TCP Vegas
Before BBR, TCP Vegas (1994) pioneered the use of delay as a congestion signal. Vegas monitors the RTT of each segment and compares the actual throughput to the expected throughput:
- Expected throughput = cwnd / BaseRTT (where BaseRTT is the minimum observed RTT)
- Actual throughput = cwnd / RTT (where RTT is the current smoothed RTT)
- Diff = Expected - Actual
If Diff is small (below a threshold alpha), the connection is underutilizing the path and cwnd is increased. If Diff is large (above a threshold beta), data is accumulating in buffers and cwnd is decreased. Vegas aims to keep just enough data in flight to fill the pipe without building queues.
Vegas was conceptually elegant but failed in practice for one devastating reason: it is unfair against loss-based algorithms. When a Vegas flow competes with a Reno flow on a shared bottleneck, Vegas detects the increasing RTT (caused by Reno filling buffers) and backs off. Reno, oblivious to delay, keeps increasing its window. The result is that Reno takes most of the bandwidth and Vegas starves. Since the internet is dominated by loss-based TCP, deploying Vegas unilaterally puts you at a disadvantage. This "deployment incentive" problem is why delay-based algorithms remained niche for decades until BBR found a way to combine delay awareness with competitive performance.
Explicit Congestion Notification (ECN)
All the algorithms described so far infer congestion indirectly -- from packet loss (Reno, CUBIC) or from RTT changes (Vegas, BBR). ECN (RFC 3168) provides an explicit congestion signal: routers can mark packets rather than dropping them, signaling congestion before any loss occurs.
ECN uses two bits in the IP header (the ECN field) and two flags in the TCP header (ECE and CWR):
- Negotiation -- During the TCP handshake, both sides set the ECE and CWR flags to indicate ECN support.
- Marking -- When an ECN-capable router experiences congestion (its queue depth exceeds a threshold), it sets the CE (Congestion Experienced) bit in the IP header of passing packets instead of dropping them.
- Feedback -- The TCP receiver sees the CE mark and sets the ECE (ECN-Echo) flag in its next ACK to the sender.
- Response -- The sender sees the ECE flag and reduces its cwnd (just as it would for a loss event). It sets the CWR (Congestion Window Reduced) flag to tell the receiver it has responded.
ECN has a major advantage: the sender learns about congestion before any packet is actually dropped. With loss-based congestion control, the first signal of congestion is a lost packet, which requires retransmission and degrades application performance. With ECN, the congestion signal arrives in a normal ACK, and the sender can reduce its rate proactively. This is particularly valuable for short flows (like web page loads) where a single lost packet can add an entire RTT of delay.
ECN adoption has been gradual but is now widespread. Linux, macOS, Windows, and iOS all support ECN. Most major CDN providers enable it, and modern AQM algorithms like CoDel and PIE work best with ECN because they can signal congestion at lower queue depths. QUIC also supports ECN, carrying the signals in its own frame types since QUIC does not use TCP headers.
Bufferbloat and Active Queue Management
Loss-based congestion control assumes that packet loss is the signal of congestion. This assumption fails badly when router buffers are excessively large -- a problem known as bufferbloat.
The Bufferbloat Problem
In the early internet, router buffers were small (often sized at the bandwidth-delay product of the link). Buffers would fill quickly, packets would be dropped, and TCP would reduce its sending rate. The feedback loop was fast and effective. But over the years, memory became cheap and network equipment vendors added increasingly large buffers, reasoning that bigger buffers prevent packet loss and therefore improve performance.
The opposite happened. With buffers large enough to hold hundreds of milliseconds or even seconds of data, loss-based TCP fills the entire buffer before experiencing any loss. The connection achieves good throughput, but at the cost of enormous latency. A home router with a 300ms buffer on a 10 Mbps link means that a TCP flow filling that buffer adds 300ms of queuing delay to every packet -- including packets from other flows like VoIP calls, video conferencing, and interactive gaming that share the same bottleneck link.
Bufferbloat is pervasive on consumer internet connections, where cable modems, DSL modems, and home routers routinely have buffers sized for 500ms-2000ms of data. A user running a large download will notice their video calls stutter, web browsing slows to a crawl, and latency to game servers spikes from 20ms to 500ms. The download itself shows high throughput -- the problem is the collateral damage to everything else.
Active Queue Management (AQM)
AQM algorithms solve bufferbloat by having routers intelligently drop or ECN-mark packets before the buffer is full, keeping queue occupancy low and latency bounded. The key AQM algorithms are:
RED (Random Early Detection)
RED (1993) was the first widely deployed AQM. It monitors the average queue length and probabilistically drops packets as the queue grows. When the average queue is below a minimum threshold, no packets are dropped. Between the minimum and maximum thresholds, the drop probability increases linearly. Above the maximum threshold, all packets are dropped. RED's randomized dropping spreads the congestion signal across flows, avoiding the global synchronization problem where all flows drop simultaneously, back off, and then surge together.
RED's weakness is that it requires careful tuning of its parameters (min threshold, max threshold, drop probability, averaging weight) for each link speed. Poorly tuned RED can be worse than simple tail-drop. This tuning difficulty limited RED's deployment in practice.
CoDel (Controlled Delay)
CoDel (2012, RFC 8289), designed by Van Jacobson and Kathleen Nichols, takes a fundamentally different approach. Instead of monitoring queue length, CoDel monitors the sojourn time of packets -- how long each packet actually spends in the queue. If the minimum sojourn time over an interval exceeds a target (default 5ms), CoDel begins dropping packets at an increasing rate. If the sojourn time drops below the target, CoDel stops dropping immediately.
CoDel's elegance is that it has no tunable parameters that depend on link speed. The 5ms target and 100ms interval work across a wide range of link speeds because CoDel tracks the actual delay experienced by packets, not the buffer occupancy. A burst of packets that arrives and quickly drains does not trigger drops (the minimum sojourn time stays low), while a standing queue that persistently delays packets does trigger drops. This automatically distinguishes between good queues (transient bursts) and bad queues (standing congestion).
PIE (Proportional Integral controller Enhanced)
PIE (RFC 8033) uses control theory to manage queue delay. It periodically estimates the current queuing delay and adjusts the drop probability using a proportional-integral (PI) controller to drive the delay toward a target. PIE is computationally lighter than CoDel (it only needs to calculate the drop probability periodically, not examine every packet's sojourn time) and has been adopted by DOCSIS 3.1 cable modems as the standard AQM algorithm, meaning it is deployed in hundreds of millions of cable modems worldwide. If you are browsing the looking glass from a cable internet connection, PIE is likely managing the upstream queue on your modem.
fq_codel (Fair Queuing + CoDel)
fq_codel combines CoDel with stochastic fair queuing. Rather than a single queue, it hashes flows into many sub-queues (typically 1024) and services them round-robin. Each sub-queue has its own CoDel instance. This provides both latency control (CoDel keeps queues short) and fairness (round-robin prevents any single flow from monopolizing the link). A bulk download gets its own sub-queue and does not delay packets from other flows.
fq_codel has been the default queue discipline on Linux since kernel 3.6 and is widely deployed in OpenWrt routers, making it the most common AQM in home network equipment. The combination of zero-configuration operation, strong fairness, and effective latency control has made it a remarkable success story in practical AQM deployment.
Fairness: Convergence, RTT Bias, and Inter-Protocol Competition
When multiple TCP flows share a bottleneck link, the congestion control algorithms must converge to a fair allocation of bandwidth. What "fair" means, and how well each algorithm achieves it, is a central concern in congestion control design.
Jain's Fairness Index
Fairness is typically measured using Jain's fairness index: for N flows with throughputs x_1 through x_N, the index is (sum(x_i))^2 / (N * sum(x_i^2)). The index ranges from 1/N (maximally unfair -- one flow gets everything) to 1.0 (perfectly fair -- all flows get equal throughput). Reno-based AIMD provably converges to a fairness index of 1.0 for flows with the same RTT.
RTT Unfairness
The qualifier "same RTT" is important. Reno's additive increase adds 1 MSS per RTT, so a flow with a 10ms RTT increases its window 20 times faster than a flow with a 200ms RTT. In equilibrium, the short-RTT flow gets approximately 20x the bandwidth. This RTT unfairness is inherent in all RTT-counting algorithms. CUBIC's time-based growth function largely eliminates this bias, which is one of its primary advantages.
Inter-Protocol Fairness
When different congestion control algorithms compete, fairness is not guaranteed. BBRv1 was notably unfair to CUBIC, and delay-based algorithms like Vegas lose to loss-based algorithms. The internet is a heterogeneous environment where flows from different operating systems, with different algorithms and parameters, share bottleneck links. The practical result is that congestion control is always an approximation of fairness, not a guarantee. BBRv3 incorporates loss signals specifically to coexist better with CUBIC, and CUBIC's TCP-friendly mode ensures coexistence with Reno.
Congestion Control in Practice: Tuning and Selection
On Linux, the congestion control algorithm can be set per-socket or system-wide:
sysctl net.ipv4.tcp_congestion_control-- shows the current default algorithmsysctl net.ipv4.tcp_available_congestion_control-- lists available algorithmssysctl net.ipv4.tcp_allowed_congestion_control-- lists algorithms unprivileged users can selectsetsockopt(fd, IPPROTO_TCP, TCP_CONGESTION, "bbr", 3)-- sets the algorithm per-socket
Most Linux distributions default to CUBIC. Google's servers use BBR. Content delivery networks often use BBR for long-distance, high-bandwidth transfers and CUBIC for everything else. The choice depends on the workload:
- CUBIC -- Good default for most workloads. Well-tested, fair to other CUBIC flows, widely deployed. Best for general-purpose servers and short-to-medium RTT paths.
- BBR -- Better for high-bandwidth, high-latency paths (transcontinental transfers, satellite links). Better on paths with non-congestion loss (wireless). More aggressive against competing loss-based flows (improving in BBRv3). Use on Google Cloud, YouTube, and similar infrastructure.
- Reno/NewReno -- Rarely chosen intentionally anymore, but still the default on some older or embedded systems. Useful as a baseline for testing.
Initial Window Size
The initial congestion window (IW) determines how much data TCP can send in the first round trip. The original specification set IW to 1 segment. RFC 3390 (2002) allowed IW of up to 4 segments. RFC 6928 (2013) increased it to 10 segments (~14,600 bytes). Google's experiments showed that increasing IW to 10 reduced average page load time by 10% with minimal impact on congestion, because most web page downloads complete within the initial slow start phase.
Some operators experiment with even larger initial windows (16, 32, or 64 segments) for specific use cases like video streaming initialization, though this remains non-standard and can increase loss rates on congested paths.
TCP Pacing
Traditional TCP sends data in bursts: when an ACK arrives, the sender immediately sends as much data as the window allows. This burstiness can overwhelm shallow router buffers, causing momentary congestion and loss even when the average sending rate is sustainable. TCP pacing spaces out packet transmission evenly over each RTT, converting bursts into a smooth stream.
BBR inherently paces its transmissions (the pacing rate is a core part of the algorithm). For CUBIC and other algorithms, Linux provides the fq (Fair Queuing) packet scheduler, which paces TCP flows based on their socket's assigned rate. Using fq as the qdisc is recommended when running BBR (and beneficial for CUBIC as well). The combination of fq_codel at the bottleneck and fq at the sender provides both pacing and AQM.
Congestion Control Beyond TCP
The principles of congestion control extend beyond TCP to every protocol that transmits data over a shared network:
- QUIC -- QUIC implementations can use any congestion control algorithm. Google's QUIC uses BBR, while other implementations (like those in Firefox or curl) typically use CUBIC-like algorithms. QUIC's advantage is that congestion control runs in userspace, making it easier to update and experiment with new algorithms without kernel changes.
- HTTP/2 -- HTTP/2 runs over TCP and inherits TCP's congestion control entirely. HTTP/2's flow control (at the stream and connection level) is separate from TCP's congestion control -- it is possible for HTTP/2 flow control to limit throughput below what TCP's cwnd would allow, or vice versa.
- SCTP -- The Stream Control Transmission Protocol uses congestion control similar to TCP (SACK-based, with slow start and congestion avoidance) but applies it across multiple streams within a single association.
- DCCP -- The Datagram Congestion Control Protocol provides congestion control for unreliable datagram flows, offering CCID2 (TCP-like) and CCID3 (equation-based, smoother rate) profiles.
- RTP/WebRTC -- Real-time media protocols use specialized congestion control like Google Congestion Control (GCC) that optimizes for low latency and smooth rate changes rather than maximum throughput.
The Interaction Between Congestion Control and Routing
Congestion control algorithms do not operate in isolation -- their performance is shaped by the network path that BGP selects. The AS path determines the physical route, which determines the RTT, bottleneck bandwidth, and loss characteristics that congestion control must adapt to. A path change visible in the looking glass -- perhaps due to a route leak or a peering dispute -- can suddenly change the RTT and loss rate, forcing the congestion control algorithm to re-adapt.
Consider a concrete scenario: a TCP connection from AS15169 (Google) to a user in Europe normally traverses a direct peering link with 30ms RTT. If a BGP event reroutes traffic through a transit provider with a longer path, RTT might jump to 120ms. A CUBIC flow that had converged to a stable cwnd now sees its actual throughput drop (same cwnd, longer RTT), and the cubic function's time-based growth means it slowly probes for the new capacity at this higher RTT. A BBR flow adapts more gracefully -- it measures the new RTprop and BtlBw and adjusts its pacing rate within a few RTTs. But if the new path also has higher loss (perhaps due to congestion at the transit provider), BBR's tolerance of loss keeps throughput higher, while CUBIC's loss-driven halving reduces cwnd more aggressively.
This interplay between routing and transport is fundamental to internet performance. The looking glass shows you the routing layer; congestion control operates at the transport layer; and the user's experience depends on both.
Explore Network Paths
Understanding congestion control helps explain why the same server can feel fast or slow depending on the network path. Use the god.ad BGP Looking Glass to examine the routing paths that your TCP connections traverse. Look up any IP address or autonomous system to see the BGP route, origin AS, and AS path -- and consider how the path's length, latency, and congestion characteristics interact with the congestion control algorithm running on each endpoint.