The Cloudflare Backbone Outage (July 2020)

On July 17, 2020, a routine maintenance operation on Cloudflare's backbone network turned into one of the company's most significant outages. A misconfigured BGP route map on a single router in Atlanta withdrew backbone routes across Cloudflare's network, forcing traffic onto congested transit links and degrading service for roughly half of Cloudflare's global infrastructure for 27 minutes. The incident is a textbook case study in how BGP configuration errors cascade through a network, and how anycast architecture can simultaneously limit and propagate the blast radius of a failure.

Cloudflare's Backbone Architecture

To understand what went wrong, you first need to understand how Cloudflare's network (AS13335) is built. Cloudflare operates one of the most interconnected networks on the internet, with data centers in over 300 cities worldwide. These data centers are connected in two fundamental ways:

Under normal operation, traffic flowing between Cloudflare data centers travels over the backbone. If a user in Europe requests content that is cached in a US data center, the request traverses Cloudflare's backbone -- not the public internet. The backbone is engineered to carry this inter-datacenter traffic efficiently, while the transit and peering links are sized for delivering content to end users.

Cloudflare Backbone Architecture (Normal Operation) SJC San Jose DFW Dallas ATL Atlanta IAD Ashburn --- Backbone (private fiber) --- Transit Providers Transit Providers Users Users

This dual-path design is standard for large networks. The backbone handles internal traffic between Cloudflare data centers, while transit and peering handle the last-mile delivery to end users. The backbone routes are advertised internally within Cloudflare's network using BGP -- specifically iBGP (internal BGP) sessions between the company's routers.

The Maintenance Window

On July 17, 2020, Cloudflare engineers began a planned maintenance procedure on a backbone router in their Atlanta data center. The maintenance was intended to decommission one of the backbone segments connected to Atlanta, part of a routine capacity management process. The work involved modifying the BGP configuration on the Atlanta router to remove routes associated with the segment being retired.

The specific change was to apply a new BGP route map -- a set of rules that control which routes a router accepts, rejects, or modifies. Route maps are the core mechanism by which network operators implement routing policy on their BGP-speaking routers. A correctly applied route map would have selectively removed only the routes associated with the decommissioned backbone segment.

But the route map that was applied was incorrect.

What Went Wrong: The BGP Misconfiguration

Instead of selectively withdrawing routes for the specific backbone segment under maintenance, the route map caused the Atlanta router to withdraw all backbone routes. This was a configuration error -- the route map's match conditions were too broad, or the permit/deny logic was inverted, effectively telling the router "reject all backbone prefixes" instead of "reject prefixes for this one segment."

When a BGP router withdraws routes, it sends BGP UPDATE messages to all of its BGP peers informing them that those routes are no longer available. These withdrawal messages propagated through Cloudflare's internal BGP mesh in seconds. Every other Cloudflare router that had been using the Atlanta backbone links as part of its routing table suddenly lost those routes.

Route Withdrawal Cascade T+0s Bad route map applied at ATL T+2s BGP WITHDRAW propagates to iBGP peers T+5s Backbone routes removed from global RIB T+10s Traffic shifts to transit providers BEFORE AFTER SJC ATL IAD SJC->ATL->IAD (backbone) SJC ATL routes gone IAD Transit X SJC->Transit->IAD (congested) X Backbone routes withdrawn -> traffic forced onto transit links

The result was immediate. Without backbone routes, Cloudflare's routers needed an alternative path to reach other Cloudflare data centers. BGP did exactly what it was designed to do -- it reconverged and found the next-best routes. But those next-best routes went through external transit providers instead of Cloudflare's private backbone.

The Congestion Cascade

This is where the outage moved from "misconfiguration" to "major incident." The transit links between Cloudflare and its upstream providers were not engineered to carry backbone-level traffic volumes. These links are sized for exchanging traffic with the public internet -- delivering content to end users and receiving requests. They are not designed to shuttle large volumes of inter-datacenter traffic that would normally traverse Cloudflare's own backbone fiber.

When backbone traffic suddenly shifted onto transit links, those links became severely congested. Packet loss spiked. Latency increased dramatically. Connection timeouts mounted. For roughly 50% of Cloudflare's network -- the data centers that depended on backbone routes through or near Atlanta -- service degraded substantially.

The congestion was not limited to one link. Because Atlanta was a major backbone hub, its route withdrawals affected traffic patterns across a wide swath of the network. Data centers that had been routing inter-DC traffic through Atlanta suddenly all tried to push that traffic through their transit links simultaneously, creating congestion at multiple points.

Why Only 50% Was Affected

This is the part of the story that demonstrates both the strength and the limitation of anycast architecture. Cloudflare uses anycast for virtually all of its services -- the same IP prefixes (like 1.1.1.1/24) are announced from every Cloudflare data center worldwide.

With anycast, each user is routed to the topologically nearest Cloudflare data center by BGP. This means that Cloudflare's network is not a single monolithic system. It is hundreds of independent instances of the same service, each serving the users that BGP routes to that location. When the backbone outage caused congestion, it affected the data centers that depended on the Atlanta backbone hub. But data centers in other parts of the world -- those that did not route backbone traffic through Atlanta -- continued operating normally.

A user in Sydney whose traffic was served by a local Cloudflare data center experienced no impact. A user in South America, routed to a Cloudflare data center that connected to other data centers via a different backbone path, also saw no disruption. The 50% that was unaffected simply had no dependency on the Atlanta backbone segment.

Compare this to a traditional centralized architecture where all traffic funnels through a single data center or a small cluster. In that model, a backbone failure would affect 100% of users because there is no geographic redundancy at the routing level. Cloudflare's anycast design effectively created a blast radius that was bounded by network topology rather than being global.

Anycast Blast Radius: ~50% Affected Affected (~50%) ATL MIA ORD EWR IAD DFW Transit links congested after backbone routes lost Unaffected (~50%) SYD NRT SIN CDG FRA GRU No dependency on ATL backbone segment Anycast architecture = geographic isolation of failures Users routed to unaffected DCs continued receiving normal service. A centralized architecture would have affected 100% of users.

The BGP Mechanics of the Failure

To understand the technical details, let's walk through what happens inside a network when backbone routes are withdrawn.

Cloudflare's internal routing uses iBGP (internal BGP) to distribute route information between its data centers. Each backbone router advertises the prefixes reachable through it, along with a next hop attribute pointing to itself. When the Atlanta router withdrew its backbone routes, it sent BGP UPDATE messages with the WITHDRAWN ROUTES field populated for every backbone prefix it had been advertising.

Each receiving router processed these withdrawals and ran its BGP best-path selection algorithm to find an alternative route. The algorithm follows a deterministic sequence of tiebreakers:

  1. Highest local preference -- Backbone routes typically have high local preference to ensure they are preferred over transit routes.
  2. Shortest AS path -- Backbone routes have AS path length of 0 (they are internal), while transit alternatives have longer paths.
  3. Lowest origin type -- IGP is preferred over EGP, which is preferred over INCOMPLETE.
  4. Lowest MED -- Multi-exit discriminator, used to influence inbound traffic from peers.
  5. eBGP over iBGP -- External routes preferred over internal.
  6. Lowest IGP cost to next hop
  7. Various other tiebreakers

With the backbone routes withdrawn, the highest-preference internal routes were gone. The only remaining routes were those learned via eBGP from transit providers. These had lower local preference and longer AS paths, but they were the only paths available. BGP converged on these transit routes, and traffic shifted accordingly.

The critical insight is that BGP did its job perfectly. It reconverged to the best available paths within seconds. The problem was that the best available paths did not have sufficient capacity. This is a fundamental tension in network design: you want BGP to fail over to alternative paths, but those alternative paths must be provisioned to handle the overflow.

Detection and Response: 27 Minutes

Cloudflare's monitoring systems detected the anomaly quickly. Traffic graphs showed backbone utilization dropping to near zero while transit link utilization spiked. Error rates climbed. The operations team was alerted and began investigating.

The fix was straightforward once the cause was identified: revert the route map configuration on the Atlanta router. Once the correct configuration was restored, the router re-advertised its backbone routes via BGP UPDATE messages with the routes in the NLRI (Network Layer Reachability Information) field. The receiving routers immediately recognized these backbone routes as superior to the transit alternatives (higher local preference, shorter path) and switched back to the backbone paths.

The total duration of the outage was approximately 27 minutes. Given the nature of BGP convergence, recovery after the fix was applied took only seconds -- the same rapid convergence that propagated the failure also enabled fast recovery once the correct routes were restored.

Incident Timeline (July 17, 2020) 21:12 Maintenance begins 21:12 Bad route map applied 21:14 Alerts fire Congestion detected 21:30 Root cause identified 21:39 Config reverted Service restored ~27 minutes: ~50% degradation Recovery BGP convergence after fix: seconds. Full service recovery: <2 min. Most time spent identifying root cause, not fixing it.

Root Cause Analysis: Why Route Maps Are Dangerous

BGP route maps are powerful and necessary -- they are how operators implement complex routing policy. But they are also one of the most dangerous configuration primitives in networking. A route map consists of ordered match conditions and set actions, evaluated sequentially. An implicit deny all at the end of most route maps means that any route not explicitly matched by a permit clause will be rejected.

This design creates a trap: if you write a route map that is supposed to selectively filter one set of routes but your match conditions fail to capture everything you intend to permit, the implicit deny at the end silently drops all the routes you forgot to match. There is no error message, no warning. The router simply stops advertising routes, and BGP propagates the withdrawals across your network.

This exact pattern has caused outages at multiple major networks over the years. It is one of the most common classes of BGP-related outages because:

Lessons: What Cloudflare Changed

Cloudflare's post-mortem identified several improvements they implemented or accelerated after this incident:

1. Automated Route Map Validation

Before applying a route map change, automated tooling now validates the expected impact by simulating the route map against the router's current routing table. If the simulation shows an unexpected number of routes being withdrawn, the change is blocked.

2. Staged Rollouts for Backbone Changes

Configuration changes that affect backbone routing are now rolled out incrementally, one router at a time, with monitoring gates between each step. If metrics degrade after changing one router, the rollout is automatically halted and rolled back before affecting additional routers.

3. Backbone Capacity on Transit Links

Transit link capacity planning now accounts for backbone failover scenarios. While transit links will never be sized to carry full backbone traffic (that would be prohibitively expensive), they are provisioned with enough headroom to gracefully handle partial backbone failures without severe congestion.

4. Maximum Prefix Limits

BGP sessions between backbone routers now have maximum prefix limits configured. If a router suddenly withdraws more routes than expected, the peer detects this anomaly. This does not prevent the initial withdrawal but provides an additional detection mechanism.

Comparison to Other BGP Outages

The Cloudflare backbone outage is part of a broader pattern of BGP-related incidents that have affected major internet infrastructure:

These incidents share a common theme: a single configuration error on a BGP-speaking router can propagate rapidly through the network, affecting services far beyond the scope of the original change. The power of BGP to rapidly converge on new routing information is both its greatest strength and its most dangerous property.

What This Means for Network Design

The Cloudflare backbone outage illustrates several enduring principles of internet architecture:

Anycast provides natural fault isolation. Because anycast routes each user to the nearest instance, a failure in one region does not automatically cascade to all users globally. This is why CDN architectures based on anycast are more resilient than centralized architectures.

BGP convergence is a double-edged sword. The same rapid reconvergence that makes BGP robust in the face of link failures also means that misconfigurations propagate just as rapidly. A bad route announcement reaches every peer in seconds, and by the time a human notices, the damage is already network-wide.

Capacity planning must account for failure modes. When backbone links fail and traffic shifts to transit, the transit links must have headroom. When a data center fails and traffic shifts to other data centers, those data centers must have spare capacity. The failure mode that overwhelms fallback capacity is the one that turns a minor incident into a major outage.

The fix is often simple; finding the cause is the hard part. Reverting the route map took seconds. The 27-minute outage was mostly spent identifying that the route map was the problem, not applying the fix. This is typical of BGP incidents -- the operational challenge is diagnosis, not remediation.

Explore Cloudflare's Network

You can examine Cloudflare's current BGP presence using the looking glass. Look at the routes, AS paths, and peering relationships that make up one of the most connected networks on the internet:

See BGP routing data in real time

Open Looking Glass
More Articles
The Pakistan YouTube BGP Hijack (2008)
The Facebook DNS Outage (October 2021)
The Cloudflare-Verizon BGP Leak (2019)
The AWS S3 Outage (February 2017)
The Dyn DNS DDoS Attack and Mirai Botnet (2016)
The CenturyLink/Level3 Flowspec Outage (2020)