The Cloudflare Backbone Outage (July 2020)

On July 17, 2020, a routine maintenance operation on Cloudflare's backbone network turned into one of the company's most significant outages. A misconfigured BGP route map on a single router in Atlanta withdrew backbone routes across Cloudflare's network, forcing traffic onto congested transit links and degrading service for roughly half of Cloudflare's global infrastructure for 27 minutes. The incident is a textbook case study in how BGP configuration errors cascade through a network, and how anycast architecture can simultaneously limit and propagate the blast radius of a failure.

Cloudflare's Backbone Architecture

To understand what went wrong, you first need to understand how Cloudflare's network (AS13335) is built. Cloudflare operates one of the most interconnected networks on the internet, with data centers in over 300 cities worldwide. These data centers are connected in two fundamental ways:

Backbone links -- private, high-capacity fiber connections that Cloudflare owns or leases between its own data centers. These carry internal traffic between Cloudflare locations at high speed and low cost.
Transit and peering links -- connections to external networks via transit providers and IXPs. These connect Cloudflare to the rest of the internet but are shared infrastructure with limited capacity for backbone-type traffic loads.

Under normal operation, traffic flowing between Cloudflare data centers travels over the backbone. If a user in Europe requests content that is cached in a US data center, the request traverses Cloudflare's backbone -- not the public internet. The backbone is engineered to carry this inter-datacenter traffic efficiently, while the transit and peering links are sized for delivering content to end users.

This dual-path design is standard for large networks. The backbone handles internal traffic between Cloudflare data centers, while transit and peering handle the last-mile delivery to end users. The backbone routes are advertised internally within Cloudflare's network using BGP -- specifically iBGP (internal BGP) sessions between the company's routers.

The Maintenance Window

On July 17, 2020, Cloudflare engineers began a planned maintenance procedure on a backbone router in their Atlanta data center. The maintenance was intended to decommission one of the backbone segments connected to Atlanta, part of a routine capacity management process. The work involved modifying the BGP configuration on the Atlanta router to remove routes associated with the segment being retired.

The specific change was to apply a new BGP route map -- a set of rules that control which routes a router accepts, rejects, or modifies. Route maps are the core mechanism by which network operators implement routing policy on their BGP-speaking routers. A correctly applied route map would have selectively removed only the routes associated with the decommissioned backbone segment.

But the route map that was applied was incorrect.

What Went Wrong: The BGP Misconfiguration

Instead of selectively withdrawing routes for the specific backbone segment under maintenance, the route map caused the Atlanta router to withdraw all backbone routes. This was a configuration error -- the route map's match conditions were too broad, or the permit/deny logic was inverted, effectively telling the router "reject all backbone prefixes" instead of "reject prefixes for this one segment."

When a BGP router withdraws routes, it sends BGP UPDATE messages to all of its BGP peers informing them that those routes are no longer available. These withdrawal messages propagated through Cloudflare's internal BGP mesh in seconds. Every other Cloudflare router that had been using the Atlanta backbone links as part of its routing table suddenly lost those routes.

The result was immediate. Without backbone routes, Cloudflare's routers needed an alternative path to reach other Cloudflare data centers. BGP did exactly what it was designed to do -- it reconverged and found the next-best routes. But those next-best routes went through external transit providers instead of Cloudflare's private backbone.

The Congestion Cascade

This is where the outage moved from "misconfiguration" to "major incident." The transit links between Cloudflare and its upstream providers were not engineered to carry backbone-level traffic volumes. These links are sized for exchanging traffic with the public internet -- delivering content to end users and receiving requests. They are not designed to shuttle large volumes of inter-datacenter traffic that would normally traverse Cloudflare's own backbone fiber.

When backbone traffic suddenly shifted onto transit links, those links became severely congested. Packet loss spiked. Latency increased dramatically. Connection timeouts mounted. For roughly 50% of Cloudflare's network -- the data centers that depended on backbone routes through or near Atlanta -- service degraded substantially.

The congestion was not limited to one link. Because Atlanta was a major backbone hub, its route withdrawals affected traffic patterns across a wide swath of the network. Data centers that had been routing inter-DC traffic through Atlanta suddenly all tried to push that traffic through their transit links simultaneously, creating congestion at multiple points.

Why Only 50% Was Affected

This is the part of the story that demonstrates both the strength and the limitation of anycast architecture. Cloudflare uses anycast for virtually all of its services -- the same IP prefixes (like 1.1.1.1/24) are announced from every Cloudflare data center worldwide.

With anycast, each user is routed to the topologically nearest Cloudflare data center by BGP. This means that Cloudflare's network is not a single monolithic system. It is hundreds of independent instances of the same service, each serving the users that BGP routes to that location. When the backbone outage caused congestion, it affected the data centers that depended on the Atlanta backbone hub. But data centers in other parts of the world -- those that did not route backbone traffic through Atlanta -- continued operating normally.

A user in Sydney whose traffic was served by a local Cloudflare data center experienced no impact. A user in South America, routed to a Cloudflare data center that connected to other data centers via a different backbone path, also saw no disruption. The 50% that was unaffected simply had no dependency on the Atlanta backbone segment.

Compare this to a traditional centralized architecture where all traffic funnels through a single data center or a small cluster. In that model, a backbone failure would affect 100% of users because there is no geographic redundancy at the routing level. Cloudflare's anycast design effectively created a blast radius that was bounded by network topology rather than being global.

The BGP Mechanics of the Failure

To understand the technical details, let's walk through what happens inside a network when backbone routes are withdrawn.

Cloudflare's internal routing uses iBGP (internal BGP) to distribute route information between its data centers. Each backbone router advertises the prefixes reachable through it, along with a next hop attribute pointing to itself. When the Atlanta router withdrew its backbone routes, it sent BGP UPDATE messages with the WITHDRAWN ROUTES field populated for every backbone prefix it had been advertising.

Each receiving router processed these withdrawals and ran its BGP best-path selection algorithm to find an alternative route. The algorithm follows a deterministic sequence of tiebreakers:

Highest local preference -- Backbone routes typically have high local preference to ensure they are preferred over transit routes.
Shortest AS path -- Backbone routes have AS path length of 0 (they are internal), while transit alternatives have longer paths.
Lowest origin type -- IGP is preferred over EGP, which is preferred over INCOMPLETE.
Lowest MED -- Multi-exit discriminator, used to influence inbound traffic from peers.
eBGP over iBGP -- External routes preferred over internal.
Lowest IGP cost to next hop
Various other tiebreakers

With the backbone routes withdrawn, the highest-preference internal routes were gone. The only remaining routes were those learned via eBGP from transit providers. These had lower local preference and longer AS paths, but they were the only paths available. BGP converged on these transit routes, and traffic shifted accordingly.

The critical insight is that BGP did its job perfectly. It reconverged to the best available paths within seconds. The problem was that the best available paths did not have sufficient capacity. This is a fundamental tension in network design: you want BGP to fail over to alternative paths, but those alternative paths must be provisioned to handle the overflow.

Detection and Response: 27 Minutes

Cloudflare's monitoring systems detected the anomaly quickly. Traffic graphs showed backbone utilization dropping to near zero while transit link utilization spiked. Error rates climbed. The operations team was alerted and began investigating.

The fix was straightforward once the cause was identified: revert the route map configuration on the Atlanta router. Once the correct configuration was restored, the router re-advertised its backbone routes via BGP UPDATE messages with the routes in the NLRI (Network Layer Reachability Information) field. The receiving routers immediately recognized these backbone routes as superior to the transit alternatives (higher local preference, shorter path) and switched back to the backbone paths.

The total duration of the outage was approximately 27 minutes. Given the nature of BGP convergence, recovery after the fix was applied took only seconds -- the same rapid convergence that propagated the failure also enabled fast recovery once the correct routes were restored.

Root Cause Analysis: Why Route Maps Are Dangerous

BGP route maps are powerful and necessary -- they are how operators implement complex routing policy. But they are also one of the most dangerous configuration primitives in networking. A route map consists of ordered match conditions and set actions, evaluated sequentially. An implicit deny all at the end of most route maps means that any route not explicitly matched by a permit clause will be rejected.

This design creates a trap: if you write a route map that is supposed to selectively filter one set of routes but your match conditions fail to capture everything you intend to permit, the implicit deny at the end silently drops all the routes you forgot to match. In the Cloudflare incident, this "implicit deny" at the end of the route-map -- standard behavior on both Cisco IOS and Juniper Junos -- caused all backbone routes that did not match a permit clause to be withdrawn, not just the specific segment under maintenance. There is no error message, no warning. The router simply stops advertising routes, and BGP propagates the withdrawals across your network.

This exact pattern has caused outages at multiple major networks over the years. It is one of the most common classes of BGP-related outages because:

Route maps are written in a terse, vendor-specific configuration language with limited validation tooling
The match/set/deny semantics are unintuitive -- the "deny all" implicit default catches engineers off guard
Testing route maps in production-identical environments is difficult because the full routing state is complex
The blast radius of a bad route map is not localized to the router -- it propagates via BGP to every peer

Lessons: What Cloudflare Changed

Cloudflare's post-mortem identified several improvements they implemented or accelerated after this incident:

1. Automated Route Map Validation

Before applying a route map change, automated tooling now validates the expected impact by simulating the route map against the router's current routing table. If the simulation shows an unexpected number of routes being withdrawn, the change is blocked.

2. Staged Rollouts for Backbone Changes

Configuration changes that affect backbone routing are now rolled out incrementally, one router at a time, with monitoring gates between each step. If metrics degrade after changing one router, the rollout is automatically halted and rolled back before affecting additional routers.

3. Backbone Capacity on Transit Links

Transit link capacity planning now accounts for backbone failover scenarios. While transit links will never be sized to carry full backbone traffic (that would be prohibitively expensive), they are provisioned with enough headroom to gracefully handle partial backbone failures without severe congestion.

4. Maximum Prefix Limits

BGP sessions between backbone routers now have maximum prefix limits configured. If a router suddenly withdraws more routes than expected, the peer detects this anomaly. This does not prevent the initial withdrawal but provides an additional detection mechanism.

Comparison to Other BGP Outages

The Cloudflare backbone outage is part of a broader pattern of BGP-related incidents that have affected major internet infrastructure:

Facebook (October 2021) -- A configuration change on Facebook's backbone routers withdrew all of Facebook's BGP routes from the internet, making Facebook, Instagram, and WhatsApp completely unreachable for approximately six hours. Unlike Cloudflare's outage, Facebook's affected 100% of their services because the BGP withdrawal was for their externally-announced prefixes, not just internal backbone routes.
Google (November 2018) -- A BGP route leak sent Google traffic through Nigeria and China, causing packet loss and latency for Google Cloud customers for 74 minutes.
CenturyLink/Level3 (August 2020) -- Just weeks after Cloudflare's incident, a BGP flowspec (traffic filtering) misconfiguration at CenturyLink caused a cascading failure that took down a significant portion of US internet services for four hours.

These incidents share a common theme: a single configuration error on a BGP-speaking router can propagate rapidly through the network, affecting services far beyond the scope of the original change. The power of BGP to rapidly converge on new routing information is both its greatest strength and its most dangerous property.

What This Means for Network Design

The Cloudflare backbone outage illustrates several enduring principles of internet architecture:

Anycast provides natural fault isolation. Because anycast routes each user to the nearest instance, a failure in one region does not automatically cascade to all users globally. This is why CDN architectures based on anycast are more resilient than centralized architectures.

BGP convergence is a double-edged sword. The same rapid reconvergence that makes BGP robust in the face of link failures also means that misconfigurations propagate just as rapidly. A bad route announcement reaches every peer in seconds, and by the time a human notices, the damage is already network-wide.

Capacity planning must account for failure modes. When backbone links fail and traffic shifts to transit, the transit links must have headroom. When a data center fails and traffic shifts to other data centers, those data centers must have spare capacity. The failure mode that overwhelms fallback capacity is the one that turns a minor incident into a major outage.

The fix is often simple; finding the cause is the hard part. Reverting the route map took seconds. The 27-minute outage was mostly spent identifying that the route map was the problem, not applying the fix. This is typical of BGP incidents -- the operational challenge is diagnosis, not remediation.

Explore Cloudflare's Network

You can examine Cloudflare's current BGP presence using the looking glass. Look at the routes, AS paths, and peering relationships that make up one of the most connected networks on the internet:

AS13335 -- Cloudflare's autonomous system: see all announced prefixes and neighbors
1.1.1.1 -- Cloudflare's anycast DNS resolver: observe multiple routes from different vantage points
1.0.0.1 -- Cloudflare's secondary DNS resolver
cloudflare.com -- Resolve and trace the BGP route to Cloudflare's website