The Fastly CDN Global Outage (June 2021)

On June 8, 2021, at 09:47 UTC, roughly 85% of Fastly's (AS54113) global CDN network began returning 503 Service Unavailable errors. Within seconds, some of the most visited websites on the planet went dark. Amazon, Reddit, the UK government's gov.uk, The New York Times, Twitch, Pinterest, GitHub, Hulu, HBO Max, Vimeo, and dozens of other major properties became unreachable for users worldwide. For approximately 49 minutes, a significant fraction of the web was effectively offline — not because those sites had failed, but because the invisible infrastructure delivering their content had broken.

The root cause was not a cyberattack, not a hardware failure, and not an act of nature. It was a latent software bug, triggered by a routine customer configuration change. The incident exposed one of the internet's most underappreciated structural risks: the extreme concentration of web traffic through a handful of CDN providers.

What Is Fastly and Why Did It Matter?

Fastly is a major content delivery network and edge computing platform. Operating as AS54113, Fastly runs a global network of Points of Presence (POPs) — edge servers positioned in data centers worldwide, often at major Internet Exchange Points. These POPs cache and serve content for Fastly's customers, which include some of the largest properties on the web.

When a user requests a page from a site behind Fastly, the request is routed via BGP and anycast to the nearest Fastly POP. That edge server either serves the page from cache or fetches it from the customer's origin server and caches it. This architecture reduces latency dramatically: instead of every request traveling to a distant origin, users get responses from a server a few milliseconds away.

The catch is dependency. When a CDN like Fastly works correctly, it is invisible. When it fails, every site that depends on it fails simultaneously. On June 8, 2021, the scale of that dependency became visible to the entire world.

What Happened: The Technical Root Cause

Fastly's post-incident report revealed a precise chain of events. On May 12, 2021 — nearly a month before the outage — Fastly had deployed a software update to its edge network. This update contained a latent bug: under certain conditions, a specific configuration parameter in a customer's VCL (Varnish Configuration Language) settings could trigger a process crash on the edge servers. The bug existed in Fastly's codebase but could not manifest until a particular set of conditions were met.

Those conditions were met on June 8 at 09:47 UTC, when a customer made a valid, routine configuration change to their Fastly service. The configuration was syntactically correct and passed all of Fastly's validation checks. But when it was deployed to the edge network, it triggered the latent bug. The impact was immediate and cascading: approximately 85% of Fastly's global network began returning HTTP 503 errors.

The failure pattern was especially severe because of how edge CDN configurations propagate. When a customer's configuration is deployed, it is pushed to all POPs that serve that customer's traffic. The bug didn't just crash one server — it caused failures across the global fleet.

The Scale of Impact

The list of affected services read like a directory of the internet's most critical properties:

Amazon (AS16509) — both the retail site and portions of AWS infrastructure behind Fastly
Reddit — fully unreachable, showing Fastly 503 errors
GitHub (AS36459) — code hosting, CI/CD pipelines, dependency downloads all failed
gov.uk — the United Kingdom's primary government services portal
The New York Times — unable to serve articles during morning rush in the US
Twitch — live streams and the platform itself went offline
Pinterest — fully down
The Guardian, Financial Times, Le Monde — multiple major news outlets went dark simultaneously
Hulu, HBO Max, Vimeo — streaming services disrupted
Stack Overflow — developers could not access documentation and answers
Shopify storefronts — e-commerce sites dependent on Fastly were unreachable

The outage was not localized to a single continent or region. Because the bug propagated across Fastly's global POP fleet, users in North America, Europe, Asia, and elsewhere all experienced failures simultaneously. For nearly an hour, users around the world saw the same cryptic message: Error 503 Service Unavailable, often accompanied by the Fastly-specific header Varnish cache server.

The Response and Recovery

Fastly's engineering team detected the problem quickly. According to their incident report, the timeline was:

09:47 UTC — Global disruption begins as the customer configuration change triggers the bug
09:58 UTC — Fastly engineering identifies the bug (11 minutes after onset)
10:36 UTC — A fix is deployed (49 minutes after onset)
10:46 UTC — 95% of Fastly's network has recovered
10:57 UTC — Full network recovery confirmed

The initial fix involved disabling the configuration that triggered the bug, which immediately allowed the affected POPs to recover. Fastly then deployed a permanent software fix to the underlying code. The total outage window of approximately one hour was relatively short for an incident of this severity, reflecting the benefit of having the problem be software-based rather than a physical infrastructure failure.

However, "one hour" understates the real-world impact. Many sites that use CDNs like Fastly are configured so that if the CDN is unavailable, requests do not automatically fall through to the origin server. This means those sites were completely dark for the entire duration — there was no degraded mode, just a hard failure.

Why CDN Failures Are Different from Origin Failures

When a single website's origin server fails, that one site goes down. When a CDN fails, every site behind it goes down simultaneously. This is the fundamental architectural risk of CDN concentration.

A critical detail of the Fastly outage is that it was an application-layer failure, not a network-layer failure. From a BGP perspective, nothing was wrong. Fastly's IP prefixes remained announced throughout the incident. The AS paths to AS54113 remained stable. Packets were successfully routed to Fastly's edge servers. The problem was that when those packets arrived, the edge servers had nothing useful to return — only 503 errors.

This is an important distinction for network operators. BGP monitoring tools and looking glasses would not have detected this outage because the routing infrastructure was intact. It required application-level health checks and HTTP monitoring to identify the failure. This contrasts sharply with the BGP-level outages we see when a network accidentally withdraws its routes (as in the Facebook outage later that same year), where BGP monitoring immediately reveals the problem.

Fastly's Network Architecture

To understand why the failure was so widespread, it helps to understand how Fastly's network is structured. Fastly operates as AS54113 and announces a set of IP prefixes via BGP from multiple Points of Presence worldwide using anycast.

Like other major CDNs, Fastly uses anycast so that the same IP address (say, one serving reddit.com) is reachable from dozens of locations. BGP ensures that each user's traffic reaches the topologically nearest POP. This is the same technique used by Cloudflare (AS13335), Akamai (AS20940), and other CDN providers.

The architecture is powerful for performance and resilience against localized failures. If a single POP goes down, BGP reconverges and traffic flows to the next-nearest POP. But the Fastly outage was not a localized failure. The software bug hit 85% of POPs simultaneously because they all ran the same code and processed the same configuration. The anycast architecture that normally provides redundancy could not help when the failure was in the software layer shared across nearly all nodes.

CDN Concentration: The Hidden Risk

The Fastly outage crystallized a risk that network engineers had long warned about: CDN concentration. A small number of CDN providers serve a disproportionate share of global web traffic. When one of these providers experiences a global outage, the blast radius encompasses thousands of seemingly unrelated websites.

The internet was designed as a decentralized, resilient network. BGP routes around failures — if one path goes down, traffic takes another. But CDNs introduce a layer of centralization above BGP. Even though the routing layer is resilient, the application layer served by CDNs is concentrated in a way that creates correlated failure modes.

Consider the analogy: BGP is like the road network, and CDNs are like a chain of warehouses positioned along those roads. If a road is blocked, traffic reroutes via BGP. But if the warehouse chain itself has a systemic problem (a software bug that hits all locations), the roads being functional does not help — the goods cannot be delivered.

The BGP Perspective: What the Routing Table Showed

During the Fastly outage, examining the BGP routing table via a looking glass would have shown something paradoxical: everything looked normal. Fastly's prefixes, such as those in the 151.101.0.0/16 range, remained visible in the global routing table with stable AS paths originating from AS54113.

This is because Fastly's routers continued to announce their prefixes via BGP. The routers themselves were functional — it was the application-layer software (the Varnish-based edge caching layer) that was crashing. From a BGP perspective, the network was healthy. Packets were reaching their destination. The destination simply had nothing to serve.

This highlights an important limitation of BGP monitoring: it tells you whether a network is reachable, not whether it is functional. Reachability and service availability are different things. A prefix can be fully reachable via BGP while the services behind it are completely broken. Comprehensive monitoring requires both BGP-level visibility and application-layer health checks.

Comparison with Other Major Outages

The Fastly outage occurred during a year of high-profile internet failures that collectively illustrated different categories of infrastructure risk:

Fastly (June 2021) — Application-layer bug at a CDN. BGP routes stable, but edge servers returned errors. Duration: ~1 hour.
Akamai (July 2021) — Just weeks later, Akamai (AS20940) experienced a similar CDN outage affecting major banks, airlines, and other customers. Duration: ~1 hour. Root cause: a software update to its DNS infrastructure.
Facebook (October 2021) — A BGP withdrawal caused Facebook, Instagram, and WhatsApp to become completely unreachable. Unlike Fastly, this was visible in BGP — Facebook's (AS32934) routes disappeared from the global routing table entirely. Duration: ~6 hours.

The Facebook outage was fundamentally different: it was a routing-layer failure where BGP routes were withdrawn due to a configuration error in Facebook's backbone network. A BGP looking glass would have immediately shown the absence of Facebook's routes. The Fastly and Akamai outages, by contrast, were application-layer failures invisible to BGP monitoring.

Lessons for Network Architecture

The Fastly outage offered concrete lessons for anyone building services that depend on the internet's infrastructure:

Multi-CDN Strategies

Organizations that used Fastly as their sole CDN had no fallback when it failed. Sites that employed a multi-CDN strategy — distributing traffic across multiple providers like Cloudflare, Akamai, and Fastly — experienced reduced impact because traffic could shift to the providers that remained operational. Multi-CDN adds complexity and cost, but it eliminates the single-provider risk.

Origin Fallback

Many affected sites had no mechanism to serve traffic directly from their origin servers when the CDN failed. Implementing an origin fallback — where DNS records can be updated (or automatically switched) to point directly to origin servers — provides a last-resort serving path. The trade-off is higher latency and reduced capacity compared to CDN delivery, but serving slowly is better than not serving at all.

Software Deployment Safety

The root cause was a latent software bug that went undetected for 27 days. This underscores the importance of canary deployments (rolling changes out to a small fraction of infrastructure first), comprehensive integration testing that covers edge-case configurations, and the ability to rapidly roll back deployments. Fastly's post-incident improvements included enhanced software testing and deployment procedures specifically targeting the class of bug that caused the outage.

Configuration Validation

The triggering configuration change was valid according to Fastly's validation rules. This revealed that the validation system did not cover all paths through the edge software. Ensuring that configuration validation exercises the same code paths as production serving is critical for platforms that allow customers to deploy arbitrary configurations to edge infrastructure.

CDNs, Anycast, and BGP: How They Interconnect

The Fastly outage is a useful case study for understanding how CDNs, anycast, and BGP interact in practice.

CDN providers announce their IP prefixes from many locations using BGP anycast. Customers configure their DNS records (typically via CNAME) to point to addresses within the CDN's anycast prefixes. When a user makes a request, BGP routing directs them to the nearest POP, where the CDN software handles the request.

This means CDN reliability depends on three separate layers, all working correctly:

BGP routing — Prefixes must be announced and reachable. If BGP routes are withdrawn, the CDN is unreachable at the network layer.
DNS resolution — Customer domains must resolve to CDN addresses. If DNS fails, users cannot find the CDN.
Edge software — The CDN's application layer must correctly serve or proxy content. If the edge software crashes (as in the Fastly outage), the CDN is reachable but non-functional.

The Fastly outage was a failure at layer 3 in this stack. BGP and DNS were fine. The edge software was broken. Understanding which layer failed is essential for diagnosing and responding to CDN incidents.

How to Investigate CDN-Related Outages

When a website goes down and you suspect a CDN issue, you can use several tools to diagnose the problem:

Check BGP routes — Use a BGP looking glass to verify that the CDN's prefixes are still announced. If routes are missing, it is a BGP-level failure. If routes are present, the issue is above the routing layer.
Check DNS — Verify that the affected domain resolves to an IP address. If DNS resolution fails, the problem may be in the CDN's DNS infrastructure.
Check HTTP response — If the IP is reachable, check what HTTP response the server returns. A 503 from a CDN edge typically means the edge is up but cannot serve content — exactly what happened during the Fastly outage.
Identify the CDN — Look up the IP address the domain resolves to. The AS number of the origin tells you which CDN is serving the site. If it is AS54113, the site is behind Fastly. If it is AS13335, it is Cloudflare.

You can practice this right now. Look up any major website and check which CDN it uses:

reddit.com — check which network serves Reddit
github.com — see GitHub's current CDN and origin
nytimes.com — identify the CDN behind The New York Times

The Broader Pattern: Infrastructure Concentration

The Fastly outage is part of a broader pattern of concentration risk in internet infrastructure. Beyond CDNs, similar risks exist in:

Cloud providers — A large fraction of the web runs on AWS (AS16509), Google Cloud (AS15169), and Azure (AS8075). Regional outages at these providers have cascading effects.
DNS providers — The 2016 Dyn DNS attack showed how a DDoS against a single managed DNS provider could render thousands of sites unreachable.
Transit providers — A handful of Tier 1 networks carry a disproportionate share of global internet traffic. Issues at providers like Lumen (AS3356) or Arelion (AS1299) can affect wide swaths of the internet.
Submarine cables — Physical cable cuts can isolate entire regions, as a few key cables carry the majority of intercontinental traffic.

The internet's BGP routing system is inherently decentralized and resilient, but the services built on top of it have gravitated toward centralization for economic and performance reasons. The Fastly outage was a reminder that logical centralization can undermine the physical decentralization that BGP provides.

What Changed After the Outage

Fastly published a detailed post-incident blog post and implemented several changes:

The specific software bug was fixed and the fix deployed globally
Additional testing was added to cover the class of configuration-triggered bugs
Deployment procedures were updated to improve canary testing and rollback speed
Configuration validation was enhanced to exercise more code paths

More broadly, the outage accelerated industry adoption of multi-CDN strategies. Organizations that had previously relied on a single CDN provider began evaluating redundancy options. The incident also prompted renewed discussion about the desirability of CDN diversity in critical government services — the fact that gov.uk went down because of a US-based CDN provider's bug raised questions about supply chain risk for national infrastructure.

Explore the Networks Involved

You can examine the routing infrastructure of the networks involved in this incident using the looking glass:

AS54113 — Fastly: see its announced prefixes and peering relationships
AS13335 — Cloudflare: compare the size and peering of another major CDN
AS20940 — Akamai: one of the oldest and largest CDNs
AS16509 — Amazon/CloudFront: cloud provider with CDN services
AS36459 — GitHub: one of the affected sites
reddit.com — look up Reddit's current routing to see which CDN it uses today