The Facebook DNS Outage (October 2021)

On October 4, 2021, Facebook, Instagram, WhatsApp, Messenger, and Oculus all vanished from the internet simultaneously. For approximately six hours, over 3.5 billion users were locked out of the platforms they relied on for communication, business, and daily life. The root cause was not a cyberattack or a hardware failure -- it was a BGP configuration change that withdrew the routes to Facebook's DNS nameservers, rendering the company's entire online presence unreachable.

This was not merely a website going down. It was a case study in how tightly coupled modern internet infrastructure is, and how a single routing mistake can cascade into a global outage. Understanding what happened requires knowledge of BGP, DNS, and how large-scale networks like Facebook's (AS32934) are architected.

Background: Facebook's Network Architecture

Facebook operates AS32934, one of the largest autonomous systems on the internet. Like other hyperscale companies, Facebook runs a global backbone network -- a private fiber-optic network connecting its data centers to each other and to the dozens of Internet Exchange Points (IXPs) and peering locations where it exchanges traffic with ISPs and other networks.

Facebook's infrastructure is divided into two main layers:

Critically, Facebook's authoritative DNS nameservers -- the servers that answer queries like "what is the IP address of facebook.com?" -- are hosted inside Facebook's own network. Their IP addresses (within prefixes like 129.134.30.0/24 and 185.89.218.0/23) are announced to the internet via BGP from AS32934. If those BGP announcements disappear, the DNS servers become unreachable, and every service that depends on DNS resolution of facebook.com, instagram.com, whatsapp.com, and other Meta domains ceases to function.

What Happened: The Timeline

October 4, 2021 -- Facebook Outage Timeline (UTC) 15:39 UTC Routine backbone maintenance config change issued. Audit tool bug fails to catch the error. 15:40 UTC BGP sessions to all peering routers go down. AS32934 withdraws all route announcements globally. 15:40-15:42 DNS resolvers worldwide lose routes to FB nameservers. DNS queries for facebook.com begin returning SERVFAIL. ~16:00 UTC Engineers discover internal tools are also down. Remote access systems depend on the same DNS/network. ~16:30-18:00 Teams dispatched to data centers for physical access. Badge/door systems also affected; delays gaining entry. ~21:00 UTC BGP routes re-announced; DNS propagation begins. Services gradually restored. Full recovery by ~22:00.

The Root Cause: A Backbone Config Change

At approximately 15:39 UTC, a Facebook engineer issued a command to assess the capacity of Facebook's global backbone network. The command was intended to take backbone links out of service for maintenance evaluation. However, a bug in the audit tool -- the software that validates configuration changes before they are applied -- failed to catch a critical error in the command.

The configuration change was designed to evaluate backbone capacity, but it contained an error that caused all backbone connections between Facebook's data centers and its peering edge routers to be withdrawn. When the backbone links to the edge went down, the edge routers lost their internal routes to everything behind them -- including the DNS nameservers.

Facebook's BGP routers are programmed to withdraw their external BGP announcements if they lose connectivity to the internal network, which is a standard safety measure. If a router cannot actually deliver traffic to the destinations it is advertising, it should stop advertising them. In this case, the logic worked exactly as designed: the routers detected that they could no longer reach the DNS nameservers (and everything else) over the backbone, so they sent BGP withdrawal messages to every peer and upstream.

The BGP Withdrawals

Within seconds of the backbone going down, AS32934 withdrew all of its prefix announcements from the global BGP routing table. Every AS path leading to Facebook's address space was removed. From the perspective of every other network on the internet, Facebook's IP addresses simply ceased to exist.

Route collectors like RIPE RIS and RouteViews recorded the mass withdrawal in real time. Within two minutes, the global routing table had zero routes to any Facebook prefix. It was as if someone had erased Facebook from the internet's map.

BGP Route Withdrawal Cascade BEFORE: Normal BGP State ISP A ISP B AS32934 (Facebook) Announcing routes normally DNS Nameservers backbone BGP BGP AFTER: Routes Withdrawn ISP A ISP B AS32934 (Facebook) All routes WITHDRAWN DNS Nameservers X X X DNS Resolution Path (from any user) User queries: facebook.com A? Recursive resolver asks FB nameserver FB nameserver: UNREACHABLE No BGP route exists X

The withdrawal messages themselves are a normal part of BGP operation. Networks withdraw routes all the time -- when links go down, when maintenance is performed, or when routing policy changes. What made this event exceptional was the scope: every single prefix announced by AS32934 was withdrawn simultaneously, and the withdrawal persisted for hours because the engineers who could fix it had lost access to the systems they needed.

The DNS Cascading Failure

The BGP withdrawal did not directly "break" DNS. What it did was make Facebook's authoritative DNS nameservers unreachable at the network layer. Here is the chain of failure:

  1. BGP routes withdrawn -- the internet's routers no longer have a path to Facebook's IP prefixes
  2. DNS nameservers unreachable -- Facebook's authoritative nameservers (a.ns.facebook.com, b.ns.facebook.com, etc.) have IP addresses within those withdrawn prefixes
  3. DNS resolution fails -- when a recursive resolver (like 8.8.8.8 or 1.1.1.1) tries to resolve facebook.com, it needs to contact Facebook's nameservers, but it cannot reach them
  4. All services fail -- every domain under Facebook's control (facebook.com, instagram.com, whatsapp.net, fbcdn.net, oculus.com) becomes unresolvable
  5. Cached DNS entries expire -- DNS responses have a TTL (time to live). Once cached entries expire, even clients that previously resolved these domains successfully can no longer reach them

The effect was total. It was not a partial degradation or a regional issue. Every user on Earth who tried to access any Facebook property after their DNS cache expired received a SERVFAIL or NXDOMAIN response.

Why This Was So Hard to Fix

In most outages, engineers can SSH into routers, push a config fix, and restore service within minutes. The Facebook outage was different because the failure created a self-reinforcing lockout:

Engineers had to physically travel to Facebook's data centers and gain hands-on access to the backbone routers to revert the configuration change. This is why an issue that was conceptually simple -- "undo the last config change" -- took approximately six hours to resolve.

The BGP View: What the World Saw

From the perspective of BGP route collectors and looking glass tools, the outage was immediately visible. Here is what the data showed:

Before the outage

Facebook's AS32934 was announcing hundreds of prefixes, including:

Each prefix had healthy AS paths visible from multiple vantage points, with AS32934 as the origin.

During the outage

All prefixes originated by AS32934 were withdrawn. Looking glass queries returned zero results. The autonomous system was effectively invisible. BGP monitoring services like BGPStream and Cloudflare Radar showed the withdrawals propagating globally within seconds.

After recovery

Starting around 21:00 UTC, prefixes began reappearing in the global routing table. However, full recovery was not instant -- BGP convergence, DNS cache repopulation, and service startup all took additional time. Most users saw service restored between 21:30 and 22:00 UTC.

AS32934 Prefixes Visible in Global Routing Table ~300 ~200 ~100 0 Prefixes 14:00 15:39 17:00 19:00 21:00 22:00 Time (UTC) -- October 4, 2021 15:39 -- BGP withdrawals ZERO ROUTES ~6 hours ~21:00 -- recovery

Collateral Damage: The DNS Tsunami

The outage had a significant secondary effect on the global DNS infrastructure. When Facebook's nameservers became unreachable, every recursive DNS resolver on the internet began receiving SERVFAIL responses for Facebook-related domains. Many applications and clients responded by retrying aggressively.

The volume of DNS queries for Facebook, Instagram, and WhatsApp domains increased by an estimated 30x compared to normal levels. Recursive resolvers like Cloudflare's 1.1.1.1 and Google's 8.8.8.8 were flooded with queries that would all fail. This created noticeable load on DNS infrastructure worldwide, and some smaller resolver operators reported degraded performance for all DNS queries, not just Facebook-related ones.

Cloudflare published data showing that their resolver handled a massive spike in queries for Facebook's domains during the outage, and that the retry storms from clients constituted a significant portion of their total query volume during the event.

Technical Deep Dive: Why the Safety System Failed

Facebook's engineers had built safeguards into their change management process. Configuration changes to the backbone were supposed to be validated by an audit tool before being applied. This tool was designed to catch dangerous changes -- like ones that would disconnect all backbone links simultaneously.

The audit tool had a bug. It failed to evaluate the command correctly and approved a change that should have been blocked. According to Facebook's postmortem, the specific failure was:

  1. An engineer initiated a command intended to assess backbone capacity
  2. The command contained a parameter that, when applied, would withdraw all backbone connections to the peering edge
  3. The audit tool was supposed to simulate the effect of the command and reject it if it would cause a loss of connectivity
  4. Due to a bug, the audit tool did not correctly model the impact and allowed the command to proceed
  5. The command was applied, and within seconds, the backbone was disconnected from the edge

This is a classic example of a safety system failure mode: the operators trusted the guardrails, the guardrails had a defect, and the result was a total outage. It is analogous to a circuit breaker that fails to trip, allowing a fault to propagate through the entire system.

The Recovery Process

Restoring service required several steps, each complicated by the outage itself:

  1. Physical access -- engineers traveled to data center facilities. Reports indicate that physical security systems (badge readers, door locks) had some dependencies on the affected network, adding delays
  2. Console access -- once inside, engineers connected directly to backbone routers via serial console or out-of-band management interfaces
  3. Configuration revert -- the erroneous backbone configuration was reverted, restoring internal connectivity
  4. BGP re-establishment -- with backbone connectivity restored, edge routers re-established BGP sessions with peers and began re-announcing Facebook's prefixes
  5. DNS propagation -- as routes propagated through the global BGP table, DNS resolvers could once again reach Facebook's nameservers. DNS resolution began working, but cached negative responses (SERVFAIL) in some resolvers added additional delay
  6. Service startup -- Facebook's services had to handle a massive thundering herd of reconnecting users, requiring careful capacity management during the ramp-up

Impact and Consequences

The outage had far-reaching effects:

Lessons for Network Engineering

The Facebook outage of October 2021 offers several important lessons for anyone operating internet infrastructure:

1. DNS should not be a single point of failure

Facebook's authoritative DNS nameservers were all hosted within Facebook's own autonomous system. When AS32934's BGP routes were withdrawn, all nameservers became unreachable simultaneously. Using DNS providers in separate autonomous systems (or at least separate BGP announcements) would have maintained DNS resolution even if the primary network went down.

2. Out-of-band management is essential

When your primary network is the thing that is broken, you need a completely independent path to reach your management interfaces. This means out-of-band management networks that do not depend on the production backbone or DNS, with separate physical access paths and independent authentication.

3. Audit tools are only as good as their test coverage

The audit tool that should have caught the dangerous configuration change had a bug. Safety-critical validation systems need rigorous testing, including tests that specifically verify they reject configurations known to cause total outages.

4. BGP withdrawal propagation is fast and global

Within minutes, Facebook's prefixes were removed from routing tables worldwide. There is no "undo" button for BGP -- once withdrawals propagate, the only recovery is to re-announce the routes, which requires fixing whatever caused the withdrawal in the first place.

5. Physical access procedures must work during outages

If your badge system, door locks, or security protocols depend on the same network you need to fix, you have a circular dependency. Data center physical access must remain functional even during total network failures.

How BGP Monitoring Could Have Helped

While monitoring cannot prevent a misconfiguration, it can dramatically reduce detection time. BGP looking glass tools and route monitoring services detected the Facebook withdrawal within seconds. Organizations that monitor their own BGP announcements from external vantage points can set up alerts that trigger immediately when their prefixes disappear from the global routing table.

Key monitoring practices include:

See Facebook's Network Today

Facebook's network has since been restored and continues to operate normally. You can explore its current routing state using the looking glass:

This outage remains one of the most significant BGP-related incidents in internet history, alongside the 2008 Pakistan/YouTube BGP hijack. It demonstrated that even the most sophisticated networks are vulnerable to the fundamental mechanics of BGP routing -- and that when DNS and BGP fail together, the consequences are swift and total.

See BGP routing data in real time

Open Looking Glass
More Articles
The Pakistan YouTube BGP Hijack (2008)
The Cloudflare-Verizon BGP Leak (2019)
The AWS S3 Outage (February 2017)
The Dyn DNS DDoS Attack and Mirai Botnet (2016)
The CenturyLink/Level3 Flowspec Outage (2020)
The Fastly CDN Global Outage (June 2021)