The Facebook DNS Outage (October 2021)

On October 4, 2021, Facebook, Instagram, WhatsApp, Messenger, and Oculus all vanished from the internet simultaneously. For approximately six hours, over 3.5 billion users were locked out of the platforms they relied on for communication, business, and daily life. The root cause was not a cyberattack or a hardware failure -- it was a BGP configuration change that withdrew the routes to Facebook's DNS nameservers, rendering the company's entire online presence unreachable.

This was not merely a website going down. It was a case study in how tightly coupled modern internet infrastructure is, and how a single routing mistake can cascade into a global outage. Understanding what happened requires knowledge of BGP, DNS, and how large-scale networks like Facebook's (AS32934) are architected.

Background: Facebook's Network Architecture

Facebook operates AS32934, one of the largest autonomous systems on the internet. Like other hyperscale companies, Facebook runs a global backbone network -- a private fiber-optic network connecting its data centers to each other and to the dozens of Internet Exchange Points (IXPs) and peering locations where it exchanges traffic with ISPs and other networks.

Facebook's infrastructure is divided into two main layers:

The backbone -- Private fiber links between data centers, carrying internal traffic (replication, inter-service communication, etc.)
The edge -- Points of Presence (PoPs) at IXPs and peering facilities, where Facebook's network connects to the rest of the internet via BGP

Critically, Facebook's authoritative DNS nameservers -- the servers that answer queries like "what is the IP address of facebook.com?" -- are hosted inside Facebook's own network. Their IP addresses (within prefixes like 129.134.30.0/24 and 185.89.218.0/23) are announced to the internet via BGP from AS32934. If those BGP announcements disappear, the DNS servers become unreachable, and every service that depends on DNS resolution of facebook.com, instagram.com, whatsapp.com, and other Meta domains ceases to function.

What Happened: The Timeline

The Root Cause: A Backbone Config Change

At approximately 15:39 UTC, a Facebook engineer issued a command to assess the capacity of Facebook's global backbone network. The command was intended to take backbone links out of service for maintenance evaluation. However, a bug in the audit tool -- the software that validates configuration changes before they are applied -- failed to catch a critical error in the command.

The configuration change was designed to evaluate backbone capacity, but it contained an error that caused all backbone connections between Facebook's data centers and its peering edge routers to be withdrawn. When the backbone links to the edge went down, the edge routers lost their internal routes to everything behind them -- including the DNS nameservers.

Facebook's BGP routers are programmed to withdraw their external BGP announcements if they lose connectivity to the internal network, which is a standard safety measure. If a router cannot actually deliver traffic to the destinations it is advertising, it should stop advertising them. In this case, the logic worked exactly as designed: the routers detected that they could no longer reach the DNS nameservers (and everything else) over the backbone, so they sent BGP withdrawal messages to every peer and upstream.

The BGP Withdrawals

Within seconds of the backbone going down, AS32934 withdrew all of its prefix announcements from the global BGP routing table. The withdrawals were a side effect of the peering routers losing reachability to BGP next-hops within the IGP (IS-IS): the erroneous configuration severed the internal backbone path to the router loopback addresses that served as BGP next-hops, so each edge router concluded it could no longer forward traffic and withdrew its announcements. Every AS path leading to Facebook's address space was removed. From the perspective of every other network on the internet, Facebook's IP addresses simply ceased to exist.

Route collectors like RIPE RIS and RouteViews recorded the mass withdrawal in real time. Within two minutes, the global routing table had zero routes to any Facebook prefix. It was as if someone had erased Facebook from the internet's map.

The withdrawal messages themselves are a normal part of BGP operation. Networks withdraw routes all the time -- when links go down, when maintenance is performed, or when routing policy changes. What made this event exceptional was the scope: every single prefix announced by AS32934 was withdrawn simultaneously, and the withdrawal persisted for hours because the engineers who could fix it had lost access to the systems they needed.

The DNS Cascading Failure

The BGP withdrawal did not directly "break" DNS. What it did was make Facebook's authoritative DNS nameservers unreachable at the network layer. Here is the chain of failure:

BGP routes withdrawn -- the internet's routers no longer have a path to Facebook's IP prefixes
DNS nameservers unreachable -- Facebook's authoritative nameservers (a.ns.facebook.com, b.ns.facebook.com, etc.) have IP addresses within those withdrawn prefixes
DNS resolution fails -- when a recursive resolver (like 8.8.8.8 or 1.1.1.1) tries to resolve facebook.com, it needs to contact Facebook's nameservers, but it cannot reach them
All services fail -- every domain under Facebook's control (facebook.com, instagram.com, whatsapp.net, fbcdn.net, oculus.com) becomes unresolvable
Cached DNS entries expire -- DNS responses have a TTL (time to live). Once cached entries expire, even clients that previously resolved these domains successfully can no longer reach them

The effect was total. It was not a partial degradation or a regional issue. Every user on Earth who tried to access any Facebook property after their DNS cache expired received a SERVFAIL or NXDOMAIN response.

Why This Was So Hard to Fix

In most outages, engineers can SSH into routers, push a config fix, and restore service within minutes. The Facebook outage was different because the failure created a self-reinforcing lockout:

Remote access tools were down -- Facebook's internal remote management systems, VPNs, and out-of-band access paths all relied on the same DNS and network infrastructure that was offline
Internal tools depended on DNS -- even tools that could theoretically work without the public internet still needed internal DNS resolution, which was also disrupted because the backbone carrying traffic to the internal DNS servers was down
Physical access was impeded -- data center badge readers and door access systems reportedly had dependencies on the same network, complicating physical access for the teams dispatched to fix the issue
Verification was difficult -- with internal communication tools (which included Workplace, Facebook's internal collaboration platform) also down, coordinating the response was itself a challenge

Engineers had to physically travel to Facebook's data centers and gain hands-on access to the backbone routers to revert the configuration change. This is why an issue that was conceptually simple -- "undo the last config change" -- took approximately six hours to resolve.

The BGP View: What the World Saw

From the perspective of BGP route collectors and looking glass tools, the outage was immediately visible. Here is what the data showed:

Before the outage

Facebook's AS32934 was announcing hundreds of prefixes, including:

129.134.0.0/17 -- covering DNS nameserver IPs
157.240.0.0/17 -- covering facebook.com IPs
185.89.218.0/23 -- covering additional DNS infrastructure
And many more IPv4 and IPv6 prefixes

Each prefix had healthy AS paths visible from multiple vantage points, with AS32934 as the origin.

During the outage

All prefixes originated by AS32934 were withdrawn. Looking glass queries returned zero results. The autonomous system was effectively invisible. BGP monitoring services like BGPStream and Cloudflare Radar showed the withdrawals propagating globally within seconds.

After recovery

Starting around 21:00 UTC, prefixes began reappearing in the global routing table. However, full recovery was not instant -- BGP convergence, DNS cache repopulation, and service startup all took additional time. Most users saw service restored between 21:30 and 22:00 UTC.

Collateral Damage: The DNS Tsunami

The outage had a significant secondary effect on the global DNS infrastructure. When Facebook's nameservers became unreachable, every recursive DNS resolver on the internet began receiving SERVFAIL responses for Facebook-related domains. Many applications and clients responded by retrying aggressively.

The volume of DNS queries for Facebook, Instagram, and WhatsApp domains increased by an estimated 30x compared to normal levels. Recursive resolvers like Cloudflare's 1.1.1.1 and Google's 8.8.8.8 were flooded with queries that would all fail. This created noticeable load on DNS infrastructure worldwide, and some smaller resolver operators reported degraded performance for all DNS queries, not just Facebook-related ones.

Cloudflare published data showing that their resolver handled a massive spike in queries for Facebook's domains during the outage, and that the retry storms from clients constituted a significant portion of their total query volume during the event.

Technical Deep Dive: Why the Safety System Failed

Facebook's engineers had built safeguards into their change management process. Configuration changes to the backbone were supposed to be validated by an audit tool before being applied. This tool was designed to catch dangerous changes -- like ones that would disconnect all backbone links simultaneously.

The audit tool had a bug. It failed to evaluate the command correctly and approved a change that should have been blocked. According to Facebook's postmortem, the specific failure was:

An engineer initiated a command intended to assess backbone capacity
The command contained a parameter that, when applied, would withdraw all backbone connections to the peering edge
The audit tool was supposed to simulate the effect of the command and reject it if it would cause a loss of connectivity
Due to a bug, the audit tool did not correctly model the impact and allowed the command to proceed
The command was applied, and within seconds, the backbone was disconnected from the edge

This is a classic example of a safety system failure mode: the operators trusted the guardrails, the guardrails had a defect, and the result was a total outage. It is analogous to a circuit breaker that fails to trip, allowing a fault to propagate through the entire system.

The Recovery Process

Restoring service required several steps, each complicated by the outage itself:

Physical access -- engineers traveled to data center facilities. Reports indicate that physical security systems (badge readers, door locks) had some dependencies on the affected network, adding delays
Console access -- once inside, engineers connected directly to backbone routers via serial console or out-of-band management interfaces
Configuration revert -- the erroneous backbone configuration was reverted, restoring internal connectivity
BGP re-establishment -- with backbone connectivity restored, edge routers re-established BGP sessions with peers and began re-announcing Facebook's prefixes
DNS propagation -- as routes propagated through the global BGP table, DNS resolvers could once again reach Facebook's nameservers. DNS resolution began working, but cached negative responses (SERVFAIL) in some resolvers added additional delay
Service startup -- Facebook's services had to handle a massive thundering herd of reconnecting users, requiring careful capacity management during the ramp-up

Impact and Consequences

The outage had far-reaching effects:

Revenue loss -- Facebook estimated the outage cost approximately $60-100 million in lost advertising revenue
Stock price impact -- Facebook's stock dropped roughly 5% on the day of the outage, erasing billions in market capitalization
Global communication disruption -- in many countries, WhatsApp is the primary communication tool. Businesses, government agencies, and individuals were cut off
Secondary service impact -- countless third-party apps and websites that used "Login with Facebook" or embedded Facebook services experienced errors
DNS infrastructure strain -- the retry storm created collateral damage across the internet's DNS system

Lessons for Network Engineering

The Facebook outage of October 2021 offers several important lessons for anyone operating internet infrastructure:

1. DNS should not be a single point of failure

Facebook's authoritative DNS nameservers were all hosted within Facebook's own autonomous system. When AS32934's BGP routes were withdrawn, all nameservers became unreachable simultaneously. Using DNS providers in separate autonomous systems (or at least separate BGP announcements) would have maintained DNS resolution even if the primary network went down.

2. Out-of-band management is essential

When your primary network is the thing that is broken, you need a completely independent path to reach your management interfaces. This means out-of-band management networks that do not depend on the production backbone or DNS, with separate physical access paths and independent authentication.

3. Audit tools are only as good as their test coverage

The audit tool that should have caught the dangerous configuration change had a bug. Safety-critical validation systems need rigorous testing, including tests that specifically verify they reject configurations known to cause total outages.

4. BGP withdrawal propagation is fast and global

Within minutes, Facebook's prefixes were removed from routing tables worldwide. There is no "undo" button for BGP -- once withdrawals propagate, the only recovery is to re-announce the routes, which requires fixing whatever caused the withdrawal in the first place.

5. Physical access procedures must work during outages

If your badge system, door locks, or security protocols depend on the same network you need to fix, you have a circular dependency. Data center physical access must remain functional even during total network failures.

How BGP Monitoring Could Have Helped

While monitoring cannot prevent a misconfiguration, it can dramatically reduce detection time. BGP looking glass tools and route monitoring services detected the Facebook withdrawal within seconds. Organizations that monitor their own BGP announcements from external vantage points can set up alerts that trigger immediately when their prefixes disappear from the global routing table.

Key monitoring practices include:

Monitoring your prefixes from external BGP collectors (RIPE RIS, RouteViews)
Setting alerts for unexpected withdrawals or origin AS changes
Validating routes with RPKI to ensure your announcements are cryptographically signed
Regular testing of out-of-band access and disaster recovery procedures

See Facebook's Network Today

Facebook's network has since been restored and continues to operate normally. You can explore its current routing state using the looking glass:

AS32934 -- Meta's autonomous system: see all announced prefixes and peering
facebook.com -- resolve and view the BGP route for Facebook's web servers
instagram.com -- Instagram's routing information
157.240.1.35 -- a Facebook IP address: see the prefix, origin AS, and AS path

This outage remains one of the most significant BGP-related incidents in internet history, alongside the 2008 Pakistan/YouTube BGP hijack. It demonstrated that even the most sophisticated networks are vulnerable to the fundamental mechanics of BGP routing -- and that when DNS and BGP fail together, the consequences are swift and total.