The CenturyLink/Level3 Flowspec Outage (2020)
On August 30, 2020, a single misconfigured BGP flowspec rule cascaded across the global backbone of CenturyLink/Level 3 (AS3356), one of the world's largest Tier 1 transit providers. For roughly five hours, routers across CenturyLink's network dropped legitimate traffic, disrupting internet service for millions of users and knocking 911 emergency call systems offline in multiple US states. The incident remains one of the most significant examples of how a small configuration error in a critical autonomous system can ripple across the entire internet.
What is BGP Flowspec?
Before examining what went wrong, it helps to understand the technology at the center of this outage. BGP flowspec (defined in RFC 5575) is an extension to BGP that allows routers to distribute traffic filtering rules alongside routing information. While standard BGP deals with where to send traffic (via prefixes and AS paths), flowspec deals with which traffic to allow, rate-limit, or drop.
A flowspec rule can match on multiple packet fields simultaneously:
- Source and destination IP address
- IP protocol (TCP, UDP, ICMP)
- Source and destination port numbers
- Packet length
- DSCP (traffic class) values
- TCP flags
- ICMP type and code
The action associated with a flowspec rule can drop packets, rate-limit them, redirect them to a different VRF (virtual routing table), or mark them with specific DSCP values. These rules propagate across a network through BGP sessions, just like regular routes do, which means they can spread very quickly through iBGP meshes and route reflectors.
Flowspec is a powerful tool for DDoS mitigation because it lets a network deploy surgical traffic filters at every router in the network within seconds, without logging into each device individually. But this same speed and reach is precisely what makes it dangerous when something goes wrong.
CenturyLink/Level 3: The Backbone's Backbone
To appreciate the scale of this outage, you need to understand the role that CenturyLink (now Lumen Technologies) plays in the internet's architecture. Operating under AS3356, Level 3/CenturyLink is consistently ranked as one of the two or three largest autonomous systems by the number of downstream networks it serves. It is a Tier 1 network -- a provider that can reach every destination on the internet purely through settlement-free peering, without purchasing transit from anyone.
As of 2020, AS3356 carried an estimated 3.5% of all global internet traffic. Thousands of ISPs, enterprises, and content providers relied on CenturyLink's backbone to reach the rest of the internet. When AS3356 has a routing problem, it is not just CenturyLink's customers who are affected -- the ripple effects touch networks multiple hops away.
The Sequence of Events
The outage began in the early morning hours of August 30, 2020 (UTC). Here is the timeline as reconstructed from public reports, FCC filings, and network monitoring data:
10:04 UTC -- The Bad Rule is Injected
CenturyLink's automated DDoS mitigation system generated a flowspec rule intended to block malicious traffic associated with an ongoing denial-of-service attack. The rule itself was malformed -- it contained a configuration error that made it far broader than intended. Instead of targeting specific attack traffic, the rule matched a vastly larger set of packets.
The flowspec rule was injected into the iBGP mesh via a route reflector, which propagated it to every router in CenturyLink's global network within seconds. This is standard flowspec behavior -- the same mechanism that makes it effective for rapid DDoS response also ensures that a bad rule reaches everywhere, fast.
10:04-10:30 UTC -- Cascading Failures
As routers across the CenturyLink backbone installed the malformed flowspec rule, they began dropping legitimate traffic that matched the overly broad filter. But the situation was more complex than a simple traffic blackhole. The flowspec rule interacted with the routers' forwarding behavior in a way that caused additional instability:
- Traffic that should have been forwarded normally was dropped, causing TCP retransmissions and application timeouts
- Some routers experienced elevated CPU load as they processed the unexpected filtering behavior, affecting their ability to maintain BGP sessions
- BGP session flaps began occurring as keepalive timers expired on overloaded routers
- Each BGP session reset triggered route withdrawals and re-advertisements, creating a cascade of routing updates across the network
The result was a feedback loop: the flowspec rule caused traffic drops, which caused BGP instability, which caused route withdrawals, which caused more traffic to shift to remaining paths, which overloaded those paths, which caused more BGP flaps.
10:30 UTC -- External Impact Becomes Visible
Within 30 minutes, monitoring systems worldwide detected the problem. Cloudflare, Kentik, ThousandEyes, and other network monitoring platforms reported massive packet loss traversing CenturyLink's backbone. The RIPE RIS route collectors showed BGP route instability for thousands of prefixes normally transiting AS3356.
Networks that relied on CenturyLink as their sole transit provider lost internet connectivity entirely. Networks with multiple upstreams experienced degraded performance as traffic failed over to alternate paths that were not provisioned to handle the full load.
11:00+ UTC -- 911 Services Go Down
The most alarming consequence was the impact on 911 emergency services. Multiple US states reported failures in their emergency call routing. 911 systems in several states -- including Idaho, Missouri, and Washington -- experienced outages or degraded service. The 911 infrastructure in many regions depended on CenturyLink's network for call routing and database lookups. With the backbone dropping traffic, emergency calls could not be completed.
This is what elevated the incident from a network engineering problem to a public safety crisis and ultimately triggered an FCC investigation.
~15:00 UTC -- Restoration
CenturyLink engineers identified the malformed flowspec rule as the root cause and began the process of removing it from the network. However, recovery was not instantaneous. Removing the rule required careful coordination to avoid triggering additional instability, and the BGP convergence process -- where thousands of routes need to re-establish -- took additional time. Full restoration of service took approximately five hours from the initial incident.
Technical Root Cause: The Malformed Flowspec Rule
The FCC's subsequent investigation and CenturyLink's own disclosures revealed the technical details. The flowspec rule was generated by an automated DDoS mitigation platform. The rule was intended to filter traffic associated with a specific attack, but a flaw in the rule's construction caused it to match a far wider range of traffic than intended.
The critical problem was that the flowspec rule, once installed on a router, caused the router to begin dropping packets that were part of the BGP control plane itself -- specifically, packets needed to maintain iBGP sessions between CenturyLink's own routers. This created the devastating feedback loop:
- The flowspec rule is distributed to all routers via iBGP
- Routers install the rule and begin filtering traffic
- The filter matches some iBGP packets, disrupting BGP sessions
- Disrupted BGP sessions cause route withdrawals
- Route withdrawals cause traffic shifts and more instability
- Engineers attempting to remove the rule face difficulty because the management plane itself is degraded
This last point is crucial: the very mechanism needed to remove the bad rule (BGP) was itself impaired by the rule. It is similar to a fire that has damaged the fire suppression system -- the tool you need to fix the problem is broken by the problem itself.
Global Impact: Measured in Lost Packets
Network monitoring companies provided extensive data on the outage's reach. Cloudflare reported that traffic from CenturyLink's network dropped by approximately 30% during the peak of the incident. Kentik's measurements showed packet loss rates exceeding 50% on paths transiting AS3356.
The impact was not limited to CenturyLink's direct customers. Because AS3356 serves as a transit provider for thousands of downstream networks, the outage created a "gravity well" in the internet's topology. Networks that used CenturyLink as one of multiple upstreams saw traffic fail over to their other providers, but those alternate paths often lacked the capacity to absorb the full load. This caused secondary congestion on networks that had no direct relationship with CenturyLink.
Major services affected included:
- Cloud platforms -- AWS, Azure, and GCP customers using CenturyLink transit experienced connectivity issues
- Gaming services -- Xbox Live, PlayStation Network, and Steam reported disruptions
- Enterprise VPNs -- Remote workers (in the midst of COVID-19 pandemic) lost access to corporate networks
- Streaming services -- Hulu, Spotify, and other platforms experienced degraded performance
- Financial services -- Some online banking and trading platforms were disrupted
Perhaps most critically, the August 2020 incident occurred during the COVID-19 pandemic, when internet reliability was more important than ever. Millions of people were working remotely, students were attending school online, and telehealth visits had replaced in-person doctor visits. The outage underscored just how dependent modern life had become on a small number of backbone networks.
The 911 Crisis
The failure of 911 emergency services was the most serious consequence and the primary reason the FCC launched a formal investigation. The US 911 system relies on a complex chain of telecommunications infrastructure to connect callers to the appropriate Public Safety Answering Point (PSAP). Many PSAPs use dedicated circuits or IP-based connections that traverse carrier backbone networks.
When CenturyLink's backbone began dropping traffic, the signaling and media paths for 911 calls were disrupted. In some cases, callers heard nothing when dialing 911. In other cases, calls connected but without the caller's location data (ANI/ALI -- Automatic Number Identification / Automatic Location Identification), which is essential for dispatching emergency responders.
The FCC determined that the impact on 911 was particularly egregious because CenturyLink serves as a Local Exchange Carrier (LEC) in many US markets, meaning it is not just a backbone provider but also the local telephone company responsible for last-mile connectivity. The concentration of both backbone and local access in a single provider meant there was no fallback path for 911 calls in affected areas.
The FCC Investigation
The FCC opened an investigation into the outage, focusing on the 911 impact. Their findings highlighted several systemic issues:
- Insufficient testing of flowspec rules -- The malformed rule was not validated against a test environment before being deployed to production
- Lack of blast radius controls -- There was no mechanism to limit how widely a flowspec rule propagated before its effects could be observed
- Inadequate monitoring -- The automated system that generated the rule did not have sufficient safeguards to detect when a rule was having unintended effects
- Single points of failure in 911 infrastructure -- Many 911 systems had no diverse routing that would survive a single carrier outage
CenturyLink agreed to a series of corrective actions, including improved change management procedures for flowspec rules, enhanced monitoring of flowspec rule effects, and investments in 911 network resilience.
Lessons for Network Engineering
The CenturyLink outage is a case study in how powerful automation tools can amplify human or software errors. Several lessons apply broadly to network operations:
1. Flowspec Needs Guardrails
Flowspec is essentially "firewall rules distributed at BGP speed." Its ability to instantly deploy filtering rules across an entire backbone is both its greatest strength and its greatest vulnerability. Best practices that emerged after this incident include:
- Rate limiting flowspec rule propagation -- introducing delays before a new rule is installed globally
- Canary deployments -- applying new flowspec rules to a subset of routers first and monitoring the effect before global rollout
- Automated sanity checks -- validating that a flowspec rule does not match control plane traffic (BGP, OSPF/ISIS, management protocols) before allowing it to be installed
- Kill switches -- maintaining out-of-band access to routers so that flowspec rules can be removed even when the in-band management plane is degraded
2. Control Plane Protection is Non-Negotiable
One of the most important practices in network engineering is Control Plane Policing (CoPP) -- ensuring that traffic to and from the router's control plane (BGP sessions, routing protocol adjacencies, management access) is protected from data plane filtering. The CenturyLink outage demonstrated what happens when this protection fails: the network loses the ability to heal itself.
Modern router configurations should ensure that flowspec rules can never match control plane traffic, regardless of the rule's content. This is equivalent to a "do not filter the filters" principle.
3. Tier 1 Failures Have Outsized Impact
The internet's routing system is theoretically decentralized, but in practice a small number of Tier 1 networks carry a disproportionate share of global traffic. When AS3356, AS1299, or AS2914 has a problem, the effect is global. Networks that depend on a single Tier 1 provider have no fallback -- this is why multihoming (purchasing transit from multiple providers) is considered essential for any service that requires high availability.
4. Automated Systems Need Circuit Breakers
The flowspec rule was generated by an automated DDoS mitigation system. Automation is necessary at the scale of a Tier 1 backbone -- no human operator can manually respond to the volume and velocity of DDoS attacks these networks face. But automation must include circuit breakers: mechanisms that halt automated actions when unexpected outcomes are detected.
If the DDoS mitigation system had monitored the network's health after deploying its rule and detected the resulting packet loss and BGP instability, it could have automatically withdrawn the rule within minutes rather than hours.
Flowspec vs. RTBH: Two Approaches to DDoS Mitigation
The CenturyLink incident renewed discussion about the tradeoffs between flowspec and RTBH (Remotely Triggered Black Hole) routing, the two primary BGP-based DDoS mitigation techniques:
- RTBH -- A destination IP is black-holed (all traffic to it is dropped) by announcing a /32 route for it with a special community that tells upstream routers to discard traffic. This is coarse-grained: it stops the attack but also drops all legitimate traffic to the victim IP. However, RTBH is simple and well-understood, with limited blast radius.
- Flowspec -- Allows surgical filtering based on multiple packet fields. A well-crafted flowspec rule can block attack traffic while allowing legitimate traffic through. But as the CenturyLink incident showed, the complexity that enables precision also increases the risk of misconfiguration.
Many networks now use a layered approach: RTBH for emergency response when an attack threatens network stability, and flowspec for more targeted mitigation once the attack traffic has been characterized. Some operators have moved flowspec filtering off their backbone routers entirely, instead steering traffic to dedicated scrubbing centers where flowspec rules are applied in a more controlled environment.
How This Compares to Other Major Outages
The CenturyLink flowspec outage belongs in the pantheon of major internet incidents alongside other BGP-related events. Each highlights a different vulnerability in the internet's routing infrastructure:
- YouTube/Pakistan Telecom (2008) -- A BGP hijack caused by a route leak. Highlighted the lack of origin validation (now addressed by RPKI).
- Google (2012) -- An internal BGP misconfiguration caused Google to leak thousands of prefixes with an incorrect AS path, briefly rerouting traffic through unexpected networks.
- CenturyLink (2020) -- A flowspec rule took down a Tier 1 backbone. Highlighted the risks of distributing filtering rules via BGP without adequate safeguards.
- Facebook (2021) -- An internal configuration change withdrew all of Facebook's BGP routes, making the platform completely unreachable for six hours. Highlighted the danger of configuration systems that can withdraw all routes simultaneously.
All of these incidents share a common theme: the internet's routing system is powerful and efficient at propagating changes, which means it is equally efficient at propagating mistakes. A single misconfiguration in a single autonomous system can have global consequences within seconds.
The Current State of Flowspec Safety
Since the 2020 incident, the networking industry has made progress on flowspec safety. Router vendors have added features like flowspec validation rules that prevent rules from matching control plane traffic, rate limits on the number of flowspec rules that can be installed, and logging and alerting when flowspec rules cause significant traffic drops.
RFC 8955 (published in 2020, coincidentally around the same time as the outage) updated the original flowspec specification with improved security considerations. Network operators have also developed community best practices, including the recommendation to always test flowspec rules in a lab environment, implement gradual rollouts, and maintain out-of-band management access that cannot be affected by data plane filtering.
Despite these improvements, flowspec remains a tool that requires careful handling. The fundamental tension between "deploy filters quickly to stop an attack" and "deploy filters carefully to avoid collateral damage" has no easy resolution. Every large network that uses flowspec must balance these competing pressures.
Investigate AS3356 Yourself
You can explore CenturyLink/Lumen's network in real time using the looking glass. Look up AS3356 to see its current BGP announcements, the number of prefixes it originates and transits, and its peering relationships with other major networks. You can also examine the AS paths for any IP address to see whether AS3356 appears in the transit path -- chances are it does for many destinations.
- AS3356 -- CenturyLink / Lumen Technologies
- AS1299 -- Arelion (Telia Carrier), another major Tier 1
- AS2914 -- NTT, another major Tier 1
- AS6939 -- Hurricane Electric, known for open peering
Try it now: Enter any IP address, domain, or ASN in the search box to see its live BGP routing data -- including which Tier 1 transit providers like AS3356 carry its traffic.