Cloudflare WAF Regex Outage 2019: Catastrophic Backtracking
On July 2, 2019, Cloudflare's HTTP and HTTPS proxying was globally unavailable for approximately 27 minutes, from roughly 13:42 to 14:09 UTC. The cause was a single Web Application Firewall rule deployed globally in a matter of seconds — a rule containing a regular expression that exhibited catastrophic backtracking, driving CPU utilization to 100% on every Cloudflare edge server worldwide that processed HTTP traffic. For the duration, any request routed through Cloudflare returned a 502 error.
This was not a network failure, a BGP event, or a hardware problem. It was a software deployment that turned a safety mechanism — the WAF — into a self-inflicted denial of service.
Background: Cloudflare's WAF Architecture
Cloudflare's Web Application Firewall operates inline on every HTTP request passing through its edge. Every HTTP request is evaluated against a set of managed rules maintained by Cloudflare's security team, which are updated frequently in response to new vulnerabilities and attack patterns. The WAF is implemented in a worker process on each edge server; when a rule matches, the request is blocked or challenged before it reaches the customer's origin server.
Cloudflare's deployment pipeline for WAF rules was significantly faster than its code deployment pipeline. Code changes went through staged rollouts — a small percentage of traffic, monitored, then gradually expanded. WAF rule changes, however, could be pushed globally almost immediately. The rationale was speed of response: when a critical vulnerability like a zero-day appears in widely deployed software, defenders need to be able to push protective rules within minutes, not hours.
This design tradeoff — fast global deployment of rule updates without gradual rollout — was the underlying condition that made this incident catastrophic rather than contained.
Timeline
Root Cause: Catastrophic Backtracking
The offending rule was intended to block cross-site scripting (XSS) attacks by detecting inline JavaScript in HTTP request bodies. The regex included a pattern that matched repeated alternation groups — a known trigger for catastrophic backtracking in backtracking-based regex engines.
The specific pattern had the structure:
(?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)
|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))
The problematic part is the repeated alternation: (?:...|...|...)+ followed by content that can fail to match. When a regex engine tries to match this pattern and the input does not match, it must explore every combination of the alternating groups that could have led to a different match — exponential in the length of the input. For a sufficiently long HTTP request body (or even a modest one designed to trigger backtracking), a single regex evaluation could take seconds of CPU time.
Multiplied across millions of concurrent HTTP requests, 100% CPU utilization was nearly instantaneous.
Why the Dashboard Was Also Unavailable
The control plane failure compounded the incident's severity. Cloudflare's customer dashboard and API are themselves served through Cloudflare's infrastructure. When the WAF rule drove HTTP worker CPUs to 100%, the dashboard — which customers would use to disable WAF rules or check status — was also returning 502 errors.
The kill switch to disable the managed WAF ruleset existed, but accessing it required going through the same degraded path. Operators had to find alternate control plane access — an internal path not proxied through the saturated HTTP workers — to execute the global disable. This took several minutes and extended the outage beyond what a rapid kill-switch response would have produced.
This is a common failure mode in complex systems: the mechanism to stop a bad deployment is accessible through the infrastructure affected by that deployment. Independent, resilient control plane access that does not traverse the data plane is a necessary design requirement for any kill-switch or emergency rollback system.
Lessons
1. Non-backtracking regex engines for WAF rules
The fundamental fix is to use a regex engine that guarantees linear time complexity regardless of input. RE2-class engines (Google RE2, Hyperscan, Rust's regex crate) use NFA/DFA compilation and never backtrack. They support a slightly reduced feature set (no backreferences, no lookaheads beyond simple assertions) but are entirely adequate for network security pattern matching. After this incident, Cloudflare migrated its WAF rule engine to one that rejects or rewrites patterns that would backtrack.
2. Staged rollouts apply to rules, not just code
The same safeguards used for code deployments must apply to configuration and rule changes. A rule deployed to 1% of traffic, monitored for CPU anomalies and error rate spikes for several minutes, then incrementally expanded would have contained this incident to a tiny fraction of users and allowed rapid rollback before global impact. The distinction between "code" and "configuration" or "rules" is artificial when the configuration controls inline processing of every request.
3. Automated canary metrics for WAF deployments
WAF rule deployments should automatically measure CPU utilization, request processing latency, and error rates at the first PoP receiving the rule. A rule that drives per-request processing time above a threshold (say, 10ms for what should be a sub-millisecond check) should halt the deployment automatically, without requiring human detection and intervention.
4. Independent control plane access
The dashboard must remain reachable when the data plane is saturated or returning errors. This requires either: a separate infrastructure path for the control plane that bypasses the affected workers, aggressive circuit-breaking that prioritizes management traffic, or out-of-band access for operators that does not depend on the customer-facing infrastructure.
5. Regex safety review in rule development
Rules containing complex regexes should be reviewed with automated tools that detect ReDoS (Regular Expression Denial of Service) vulnerability patterns. Tools like safe-regex, rxxr2, or static analysis in CI pipelines can flag problematic patterns before they reach production. The particular pattern structure that causes catastrophic backtracking — nested quantifiers over alternation groups — is mechanical to detect.
The ReDoS Vulnerability Class
The Cloudflare incident is one of the most visible examples of a broader vulnerability class: ReDoS (Regular Expression Denial of Service). Any application that evaluates untrusted input against a backtracking regex engine is potentially vulnerable. The attack surface extends far beyond WAFs:
- Web frameworks: routing libraries that match URL paths with regexes are susceptible if path parameters are unbounded
- Input validation: email address validators, phone number formatters, and date parsers frequently contain ReDoS-vulnerable patterns — the email RFC is notoriously complex and naive regex implementations are common
- Log parsers: security tools that parse log files with complex regexes can be stalled by maliciously crafted log entries
- Build systems and linters: tools that scan source code with regexes can be attacked via crafted input files
The vulnerability is mechanical: any pattern containing a quantifier over a sub-expression that itself contains alternation — the form (a|b|c)+ or (a+)+ — is potentially catastrophic against non-matching input. Static analysis tools can detect these patterns automatically. The OWASP testing guide lists ReDoS as a relevant attack vector for application-layer denial of service.
For WAF rules specifically, the problem is compounded by the need for expressiveness: a WAF rule attempting to detect complex attack patterns naturally tends toward complex regexes. The engineering discipline of maintaining a rule base that is both effective and safe against backtracking requires either automated tooling, a non-backtracking engine, or both.
Impact
During the 27-minute window, all HTTP and HTTPS traffic routed through Cloudflare's proxy returned 502 errors. This included a substantial fraction of internet traffic: Cloudflare proxies millions of websites and serves as the reverse proxy for many major internet properties. Customers on Cloudflare's free tier, which provides no SLA, had no recourse. Enterprise customers received credits. The incident was notable because it affected essentially every Cloudflare customer simultaneously with no geographic variation — a global blast radius that is almost impossible to achieve with infrastructure hardware failures alone.
Related incidents in the CDN/infrastructure space: the Fastly CDN outage in June 2021 was also a configuration change that caused near-global HTTP outages, though in that case the issue was misconfiguration in a newly deployed software version rather than a regex pattern.
Comparison With Other Rule-Deployment Incidents
Several subsequent incidents follow the same template — a globally deployed configuration or rule change causing near-total service failure — and benefit from the lessons Cloudflare published in 2019:
- The Fastly outage in June 2021 was triggered by a customer-activated configuration change that exposed a software bug affecting a substantial portion of Fastly's network. Unlike the Cloudflare incident, CPU was not the bottleneck — a routing configuration defect caused services to fail. The blast radius was similar: a large fraction of the internet's CDN-served content returned errors simultaneously.
- Crowdstrike's 2024 incident (see CrowdStrike Windows outage 2024) involved a content configuration file — a "channel file" analogous to a WAF rule update — that was pushed to endpoints globally and caused kernel crashes. The deployment pipeline for content updates again bypassed the staged rollout controls applied to software updates.
The common thread is that security tooling — WAF rules, endpoint detection content, CDN routing rules — operates at the edge of what is considered "code" versus "configuration," and organizations tend to apply faster deployment pipelines to configuration than to code. The blast radius when configuration is wrong is identical to when code is wrong, but the safeguards are often weaker.
Explore It Live
- AS13335 — Cloudflare's autonomous system; the network this outage affected globally
- 1.1.1.1 — Cloudflare's public DNS resolver, which remained operational during the outage (DNS is separate from HTTP proxying)
- 104.16.0.0 — a prefix in Cloudflare's anycast space; see how it is announced globally