Cloudflare WAF Regex Outage 2019: Catastrophic Backtracking

On July 2, 2019, Cloudflare's HTTP and HTTPS proxying was globally unavailable for approximately 27 minutes, from roughly 13:42 to 14:09 UTC. The cause was a single Web Application Firewall rule deployed globally in a matter of seconds — a rule containing a regular expression that exhibited catastrophic backtracking, driving CPU utilization to 100% on every Cloudflare edge server worldwide that processed HTTP traffic. For the duration, any request routed through Cloudflare returned a 502 error.

This was not a network failure, a BGP event, or a hardware problem. It was a software deployment that turned a safety mechanism — the WAF — into a self-inflicted denial of service.

Background: Cloudflare's WAF Architecture

Cloudflare's Web Application Firewall operates inline on every HTTP request passing through its edge. Every HTTP request is evaluated against a set of managed rules maintained by Cloudflare's security team, which are updated frequently in response to new vulnerabilities and attack patterns. The WAF is implemented in a worker process on each edge server; when a rule matches, the request is blocked or challenged before it reaches the customer's origin server.

Cloudflare's deployment pipeline for WAF rules was significantly faster than its code deployment pipeline. Code changes went through staged rollouts — a small percentage of traffic, monitored, then gradually expanded. WAF rule changes, however, could be pushed globally almost immediately. The rationale was speed of response: when a critical vulnerability like a zero-day appears in widely deployed software, defenders need to be able to push protective rules within minutes, not hours.

This design tradeoff — fast global deployment of rule updates without gradual rollout — was the underlying condition that made this incident catastrophic rather than contained.

Timeline

July 2, 2019 — Cloudflare WAF Outage Timeline (UTC) 13:31 WAF rule change submitted for XSS detection improvement. Rule regex contains unbounded lookahead with repeated alternation groups. 13:42 Rule deployed globally to all edge PoPs simultaneously. CPU on HTTP worker processes spikes to 100% worldwide within seconds. 13:42–13:47 502 errors globally. Cloudflare dashboard begins degrading. On-call receives pages; initial hypothesis: attack traffic surge. ~13:52 Teams identify CPU saturation pattern, rule deployment correlates. Dashboard/API impaired — kill switch initially hard to reach. ~14:00 Decision made to globally disable the managed WAF ruleset. Alternate access path to control plane identified and used. 14:09 WAF kill switch activated globally. CPU drops. 502s resolve. Total HTTP downtime: ~27 minutes. Traffic fully restored.

Root Cause: Catastrophic Backtracking

The offending rule was intended to block cross-site scripting (XSS) attacks by detecting inline JavaScript in HTTP request bodies. The regex included a pattern that matched repeated alternation groups — a known trigger for catastrophic backtracking in backtracking-based regex engines.

The specific pattern had the structure:

(?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)
|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))

The problematic part is the repeated alternation: (?:...|...|...)+ followed by content that can fail to match. When a regex engine tries to match this pattern and the input does not match, it must explore every combination of the alternating groups that could have led to a different match — exponential in the length of the input. For a sufficiently long HTTP request body (or even a modest one designed to trigger backtracking), a single regex evaluation could take seconds of CPU time.

Multiplied across millions of concurrent HTTP requests, 100% CPU utilization was nearly instantaneous.

Catastrophic Backtracking: Exponential Path Explosion Pattern: (a|ab|b)+ trying to match "aaaaaaaax" (no match) a a a a a a a x FAIL Engine backtracks: tries all splits of "aaaaaaa" among alternation groups For N chars, backtracking engine explores O(2^N) combinations before giving up: N=10: ~1,024 steps N=20: ~1,048,576 steps N=30: ~1 billion steps A 30-char no-match input can consume seconds of CPU — per request Millions of concurrent requests x seconds each = 100% CPU instantaneously Fix: use RE2/DFA-class engines — O(N) time, no backtracking, no worst case RE2, RE2/J, Hyperscan, Google's re2 crate all guarantee linear time

Why the Dashboard Was Also Unavailable

The control plane failure compounded the incident's severity. Cloudflare's customer dashboard and API are themselves served through Cloudflare's infrastructure. When the WAF rule drove HTTP worker CPUs to 100%, the dashboard — which customers would use to disable WAF rules or check status — was also returning 502 errors.

The kill switch to disable the managed WAF ruleset existed, but accessing it required going through the same degraded path. Operators had to find alternate control plane access — an internal path not proxied through the saturated HTTP workers — to execute the global disable. This took several minutes and extended the outage beyond what a rapid kill-switch response would have produced.

This is a common failure mode in complex systems: the mechanism to stop a bad deployment is accessible through the infrastructure affected by that deployment. Independent, resilient control plane access that does not traverse the data plane is a necessary design requirement for any kill-switch or emergency rollback system.

Lessons

1. Non-backtracking regex engines for WAF rules

The fundamental fix is to use a regex engine that guarantees linear time complexity regardless of input. RE2-class engines (Google RE2, Hyperscan, Rust's regex crate) use NFA/DFA compilation and never backtrack. They support a slightly reduced feature set (no backreferences, no lookaheads beyond simple assertions) but are entirely adequate for network security pattern matching. After this incident, Cloudflare migrated its WAF rule engine to one that rejects or rewrites patterns that would backtrack.

2. Staged rollouts apply to rules, not just code

The same safeguards used for code deployments must apply to configuration and rule changes. A rule deployed to 1% of traffic, monitored for CPU anomalies and error rate spikes for several minutes, then incrementally expanded would have contained this incident to a tiny fraction of users and allowed rapid rollback before global impact. The distinction between "code" and "configuration" or "rules" is artificial when the configuration controls inline processing of every request.

3. Automated canary metrics for WAF deployments

WAF rule deployments should automatically measure CPU utilization, request processing latency, and error rates at the first PoP receiving the rule. A rule that drives per-request processing time above a threshold (say, 10ms for what should be a sub-millisecond check) should halt the deployment automatically, without requiring human detection and intervention.

4. Independent control plane access

The dashboard must remain reachable when the data plane is saturated or returning errors. This requires either: a separate infrastructure path for the control plane that bypasses the affected workers, aggressive circuit-breaking that prioritizes management traffic, or out-of-band access for operators that does not depend on the customer-facing infrastructure.

5. Regex safety review in rule development

Rules containing complex regexes should be reviewed with automated tools that detect ReDoS (Regular Expression Denial of Service) vulnerability patterns. Tools like safe-regex, rxxr2, or static analysis in CI pipelines can flag problematic patterns before they reach production. The particular pattern structure that causes catastrophic backtracking — nested quantifiers over alternation groups — is mechanical to detect.

The ReDoS Vulnerability Class

The Cloudflare incident is one of the most visible examples of a broader vulnerability class: ReDoS (Regular Expression Denial of Service). Any application that evaluates untrusted input against a backtracking regex engine is potentially vulnerable. The attack surface extends far beyond WAFs:

The vulnerability is mechanical: any pattern containing a quantifier over a sub-expression that itself contains alternation — the form (a|b|c)+ or (a+)+ — is potentially catastrophic against non-matching input. Static analysis tools can detect these patterns automatically. The OWASP testing guide lists ReDoS as a relevant attack vector for application-layer denial of service.

For WAF rules specifically, the problem is compounded by the need for expressiveness: a WAF rule attempting to detect complex attack patterns naturally tends toward complex regexes. The engineering discipline of maintaining a rule base that is both effective and safe against backtracking requires either automated tooling, a non-backtracking engine, or both.

Impact

During the 27-minute window, all HTTP and HTTPS traffic routed through Cloudflare's proxy returned 502 errors. This included a substantial fraction of internet traffic: Cloudflare proxies millions of websites and serves as the reverse proxy for many major internet properties. Customers on Cloudflare's free tier, which provides no SLA, had no recourse. Enterprise customers received credits. The incident was notable because it affected essentially every Cloudflare customer simultaneously with no geographic variation — a global blast radius that is almost impossible to achieve with infrastructure hardware failures alone.

Related incidents in the CDN/infrastructure space: the Fastly CDN outage in June 2021 was also a configuration change that caused near-global HTTP outages, though in that case the issue was misconfiguration in a newly deployed software version rather than a regex pattern.

Comparison With Other Rule-Deployment Incidents

Several subsequent incidents follow the same template — a globally deployed configuration or rule change causing near-total service failure — and benefit from the lessons Cloudflare published in 2019:

The common thread is that security tooling — WAF rules, endpoint detection content, CDN routing rules — operates at the edge of what is considered "code" versus "configuration," and organizations tend to apply faster deployment pipelines to configuration than to code. The blast radius when configuration is wrong is identical to when code is wrong, but the safeguards are often weaker.

Explore It Live

See BGP routing data in real time

Open Looking Glass
← Previous Orange Spain BGP Hijack 2024: RPKI Weaponized via Stolen RIPE Credentials
More Articles
The Pakistan YouTube BGP Hijack (2008)
The Facebook DNS Outage (October 2021)
The Cloudflare-Verizon BGP Leak (2019)
The AWS S3 Outage (February 2017)
The Dyn DNS DDoS Attack and Mirai Botnet (2016)
The CenturyLink/Level3 Flowspec Outage (2020)