AWS US-East-1 Outage 2021: Internal Network Congestion

On December 7, 2021, a significant portion of Amazon Web Services became unavailable or severely degraded for several hours, beginning around 15:30 UTC. The outage was centered on the US-East-1 region in Northern Virginia — AWS's oldest, largest, and most heavily used region — and affected not just customer workloads but AWS's own operational tooling, making the incident both broad and slow to resolve. Services including Ring, Disney+, Venmo, and Amazon's own delivery operations experienced disruptions visible to end users across the United States.

The incident illustrated a failure mode that has appeared repeatedly in large-scale cloud infrastructure: an automated process triggered unexpected network behavior, the resulting congestion disabled monitoring and control plane tooling, and operators were partly blind during the period they most needed visibility.

Background: US-East-1's Architecture

US-East-1 (Northern Virginia) is where AWS launched in 2006, and it has accumulated more services, more customers, and more technical debt than any other AWS region. It is the home region for most global AWS services including IAM (Identity and Access Management), the AWS Management Console, CloudFront configuration APIs, and Route 53 — services that other regions depend on for control plane operations. A degradation in US-East-1 can therefore cascade outward even to regions geographically distant.

AWS's internal network in US-East-1 connects data center buildings via a dense fabric. Traffic flows between workloads, between workloads and the internet, and between workloads and AWS's internal infrastructure — the services that handle authentication, DNS resolution, API authorization, and monitoring. The internal infrastructure network was separated from the main network by network devices that aggregated and managed traffic flows between them.

Timeline

December 7, 2021 — AWS US-East-1 Outage Timeline (UTC) ~15:30 Automated scaling activity triggers unexpected connection surge on the internal network between US-East-1 data centers. ~15:35 Network devices interconnecting internal and main networks become congested. EC2 API degraded; instance launches and terminations slow. ~15:45 Monitoring systems — hosted on the same affected network — begin reporting stale data. Operators lose accurate visibility into impact scope. ~16:00 Retry storms from affected services amplify congestion. IAM, the Management Console, and downstream services report widespread errors. ~17:00–20:00 AWS throttles and disables the triggering automation; begins scaling internal network capacity. Services begin gradual partial recovery. ~20:00–22:00 Most EC2, EBS, and Lambda operations recovering. Downstream services restore as API availability improves. ~22:30+ Full recovery for most services. Total duration: ~7+ hours.

Root Cause: Internal Network Congestion

AWS's post-incident analysis described the triggering event as automated network capacity scaling activity. A routine process intended to add capacity to the internal network triggered unexpected behavior in the devices that sit between the internal network (where AWS's own infrastructure services run) and the main customer network. This caused a surge of connection activity — essentially a connection storm — on those internal network devices.

Internal vs. Main Network: Congestion Cascade Main Network (Customer) EC2 Instances Lambda / EBS Customer APIs Networking Svc. Internal Infrastructure Network IAM / AuthZ Internal DNS Monitoring Network Devices CONGESTED Cascading Effects 1. EC2 API calls hit IAM for AuthZ 2. IAM on congested internal net → API calls time out or fail 3. Retries storm the congested path → congestion worsens, not better 4. Monitoring on same network → operators see stale/no data 5. Console depends on IAM → customers can't even check their own service status

The network devices between these two networks were overwhelmed by connection state that they could not handle at the volume the surge produced. Once congested, traffic traversing those devices — which includes every API call that required authorization via IAM, every internal DNS query, and all monitoring telemetry — became slow, dropped, or unreliable.

The feedback loop that followed is characteristic of network congestion incidents: congestion causes timeouts, timeouts cause retries, retries add more traffic to the congested path, which deepens congestion further. Without aggressive backoff or circuit-breaking, this self-reinforcing loop prevents natural recovery.

Why Monitoring Failed

The most operationally damaging aspect of the incident was that AWS's own monitoring infrastructure sat on the internal network that was congested. Metrics that operators rely on to understand the scope and progress of an incident — including CloudWatch, the service health dashboard, and internal tooling — were either unavailable or reporting stale data from before the congestion began.

This left operators in an unusual position: they knew something was wrong because customers and service alarms were firing, but their normal tools for determining which services were affected, to what degree, and whether recovery actions were working, were themselves impaired. Making confident decisions about what to throttle, disable, or scale required correlating signals from degraded sources.

The external AWS Service Health Dashboard was also slow to update, which meant customers experienced disruptions for extended periods while the dashboard showed no issues — a communication failure that compounded the frustration of the incident.

Downstream Impact

The architectural reality of US-East-1 meant the blast radius extended well beyond the region itself:

Control-Plane and Data-Plane Separation

AWS designed many of its services with the explicit goal of separating control plane from data plane: once a resource is provisioned, its data path continues to function even if the control plane for provisioning new resources is unavailable. An existing EC2 instance continues to run and accept network traffic even if the EC2 API is down. An existing load balancer continues to forward connections even if the Elastic Load Balancing API cannot be reached.

This design partially insulated running workloads. Services that had pre-provisioned capacity and did not need to make API calls during the incident were largely unaffected. The disruption was worst for:

Lessons

1. Monitor independence is not optional

Monitoring infrastructure must be topologically separated from the systems it monitors. If your observability stack runs on the same network path that fails during an incident, you lose the ability to understand the incident at the moment you most need to. Effective monitoring independence requires: a dedicated out-of-band management network; monitoring data paths that do not traverse the same network devices as the workloads being monitored; and at minimum, external synthetic monitoring (probes from outside the affected network) that continues to provide ground truth.

2. Retry backoff discipline

Every API client, every application retry loop, and every infrastructure component that retries on failure must implement exponential backoff with jitter. Linear or fixed-interval retries under congestion are not neutral — they actively worsen the congestion they are trying to recover from. This is not a new lesson, but it is one that needs to be enforced architecturally: libraries and SDKs should default to proper backoff, not flat retries.

3. US-East-1 concentration risk

The outsized impact of this incident was partly a consequence of how much of the internet runs on US-East-1 specifically. AWS's IAM service, Route 53, and the Management Console are all globally dependent on this single region. Organizations that require high availability must architect across multiple regions and accept that some degree of multi-region operation is a prerequisite for resilience against incidents of this type.

The related lesson for application developers: dependency on global services (IAM, Route 53 global) means that a US-East-1 disruption has global reach even for workloads running in other regions. Designing for graceful degradation when these global control-plane services are unavailable — for example, by caching authorization decisions locally — provides meaningful resilience.

4. Automated operations need circuit breakers

The triggering cause was automated scaling activity that produced unexpected behavior. Automation that interacts with production network infrastructure should have circuit breakers: if an automated action produces anomalous metrics (connection rates, error rates, device CPU), it should stop and alert before the effect propagates to a catastrophic scale. The time between the trigger and the onset of customer impact in this incident was measured in minutes — faster than human detection and response, but slow enough for an automated circuit breaker to catch.

5. Independent dashboard access

The AWS Management Console requiring IAM authentication from the affected network during an IAM degradation incident is the same class of failure as the Cloudflare 2019 WAF outage, where the control plane for disabling a bad rule was served through the infrastructure affected by the rule. Control surfaces must remain reachable independently of the services they manage. Compare also with the Facebook 2021 outage, where internal management tools depended on the same BGP-routed DNS that the outage disrupted.

Explore It Live

See BGP routing data in real time

Open Looking Glass
← Previous Cloudflare WAF Regex Outage 2019: Catastrophic Backtracking
More Articles
The Pakistan YouTube BGP Hijack (2008)
The Facebook DNS Outage (October 2021)
The Cloudflare-Verizon BGP Leak (2019)
The AWS S3 Outage (February 2017)
The Dyn DNS DDoS Attack and Mirai Botnet (2016)
The CenturyLink/Level3 Flowspec Outage (2020)