AWS US-East-1 Outage 2021: Internal Network Congestion
On December 7, 2021, a significant portion of Amazon Web Services became unavailable or severely degraded for several hours, beginning around 15:30 UTC. The outage was centered on the US-East-1 region in Northern Virginia — AWS's oldest, largest, and most heavily used region — and affected not just customer workloads but AWS's own operational tooling, making the incident both broad and slow to resolve. Services including Ring, Disney+, Venmo, and Amazon's own delivery operations experienced disruptions visible to end users across the United States.
The incident illustrated a failure mode that has appeared repeatedly in large-scale cloud infrastructure: an automated process triggered unexpected network behavior, the resulting congestion disabled monitoring and control plane tooling, and operators were partly blind during the period they most needed visibility.
Background: US-East-1's Architecture
US-East-1 (Northern Virginia) is where AWS launched in 2006, and it has accumulated more services, more customers, and more technical debt than any other AWS region. It is the home region for most global AWS services including IAM (Identity and Access Management), the AWS Management Console, CloudFront configuration APIs, and Route 53 — services that other regions depend on for control plane operations. A degradation in US-East-1 can therefore cascade outward even to regions geographically distant.
AWS's internal network in US-East-1 connects data center buildings via a dense fabric. Traffic flows between workloads, between workloads and the internet, and between workloads and AWS's internal infrastructure — the services that handle authentication, DNS resolution, API authorization, and monitoring. The internal infrastructure network was separated from the main network by network devices that aggregated and managed traffic flows between them.
Timeline
Root Cause: Internal Network Congestion
AWS's post-incident analysis described the triggering event as automated network capacity scaling activity. A routine process intended to add capacity to the internal network triggered unexpected behavior in the devices that sit between the internal network (where AWS's own infrastructure services run) and the main customer network. This caused a surge of connection activity — essentially a connection storm — on those internal network devices.
The network devices between these two networks were overwhelmed by connection state that they could not handle at the volume the surge produced. Once congested, traffic traversing those devices — which includes every API call that required authorization via IAM, every internal DNS query, and all monitoring telemetry — became slow, dropped, or unreliable.
The feedback loop that followed is characteristic of network congestion incidents: congestion causes timeouts, timeouts cause retries, retries add more traffic to the congested path, which deepens congestion further. Without aggressive backoff or circuit-breaking, this self-reinforcing loop prevents natural recovery.
Why Monitoring Failed
The most operationally damaging aspect of the incident was that AWS's own monitoring infrastructure sat on the internal network that was congested. Metrics that operators rely on to understand the scope and progress of an incident — including CloudWatch, the service health dashboard, and internal tooling — were either unavailable or reporting stale data from before the congestion began.
This left operators in an unusual position: they knew something was wrong because customers and service alarms were firing, but their normal tools for determining which services were affected, to what degree, and whether recovery actions were working, were themselves impaired. Making confident decisions about what to throttle, disable, or scale required correlating signals from degraded sources.
The external AWS Service Health Dashboard was also slow to update, which meant customers experienced disruptions for extended periods while the dashboard showed no issues — a communication failure that compounded the frustration of the incident.
Downstream Impact
The architectural reality of US-East-1 meant the blast radius extended well beyond the region itself:
- Ring: live view and notification functionality — which depend on AWS services including Kinesis — were disrupted for users across the US
- Disney+ and other media services: streaming platforms hosted on AWS experienced degraded performance or inability to load
- Venmo and financial services: payment processing and account access were intermittently unavailable
- Amazon's own operations: the company's internal logistics tools and delivery driver apps, which run on AWS, experienced disruptions that affected package routing and delivery confirmation
- AWS Management Console: customers could not access the console to check their own service status or attempt remediation, because the console itself requires IAM authentication against services on the affected internal network
Control-Plane and Data-Plane Separation
AWS designed many of its services with the explicit goal of separating control plane from data plane: once a resource is provisioned, its data path continues to function even if the control plane for provisioning new resources is unavailable. An existing EC2 instance continues to run and accept network traffic even if the EC2 API is down. An existing load balancer continues to forward connections even if the Elastic Load Balancing API cannot be reached.
This design partially insulated running workloads. Services that had pre-provisioned capacity and did not need to make API calls during the incident were largely unaffected. The disruption was worst for:
- Applications that made real-time calls to AWS APIs (SQS, DynamoDB, Lambda invocations) that passed through the congested internal path
- Applications that relied on auto-scaling to handle load — scaling operations require EC2 API calls, which required IAM authentication through the congested network
- Any application using the AWS Management Console for monitoring or incident response
Lessons
1. Monitor independence is not optional
Monitoring infrastructure must be topologically separated from the systems it monitors. If your observability stack runs on the same network path that fails during an incident, you lose the ability to understand the incident at the moment you most need to. Effective monitoring independence requires: a dedicated out-of-band management network; monitoring data paths that do not traverse the same network devices as the workloads being monitored; and at minimum, external synthetic monitoring (probes from outside the affected network) that continues to provide ground truth.
2. Retry backoff discipline
Every API client, every application retry loop, and every infrastructure component that retries on failure must implement exponential backoff with jitter. Linear or fixed-interval retries under congestion are not neutral — they actively worsen the congestion they are trying to recover from. This is not a new lesson, but it is one that needs to be enforced architecturally: libraries and SDKs should default to proper backoff, not flat retries.
3. US-East-1 concentration risk
The outsized impact of this incident was partly a consequence of how much of the internet runs on US-East-1 specifically. AWS's IAM service, Route 53, and the Management Console are all globally dependent on this single region. Organizations that require high availability must architect across multiple regions and accept that some degree of multi-region operation is a prerequisite for resilience against incidents of this type.
The related lesson for application developers: dependency on global services (IAM, Route 53 global) means that a US-East-1 disruption has global reach even for workloads running in other regions. Designing for graceful degradation when these global control-plane services are unavailable — for example, by caching authorization decisions locally — provides meaningful resilience.
4. Automated operations need circuit breakers
The triggering cause was automated scaling activity that produced unexpected behavior. Automation that interacts with production network infrastructure should have circuit breakers: if an automated action produces anomalous metrics (connection rates, error rates, device CPU), it should stop and alert before the effect propagates to a catastrophic scale. The time between the trigger and the onset of customer impact in this incident was measured in minutes — faster than human detection and response, but slow enough for an automated circuit breaker to catch.
5. Independent dashboard access
The AWS Management Console requiring IAM authentication from the affected network during an IAM degradation incident is the same class of failure as the Cloudflare 2019 WAF outage, where the control plane for disabling a bad rule was served through the infrastructure affected by the rule. Control surfaces must remain reachable independently of the services they manage. Compare also with the Facebook 2021 outage, where internal management tools depended on the same BGP-routed DNS that the outage disrupted.
Explore It Live
- AS16509 — Amazon AWS; see the BGP prefixes announced for AWS infrastructure including US-East-1
- 52.94.0.0 — a prefix in AWS's US-East-1 address space; see the origin AS and routing
- 205.251.192.0 — Route 53's anycast address space, the DNS service at the center of many AWS dependencies