The AWS S3 Outage (February 2017)
On February 28, 2017, a large portion of the internet went dark. Websites returned errors. Apps stopped loading. Connected doorbells, thermostats, and light switches went unresponsive. The cause was not a sophisticated cyberattack, not a natural disaster, not a massive hardware failure. It was a single mistyped command, entered by an engineer during routine maintenance of Amazon Web Services' Simple Storage Service — better known as S3. The incident would become one of the most consequential cloud outages in history, cascading across thousands of services and exposing the staggering concentration risk the internet had quietly accumulated.
What is S3 and Why Does It Matter?
Amazon Web Services (AS16509) launched S3 in 2006 as one of its first cloud services. S3 provides object storage — a simple way to store and retrieve any amount of data over the internet. By 2017, S3 had become foundational infrastructure for a vast portion of the web. It hosted images for websites, backups for enterprises, static assets for mobile apps, and data for countless other AWS services. Trillions of objects were stored in S3, and the service handled millions of requests per second.
S3 is organized into regions. The oldest and largest region is us-east-1 (Northern Virginia), which served as the default region for many AWS customers and internal services alike. Because of its age and the default status it held in many AWS tools, us-east-1 had become a critical concentration point — a disproportionate share of the internet's infrastructure depended on it.
The Incident: A Mistyped Command
At 9:37 AM Pacific Time on February 28, 2017, an authorized S3 team member was executing a routine playbook to debug a billing system issue. The procedure involved removing a small number of servers from one of S3's subsystems. The engineer entered the command — but made a typo in the input that specified how many servers to remove. Instead of taking down a small handful of servers, the command targeted a far larger set than intended.
Two critical S3 subsystems were affected:
- The index subsystem — This system managed the metadata for all S3 objects in the us-east-1 region. It is the lookup table that maps object keys to their physical storage locations. Without the index, S3 cannot find any object, even if the data is intact on disk.
- The placement subsystem — This system handles allocation of storage for new objects. It decides where to write new data and manages capacity. Without placement, no new objects can be stored, and no existing objects can be modified.
When too many servers in these subsystems were removed simultaneously, both subsystems lost the capacity they needed to function. S3 in us-east-1 effectively stopped working — it could neither read existing objects nor write new ones.
Why Recovery Took So Long
The S3 team identified the problem quickly. But fixing it was far more difficult than simply restarting the removed servers. The index subsystem and the placement subsystem had not been fully restarted in the us-east-1 region for years. S3 had grown so large that these subsystems now managed an enormous volume of metadata. Restarting them required a full rebuild of the index, which had to process the metadata for the billions of objects stored in the region.
S3's designers had built the system for component-level failures — a server here, a rack there. The system was resilient to individual failures and could recover gracefully from them. But the tool that removed the servers had no safeguard against removing too many at once. This was not a failure the system had been designed to handle, because it was not supposed to be possible.
The full restart of the index subsystem took several hours. The placement subsystem came back first, but without the index, objects could not be located. S3 did not reach full recovery until approximately 2:08 PM Pacific Time — nearly five hours after the initial command.
The Cascade: How S3 Broke the Internet
If S3 had been an isolated storage service, the outage would have been significant but contained. Instead, S3 was deeply intertwined with nearly everything else on AWS — and AWS itself was deeply intertwined with the internet.
AWS Services That Depend on S3
Multiple core AWS services stored configuration data, logs, or runtime dependencies in S3. When S3 went down, these services experienced their own failures:
- EC2 — New instances could not be launched because the API endpoints stored configuration in S3. Running instances were mostly unaffected, but management operations failed.
- EBS (Elastic Block Store) — Creating new volumes and snapshots failed.
- Lambda — Function code is stored in S3. New invocations of many functions failed because the code could not be retrieved.
- SES, SNS, SQS — Various messaging and email services experienced degradation.
- CloudWatch — Monitoring and alerting were impaired, making it harder for AWS and its customers to understand the scope of the outage.
- The AWS Service Health Dashboard — Perhaps most ironically, the dashboard used to communicate service status to customers was itself hosted on S3. AWS could not update its own status page to tell people what was happening.
During the outage, the AWS status page showed green checkmarks for all services while half the internet was broken. AWS engineers eventually resorted to updating customers via Twitter. After the incident, AWS redesigned the dashboard to remove its dependency on S3.
Websites and Services That Went Down
The blast radius extended far beyond AWS itself. Any website or service that stored static assets (images, JavaScript, CSS) in S3 buckets experienced partial or complete failure. Any application that used S3 as its primary data store was unable to function. The list of affected services included household names across every sector: media sites, collaboration tools, project management platforms, IoT device backends, CI/CD pipelines, and countless others.
The incident laid bare a reality that many had not fully appreciated: a significant fraction of the internet was, directly or transitively, dependent on a single service in a single AWS region.
The Status Dashboard Problem
One of the most memorable aspects of the outage was the failure of AWS's own Service Health Dashboard. The dashboard — the official channel AWS provides for customers to check whether services are operational — relied on S3 to serve its assets. When S3 went down, the dashboard could not be updated. For hours, it showed a reassuring grid of green icons while customers were experiencing widespread failures.
This created an information vacuum. Customers could not tell whether the problem was on their end or AWS's end. Engineers at affected companies spent precious time debugging their own systems before realizing the issue was upstream at AWS. Twitter became the de facto status page, with the @AWSCloud account posting updates that the official dashboard could not.
The irony was not lost on the internet. The tool designed to tell you if AWS is broken was itself broken because AWS was broken. It was a vivid illustration of circular dependencies in cloud infrastructure — and it became a lasting lesson in the importance of independent monitoring that does not depend on the system it is monitoring.
Cloud Concentration Risk
The S3 outage forced a reckoning with cloud concentration risk. Before February 2017, the narrative around cloud computing emphasized reliability through redundancy. AWS marketed S3 as providing "99.999999999% durability" (eleven nines). And that durability claim was about data loss, not availability — but many customers conflated the two. They assumed S3 would always be accessible.
The outage revealed several uncomfortable truths:
- Single-region dependencies are everywhere. Many customers used only us-east-1 because it was the default, the cheapest, and had the most features. Multi-region architectures were possible but expensive and complex.
- Transitive dependencies are invisible. Even if your application did not use S3 directly, it likely depended on AWS services that did. Your Lambda function might not store data in S3, but its code is stored in S3. Your EC2 instances might not read from S3, but the API that launches them does.
- The blast radius of foundational services is enormous. S3 was not just storage. It had become a load-bearing pillar of the AWS control plane. Removing it was like pulling the foundation out from under a skyscraper.
- Monitoring often fails when you need it most. If your monitoring system runs on the same infrastructure as the thing being monitored, it will fail at exactly the moment you need it.
From a BGP and routing perspective, the outage was a stark reminder that the internet's logical dependencies can be far more concentrated than its physical topology suggests. AWS (AS16509) peers widely and announces many prefixes — the network layer was fine. The routing table was healthy. Traffic was flowing normally to AWS's IP addresses. But the services behind those IP addresses were broken. BGP routes existed to S3's endpoints, but S3 was not answering.
This highlights a limitation of tools like BGP looking glasses: they show you that a route exists to a destination, but they cannot tell you whether the application at that destination is functional. A BGP hijack makes routes disappear or change. The S3 outage was different — the routes were fine; the service behind them was not.
What AWS Changed Afterward
Amazon published a detailed post-mortem and implemented several changes to prevent a similar incident:
Safeguards Against Mass Removal
The tool used to remove servers was modified to include rate limits and capacity checks. It can no longer remove more servers than a defined threshold in a single operation. If a command would remove enough capacity to drop below a minimum safe level, the tool blocks the command and requires explicit override with additional approval.
Faster Subsystem Restart
AWS re-engineered the index and placement subsystems to restart much faster. The original design had never anticipated a full restart, and over the years as S3 grew, the restart process had become impractically slow. Post-incident, AWS invested in making these critical subsystems capable of cold-starting within minutes rather than hours.
Independent Status Page
The AWS Service Health Dashboard was decoupled from S3 and redesigned to operate independently. AWS now runs the dashboard infrastructure outside of any single service dependency, ensuring it can report on outages even when core services are down.
Partitioning and Isolation
AWS introduced additional partitioning within S3 to limit the blast radius of any single failure. Rather than all of us-east-1 being a single failure domain for the index and placement subsystems, the region was segmented so that an issue in one partition would not bring down the entire region.
Lessons for the Industry
The S3 outage was not the first major cloud outage, and it was far from the last. But it became a defining case study because of its scope and its root cause. The lessons it taught apply far beyond AWS:
Design for Dependent Failure
Every service should plan for the failure of services it depends on. If your application depends on S3, what happens when S3 is unavailable? If the answer is "the application crashes," that is a design flaw, not bad luck. Graceful degradation, fallback paths, and circuit breakers are engineering essentials, not luxuries.
Avoid Circular Dependencies
If your monitoring system depends on the thing being monitored, your monitoring is an illusion. The status dashboard failing alongside S3 was a textbook case. Independent monitoring — hosted on separate infrastructure, ideally with a different cloud provider — is not paranoia. It is basic engineering hygiene.
Understand Your Transitive Dependencies
Most organizations do not have a complete picture of their dependency graph. You might know that your application uses DynamoDB, but do you know that DynamoDB uses S3 for certain internal operations? Mapping transitive dependencies is difficult but essential for understanding true blast radius.
Multi-Region is Not Optional
The S3 outage only affected us-east-1. S3 in other regions continued to operate normally. Customers who had architected their applications to use multiple regions, or who had replicated critical data to other regions, experienced minimal impact. The outage was a powerful argument for multi-region architecture — and a reminder that "the cloud" is not a uniform entity but a collection of physical regions that can fail independently.
Operational Tooling Needs Guardrails
The root cause was a human error in operational tooling. The engineer was authorized to run the command; they just made a typo. The tool should have prevented the action because removing that many servers would have been obviously dangerous. Operational tools that can cause widespread damage should validate inputs against safety thresholds, require confirmation for high-impact operations, and prevent actions that would reduce capacity below safe minimums.
The BGP Angle: When Routes Are Fine but Services Are Not
From a BGP routing perspective, the S3 outage is an interesting case because the network layer was healthy throughout. AWS's autonomous system (AS16509) continued to announce its prefixes normally. The AS paths to AWS endpoints were stable. CDN traffic continued to flow. DNS for AWS services continued to resolve. From a BGP looking glass, everything looked normal.
This contrasts with other notable outages like the Facebook outage of October 2021, where the root cause was a BGP withdrawal — Facebook's routes literally disappeared from the global routing table, and a looking glass would have shown no routes to Facebook's prefixes. In the S3 case, the issue was entirely at the application layer, invisible to routing-level monitoring.
This distinction matters for operators and monitoring systems. BGP monitoring can detect route hijacks, leaks, and withdrawals. But it cannot detect application-layer failures like the S3 outage. A comprehensive monitoring strategy needs both: routing-level tools like looking glasses and RPKI validation to catch network-level issues, and application-level health checks to catch service failures.
Putting the Outage in Context
In the years since the S3 outage, the internet has continued to concentrate on a small number of cloud providers. AWS, Microsoft Azure (AS8075), and Google Cloud (AS15169) collectively host an enormous share of the internet's services. Each subsequent major outage — AWS us-east-1 has experienced several more since 2017, including a significant incident in December 2021 — reinforces the same lessons about concentration risk.
The question the S3 outage raised is still unanswered: how much concentration is too much? The internet was designed as a decentralized, fault-tolerant network. BGP routes around failures. DNS has redundant authoritative servers. Anycast distributes services across many locations. But all of that resilience is undermined if the applications running on top of these protocols are all deployed in a single region of a single cloud provider.
The S3 outage was not about routing or network connectivity. It was about what happens when a linchpin service in a deeply interconnected system fails. The internet's physical infrastructure handled it fine. The internet's logical architecture — the web of dependencies that applications and services build on top of the physical network — did not.
Explore AWS's Network
You can look up AWS's network to see its BGP routes, prefixes, and peering relationships. AWS operates one of the largest autonomous systems on the internet:
- AS16509 — Amazon / AWS
- AS14618 — Amazon.com (legacy)
- AS8075 — Microsoft Azure
- AS15169 — Google Cloud
Try it yourself: Look up any IP address or ASN to see its BGP routing data, origin network, and AS path. Search the looking glass →