The AWS S3 Outage (February 2017)

On February 28, 2017, a large portion of the internet went dark. Websites returned errors. Apps stopped loading. Connected doorbells, thermostats, and light switches went unresponsive. The cause was not a sophisticated cyberattack, not a natural disaster, not a massive hardware failure. It was a single mistyped command, entered by an engineer during routine maintenance of Amazon Web Services' Simple Storage Service — better known as S3. The incident would become one of the most consequential cloud outages in history, cascading across thousands of services and exposing the staggering concentration risk the internet had quietly accumulated.

What is S3 and Why Does It Matter?

Amazon Web Services (AS16509) launched S3 in 2006 as one of its first cloud services. S3 provides object storage — a simple way to store and retrieve any amount of data over the internet. By 2017, S3 had become foundational infrastructure for a vast portion of the web. It hosted images for websites, backups for enterprises, static assets for mobile apps, and data for countless other AWS services. Trillions of objects were stored in S3, and the service handled millions of requests per second.

S3 is organized into regions. The oldest and largest region is us-east-1 (Northern Virginia), which served as the default region for many AWS customers and internal services alike. Because of its age and the default status it held in many AWS tools, us-east-1 had become a critical concentration point — a disproportionate share of the internet's infrastructure depended on it.

The Incident: A Mistyped Command

At 9:37 AM Pacific Time on February 28, 2017, an authorized S3 team member was executing a routine playbook to debug a billing system issue. The procedure involved removing a small number of servers from one of S3's subsystems. The engineer entered the command — but made a typo in the input that specified how many servers to remove. Instead of taking down a small handful of servers, the command targeted a far larger set than intended.

Two critical S3 subsystems were affected:

When too many servers in these subsystems were removed simultaneously, both subsystems lost the capacity they needed to function. S3 in us-east-1 effectively stopped working — it could neither read existing objects nor write new ones.

S3 us-east-1 Subsystem Failure Sequence time NORMAL Index Placement Storage Servers All subsystems healthy 09:37 AM COMMAND ERROR Index Placement Too many removed Both subsystems down 09:37 AM CASCADE EC2 APIs Lambda EBS IoT Status Dashboard AWS services failing 10:00 AM+ RECOVERY Index Placement Full restart needed ~5 hours to recover ~2:08 PM Impact Timeline (All times Pacific, Feb 28 2017) 09:37 — Mistyped command removes too many S3 servers 09:37–10:00 — S3 error rates spike; GET and PUT requests fail 10:00+ — Dependent AWS services (EC2, EBS, Lambda) begin failing ~2:08 PM — S3 fully recovered; dependent services restoring

Why Recovery Took So Long

The S3 team identified the problem quickly. But fixing it was far more difficult than simply restarting the removed servers. The index subsystem and the placement subsystem had not been fully restarted in the us-east-1 region for years. S3 had grown so large that these subsystems now managed an enormous volume of metadata. Restarting them required a full rebuild of the index, which had to process the metadata for the billions of objects stored in the region.

S3's designers had built the system for component-level failures — a server here, a rack there. The system was resilient to individual failures and could recover gracefully from them. But the tool that removed the servers had no safeguard against removing too many at once. This was not a failure the system had been designed to handle, because it was not supposed to be possible.

The full restart of the index subsystem took several hours. The placement subsystem came back first, but without the index, objects could not be located. S3 did not reach full recovery until approximately 2:08 PM Pacific Time — nearly five hours after the initial command.

The Cascade: How S3 Broke the Internet

If S3 had been an isolated storage service, the outage would have been significant but contained. Instead, S3 was deeply intertwined with nearly everything else on AWS — and AWS itself was deeply intertwined with the internet.

AWS Services That Depend on S3

Multiple core AWS services stored configuration data, logs, or runtime dependencies in S3. When S3 went down, these services experienced their own failures:

During the outage, the AWS status page showed green checkmarks for all services while half the internet was broken. AWS engineers eventually resorted to updating customers via Twitter. After the incident, AWS redesigned the dashboard to remove its dependency on S3.

Websites and Services That Went Down

The blast radius extended far beyond AWS itself. Any website or service that stored static assets (images, JavaScript, CSS) in S3 buckets experienced partial or complete failure. Any application that used S3 as its primary data store was unable to function. The list of affected services included household names across every sector: media sites, collaboration tools, project management platforms, IoT device backends, CI/CD pipelines, and countless others.

The incident laid bare a reality that many had not fully appreciated: a significant fraction of the internet was, directly or transitively, dependent on a single service in a single AWS region.

S3 Dependency Graph: Cascade of Failures S3 us-east-1 Direct AWS dependencies EC2 API Lambda EBS CloudWatch Status Dashboard Customer applications and services Web Apps SaaS Platforms Serverless Apps CI/CD Pipelines IoT Backends Media Sites End users Broken websites Failed uploads App errors Unresponsive IoT Missing images Directly broken Cascading failure User-facing impact A single storage service failure propagated to affect millions of users

The Status Dashboard Problem

One of the most memorable aspects of the outage was the failure of AWS's own Service Health Dashboard. The dashboard — the official channel AWS provides for customers to check whether services are operational — relied on S3 to serve its assets. When S3 went down, the dashboard could not be updated. For hours, it showed a reassuring grid of green icons while customers were experiencing widespread failures.

This created an information vacuum. Customers could not tell whether the problem was on their end or AWS's end. Engineers at affected companies spent precious time debugging their own systems before realizing the issue was upstream at AWS. Twitter became the de facto status page, with the @AWSCloud account posting updates that the official dashboard could not.

The irony was not lost on the internet. The tool designed to tell you if AWS is broken was itself broken because AWS was broken. It was a vivid illustration of circular dependencies in cloud infrastructure — and it became a lasting lesson in the importance of independent monitoring that does not depend on the system it is monitoring.

Cloud Concentration Risk

The S3 outage forced a reckoning with cloud concentration risk. Before February 2017, the narrative around cloud computing emphasized reliability through redundancy. AWS marketed S3 as providing "99.999999999% durability" (eleven nines). And that durability claim was about data loss, not availability — but many customers conflated the two. They assumed S3 would always be accessible.

The outage revealed several uncomfortable truths:

From a BGP and routing perspective, the outage was a stark reminder that the internet's logical dependencies can be far more concentrated than its physical topology suggests. AWS (AS16509) peers widely and announces many prefixes — the network layer was fine. The routing table was healthy. Traffic was flowing normally to AWS's IP addresses. But the services behind those IP addresses were broken. BGP routes existed to S3's endpoints, but S3 was not answering.

This highlights a limitation of tools like BGP looking glasses: they show you that a route exists to a destination, but they cannot tell you whether the application at that destination is functional. A BGP hijack makes routes disappear or change. The S3 outage was different — the routes were fine; the service behind them was not.

What AWS Changed Afterward

Amazon published a detailed post-mortem and implemented several changes to prevent a similar incident:

Safeguards Against Mass Removal

The tool used to remove servers was modified to include rate limits and capacity checks. It can no longer remove more servers than a defined threshold in a single operation. If a command would remove enough capacity to drop below a minimum safe level, the tool blocks the command and requires explicit override with additional approval.

Faster Subsystem Restart

AWS re-engineered the index and placement subsystems to restart much faster. The original design had never anticipated a full restart, and over the years as S3 grew, the restart process had become impractically slow. Post-incident, AWS invested in making these critical subsystems capable of cold-starting within minutes rather than hours.

Independent Status Page

The AWS Service Health Dashboard was decoupled from S3 and redesigned to operate independently. AWS now runs the dashboard infrastructure outside of any single service dependency, ensuring it can report on outages even when core services are down.

Partitioning and Isolation

AWS introduced additional partitioning within S3 to limit the blast radius of any single failure. Rather than all of us-east-1 being a single failure domain for the index and placement subsystems, the region was segmented so that an issue in one partition would not bring down the entire region.

Before vs After: Safeguards Added Post-Outage BEFORE (Feb 2017) No limit on server removal count Single index/placement domain per region Multi-hour cold restart for subsystems Status dashboard hosted on S3 AFTER (Remediation) Rate limits + capacity thresholds on removal Partitioned subsystems, isolated blast radius Fast cold-start for index + placement Status dashboard independent of S3 A 5-hour outage led to fundamental architectural changes at AWS

Lessons for the Industry

The S3 outage was not the first major cloud outage, and it was far from the last. But it became a defining case study because of its scope and its root cause. The lessons it taught apply far beyond AWS:

Design for Dependent Failure

Every service should plan for the failure of services it depends on. If your application depends on S3, what happens when S3 is unavailable? If the answer is "the application crashes," that is a design flaw, not bad luck. Graceful degradation, fallback paths, and circuit breakers are engineering essentials, not luxuries.

Avoid Circular Dependencies

If your monitoring system depends on the thing being monitored, your monitoring is an illusion. The status dashboard failing alongside S3 was a textbook case. Independent monitoring — hosted on separate infrastructure, ideally with a different cloud provider — is not paranoia. It is basic engineering hygiene.

Understand Your Transitive Dependencies

Most organizations do not have a complete picture of their dependency graph. You might know that your application uses DynamoDB, but do you know that DynamoDB uses S3 for certain internal operations? Mapping transitive dependencies is difficult but essential for understanding true blast radius.

Multi-Region is Not Optional

The S3 outage only affected us-east-1. S3 in other regions continued to operate normally. Customers who had architected their applications to use multiple regions, or who had replicated critical data to other regions, experienced minimal impact. The outage was a powerful argument for multi-region architecture — and a reminder that "the cloud" is not a uniform entity but a collection of physical regions that can fail independently.

Operational Tooling Needs Guardrails

The root cause was a human error in operational tooling. The engineer was authorized to run the command; they just made a typo. The tool should have prevented the action because removing that many servers would have been obviously dangerous. Operational tools that can cause widespread damage should validate inputs against safety thresholds, require confirmation for high-impact operations, and prevent actions that would reduce capacity below safe minimums.

The BGP Angle: When Routes Are Fine but Services Are Not

From a BGP routing perspective, the S3 outage is an interesting case because the network layer was healthy throughout. AWS's autonomous system (AS16509) continued to announce its prefixes normally. The AS paths to AWS endpoints were stable. CDN traffic continued to flow. DNS for AWS services continued to resolve. From a BGP looking glass, everything looked normal.

This contrasts with other notable outages like the Facebook outage of October 2021, where the root cause was a BGP withdrawal — Facebook's routes literally disappeared from the global routing table, and a looking glass would have shown no routes to Facebook's prefixes. In the S3 case, the issue was entirely at the application layer, invisible to routing-level monitoring.

This distinction matters for operators and monitoring systems. BGP monitoring can detect route hijacks, leaks, and withdrawals. But it cannot detect application-layer failures like the S3 outage. A comprehensive monitoring strategy needs both: routing-level tools like looking glasses and RPKI validation to catch network-level issues, and application-level health checks to catch service failures.

Putting the Outage in Context

In the years since the S3 outage, the internet has continued to concentrate on a small number of cloud providers. AWS, Microsoft Azure (AS8075), and Google Cloud (AS15169) collectively host an enormous share of the internet's services. Each subsequent major outage — AWS us-east-1 has experienced several more since 2017, including a significant incident in December 2021 — reinforces the same lessons about concentration risk.

The question the S3 outage raised is still unanswered: how much concentration is too much? The internet was designed as a decentralized, fault-tolerant network. BGP routes around failures. DNS has redundant authoritative servers. Anycast distributes services across many locations. But all of that resilience is undermined if the applications running on top of these protocols are all deployed in a single region of a single cloud provider.

The S3 outage was not about routing or network connectivity. It was about what happens when a linchpin service in a deeply interconnected system fails. The internet's physical infrastructure handled it fine. The internet's logical architecture — the web of dependencies that applications and services build on top of the physical network — did not.

Explore AWS's Network

You can look up AWS's network to see its BGP routes, prefixes, and peering relationships. AWS operates one of the largest autonomous systems on the internet:

Try it yourself: Look up any IP address or ASN to see its BGP routing data, origin network, and AS path. Search the looking glass →

See BGP routing data in real time

Open Looking Glass
More Articles
The Pakistan YouTube BGP Hijack (2008)
The Facebook DNS Outage (October 2021)
The Cloudflare-Verizon BGP Leak (2019)
The Dyn DNS DDoS Attack and Mirai Botnet (2016)
The CenturyLink/Level3 Flowspec Outage (2020)
The Fastly CDN Global Outage (June 2021)