Vivian Voss

Posted on May 6 • Originally published at vivianvoss.net

The Command That Removed Too Much

#postmortem #aws #sysadmin #cloud

Tales from the Bare Metal, Episode 02

« Thou shalt give every destructive tool a floor below which it must refuse! »

At 9:37 in the morning, Pacific time, on Tuesday 28 February 2017, an authorised engineer in Amazon's S3 team typed a command. The command was meant to remove a small number of servers from the S3 billing subsystem, an internal cost-tracking layer that had been showing an issue worth debugging. By 13:54, roughly half of the public-facing internet had been down for four hours, and Amazon had spent most of those four hours unable to update its own status dashboard to say so.

This is a long-documented incident. Amazon's official postmortem, posted on aws.amazon.com on the day after, is brief, factual, and self-critical in the AWS house style. The story has been retold many times. The point of revisiting it now is the architecture, which is what survives the retelling.

What Happened, in Sequence

The S3 team was investigating slow billing reports. The investigation pointed at a small set of servers in the billing subsystem that needed to be removed and replaced. Capacity removal in S3 is a routine operation; it has been part of normal operations since the service was launched in 2006. The S3 team has a tool for it.

At 09:37 PST, the engineer ran the tool. One of the inputs to the command was entered incorrectly. The tool did exactly what its arguments said: it removed servers. The set it removed was much larger than the small set the engineer had in mind. The servers it removed were not all part of the billing subsystem; they were also part of two other S3 subsystems.

The first of those was the index subsystem. The index subsystem is what AWS calls the layer that "manages the metadata and location information of all S3 objects in the region". When a client makes a GET, LIST, PUT or DELETE request, the index subsystem is consulted before any byte of object data is touched. With insufficient index capacity, S3 in US-EAST-1 could not serve requests against existing objects. Every S3-backed application in that region lost the ability to read or write.

The second was the placement subsystem. The placement subsystem allocates storage for new objects, and it depends on the index subsystem to function. Once the index was below capacity, placement was also down by transitive dependency.

The team began restarting both subsystems. Restarting them was where the morning got worse. AWS state in their postmortem: "S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected." In other words: the subsystems had not been fully restarted in larger regions for a long time, and the integrity validation against years of accumulated metadata was a procedure no one had recently rehearsed at this scale.

The index subsystem reached partial recovery at 12:26 PST, full recovery at 13:18 PST. The placement subsystem, which had to wait for index, completed recovery at 13:54 PST. Total visible outage: roughly four hours and seventeen minutes.

What the Status Page Did Not Say

There is a small additional indignity in the AWS postmortem that deserves its own paragraph.

From 09:37 to 11:37 PST, AWS was unable to update the AWS Service Health Dashboard. The dashboard administration console depended on S3 in US-EAST-1. The status icons on the public Service Health page were stored as image files on S3. During the outage, the page showed the affected services as healthy, in green, while many of those services were in fact unable to serve a single request.

AWS communicated through @AWSCloud on Twitter and via SHD banner text until they were able to update individual service status at 11:37 PST. The Service Health Dashboard was subsequently rebuilt to run across multiple AWS regions, so that an outage in any one region could no longer take down the page that reported the outage.

This is the same architectural shape as the Facebook BGP outage of October 2021, where the system used to recover the network depended on the network. The pattern, in both cases: a control surface should not share fate with the data plane it controls.

The Context, in Fairness to the Build

The capacity-removal tool was a routine internal operation. It had been built in the early days of S3, when the operational footprint was small and humans driving the tool were a few experienced engineers who knew every part of the system. In that context, accepting whatever arguments the operator typed and acting on them was a reasonable design. The cost of an over-shoot was small because the system was small.

Three systemic conditions made the morning of 28 February what it became.

First: the tool's authority exceeded the floor of safety. The capacity-removal command, as built, would remove whatever capacity it was told to remove. There was no rate-limit. There was no minimum-capacity floor. There was no dry-run default that printed what would be removed before doing it. The tool was correct; the tool was also unforgiving in a way that had grown out of proportion to the system it now operated against.

Second: the recovery procedure had aged unrehearsed. The index and placement subsystems had not been completely restarted in larger regions for many years. The validation steps required to bring them back, against years of accumulated metadata, took longer than the team's recovery model assumed. This is the same shape as the GitLab backup that had not been restored: a procedure that is correct on paper degrades silently if it is not regularly exercised.

Third: the status surface shared fate with the system it described. The dashboard meant to communicate the outage was itself dependent on the subsystem that was outaged. Communication therefore had to migrate, mid-event, to a separate channel (@AWSCloud on Twitter), which was less prominent and less trusted, while the official dashboard continued to display green icons.

These are not excuses. The wrong number of servers was removed by an operator who could read the count they had typed. The error happened. But the consequences of the error (four hours of outage across half the public web) were architectural, not behavioural. A different architecture would have absorbed the same error in seconds.

The Principle

There are three architectural moves, each older than S3 itself.

A destructive tool must refuse to descend below a defined floor. The capacity-removal tool's job is to remove capacity. Its second job, equally important, is to know how much capacity is the minimum required for the subsystem to function, and to refuse any command that would breach that floor. After the incident, AWS modified the tool to do exactly this. The pattern is not new. On FreeBSD, the small expression of it is zfs snapshot tank/data@before-cleanup before any destructive operation: cost, milliseconds; benefit, five seconds of arrival before zfs rollback returns the dataset. Capsicum or jails can take it one step further by limiting the tool to the capability it actually needs (remove a small set of hosts) and refusing any escalation.

Recovery procedures must be rehearsed. A subsystem that has not been completely restarted in years is not a recoverable subsystem; it is a subsystem with a hopeful runbook. Periodic restart drills, sometimes called Game Days, surface the unspoken integrity checks that have grown over time. The drill needs to run against a meaningful fraction of production load, not against a sandbox of toy data.

Status surfaces must run out-of-band. The page on which an organisation reports the health of its systems must be structurally independent of those systems. If S3 is down, the S3 status page must still load. If BGP is misconfigured, the network operations team must still be able to reach their out-of-band console. AWS rebuilt the Service Health Dashboard to span multiple regions; Facebook, four years later, rebuilt their internal authentication paths after their BGP outage. Both reached the same conclusion: control surface and data plane must not share fate.

The further structural change in S3 was to partition the index subsystem into smaller "cells", so that a future capacity-removal mistake against one cell would not take down the entire region. This is the architectural pattern of cellularisation: many small, independently failable units instead of one large unit. It is the same pattern that gave us containers, jails, and microservices; the difference is in execution, not in idea.

Where the Pattern Travels

The principle is not specific to S3, not specific to AWS, not specific to cloud.

Kubernetes. kubectl delete -A on the wrong context, kubectl scale --replicas=0 with the wrong namespace selected, a Helm release uninstalled in the wrong cluster. Kubernetes accepts the command as typed. The defensive shape is the same: a CI-driven dry-run default, a pre-flight that announces what will change, an admission controller that refuses to drop below a defined replica floor.

Cloud control planes in general. Terraform applied against the wrong workspace destroys real infrastructure in seconds. The az CLI with the wrong subscription scoped, gcloud with the wrong project active, AWS CLI with the wrong account-aliased profile. The tools do as they are asked. The discipline is to make destructive commands print a confirmation that includes the resolved scope before they act.

PostgreSQL and other databases. TRUNCATE on the wrong database, DROP TABLE because the prompt looked like staging, role-grant scripts run as superuser when they were intended for a constrained role. The defensive shapes are well-established: schemas with naming conventions that make the wrong target obvious, dedicated users with limited grants, transactions on every destructive statement so that COMMIT is a deliberate second decision.

Backup and rotation pipelines. A rotation job that deletes the oldest copy is fine. A rotation job that deletes the only good copy because it could not parse its own retention rule is a Bare Metal incident. The defensive shape: never delete unless the tool can prove that at least one verified copy remains.

Internal status pages and alerting. This is the hardest one to retrofit because it requires the organisation to model fate-sharing explicitly. The question is: when our most important systems are down, can the people who need to coordinate the response still talk to each other, still see what is happening, still reach the boxes? If the answer is "yes, we use Slack", and Slack runs on AWS in US-EAST-1, the answer is no.

In every case, the same shape: the tool does as it is told. The architecture decides what "as it is told" actually means in the worst case. Build the architecture as if the worst case is the case you have been given.

What to Take Home

If your operational situation reminds you of any of the following, treat it as something to investigate this week:

You have an internal tool that performs destructive operations and has no minimum-state floor.
Your status page, alerting system or incident-response chat shares infrastructure with the systems it reports on.
Your most critical subsystems have not been completely restarted from cold in the last twelve months.
Your operators rely on terminal hostnames and prompts to distinguish production from staging.

Each of these was true at AWS on 28 February 2017. Each of them is true at many organisations now.

The fix is not a new tool. The fix is a small set of structural moves: floors on destructive operations, rehearsal of recovery, out-of-band status, cellularisation of single points of failure. The cost of these moves is engineering time. The cost of not making them, when the wrong argument is eventually typed, is broadcasting your outage on Twitter while your own dashboard insists everything is green.

The command did exactly what was asked. The architecture decided what exactly meant.

Read the full article on vivianvoss.net →

By Vivian Voss, System Architect & Software Developer. Follow me on LinkedIn for daily technical writing.

Top comments (1)

Gilder Miller • May 7

Thank you!
That tribal knowledge part is so real. I'm constantly finding undocumented processes that only exist in people's heads.
I hope to listen that what simple tools actually work for capturing this? Our team tried a shared wiki, but nobody kept it updated.

Waiting and enjoying, Vivian