Olawale Afuye

Posted on Jun 5

Amazon S3 Doesn't Hope Hardware Won't Fail. It Assumes It Already Has.

#distributedsystems #aws #architecture #engineering

Most engineers build distributed systems hoping nothing breaks.

Amazon S3 was engineered under the opposite assumption: that something is already broken, right now, and the system needs to be fine with that.

That one mindset shift explains almost everything about how S3 works — and why it's one of the most reliable pieces of infrastructure on the planet.

I went through a deep-dive conversation with Mai-Lan Tomsen Bukovec, VP of Data and Analytics at AWS, and extracted the engineering philosophy underneath the product. Not the marketing version. The real one.

Here's what actually matters.

1. Hardware failure is not an emergency. It's Tuesday.

S3 manages hundreds of exabytes of data across tens of millions of hard drives, spread across 120 Availability Zones in 38 AWS Regions. It currently stores over 500 trillion objects.

At that scale, something is always failing. A disk here. A rack there. An availability zone every now and then. The math is unforgiving.

So the S3 team made a deliberate architectural decision early: stop treating failure as an exception. Design it into the system as the baseline state.

This means dedicated auditor and repair microservices run continuously in the background — not when something goes wrong, but always. They scan the entire fleet, inspect every byte of data, detect discrepancies, and trigger repairs automatically. No human in the loop. No incident ticket. No war room.

There's also a specific property they engineer for called crash consistency — the system is designed so that after any fail-stop event, it automatically returns to a valid state without manual intervention. The failure happens. The system continues. Those two things are not in conflict.

The system heals itself because it was designed to assume it's already sick.

If you're building distributed systems and your failure handling is reactive — you only respond after something breaks — you've already lost. Design the repair loop as a first-class citizen, not an afterthought.

2. Strong consistency at this scale wasn't supposed to be possible. They did it anyway — and made it free.

S3 launched in 2006 as an eventually consistent data store. For many use cases, that was fine.

But eventual consistency creates a class of bugs that are incredibly painful to debug: you PUT an object, immediately GET it, and sometimes get the old version back. For developers building workflows that depend on read-after-write correctness, this was a real problem.

The move to strong consistency required building a new distributed data structure from scratch — a replicated journal — where storage nodes are chained together and writes flow sequentially through them. Each node learns the value and the sequence number. This guarantees that any subsequent read reflects the most recent write, everywhere, always.

The hardware cost was real. The engineering effort was significant.

They shipped it free to every customer anyway.

Because their standard isn't "acceptable." It's the best storage service on the planet. Those are different bars.

How many engineering decisions do you make based on what's acceptable versus what's actually right?

3. 200+ microservices. Developers only see `put()` and `get()`.

This is the part that doesn't get enough credit.

S3 is a system of over 200 microservices operating at a scale most people can't picture. Replication, auditing, consistency enforcement, caching, tiered storage, access control — all of it happening under the hood.

And the developer experience is just:

PUT /bucket/object
GET /bucket/object

That's it.

This isn't accidental simplicity. It's a design principle they call relentless simplification. The complexity doesn't disappear — it gets absorbed by the platform. The developer should never be forced to reason about the distributed systems problem underneath.

Every time you expose internal complexity to your users because it was easier for you to ship it that way, you're failing them. Simplicity isn't a UX concern — it's an engineering responsibility.

4. Formal methods: because at this scale, you can't test your way to correctness.

This one surprised me.

AWS uses formal methods — mathematical verification — to prove the correctness of critical code paths in S3. Not as a best-practice checkbox. As a requirement, on every single check-in.

Why? Because the system is too complex and too consequential to rely solely on unit tests and staging environments. Tests verify that the code did what you expected. Formal methods verify that the logic will behave correctly under conditions you haven't even anticipated yet — including edge cases in distributed failure scenarios that would take years to surface organically.

Crash consistency. Replication correctness. Cache coherency. These are not things you want to discover are broken at 3am during a customer incident.

Most engineering teams won't need formal methods. But the underlying principle applies everywhere: what does it mean to prove something works, not just hope it does? What are the correctness properties your most critical systems should guarantee, and how do you actually verify them?

5. Scale should make your system faster, not slower. If it doesn't, you designed it wrong.

This is counterintuitive, so read it again.

The S3 engineering team operates on a principle that as the system grows, performance should improve or remain stable — never degrade.

The reason this works: at massive scale, workloads become more decorrelated. No single server or node sees a disproportionate share of traffic. The load spreads so evenly that the marginal cost of an additional request trends toward zero.

Most engineers experience the opposite. Their system slows down as usage grows because they built for the current scale, not the next one.

If adding more users to your system introduces proportionally more latency, more contention, or more failure modes — you have a scaling debt. S3 treats scale as structural leverage, not structural liability. That's a choice you make at architecture time, not after the system is groaning under load.

6. Vectors are the next primitive. And S3 is already there.

S3 launched in 2006 to store unstructured data. PDFs, images, backups. Simple.

In 2024/2025, it evolved to support structured, tabular data via Apache Iceberg and S3 Tables. SQL-queryable. Analytics-ready.

Now it's adding S3 Vectors — native vector storage supporting up to 2 billion vectors per index and 20 trillion vectors per bucket.

Why does this matter? Because the next generation of applications doesn't just retrieve files. It retrieves meaning.

Embedding models convert documents, audio, images, and codebases into long strings of numbers that represent their semantic content. Instead of searching by filename or metadata, you search by what the data means. You can query for "find an image that looks like a puppy." You can extract sentiment from a batch of customer service audio recordings. You can ask your data ocean a question in plain language and get a conceptually accurate answer back.

This is why the term multimodal matters. Your business data isn't just text. It's PDFs, Slack messages, call recordings, whiteboards, spreadsheets — all different formats representing the same organizational knowledge. Embedding models collapse all of that into a unified vector space. S3 Vectors stores and queries that space at scale.

The performance architecture is interesting too. Rather than brute-force comparison against every record — which becomes prohibitively expensive at billions of vectors — S3 uses pre-computed vector neighborhoods, allowing similarity queries to return results in under 100ms regardless of dataset size.

If you're building anything in the AI space and you're still treating your vector store as an external bolt-on, you need to revisit that architecture. The storage layer is becoming the intelligence layer. Those two things are converging fast.

7. The traits they hire for tell you everything about how they build.

This one's worth paying attention to if you want to understand the culture behind the engineering.

S3 looks for four things in engineers:

Deep ownership. Not ownership in the sense of "you own this ticket." Ownership in the sense of a personal commitment to every byte of data every customer has ever stored. Mai-Lan described it as engineers feeling personally responsible for the durability, integrity, and usability of data that doesn't belong to them. That's a different psychological contract than most engineering roles ask for.

Relentless curiosity. The willingness to step back, read the latest research, and genuinely question whether the way things are currently done is still the right way. Not out of restlessness. Out of the recognition that technology and customer needs are both moving targets, and the system has to move with them.

Technical fearlessness. The ability to innovate aggressively on top of a system that cannot afford to break. This is the hardest one. You're extending a platform that stores 500 trillion objects and serves hundreds of millions of requests per second. Being bold in that context requires a specific kind of courage — the kind grounded in rigour, not recklessness.

Commitment to simplicity. Even as the underlying system grows more complex, the developer-facing interface should feel simpler. Engineers who optimize for their own convenience at the expense of the user's experience don't last long on a team with this mandate.

These aren't interview values. They're engineering identities.

The question isn't whether you can pass a system design interview. The question is whether you think like this when nobody's evaluating you.

The pattern underneath all of this

Seven principles. But really just one idea expressed seven different ways:

Build for what will inevitably happen, not what you hope won't.

Hardware will fail. Consistency will be tested. Scale will arrive before you're ready. Developers will get frustrated with your API. New data primitives will emerge and your architecture will need to absorb them.

S3's durability — that 11 nines number everyone quotes — isn't a product of luck or extraordinary hardware. It's a product of a team that refused to build a system that only works when everything goes right.

That's the standard.

Sources: AWS re:Invent session with Mai-Lan Tomsen Bukovec, VP of Data and Analytics at AWS. (interview)[ https://www.youtube.com/watch?v=5vL6aCvgQXU&t=817s]

DEV Community

Amazon S3 Doesn't Hope Hardware Won't Fail. It Assumes It Already Has.

1. Hardware failure is not an emergency. It's Tuesday.

2. Strong consistency at this scale wasn't supposed to be possible. They did it anyway — and made it free.

3. 200+ microservices. Developers only see `put()` and `get()`.

4. Formal methods: because at this scale, you can't test your way to correctness.

5. Scale should make your system faster, not slower. If it doesn't, you designed it wrong.

6. Vectors are the next primitive. And S3 is already there.

7. The traits they hire for tell you everything about how they build.

The pattern underneath all of this

Top comments (0)

1. Hardware failure is not an emergency. It's Tuesday.

2. Strong consistency at this scale wasn't supposed to be possible. They did it anyway — and made it free.

3. 200+ microservices. Developers only see put() and get().

4. Formal methods: because at this scale, you can't test your way to correctness.

5. Scale should make your system faster, not slower. If it doesn't, you designed it wrong.

6. Vectors are the next primitive. And S3 is already there.

7. The traits they hire for tell you everything about how they build.

The pattern underneath all of this

3. 200+ microservices. Developers only see `put()` and `get()`.