MaxHuo

Posted on Jun 22

Strong vs Eventual Consistency in Distributed Storage (Without the Confusion)

#storage #devops #s3 #architecture

After writing about metadata in object storage systems, I kept coming back to the same question:

If metadata is distributed across multiple machines, how do those machines agree on what is true?

This is where consistency enters the picture.

The problem is that consistency is often explained through CAP theorem diagrams and academic terminology. While those are important, they don't always help when you're trying to understand how a storage system actually behaves.

A more practical question is:

If I upload an object right now, when should everyone else be able to see it?

That question sits at the heart of consistency.

A storage engineer's version of the problem

Imagine a storage cluster with three metadata nodes.

A client uploads:

photos/cats.png

The upload succeeds.

One second later, another service asks:

"Does photos/cats.png exist?"

Most people expect the answer to be obvious.

Either the object exists or it doesn't.

But in a distributed system, different nodes may learn about the write at different times. That's where consistency guarantees start to matter.

Strong consistency: one answer, everywhere

With strong consistency, once a write is acknowledged, every future read must observe that write.

From a developer's perspective, life is simple.

Upload an object:

Upload object
Receive success response
Read object

If step three doesn't find the object, something is wrong.

The system behaves as though there is a single source of truth, even if dozens of machines are involved behind the scenes.

This is one reason developers like strongly consistent systems: they're easier to reason about.

The complexity stays inside the storage layer rather than leaking into application code.

Why storage systems historically liked eventual consistency

Eventual consistency often gets treated as a design mistake.

I don't think that's fair.

For years, many large-scale storage systems made that trade-off intentionally because coordinating every write across every node can become expensive.

The goal wasn't to be less correct.

The goal was to remain responsive and available at scale.

A classic example involved object listing.

You upload an object.

The upload succeeds.

You immediately perform a LIST operation.

The object isn't there.

The first time you see this behavior it feels like a bug.

In reality, nothing is broken. The metadata simply hasn't converged yet.

This kind of behavior existed in enough systems that many cloud engineers learned to expect it.

The trade-off is where the complexity lives

A lot of discussions frame consistency as:

Strong consistency = good

Eventual consistency = bad

In practice, the decision is rarely that simple.

Strong consistency often pushes complexity into the storage layer through coordination, consensus, and failure handling.

Eventual consistency often pushes complexity into application code through retries, stale reads, and edge cases.

The complexity doesn't disappear.

It just moves.

That's why consistency discussions are ultimately engineering discussions, not ideological ones.

The real question isn't:

"Which model is better?"

The real question is:

"Where do we want to pay the cost?"

Key takeaway

The most useful way I've found to think about consistency is this:

Consistency determines how quickly a distributed system agrees on the result of a change.

Strong consistency prioritizes agreement before responding.

Eventual consistency prioritizes progress and convergence over time.

Neither approach is free.

Every storage system chooses a different balance between coordination, performance, availability, and operational complexity.

Top comments (1)

MaxHuo • Jun 23

One thing I've noticed is that eventual consistency usually isn't where the frustration starts.

The storage system is often doing exactly what it's supposed to do. The surprises show up later when someone uploads an object, immediately tries to read or list it, and gets a result they weren't expecting.

I've seen plenty of discussions where people describe that behavior as a bug, when it's really a consequence of the consistency guarantees the system chose to make.

For engineers who have worked with distributed systems, what consistency-related issue caused the most confusion on your team?