MaxHuo

Posted on Jun 30

Why Object Storage Rewrites Whole Objects (It's Far More Than Just Consensus)

#storage #distributedsystems #cloud #architecture

A junior on my team pulled me aside yesterday with that familiar frustrated look—the kind that screams he's convinced he found a massive, stupid design flaw.

He pulled up our monitoring dashboard and pointed aggressively. "Look at this. I changed one freaking character in a 5 MB JSON file—only one single byte modified. The system reuploaded the entire file, saturated our network ingress and kicked off a full garbage collection cycle. Why can't it just flip that byte directly on disk, like my laptop's drive does?"

It's a totally fair question. Consensus and replication do factor into object storage design, but they aren’t the sole driving force. The first constraint lives much lower in the stack: how physical storage hardware performs best under different write patterns. But that’s only half the equation—distributed cluster requirements add another layer of pressure pushing object storage toward immutable write patterns.

One major foundational bottleneck sits in every storage rack: spinning hard drives, running at 7,200 RPM.

Let's pop the hood on spinning hard drives

Object storage is built for massive scale. Even in 2026, that scale still relies on spinning rust—HDDs. No team can afford to host multi-exabyte datasets entirely on all-flash arrays without draining the entire cloud budget.

The core issue comes down to basic hard drive physics: disks hate random access, and they thrive on sequential streaming.

If you edit a single byte buried in the middle of a 5 MB file on an HDD, the hardware has to execute a full read-modify-write sequence, with three distinct steps:

Pull the full 4K (or larger) sector into memory
Tweak that one tiny byte in RAM
Wait for the platter to rotate back, then move the read/write head to the exact physical track again

That mechanical seek operation costs roughly 10 milliseconds.

Ten milliseconds sounds negligible on its own, until you imagine 10,000 concurrent microservices all patching tiny snippets of data at once. Drive heads flick back and forth across platters nonstop, like a broken metronome having a seizure. IOPS collapse to single digits, and throughput tanks from 200 MB/s down to dial-up levels.

Object storage isn't optimized for frequent, fine-grained updates. Its strengths are large sequential writes, high durability, and massive scalability. The simplest way to lock in consistent high bandwidth is to avoid random head movement entirely. Instead of mutating existing object data in-place, most object storage implementations generate an entirely separate new object, then flip metadata pointers to mark this new version as live.

We sacrifice random write flexibility to keep drives performant—and save ourselves endless operational headaches.

Hardware explains why sequential writes are efficient.
Distributed systems explain why immutable objects are easier to coordinate.

Modern object storage inherits both ideas. The HDD behavior gave us the mechanical push toward append-only patterns, but the distributed coordination challenges gave us the architectural lock-in.

Yeah, but what about NVMe SSDs?

SSDs have no moving heads, but they carry their own critical flaw: write amplification.

SSD hardware writes data in 4KB pages, but erases data in large 2MB–4MB blocks. If you overwrite that tiny changed byte in-place, the controller has to read the entire block into cache, fully erase the physical block (a destructive operation), then rewrite the whole block just to apply your tiny change.

This chews through the drive's limited Program/Erase (P/E) cycles at an alarming rate. Frequent small in-place overwrites significantly accelerate SSD wear-out due to write amplification, shortening drive lifespan far faster than sequential workloads would. For most large-scale cloud deployments, the extra storage overhead from immutable full rewrites is a worthwhile tradeoff. This design delivers more stable hardware lifespans, simpler failure recovery logic, and consistent throughput—benefits that outweigh the cost of retaining duplicate object versions until garbage collection cleans them up.

Fine, but what about the distributed cluster side of things?

Now we get to the topic everyone fixates on: consistency. Consensus challenges are the other half of the story, but they're not where the physical constraints begin.

That said, supporting in-place updates across three data replicas (Nodes A, B, C) creates an unmanageable pile of edge cases.

Picture a client sending a tiny patch mid-write, and Node B crashes halfway through the operation. Node A holds the updated data, Node B is left with corrupted partial bytes, and Node C still serves the original file.

Which version do we serve for the next GET request?
Do we roll back changes entirely, or risk serving a broken mix of old and new data?
How do we manage cross-node locks without crushing read throughput?

Immutable objects eliminate this whole mess. We write the complete new object copy to every node first. Once all replicas validate the checksum successfully, the metadata layer swaps one single pointer to mark the new version as live. If the upload fails halfway? The original object stays fully intact and continues serving traffic. There's no partial state to reconcile, no emergency 2 AM incident calls to untangle split-brain data issues.

An atomic metadata swap is trivial to implement. Safe partial in-place writes across distributed replicas? That's considerably harder to implement safely and reason about.

The unadvertised downside vendors skip in marketing docs

This architecture comes with an unavoidable, messy side effect: rampant storage bloat.

If you repeatedly edit a 10 GB database backup every five minutes (please don't do this), you generate another 10 GB of orphaned stale data every cycle. Object storage quickly turns into a dumping ground unless you carefully tune lifecycle policies and garbage collection cadence.

Versioning is widely marketed as a user recovery feature, but it's really a natural side effect of immutable storage—one that vendors package as a sellable "one-click rollback" capability.

If you treat object storage like a local hard drive, you're going to have a bad time.

It's not built for frequent edits. It's built for publishing—upload once, leave it untouched, delete only when you're certain. It trades random-write flexibility for raw throughput, accepts slower edits to extend hardware lifespan, and dodges the worst distributed consensus nightmares by simply refusing to play the "modify" game at all.

Once you frame object storage as an append-first system rather than a mutable local filesystem, every counterintuitive design choice falls into place.

Top comments (2)

MaxHuo • Jul 1

Quick pinned note for everyone stopping by!

This article pushes back against the common take that consensus is the main driver for full object rewrites in object storage. Hardware limitations (HDD random I/O pain, SSD write amplification) are the foundational constraint, with distributed consistency challenges as a secondary architectural lock-in.

Drop a comment if you’ve run into real-world wear/throughput issues with mutable patch workflows on your storage cluster. Bookmark if this breakdown cleared up confusion for you—much appreciated!

Valentyn Kit • Jul 8

This is the same argument that gets you to LSM trees in general - sequential writes plus immutable segments plus background compaction, whether the "segment" is an SSTable or a 5MB object. Object storage just makes the trade-off explicit because you can't hide it behind a page cache.