Read Modify Write Is Where NoSQL Concurrency Bugs Begin.

#architecture #distributedsystems #mongodb #nosql

Part 1 of 3 — the single-document case.

There's a class of bug that every backend engineer ships at least once, usually
without noticing for months. It hides inside the most innocent-looking operation:
read a document, decide something, write it back.

Take a concrete invariant: a team can hold at most 10 seats. To add a seat you
read the team document, count the seats, check count < 10, and write. A textbook
Read → Modify → Write.

Now run it twice at the same instant. Request A reads count = 9, decides "9 < 10,
fine", and writes 10. Request B, a millisecond apart, also read count = 9,
decided "fine", and writes 10. You now have a team that thinks it has 10 seats but
actually granted 11. Neither request did anything wrong on its own. One write
silently erased the premise of the other. This is a lost update, and it's the
core anomaly of the single-document case.

T0   A reads count = 9
T1   B reads count = 9
T2   A writes count = 10   ("9 < 10, fine")
T3   B writes count = 10   ("9 < 10, fine")

Reality:        11 seats granted
Database state: 10
Invariant:      violated, silently

Here's what teams actually reach for, and exactly what each option leaves on the
table.

The fat aggregate (atomic operators)

If you can express the whole mutation as a single atomic operator — $inc,
$push with $slice, or a conditional findAndModify — MongoDB applies it
atomically on the document. There's no read-then-write window, so no lost update.
For invariants that fit a single atomic expression, this is genuinely the right
tool, and you should reach for it first.

The catch: not every invariant fits. The moment your check needs branching ("if
the plan is free and count ≥ 5, reject") you're back to reading, deciding in
application code, and writing — and the window reopens. Embedding related data is
a perfectly good modeling choice; the trap is different. It's the temptation to keep
stretching one document's consistency boundary — folding in unrelated rules just
to keep the write atomic — which is exactly how you end up with 16 MB documents and a saturated network.

Anomaly status: ✅ lost update handled — for the subset of rules expressible as
one atomic op.

The pessimistic lock (Redis)

Grab a distributed lock before the read, release after the write. It works — but
for a single document it's a sledgehammer. You've added a network round-trip, a
brand-new failure mode (the lock service), and a whole class of distributed
coordination failures — lease expiry, lock drift, fencing, split-brain — all to
guard one document the database could have guarded itself.

Anomaly status: ✅ everything — at the cost of latency and distributed coordination
failures. (Part 3 is dedicated to why that bill is steep.)

Optimistic locking (a version field)

Carry a version on the document. Read it, run your logic, then write with a
guard: findAndModify({_id, version: v}, {$set: {...}, $inc: {version: 1}}). If
anyone wrote in between, version moved, your guard matches nothing, and you
retry. This is the clean default for single-document RMW that doesn't fit an
atomic operator — it kills lost update with no external system.

The catch: under contention it's a retry machine. The more concurrent writers, the
more losers re-run their logic, burning CPU and tail latency.

Anomaly status: ✅ lost update — at the cost of app-side retries.

Pray

Bet that two requests never touch the same document in the same millisecond. They
will. Anomaly status: ❌ lost update, in production, at 3 a.m.

The point

For a single document, you're actually well served: atomic operators or optimistic
locking close the gap cleanly, without external machinery. The single-document
case is the easy one.

The real pain begins the instant your invariant spans two documents — a
workspace budget gating a user debit, for example. There, optimistic locking stops
being sufficient: it still guards each document on its own, but it can no longer
guarantee an invariant that lives between them. And a nastier anomaly walks in —
the database stays perfectly "consistent" while your business invariant quietly
dies.

Welcome to write skew. That's part 2.