The race condition appeared exactly once every few thousand requests. Not often enough to catch in testing. Often enough to corrupt customer data in production.
We were using optimistic locking—a pattern that works beautifully in monoliths and disastrously in distributed systems. I learned this the expensive way: by watching it fail in production while our monitoring showed everything was fine.
The pattern seemed reasonable. Read a record, include a version number, perform your business logic, write back with the version check. If the version changed between read and write, someone else modified the record—abort and retry. Classic optimistic concurrency control.
This works when your database transaction can see all the reads and writes. It breaks when those operations happen across service boundaries, with network calls in between, and multiple sources of truth that don't coordinate.
We found out because a customer's account balance went negative in a way that should have been impossible. Our code had checks preventing this. Our database had constraints preventing this. Yet somehow, between three microservices coordinating a transaction, we managed to violate both.
The Setup That Looked Safe
We had three services: Account Service (managed user balances), Payment Service (processed transactions), and Ledger Service (maintained transaction history). Standard microservices decomposition—each service owned its domain, communicated via APIs, stored its own data.
When a user made a purchase, the flow looked like this:
- Payment Service receives purchase request
- Payment Service calls Account Service to check balance
- Account Service returns current balance with version number
- Payment Service validates sufficient funds
- Payment Service calls Account Service to deduct amount (passing version)
- Account Service checks version, deducts if unchanged
- Payment Service calls Ledger Service to record transaction
Each step looked safe in isolation. The version check at step 6 ensured no one modified the balance between check and deduct. Optimistic locking doing its job.
Except this wasn't atomic. Between steps 2 and 6, other requests could be processing. The version check caught concurrent modifications to the same account, but it didn't coordinate across services. And that's where everything broke.
The Failure Mode Nobody Expected
Here's what actually happened in production:
Request A: User purchases item for $50. Balance is $100, version 42.
Request B: User purchases item for $75. Balance is $100, version 42.
Both requests read the same balance and version simultaneously. Both validated sufficient funds—$100 is enough for $50 and enough for $75 individually.
Request A writes first: Balance becomes $50, version 43.
Request B's version check fails—version 42 doesn't match current version 43. Retry.
Request B reads again: Balance is $50, version 43.
Request B validates: $50 is enough for $75—oh wait, it's not. Reject.
This is the happy path. Optimistic locking worked. The second transaction was rejected because the balance changed.
But sometimes this happened:
Request A deducts $50, version check passes, balance becomes $50.
Between the version check and the actual write, the database commits.
Request B checks version 42—still matches because the write hasn't committed.
Request A's commit completes. Version is now 43, balance is $50.
Request B's write goes through with stale data, sets balance to $25 (original $100 - $75).
Now the balance is $25. Both transactions succeeded. The user spent $125 on a $100 balance.
We had optimistic locking. We had version checks. We had what looked like safe concurrency control. What we didn't have was transactional coordination across service boundaries.
Why Optimistic Locking Fails in Distributed Systems
In a monolith, optimistic locking works because everything happens in one database transaction. Read, validate, write—atomic. The database guarantees that if the version changed, your write fails.
In microservices, that guarantee disappears. You're not in one transaction. You're in multiple network calls, each with its own transaction, its own timing, its own failure modes.
Network delays create timing windows. Between reading a version and writing with that version, enough time passes for multiple other requests to complete their entire lifecycle. Your version check is validating against state that existed milliseconds ago—an eternity in high-throughput systems.
Service boundaries break atomicity. When Account Service deducts a balance, Payment Service records a charge, and Ledger Service logs a transaction, these aren't one atomic operation. They're three separate operations that can succeed or fail independently.
Retries compound the problem. When a version check fails, the standard response is retry. But retries mean re-reading state, re-validating, re-attempting writes. Each retry is another chance for race conditions between services that think they're coordinating but actually aren't.
Optimistic locking assumes low contention. It's designed for scenarios where concurrent modifications are rare. In distributed systems with multiple services reading and writing shared state, contention isn't rare—it's constant.
We learned this by watching our retry rates. They were acceptable in testing (low load, no concurrency). In production (high load, constant concurrency), retry storms created cascading failures. Services spent more time retrying than processing.
The Monitoring That Lied to Us
Our monitoring showed healthy systems. API response times were good. Error rates were low. Database performance was fine. Everything looked green.
What we didn't monitor was the thing that actually broke: cross-service consistency.
Account Service's database was consistent. Payment Service's database was consistent. Ledger Service's database was consistent. But the relationship between them—the invariant that balance changes must match transaction records—was broken.
We had metrics for each service. We didn't have metrics for the contracts between services.
The bugs appeared as data anomalies discovered by batch jobs hours later. "Account balance doesn't match sum of transactions." By then, the request that caused the inconsistency was long gone from logs, impossible to debug, impossible to prevent from happening again.
We needed to monitor different things:
Cross-service invariant checks. Regular jobs that validated relationships between services. Did the sum of transactions in Ledger match the balance changes in Account? Did every payment in Payment Service have corresponding entries in both other services?
Version collision rates. How often did optimistic locking version checks fail? Rising collision rates indicated growing contention that would eventually cause consistency issues.
Compensation transaction frequency. How often did we need to roll back or fix data? This was the real error rate—not HTTP 500s, but business logic failures that succeeded at the technical level but failed at the semantic level.
Tools that help you analyze trends across distributed logs became essential. We couldn't see the pattern from any single service's metrics—only by correlating data across services did the consistency failures become visible.
What Actually Works
After debugging our third major consistency issue, we rewrote the critical paths with different patterns:
Saga pattern with compensation. Instead of optimistic locking across services, we used orchestrated sagas. One service coordinates a multi-step transaction, with explicit compensation logic if any step fails. This trades performance for consistency—it's slower, but it actually works.
Pessimistic locking where it matters. For high-value operations, we switched to distributed locks. Before processing a transaction, acquire a lock on the account. This kills concurrency, but it prevents impossible states. Some operations are worth the latency cost.
Event sourcing for audit trails. Instead of updating balances directly, we started storing events (TransactionCreated, BalanceDeducted) and computing balances from event streams. This gave us both consistency and a complete audit trail. You can't have two transactions that both think they were first when there's an append-only event log.
Idempotency keys everywhere. Every request that modifies state requires an idempotency key. Retries with the same key return the same result without re-executing. This doesn't prevent race conditions, but it prevents them from multiplying on retries.
We also started using AI models to help us reason through distributed transaction flows when designing new features. Not to generate code, but to help us think through edge cases. "What happens if service A succeeds but service B fails? What if they both retry? What invariants could break?"
For complex state machines across services, we'd use tools that could visualize the relationships and data flows, making it easier to spot where optimistic locking assumptions would break down.
The Real Lessons
Lesson one: Patterns that work locally fail globally. Optimistic locking is great in a single database. Across network boundaries, it's a source of subtle bugs. The distributed systems version of these patterns requires different primitives—distributed locks, consensus protocols, event sourcing.
Lesson two: You can't monitor distributed systems like monoliths. Each service being healthy doesn't mean the system is healthy. You need to monitor the relationships between services, not just the services themselves.
Lesson three: Consistency is expensive and worth it. The performance cost of pessimistic locking or sagas is nothing compared to the operational cost of data inconsistencies. Some operations should be slow to be correct.
Lesson four: Design for failure modes you haven't seen yet. Every distributed system has race conditions you didn't anticipate. Build compensation mechanisms, audit trails, and reconciliation processes from day one.
What You Should Actually Do
If you're building microservices, here's what I'd do differently:
Don't trust optimistic locking across service boundaries. Use it within a service's database, but not as coordination between services. The network timing makes version checks unreliable.
Build reconciliation into your architecture. Have background jobs that check cross-service consistency and flag anomalies. You can't prevent all race conditions, but you can detect them quickly.
Make critical operations pessimistic. Distributed locks are painful, but data corruption is worse. Identify the operations where consistency matters more than latency, and use coordination primitives that actually guarantee atomicity.
Log enough to debug race conditions. When a subtle consistency bug appears in production, you need to reconstruct what happened. Log request IDs, correlation IDs, versions, timestamps—enough to piece together the sequence of events across services.
Use idempotency keys religiously. They won't prevent race conditions, but they'll prevent them from getting worse on retries.
Platforms like Crompt AI that let you work with multiple reasoning models can help you think through these distributed transaction flows before you build them. Not as code generators, but as thought partners that help you identify edge cases and failure modes.
The Uncomfortable Truth
Distributed systems are harder than they look. The patterns that feel safe often aren't. Optimistic locking across microservices is one of those patterns—it looks reasonable, it works in testing, and it fails in production in ways that are nearly impossible to debug.
The gap between "technically correct" and "actually works under load" is where most microservices bugs live. You can have perfect code in each service and still have data corruption because the coordination between services has race conditions.
The developers who succeed with microservices aren't the ones who write the most sophisticated code. They're the ones who deeply understand distributed systems failure modes and design defensively for problems they haven't encountered yet.
Your microservices will have race conditions. The question is whether you've designed your system to catch them, log them, and recover from them—or whether they'll silently corrupt data until a batch job discovers the problem hours later.
Optimistic locking works great in monoliths. In distributed systems, optimism gets you in trouble.
Building distributed systems? Use Crompt AI to reason through transaction flows and edge cases before they become production incidents—because distributed systems are too complex to get right the first time.
-Leena:)
Top comments (0)