The Distributed State Problem Nobody Talks About in Ecommerce

#api #distributedsystems #webdev #backend

If you've ever built integrations for a multichannel ecommerce platform, you've hit a problem that looks simple on the surface and gets deeply interesting the moment you put it under load.
Every marketplace — Shopify, Amazon, eBay, Walmart, Flipkart — maintains its own completely isolated inventory record. They were each designed to be the center of a seller's universe, not one participant in a larger synchronized system.
When you're building a platform that connects all of them, you inherit a distributed state problem: the same piece of data (available quantity for a SKU) exists in N independent systems simultaneously, with no native coordination mechanism between them.
Here's how that problem behaves in production, and the architecture we settled on after learning the hard way.

Why the naive approach fails
The first implementation most teams reach for is a scheduled sync job. Every N minutes, read the master count and push it to every connected channel.
This works at low volume. It fails in specific, predictable ways at high volume.
The polling window is your exposure window. A 15-minute sync cycle means every channel has a 15-minute window where it can sell stock that no longer exists. During a demand spike — a flash sale, a viral product moment, or an AI shopping agent recommending your product to millions of users simultaneously — you can sell the same unit on five channels before a single sync propagates. The data on this is clear — real-time sync consistently outperforms batch updates in every high-velocity scenario.
Last-write-wins corrupts state. If two channels both decrement their local count between sync cycles, the next polling run overwrites one of those decrements with a stale value. You don't just fail to prevent the problem — you actively undo a correct update.
Polling doesn't scale gracefully. As SKU count and channel count grow, sync jobs take longer. A job that runs in 8 minutes on a 15-minute cycle is no longer a 15-minute sync — it's a continuous operation that never fully completes.

The event-driven model
The architecture that actually works treats every sale as an event that must immediately propagate to every connected channel.
When a sale fires on any channel:

Webhook received — acknowledge immediately
Payload queued for async processing
Idempotency check — has this event ID been processed before?
Atomic conditional decrement against master inventory
Fan-out to all connected channels
Event ID stored as processed
Hourly reconciliation catches anything the event stream missed

Each step handles a specific failure mode.

Atomic decrements — the race condition fix
A standard decrement is not safe under concurrent load. If two orders arrive simultaneously from different channels for the last unit, both read quantity = 1, both decrement, you're at -1 with two confirmed orders.
The fix is a conditional decrement that only executes if the guard condition is met:
sqlUPDATE inventory
SET quantity = quantity - 1
WHERE sku = $1 AND quantity > 0
RETURNING quantity
If this returns no rows, the sale fails cleanly as out-of-stock. If it returns a row, the sale succeeded. No race condition possible regardless of concurrency.

**Idempotency — the retry problem
Platforms retry webhooks on failure. Shopify retries 19 times over 48 hours. Amazon SNS can retry up to 100,000 times with exponential backoff.
Without idempotency, a single network hiccup becomes 19 duplicate inventory decrements.
Every event needs a unique identifier. Every processed event ID needs to be stored. Before processing, check whether you've seen the ID before. If yes, return cached result. If no, process and store.
sqlINSERT INTO processed_events (event_id, channel, sku)
VALUES ($1, $2, $3)
ON CONFLICT (event_id) DO NOTHING
RETURNING event_id;
If this returns a row — new event, process it. If it returns nothing — already processed, skip it.

**Webhook reliability variance
Every platform has different delivery guarantees and different retry behavior.
Shopify gives you 5 seconds to acknowledge or it retries. Any synchronous processing that exceeds 5 seconds triggers retries for successfully-processed events. Handler must return 200 immediately and process asynchronously.
Amazon SNS retries up to 100,000 times with default configuration. Idempotency is non-negotiable here.
WooCommerce has no retry mechanism at all. Silent failures are possible. Reconciliation catches what webhooks miss.
TikTok Shop's webhook implementation changed twice during our integration period. Plan for API changes in your version management strategy.
The abstraction that helps: normalize every incoming webhook into a canonical internal event format before it touches any business logic. Platform-specific handling lives in the ingestion layer. Everything downstream sees a consistent event schema.

Rate limits under load
One sale triggering simultaneous API calls to five channels — each with their own rate limit budget — can exhaust your quota across all channels simultaneously during a demand spike.
The fix: queue updates and batch them intelligently within each channel's rate limit budget. A small buffer between the event and the outbound API call gives you rate limit management without sacrificing meaningful latency. The tradeoff is a small increase in propagation time — typically under 30 seconds under high load — in exchange for not exhausting API quotas during the exact moments when you most need them.

Reconciliation — the safety net
Even with all of the above, events get missed. Network partitions. Deployment windows. Platform outages. Over time small discrepancies accumulate.
Hourly reconciliation compares the master record against each channel's reported count and corrects any drift. The master record wins — always. Channels that drift get corrected from it, never the other way around.
In practice, reconciliation catches roughly 2-3% of events that the webhook layer misses. Small enough that the system is reliable. Large enough that you'd notice immediately if reconciliation stopped running.

The result
We run this architecture across 34 marketplace integrations at Nventory — Shopify, Amazon, Flipkart, WooCommerce, TikTok Shop, and more. Sub-10 second propagation under normal load. The reconciliation layer handles the edge cases the event stream misses.
If you're evaluating platforms for multichannel ecommerce operations, you can see the full list of integrations and pricing starting from $25/month here.
The pattern isn't ecommerce-specific. It's the same distributed state problem that shows up in any system where the same data lives in multiple independent stores simultaneously. The specific mechanics are ecommerce-flavored. The underlying architecture is just distributed systems.
If you're building something similar or have hit different failure modes, I'd genuinely like to compare notes in the comments.

DEV Community

The Distributed State Problem Nobody Talks About in Ecommerce

Top comments (0)