We Replaced Redis with MySQL SKIP LOCKED for Inventory Reservation — Oversells Went to Zero

#mysql #redis #database #architecture

For two years, our Sponsored Placements service booked limited ad inventory through Redis: a counter in Redis, a Redlock around the decrement, and a TTL key per hold.

It oversold. Not catastrophically — consistently. 40–60 double-booked placements a month, each one a manual refund and an apology email to an advertiser.

The root cause was never one bug. It was the architecture: two sources of truth that could not be made atomic with each other. The count lived in Redis; the ownership lived in SQL. No transaction spans both. The Redlock only ever protected the Redis half.

The one mental shift

SKIP LOCKED turns a contended table into a concurrent work queue. Instead of every request fighting over one counter, each request grabs different rows and ignores the ones someone else is holding.

FOR UPDATE alone serializes — that's the experience that scares people off SQL locking. FOR UPDATE SKIP LOCKED is the opposite: a transaction that would have blocked instead skips the locked row and takes the next free one.

One row per reservable unit, then:

START TRANSACTION;

SELECT id
FROM inventory_unit
WHERE placement_id = 42
  AND (status = 'available'
       OR (status = 'held' AND hold_expires_at < NOW(3)))  -- self-healing expiry
ORDER BY id
LIMIT 2
FOR UPDATE SKIP LOCKED;   -- the whole trick

UPDATE inventory_unit
SET status = 'held', reservation_id = 'uuid', hold_expires_at = NOW(3) + INTERVAL 10 MINUTE
WHERE id IN (1107, 1108);

INSERT INTO reservation (...) VALUES (...);

COMMIT;

Two concurrent requests for the same pool lock different rows. Neither waits. The claim, the hold, and the reservation are one transaction — there is nothing to reconcile because there is nothing else.

The numbers (8 weeks before vs 8 weeks after)

Metric	Redis + Redlock	MySQL SKIP LOCKED
Oversells / month	40–60	0
Reservation p95	210 ms	34 ms
Reservation p99	540 ms	61 ms
Throughput / instance	~600 RPS	1,400 RPS
Lock-wait timeouts / day	~900	<5
Nightly reconciliation	9–14 min	deleted
Redis cluster	3 nodes	decommissioned

What made it work (the short version)

One row per unit, not a counter. A single counter row + FOR UPDATE is correct but serial — we measured the cliff at ~600 RPS.
Self-healing expiry. The claim query also picks up held rows past hold_expires_at, so correctness never depends on a sweeper running on time. (Redis TTL loses holds on failover — async replication.)
ORDER BY id + retry on 1213/1205. Deterministic lock order nearly closes the deadlock window; a 3-attempt retry handles the rest. <2 deadlocks/day, all invisible to users.
READ COMMITTED, not REPEATABLE READ. Gap locks under the MySQL default widened contention — switching cut deadlocks ~70% on its own.
A unique index as the backstop. Even if app logic is wrong, the database refuses to record the same unit sold twice.

When NOT to do this

Row-per-unit explodes for fungible, high-cardinality stock (5M identical SKUs → use a guarded UPDATE ... WHERE available >= qty instead). Flash-sale "1 item, 100k people" still wants a queue in front. And we kept Redis — for caching browse-page counts, where it belongs. Redis for speed, MySQL for truth, never the two confused.

The full write-up has the complete before/after C# handlers, the failover timeline that used to oversell, the index/EXPLAIN work, pool sizing, and a pre-merge checklist:

👉 How We Replaced Redis with MySQL SKIP LOCKED for Inventory Reservation at Scale