Daniel Popoola

Posted on Apr 13

Why Redis Cannot Share the Truth with Postgres - The architecture mistake that will oversell your tickets

#distributedsystems #softwareengineering #go #postgres

There is a moment, somewhere in the design of almost every backend system that mixes Redis and Postgres, where an engineer makes a decision that feels obviously correct and is actually subtly wrong.

The decision looks like this: Redis is fast, Postgres is durable. Use Redis to track inventory — it can handle the load. Persist the important stuff to Postgres.

It feels right because both halves are true. Redis is fast. Postgres is durable. The mistake is not in the premises. The mistake is in the conclusion — that these two systems can jointly own authoritative state.

They cannot. Not because of a tooling limitation you can engineer around. Because of a fundamental property of distributed systems that no amount of clever code eliminates.

This article is about that property, why it matters, and what the correct mental model looks like. It is grounded in a real system I built: FairQueue, a virtual queue and inventory allocation engine for high-demand live events in the Nigerian market — the kind of system that has to survive 50,000 people trying to buy 5,000 tickets at exactly the same second.

The Problem Space

Picture Detty December in Lagos. A Burna Boy concert. 5,000 tickets. The sale opens at noon.

By 12:00:00.003, your server is receiving more concurrent requests than it has ever seen. Every one of those requests wants the same thing: a ticket. Most of them will be disappointed. Your job is to make sure exactly 5,000 of them succeed — no more, no less — and that payment for each of those 5,000 is correctly recorded.

Overselling is not a minor bug. It means you charged someone for a ticket that does not exist. Silent inventory loss means someone got a ticket and you have no payment record. Both outcomes end careers and companies.

The thundering herd problem is well understood. The less-discussed problem is what happens to your data model when you try to handle it.

The Intuitive Architecture (And Why It Breaks)

The most natural response to high read/write volume on a shared counter is: put it in Redis. Redis is single-threaded. Operations are atomic. A DECR command cannot race with another DECR the way a Postgres UPDATE can under concurrent load without explicit locking. This reasoning is sound as far as it goes.

So the intuitive architecture emerges:

Redis holds inventory:{event_id} — the live ticket count
Postgres holds orders, claims, payments — the durable record
The flow: check Redis, decrement Redis, write to Postgres

Here is what that looks like in code:

// Check inventory
count, _ := redis.Get(ctx, "inventory:event-123").Int64()
if count <= 0 {
    return ErrSoldOut
}

// Decrement
redis.Decr(ctx, "inventory:event-123")

// Persist
db.Exec(ctx, "INSERT INTO claims (...) VALUES (...)")

This code has a bug. The bug is not in any single line. The bug is in the model — in the assumption that these three operations form a coherent unit.

They do not. They are three separate operations across two separate systems. No transaction spans them. Between any two of those lines, the process can crash, the network can partition, the Redis instance can restart. Each of those events produces a different kind of corruption.

Let us be precise about what each failure looks like.

The Four Failure Windows

Window 1: Between check and decrement

You read the inventory: 1 ticket left. Before you decrement, another request reads the same count. Both see 1. Both decrement. Both insert into Postgres. You have sold the same ticket twice.

This is a classic time-of-check/time-of-use (TOCTOU) race. It is solvable — Redis Lua scripts can make the check-and-decrement atomic. But solving this window does not close the others.

Window 2: Between Redis decrement and Postgres insert

You atomically decrement Redis to 0. Before you insert into Postgres, the process crashes — OOM kill, deployment, hardware fault, power failure. It does not matter why.

Redis says 0 tickets remain. Postgres has no claim record. The ticket has vanished. A real person paid — or was about to pay — and there is no recoverable record of their claim.

Window 3: Between Postgres insert and Redis decrement

You reverse the order — Postgres first, Redis second. The Postgres insert commits. Before you decrement Redis, the process crashes.

Now Redis shows 1 ticket remaining. Postgres has a committed claim. The next request that checks Redis will be told a ticket is available when none is. You may oversell.

Window 4: Redis restart

Your Redis instance restarts. The inventory key evaporates. All the careful decrements you performed are gone. Redis now reports the key as missing — which your code interprets as "full inventory available" — and suddenly every ticket is available again, even the ones already claimed and paid for.

Why You Cannot Fix This With More Code

The instinct at this point is to reach for compensating mechanisms. Retry logic. Distributed transactions. Two-phase commit. Sagas.

These approaches are real and useful in the right contexts. They do not fix the fundamental problem here, because the fundamental problem is not a missing feature. It is a property of the environment.

Martin Kleppmann puts it clearly in Designing Data-Intensive Applications: the dual-write problem is not solved by making writes faster or retries smarter. It is solved by choosing one system to be the source of truth and treating all other systems as derived state.

The moment you split authoritative state across Redis and Postgres — the moment both systems are required to agree for your data to be correct — you have created a consistency problem that lives in the gap between them. That gap cannot be closed. It can only be made smaller (with enough engineering complexity) or eliminated (by removing the split).

There is no atomic operation that spans two storage systems. That is not a Redis limitation or a Postgres limitation. It is a consequence of the fact that they are separate processes, on separate machines, with separate failure modes.

Every approach that tries to compensate for this — writing to both, reconciling differences, detecting divergence — is acknowledging the problem and managing it, not solving it. Management has a cost: operational complexity, latency, edge cases, and the ever-present risk that your compensation logic has its own bugs.

The simpler answer is to not create the split.

The Correct Mental Model: One Truth, One Cache

The model that actually works is this:

Postgres is the single source of truth. Redis is a performance layer. Redis holds nothing that cannot be reconstructed from Postgres.

This sounds like a constraint. It is actually a simplification. When Redis holds only reconstructible state, every failure mode has a clean answer: reconstruct from Postgres.

The ordering rule that follows from this model is strict:

Postgres is always written first. Redis is always written second. Never the reverse.

This rule is asymmetric by design. Violating it in one direction (Redis first, Postgres second) creates the possibility of a Redis state that Postgres cannot recover — an authoritative count with no corresponding record. That is the failure mode that oversells tickets.

Violating it in the other direction (Postgres first, Redis second) means a process crash between the two writes leaves Redis showing more inventory than actually exists. This is inflation — Redis is too generous. It is wrong, but it is recoverable. The next reconciliation pass reads the authoritative Postgres count and corrects Redis. No customer was incorrectly turned away. No ticket was oversold.

Choosing between these two failure modes is not splitting hairs. Temporary inflation that heals automatically is categorically different from silent overselling that requires manual intervention. One is a known, bounded failure. The other is a correctness violation.

What This Looks Like in FairQueue

FairQueue's inventory flow is built entirely around this model. Here is the actual path a claim request takes:

Claim request arrives
       │
       ▼
Redis SET NX lock acquired?  ← Layer 1: prevent concurrent claims for same customer
  No  → return ErrAlreadyClaimed
  Yes → continue
       │
       ▼
Redis Lua: DECRBY inventory if > 0  ← Atomic check-and-decrement
  -2 (sold out)   → return ErrEventSoldOut
  -1 (cache miss) → fall back to Postgres count, then retry
  ≥ 0 (success)   → continue
       │
       ▼
Postgres INSERT claim  ← Source of truth write
  unique violation → rollback Redis decrement, return ErrAlreadyClaimed
  success          → claim created ✓
       │
       ▼
Release lock

Several things in this flow are worth examining closely.

The Lua script is not the correctness guarantee. It is a performance optimisation. It prevents most concurrent claims from reaching Postgres at all, which reduces contention. But if Redis is unavailable, if the Lua script has a bug, if the lock fails — the Postgres unique constraint on (customer_id, event_id) is still there. That constraint is the inviolable correctness guarantee. Two rows cannot be inserted for the same customer and the same event. The database enforces this atomically, regardless of what happened in Redis.

This is the two-layer concurrency shield: Redis is the cheap doorman that turns away most concurrent attempts before they reach the database. Postgres is the last line of defence that holds even if the doorman is asleep.

The rollback on Postgres failure is explicit. If the Postgres insert fails after the Redis decrement succeeds, the code immediately increments Redis back. This is a best-effort compensation — if the increment also fails, the reconciliation worker will correct the divergence on its next tick. The failure is bounded and self-healing.

The cache miss path falls back to Postgres. When Redis does not have the inventory key — because it restarted, because it was never set, because the key expired — the code reads the authoritative count from Postgres and retries the decrement. Redis is not required for correctness. It is required for performance.

The Reconciliation Worker: Embracing Eventual Consistency

No matter how careful your write ordering is, Redis and Postgres will diverge. Process crashes, network blips, partial failures — these are not edge cases in production systems. They are normal operating conditions.

FairQueue has a reconciliation worker that runs every 30 seconds. Its job is mechanical: for every active event, derive the authoritative inventory count from Postgres (total_inventory - COUNT(active claims)), compare it to the Redis count, and force-sync if they differ.

func (w *ReconciliationWorker) reconcileEvent(ctx context.Context, event *domain.Event) error {
    activeClaims, _ := w.claims.CountActive(ctx, event.ID)
    pgCount := int64(event.TotalInventory) - activeClaims

    redisCount, _ := w.inventory.GetCount(ctx, event.ID)

    if redisCount == pgCount {
        return nil
    }

    w.logger.Warn("inventory divergence detected, healing",
        "event_id", event.ID,
        "postgres_count", pgCount,
        "redis_count", redisCount,
    )

    return w.inventory.ForceSync(ctx, event.ID, pgCount)
}

This worker does not make the system eventually consistent in the casual, hand-wavy sense. It makes the system intentionally eventually consistent with a bounded heal window. The maximum time Redis can be wrong is 30 seconds, and the direction of that wrongness (inflation, not deflation) is controlled.

The worker also handles Redis restarts entirely. When Redis comes back empty, the next reconciliation tick finds every event with a missing or zero inventory key and rebuilds them from Postgres. No manual intervention. No data loss. The system heals itself.

The Broader Principle

The dual-write problem is one instance of a more general principle: every distributed system design decision is actually a choice between failure modes, not a choice between correctness and incorrectness.

There is no architecture that eliminates failure. There are only architectures that choose which failures are acceptable, how long they last, and whether they are recoverable.

The engineers who get this wrong are not making careless mistakes. They are often making locally reasonable decisions — Redis is fast, Postgres is slow, put the hot path in Redis — without tracking the global consequence: that splitting authoritative state across systems creates a consistency gap, and that gap will be exercised in production.

The question to ask when designing a system like this is not "what happens when everything works?" It is "what happens when the process dies between these two lines of code?" And then: "is that failure mode acceptable?"

For FairQueue, the acceptable failure mode is: Redis briefly shows more inventory than exists, a reconciliation worker corrects it within 30 seconds, and no customer is permanently locked out. The unacceptable failure mode is: a ticket is sold that does not exist, or a payment is charged with no record.

Choosing the right failure mode and designing around it deliberately is what separates systems that survive production from systems that produce incident reports.

What FairQueue Ended Up With

For reference, the final architecture that came out of this reasoning:

Concern	System	Rationale
Inventory count	Redis (cache)	Performance — absorbs concurrent reads
Inventory truth	Postgres (derived)	`total - COUNT(active claims)`
Claim record	Postgres	Source of truth, unique constraint
Concurrency shield	Redis SET NX + Postgres unique index	Two layers; neither alone is sufficient
Queue position	Redis ZSET	Reconstructible from Postgres on restart
Payment record	Postgres	Outbox pattern; written before gateway call
Divergence healing	Reconciliation worker	Runs every 30s; force-syncs from Postgres

Redis handles roughly 50,000 concurrent queue joins at O(log N) per operation without touching Postgres. Postgres handles claim inserts with a unique constraint that makes overselling physically impossible. The reconciliation worker makes the system self-healing under any single-component failure.

The system never requires Redis and Postgres to agree atomically, because it never splits authoritative state between them. Redis is always derived. Postgres is always truth. The failure modes are chosen, bounded, and recoverable.

Closing

If you are building a system that mixes Redis and Postgres — and most production backends do — the question worth sitting with is: which system owns the truth?

Not "which system is faster" or "which system is more durable." Those are properties of the systems. The question is about your data model: when Postgres and Redis disagree, which one wins?

If the answer is not immediately obvious, you may have accidentally split your source of truth. That split will find you eventually. It tends to find you at the worst possible time — when load is highest, when the stakes are real, when the Detty December concert just went on sale.

Choose one system to own the truth. Let the other be fast. Design your failure modes deliberately. The system will be simpler, more debuggable, and more survivable for it.

FairQueue is open source. The full implementation — including the Lua scripts, reconciliation worker, and integration tests — is available on GitHub.

DEV Community