The Dual-Write Problem: Why Your Payment API Is One Crash Away From Silent Data Loss

#distributedsystems #python #backend #kafka

You commit a payment to your database. Then you publish an event to Kafka so downstream services can settle it. Both succeed — until one day the process crashes in the 3 milliseconds between those two operations.

The database says the payment happened. Kafka never heard about it. The settlement worker never ran. The customer was charged and nothing moved.

That's the dual-write problem. This post explains why it's unsolvable with the obvious approaches, and how the Outbox pattern fixes it properly — using an implementation I built and load-tested to 1,000 concurrent users with zero duplicate charges.

Why the Obvious Solutions Don't Work

"Just publish to Kafka first, then write to the DB."

Same problem, reversed. The event fires but the payment row never gets written. Your downstream consumers process a payment that your database has no record of.

"Use a transaction that wraps both."

You can't. A database transaction and a Kafka publish are two entirely separate systems. PostgreSQL has no knowledge of Kafka. There is no COMMIT that covers both. The moment you step outside your DB transaction to call producer.send(), you're in crash territory.

"Use Two-Phase Commit (2PC)."

Kafka doesn't support it. And even in systems where both sides support 2PC, you're introducing a coordinator as a single point of failure with significantly higher latency. This is why 2PC has largely been abandoned in modern distributed systems in favour of patterns like the Outbox.

The Crash Window Nobody Talks About

Here's the exact sequence that fails silently:

1. BEGIN transaction
2. INSERT INTO payments (status = 'PENDING')   ← DB write
3. COMMIT                                       ← success
4.                                              ← 💥 process crashes here
5. producer.send('payment.initiated', ...)      ← never reached

Step 4 is real. Network blips, OOM kills, deploys — any of these can fire between steps 3 and 5. The window is tiny, but at scale it closes eventually.

The Outbox Pattern

The fix is to stop writing to two systems. Write to one.

Instead of publishing directly to Kafka, you write the event as a row in an outbox_events table — inside the same database transaction as the payment row. A separate background poller reads from that table and publishes to Kafka.

1. BEGIN transaction
2. INSERT INTO payments (status = 'PENDING')
3. INSERT INTO outbox_events (event = 'payment.initiated', published_at = NULL)
4. COMMIT                                    ← both rows land atomically

Now the Kafka publish is handled by the poller:

OUTBOX POLLER  →  SELECT * FROM outbox_events WHERE published_at IS NULL
               →  producer.send(event)
               →  UPDATE outbox_events SET published_at = NOW()

If the poller crashes after publishing but before marking the row, it simply replays on restart — Kafka receives a duplicate, which you handle with a deterministic event ID (more on this below). The payment row is never orphaned because the event was committed to the database first.

The full flow in my implementation looks like this:

CLIENT  →  POST /payments  +  Idempotency-Key: <uuid>
                │
                ▼
        ┌─ Redis cache check ──── HIT → return stored response (no DB touch)
        ├─ Distributed lock ───── prevents concurrent duplicate requests
        ├─ DB transaction ──────── Payment row + OutboxEvent row (atomic)
        └─ Cache response, release lock → 202 Accepted

OUTBOX POLLER  →  polls outbox_events WHERE published_at IS NULL  →  Kafka

KAFKA  →  SETTLEMENT WORKER
           ├─ PENDING → PROCESSING → SETTLED / FAILED
           ├─ Exponential backoff, max 5 retries
           └─ Dead Letter Queue on exhaustion

Handling the At-Least-Once Delivery Problem

The outbox poller delivers at-least-once to Kafka — meaning duplicate events are possible on replay. The settlement worker handles this with deterministic UUID5 event IDs:

event_id = uuid.uuid5(
    uuid.NAMESPACE_URL,
    f"{topic}:{partition}:{offset}"
)

The same topic:partition:offset always produces the same UUID. On replay, the deduplication check is a no-op — it sees the event ID already in processed_events and skips it. No double processing, no complex coordination.

Does It Actually Work?

I ran two load test scenarios with Locust against a single Docker container:

Scenario	Concurrent Users	Total Requests	Duplicate Charges
Normal load	50	1,378	0
Stress test	1,000	12,746	0

Correctness held at 0% duplicate charges through both. The 0.4% error rate at 1,000 users was connection pool exhaustion — not an idempotency failure. Every retry with the same idempotency key returned the identical payment_id.

What the Outbox Pattern Trades Off

Nothing is free. The outbox poller introduces a small delay — typically 1–5 seconds — between a payment being committed and its event reaching Kafka. For most use cases this is acceptable. For real-time fraud scoring that needs to act on the event immediately, it isn't, and you'd need a different approach.

The poller also needs to be a reliable background process. If it stops running silently, your outbox table grows and events stall. Monitoring queue depth is not optional.

The One-Sentence Summary

The Outbox pattern solves the dual-write problem by making the event a database record first and delegating the Kafka publish to a separate, restartable poller — so you never write to two systems atomically, you write to one.

Full source code, DESIGN.md, and load test results: https://github.com/macaulaypraise/idempotent-payment-processing-system.git

Stack: Python 3.12 · FastAPI · PostgreSQL 15 · Redis 7 · Kafka · SQLAlchemy (async) · Docker Compose