Event-Driven Architecture: The Dual Write Problem and How to Solve It

#microservices #architecture #eventdriven #distributedsystems

You have a well-designed order service. It writes to the database and publishes an event to Kafka. Clean, decoupled, event-driven. Then Kafka has a brief network hiccup. The database write succeeds. The event publish fails. The order exists. Fulfillment never hears about it. No alert fires. Just a quietly broken order going nowhere.

This is the dual write problem — an architectural correctness problem that exists the moment you write to two separate systems without a coordination mechanism.

The Problem

A dual write occurs when your application writes to two separate systems as part of a single logical operation without atomicity across both. The dangerous failure modes are silent — the HTTP response returns 200, the client gets a success, and nothing downstream happens.

The naive fixes don't work:

Try/catch with retry — introduces duplicate events; consumers must be idempotent
Publish first, then write DB — just reverses which failure mode you're exposed to
Distributed transactions (2PC) — sacrifices availability and introduces distributed locking

The real solution: reduce to a single atomic write and derive the event from it.

Solution 1: Transactional Outbox Pattern

Write the event as a row in an outbox table in the same database transaction as your business data. A separate relay process reads from the outbox and publishes to the broker.

Both writes succeed or fail together (single DB transaction)
Relay publishes and marks messages as published
Guarantees at-least-once delivery — consumers must be idempotent

Best for: greenfield services, full control over event schema, teams wanting simplicity.

Solution 2: Change Data Capture (Debezium)

Read directly from the database's transaction log (WAL/binlog). Every committed write is captured and streamed to Kafka automatically. No application code changes required.

Sub-second publish latency (WAL-based, no polling)
Captures all state changes including DB migrations and admin tools
Requires infrastructure for Kafka Connect + Debezium

Best for: legacy systems, high-throughput services, capturing all state changes without code modification.

Solution 3: Event Sourcing

The event log is the source of truth. The database is a derived projection. There is no dual write because there is only one write — appending events to the event store.

Eliminates the problem entirely
Introduces significant complexity (schema versioning, aggregate rehydration, eventual consistency)

Best for: domains where history of state changes matters (financial systems, audit-heavy domains).

Operational Non-Negotiables

Consumer idempotency — at-least-once delivery means duplicates will arrive. Deduplicate on event ID.
Outbox housekeeping — purge published messages; don't let the table grow unbounded.
Replication slot monitoring — for CDC, a stuck connector causes WAL accumulation and disk exhaustion.

Read the Full Article

This is a summary of my deep dive into the dual write problem. The full article covers all three solutions with production implementation examples:

👉 The Dual Write Problem and How to Solve It — Full Article

The full article includes:

Four failure scenarios with a dual write matrix
Transactional Outbox Pattern implementation (.NET with EF Core)
Polling relay vs log-tailing relay comparison
Debezium PostgreSQL connector configuration
Event Sourcing with aggregate pattern (C#)
Decision matrix for choosing between the three solutions
Operational concerns: housekeeping, replication slot monitoring, consumer idempotency
Production deployment checklist