Rajkiran

Posted on Jun 12

System Design - 14. Event-Driven Architecture: Event Sourcing, CQRS, and the Outbox Pattern Explained

#systemdesign #software #high #distributedsystems

Event-Driven Architecture: Event Sourcing, CQRS, and the Outbox Pattern Explained

Covers: Event Sourcing, CQRS, Outbox Pattern, Choreography vs Orchestration

The Bank That Never Stores a Balance

Here's something that surprises most engineers: many banking systems don't store your account balance as a number in a database row.

Instead, they store every transaction that ever happened — every deposit, withdrawal, transfer, fee — as an immutable event. Your "balance" is computed by replaying all those events.

Account 12345 events:
  2024-01-01: DEPOSIT +1000
  2024-01-05: WITHDRAWAL -200
  2024-01-10: DEPOSIT +500
  2024-01-15: WITHDRAWAL -150

Balance = 1000 - 200 + 500 - 150 = 1150

Why would anyone do this instead of just storing balance: 1150?

Because the event log gives you something a single number never can: a complete, immutable, auditable history of everything that ever happened. You can answer "what was my balance on January 8th?" You can detect fraud by analyzing transaction patterns. You can replay history to debug a discrepancy.

This is event sourcing — and it's one piece of a broader architectural philosophy called event-driven architecture.

The Core Idea: Events as Facts

In traditional architecture, your database stores current state. An UPDATE statement overwrites the old value — it's gone forever.

In event-driven architecture, you store events — immutable facts about things that happened. State is derived from events, not stored directly (or stored as a cache of the derived state).

Traditional:
  UPDATE accounts SET balance = 1150 WHERE id = 12345
  (Previous balance 1300 is lost — no record of the $150 withdrawal)

Event-Driven:
  INSERT INTO events (account_id, type, amount, timestamp)
  VALUES (12345, 'WITHDRAWAL', -150, '2024-01-15T10:30:00Z')
  (The event is permanent. Balance is computed by replaying events.)

This single shift — from "store current state" to "store the history of changes" — unlocks several powerful patterns.

Event Sourcing: Store History, Derive State

Event Sourcing is the pattern of persisting all changes to application state as a sequence of events, and reconstructing current state by replaying those events.

How It Works

Event Store (append-only log):
┌────────────────────────────────────────────┐
│ OrderCreated   { order_id: 1, items: [...] }│
│ ItemAdded      { order_id: 1, item: "X" }   │
│ PaymentReceived{ order_id: 1, amount: 50 }  │
│ OrderShipped   { order_id: 1, carrier: "Y" }│
└────────────────────────────────────────────┘
            ↓ replay events in order
┌────────────────────────────────────────────┐
│ Current State:                              │
│ Order #1: items=[...], paid=true,           │
│           status="shipped"                  │
└────────────────────────────────────────────┘

To get the current state of Order #1, you replay all events for that order, applying each one in sequence.

Snapshots: Avoiding Replaying Everything

If an order has 10,000 events (unlikely, but imagine a long-lived entity like a user account with years of activity), replaying all of them on every read is slow.

Snapshots solve this — periodically save the computed state, then only replay events since the snapshot:

Snapshot at event #9000: { state at that point }
                ↓
Replay events #9001 - #10000 (only 1000 events, not 10000)
                ↓
Current state

Why Event Sourcing Is Powerful

1. Complete audit trail
Every change is recorded with who, what, when. Critical for compliance (finance, healthcare).

2. Time travel debugging
"What did this order look like before the bug was introduced?" — replay events up to that point in time.

3. Temporal queries
"What was the user's subscription status on March 15th?" — replay events up to March 15th.

4. Multiple projections from one source
The same event stream can generate different "views" — a dashboard view, an analytics view, a search index — all derived independently from the same events.

The Costs

1. Complexity
Reconstructing state from events is more complex than SELECT * FROM table.

2. Schema evolution is hard
If your event format changes, you need to handle old event formats when replaying historical events.

3. Eventual consistency
Projections (derived views) may lag behind the event stream slightly.

Real example: Banking systems, Git (every commit is an immutable event; your working directory is the "current state" derived from replaying commits), and the Axon Framework (Java event sourcing framework used in enterprise systems).

CQRS: Separating Reads from Writes

CQRS (Command Query Responsibility Segregation) separates the model used for writing data (Commands) from the model used for reading data (Queries).

The Problem CQRS Solves

In a traditional system, the same database table serves both writes and reads:

Single Model:
  Write: INSERT INTO orders (...)
  Read:  SELECT * FROM orders WHERE user_id = ? ORDER BY date DESC

But writes and reads often have very different requirements:

Writes need to be fast, validated, transactional
Reads need to be fast, denormalized, optimized for specific UI views — often aggregating data from multiple sources

Trying to satisfy both with one schema leads to compromises on both sides.

The CQRS Solution

Commands (Writes)              Queries (Reads)
       ↓                              ↑
[Write Model / DB]  ──events──► [Read Model / DB]
  Normalized,                    Denormalized,
  transactional,                 optimized per view,
  validates business rules       can be multiple specialized stores

Concrete example — e-commerce order system:

Write side (PostgreSQL, normalized):
  orders table, order_items table, customers table
  → Strict foreign keys, ACID transactions, business rule validation

Read side (multiple specialized views, built from events):
  - "Order History" view (Elasticsearch — fast full-text search)
  - "Admin Dashboard" view (denormalized SQL — pre-joined for reports)
  - "Customer Order Count" view (Redis — instant counter lookups)

Each read view is updated asynchronously when write-side events occur. The write side stays clean and normalized. The read side is optimized for whatever each specific screen needs — even if that means redundant, denormalized copies of data.

CQRS + Event Sourcing: A Natural Pair

These two patterns are often used together (though independently optional):

1. Command arrives: "Place Order"
2. Write side validates, persists event: OrderPlaced
3. Event published to event bus (Kafka)
4. Read-side projections consume the event:
   - Search index adds the order
   - Analytics dashboard updates order count
   - Customer's "recent orders" cache updates
5. Each read view is eventually consistent with the write side

When to use CQRS: Complex domains where read and write patterns are genuinely different — e-commerce (write: place order; read: browse history, search, recommendations), social media (write: post; read: feed, search, trending).

When NOT to use it: Simple CRUD applications. CQRS adds real complexity — don't introduce it unless reads and writes are genuinely pulling your data model in different directions.

The Outbox Pattern: Solving the Dual-Write Problem

Here's a subtle but critical bug pattern in event-driven systems.

The dual-write problem:

def place_order(order_data):
    db.insert("orders", order_data)          # Write 1: database
    kafka.publish("order-placed", order_data) # Write 2: message queue

    # PROBLEM: What if the process crashes between these two lines?
    # → Order exists in DB, but event was never published
    # → Downstream services never know about this order

These are two separate systems (database and message broker). There's no way to make both writes atomic with standard tools. If the database write succeeds but the Kafka publish fails (network blip, broker down, process crash) — you have a "ghost order" that exists but nothing downstream knows about.

The Outbox Pattern Solution

Write the event to an outbox table in the same database transaction as the business data. A separate process reads the outbox and publishes to Kafka.

BEGIN TRANSACTION;
  INSERT INTO orders (id, user_id, total) VALUES (123, 456, 99.99);
  INSERT INTO outbox (event_type, payload, status) 
    VALUES ('OrderPlaced', '{"order_id": 123, ...}', 'PENDING');
COMMIT;
-- Both inserts succeed or both fail. Atomic. Guaranteed.

A separate outbox processor (running continuously) reads pending outbox rows and publishes them to Kafka:

def outbox_processor():
    while True:
        pending = db.query(
            "SELECT * FROM outbox WHERE status = 'PENDING' ORDER BY created_at"
        )
        for event in pending:
            kafka.publish(event.event_type, event.payload)
            db.execute(
                "UPDATE outbox SET status = 'PUBLISHED' WHERE id = ?", 
                event.id
            )
        sleep(0.1)

Why this works: The database transaction guarantees the order and the outbox event are written together — atomically. The outbox processor guarantees eventual publishing to Kafka. Even if the processor crashes mid-publish, it retries unpublished events on restart (the outbox row stays PENDING until confirmed).

Debezium is a popular tool that implements this via Change Data Capture (CDC) — it watches the database's write-ahead log directly and publishes changes to Kafka, eliminating the need for a custom outbox processor entirely.

Choreography vs Orchestration

When an event triggers a chain of actions across multiple services, who's "in charge" of the workflow?

Choreography: No Central Coordinator

Each service listens for events and reacts independently. No service knows about the full workflow — each just does its part.

OrderPlaced event published
    ├──► Inventory Service: reserves stock → publishes StockReserved
    ├──► Payment Service: charges card → publishes PaymentProcessed
    └──► Notification Service: (listens for PaymentProcessed) → sends email

Each service reacts to events. No one orchestrates the whole flow.

Advantages:

Fully decoupled — services don't know about each other
Easy to add new participants (just subscribe to relevant events)
No single point of failure

Disadvantages:

Hard to see the overall flow — the "business process" is implicit, scattered across many services' event handlers
Debugging is hard — tracing a request through choreographed events requires distributed tracing
Easy to create circular dependencies (Service A reacts to Service B's event, which reacts to Service A's event...)

Orchestration: A Central Coordinator

A central orchestrator explicitly directs each step of the workflow.

OrderSaga (orchestrator):
  1. Call Inventory Service: reserve stock
     → wait for response
  2. Call Payment Service: charge card
     → wait for response
  3. Call Shipping Service: schedule delivery
     → wait for response
  4. Call Notification Service: send confirmation

If step 2 fails: orchestrator calls Inventory Service to release stock (compensating action)

Advantages:

Workflow is explicit and visible — read the orchestrator code to understand the whole process
Easier to handle complex error/retry/compensation logic
Easier to debug — one place to look

Disadvantages:

Orchestrator is a central point of coordination — if poorly designed, becomes a bottleneck
Services become coupled to the orchestrator's expectations

The general guideline: Choreography for simple event reactions (2-3 services, simple flows). Orchestration for complex multi-step business processes with compensation logic (this leads directly into the Saga pattern — our next topic).

Real-World Example: Activity Feed Using Events

How would you design Instagram's "Following" activity feed using event-driven architecture?

User performs action → publishes event:
  - PostCreated { user_id, post_id, timestamp }
  - PostLiked { user_id, post_id, liker_id }
  - UserFollowed { follower_id, followed_id }
  - CommentAdded { user_id, post_id, comment_id }

Activity Feed Service subscribes to all these events:
  - On PostCreated: notify followers → write to their feed projection
  - On PostLiked: append "X liked your post" to user's notification feed
  - On UserFollowed: append "X started following you"
  - On CommentAdded: append "X commented on your post"

Each user's activity feed is a CQRS read model — 
  built entirely by projecting these events into a 
  per-user feed table (Cassandra, partitioned by user_id).

This is exactly the architecture used by large-scale social platforms. The write path (creating posts, likes, follows) is decoupled from the read path (viewing feeds) via events.

Key Takeaways

Event-driven architecture treats "things that happened" (events) as the primary data, with state derived from them.
Event Sourcing: store the full history of events, reconstruct state by replaying. Gives you audit trails, time travel, and multiple derived views — at the cost of complexity.
CQRS: separate write models (normalized, transactional) from read models (denormalized, optimized per view). Pairs naturally with event sourcing.
Outbox Pattern: solves the dual-write problem — write business data and the event in the same DB transaction, publish asynchronously via a separate processor.
Choreography: decentralized, event-reactive — simple flows, fully decoupled.
Orchestration: centralized coordinator — complex flows, explicit logic, easier debugging.

What's Next

Topic 15 closes Day 5 with the Saga Pattern — how to handle transactions that span multiple microservices, why Two-Phase Commit is considered an anti-pattern in modern architectures, and how Uber coordinates trip booking across a dozen services without ever locking a database.

Tags: system-design event-driven-architecture cqrs event-sourcing backend distributed-systems interview-prep

DEV Community