The Saga vs Event-Sourcing Decision Most Teams Get Backwards

#architecture #eventdrivenarchitecture #microservices #programming

Book: Event-Driven Architecture Pocket Guide
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Picture the team that spends four months migrating an order-processing service to full event sourcing because checkout keeps losing partial-failure state. They build an event store, projections, snapshots. Six months later, the same partial-failure bug shows up in a different shape: payment captured, inventory reserved, shipping never booked, no compensating action ever ran. They built the wrong pattern. They needed a saga. The event store was useful, but it solved a problem they did not have yet, and the problem they did have was still in production.

The reverse mistake is just as common. Teams reach for sagas when their actual pain is "we cannot reconstruct what happened during yesterday's incident because we only store the current state." Compensations do not give you history. An event store does. Picking the wrong pattern costs you a quarter at minimum and an architecture rewrite at worst.

The two patterns are often introduced together (same conference talk, same chapter in the same book) and most teams come away thinking they are alternatives. They are not. They solve different problems. They compose cleanly. The decision rule is two questions, and a short saga sketch next to a short event-sourcing sketch makes the seam between them visible.

The two questions to ask, in order

You can short-circuit most of the pattern debate with two questions. Ask them in this order, because mixing them up is exactly how teams end up with the wrong tool.

Question 1: Do you need to coordinate a multi-step transaction across services where any step can fail?

If yes, you need a saga. The defining feature of a saga is the compensating action: when step 4 of a 6-step workflow fails, you run the inverse of steps 1, 2, and 3 to leave the system in a consistent state. microservices.io's saga page frames it as a sequence of local transactions where each step is recoverable through compensation.

The shape: order placed, payment authorized, inventory reserved, shipping label created, fulfillment notified. If shipping fails, you reverse the inventory reservation, void the payment authorization, and mark the order failed. The orchestrator (or the choreography) tracks where you are in that sequence and runs the compensation that matches the failure point.

Question 2: Do you need a permanent, replayable record of every state change in the system?

If yes, you need event sourcing. The defining feature of event sourcing is that the events are the database. Current state is derived by folding the event log. microservices.io's event sourcing page makes the distinction explicit: state is computed, not stored.

The shape: every change to an account is an event (Deposited, Withdrawn, Frozen). The current balance is a projection. You can rebuild any historical state by replaying events up to a timestamp. You can add a new projection (say, "monthly transaction count") and backfill it by replaying the existing log. There is no UPDATE statement in your domain code.

The two questions test for different things. Question 1 is about coordinating across services right now. Question 2 is about what you can answer about the past.

When the answers go together

Both yes is common. You have a multi-step workflow where steps can fail (saga) and you also need a replayable record of what happened (event sourcing). They compose: the saga's state transitions are themselves events in the event store. Greg Young's foundational CQRS and event sourcing work treats this as the natural pairing.

Both no is also common, and it is the case most teams skip past too fast. If your workflow is single-service and your audit needs are met by application logs plus a created_at/updated_at column, neither pattern earns its keep. Pick boring CRUD, ship, and revisit.

The dangerous quadrants are the off-diagonals. Yes-saga, no-event-sourcing is a saga over plain databases: common, and what most order-processing systems actually run. No-saga, yes-event-sourcing is an event-sourced single-service domain, what banking ledgers and audit-heavy systems run. The mistake is treating the off-diagonals as the same thing.

A short saga: the compensations

A sketch of an orchestrator-based saga. Each step has a forward action and a compensation. The orchestrator runs forward until something fails, then runs the compensations in reverse.

from dataclasses import dataclass, field
from typing import Callable

@dataclass
class Step:
    name: str
    forward: Callable[[dict], dict]
    compensate: Callable[[dict], None]


@dataclass
class Saga:
    steps: list[Step]
    state: dict = field(default_factory=dict)

    def run(self):
        completed: list[Step] = []
        try:
            for step in self.steps:
                result = step.forward(self.state)
                self.state.update(result)
                completed.append(step)
        except Exception as exc:
            for step in reversed(completed):
                try:
                    step.compensate(self.state)
                except Exception as comp_exc:
                    log_compensation_failure(step.name, comp_exc)
            raise
        return self.state


# Wiring it for a checkout.
order_saga = Saga(steps=[
    Step("auth_payment", auth_payment, void_authorization),
    Step("reserve_stock", reserve_stock, release_stock),
    Step("book_shipping", book_shipping, cancel_shipping),
    Step("notify_fulfillment", notify_fulfillment, lambda s: None),
])
order_saga.run()

The shape worth pointing at: compensations are not rollbacks. void_authorization is a new operation that runs against the payment gateway; it is not a transactional ROLLBACK. Compensations can themselves fail, which is why log_compensation_failure exists and why production sagas need their own DLQ for stuck compensations. The AWS saga choreography guide is a useful companion read; the orchestration variant is one click over.

What this saga gives you: a forward path with cleanup. What it does not give you: any record of why book_shipping failed three weeks ago, or what the system state looked like before the compensation ran. For that, you need the second pattern.

A short event-sourcing sketch: the log and the projection

A minimal event-sourced bank account. The events are the source of truth. The current state is folded from the events on demand.

from dataclasses import dataclass
from typing import Iterable
import uuid, json, time

@dataclass(frozen=True)
class Event:
    aggregate_id: str
    type: str
    payload: dict
    ts: float


# The store is append-only.
class EventStore:
    def __init__(self):
        self._events: list[Event] = []

    def append(self, event: Event):
        self._events.append(event)

    def stream(self, aggregate_id: str) -> Iterable[Event]:
        return [e for e in self._events
                if e.aggregate_id == aggregate_id]

That is the whole storage layer. The aggregate is just a fold over the stream.

# The aggregate folds events into state.
def account_state(events: Iterable[Event]) -> dict:
    state = {"balance": 0, "frozen": False}
    for e in events:
        if e.type == "Deposited":
            state["balance"] += e.payload["amount"]
        elif e.type == "Withdrawn":
            state["balance"] -= e.payload["amount"]
        elif e.type == "Frozen":
            state["frozen"] = True
    return state


# Commands produce events; events update state.
def deposit(store, account_id, amount):
    store.append(Event(account_id, "Deposited",
                       {"amount": amount}, time.time()))


def withdraw(store, account_id, amount):
    state = account_state(store.stream(account_id))
    if state["frozen"]:
        raise RuntimeError("frozen")
    if state["balance"] < amount:
        raise RuntimeError("insufficient funds")
    store.append(Event(account_id, "Withdrawn",
                       {"amount": amount}, time.time()))

The shape worth pointing at: there is no UPDATE accounts SET balance = ... anywhere. There is append. The current balance is account_state(store.stream(id)). Want to know this account's balance on March 12 at 2pm? Fold only the events with ts < that_time. Want to add a new question ("how many withdrawals over $10k happened last quarter")? Read the same log a different way. That is the property the team in the opening scenario actually wanted.

What this event store gives you: history, replay, new projections without backfill scripts. What it does not give you: any coordination across services. If a deposit needs to also reserve credit on a partner ledger, the event store alone will not handle the partial failure. That is the saga's job.

The composition

The two patterns compose by treating the saga's state transitions as events in the event store. The orchestrator emits OrderSagaStarted, PaymentAuthorized, StockReserved, ShippingFailed, CompensationCompleted. Each is appended to the log. The current saga state is folded the same way an account balance is folded.

The composition gives you both properties. Multi-service coordination with compensations: the saga. Replayable record of every step including the compensations: the event store. When something goes wrong six weeks later and a customer disputes a charge, you can replay the exact sequence of events that produced the outcome.

The Decipherzone writeup on combining the two goes deeper on the operational details: idempotency keys, event versioning, projection rebuilds. The thing to internalize is that the saga and the event store are not competing for the same slot in your architecture. They sit at different layers.

When the second problem shows up

Teams get this backwards because the patterns are introduced together and the surface vocabulary overlaps: both involve "events," both come up when someone says "distributed systems," both have ASCII diagrams with arrows. Underneath, they are answering different questions. Saga answers "how do we recover when step 4 fails." Event sourcing answers "what happened, and can we replay it."

The cost of getting it wrong is the lost quarter from the opening scenario. Build event sourcing when you needed compensations and you still have the original bug, plus a year of accumulated event-versioning headaches. Build a saga when you needed history and you still cannot answer the audit question, plus a workflow-state machine that nobody likes to extend.

Pick the pattern that solves the problem you actually have right now. Compose them later when the second problem shows up. Resist the conference-talk pull to architect for both at once.

If this was useful

The Saga, CQRS, and Event Sourcing chapters of Event-Driven Architecture Pocket Guide walk through the decision in production terms: orchestration vs choreography, snapshotting strategies, the projection rebuild patterns that hold up under traffic, and the operational traps (compensation idempotency, event versioning) that bite teams two quarters in. If you're picking between these patterns or already running both, the book is for you.