Gabriel Anhaia

Posted on May 24

Saga Compensation When Undo Is Impossible: 3 Patterns and the Audit Trail

#architecture #eventdriven #saga #backend

Book: Event-Driven Architecture Pocket Guide: Saga, CQRS, Outbox, and the Traps Nobody Warns You About
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

A saga step charged the customer's card. The next step, reserving the hotel room, failed because the inventory service returned a 503. Your saga orchestrator does what every textbook says: trigger the compensation. Refund the charge.

The refund webhook also fails. Stripe responds with 429 Too Many Requests. Your retry policy kicks in. The third retry succeeds, except the response is 400 charge_already_refunded because retry #2 actually went through, you just never saw the ACK.

Now you have a customer who was charged, refunded, and a saga that thinks both the charge and the refund are in flight. The orchestrator is stuck. The on-call engineer is paged. The lawyer wants to know if the audit trail will survive subpoena.

This is the part of saga that the tutorials skip.

The compensation lie

Every saga article opens the same way. Booking system. Three services. Charge → reserve hotel → book flight. If flight booking fails, compensate by canceling the hotel and refunding the charge. Clean diagrams with arrows pointing backwards. Everybody nods.

The diagrams are lying to you by omission.

Compensation is a verb, not a guarantee. It assumes:

The downstream side effect is reversible.
The reversal API is reliable enough to count on.
The reversal happens fast enough to matter to the user.

In production, none of these hold consistently. A Stripe charge can be refunded, but the refund is a new asynchronous side effect that can fail on its own. An email sent through SendGrid can't be un-sent. A row inserted into a partner's CRM via webhook can't be removed without their cooperation. A position opened on a brokerage exchange can be flattened, but at a different price.

When the compensation step itself can fail, you don't have a transaction. You have two interleaved sequences of events, each with its own failure modes, and a state machine that has to reconcile both.

The three patterns below are what teams actually ship once they accept that "compensate" is a wish, not a primitive.

Pattern 1: Forward recovery

Forward recovery says: don't try to undo. Commit to the end of the saga, mark the result as inconsistent, and resolve it out of band.

This is the right call when the cost of an incomplete forward path is lower than the cost of a half-broken rollback. The classic case: you charged the card, the hotel booking failed, but a comparable hotel is available. You book the comparable hotel and email the customer about the swap. The saga finishes successfully even though one step had to deviate from the original plan.

Here is the orchestrator step in Python. Note the explicit Outcome type that carries the deviation forward:

from dataclasses import dataclass
from enum import Enum

class Outcome(Enum):
    SUCCESS = "success"
    DEVIATED = "deviated"
    UNRECOVERABLE = "unrecoverable"

@dataclass
class StepResult:
    outcome: Outcome
    saga_id: str
    step: str
    payload: dict
    deviation_reason: str | None = None

def reserve_hotel(saga_id: str, request: dict) -> StepResult:
    try:
        booking = hotel_api.reserve(
            hotel_id=request["hotel_id"],
            checkin=request["checkin"],
            checkout=request["checkout"],
        )
        return StepResult(Outcome.SUCCESS, saga_id, "reserve_hotel",
                          {"booking_id": booking.id})
    except HotelUnavailable:
        # forward recovery: find a comparable hotel instead of unwinding
        fallback = hotel_api.find_comparable(request)
        if fallback is None:
            return StepResult(Outcome.UNRECOVERABLE, saga_id,
                              "reserve_hotel", {},
                              deviation_reason="no_fallback_available")

        booking = hotel_api.reserve(
            hotel_id=fallback.id,
            checkin=request["checkin"],
            checkout=request["checkout"],
        )
        return StepResult(Outcome.DEVIATED, saga_id, "reserve_hotel",
                          {"booking_id": booking.id,
                           "original_hotel": request["hotel_id"],
                           "actual_hotel": fallback.id},
                          deviation_reason="primary_sold_out")

The saga itself doesn't care whether the step deviated. It cares whether the saga can continue. The deviation gets recorded in the event log so customer support, billing, and the downstream notification service can act on it.

Forward recovery works because it converts a failure into a different success. It fails when there is no acceptable substitute, when the only safe thing to do is stop. That's when you need the next pattern.

Pattern 2: Pivot transaction

A pivot transaction re-shapes the irreversible side effect into something you can undo.

The Stripe-charge-then-fail scenario is the textbook example. You can't truly undo a charge. You can issue a refund, but the refund is a separate transaction that can fail, can be delayed, and shows up on the customer's statement as two line items.

The pivot: don't capture the charge during the saga. Authorize it, run the rest of the saga, then capture or void at the end.

def authorize_payment(saga_id: str, request: dict) -> StepResult:
    intent = stripe.PaymentIntent.create(
        amount=request["amount_cents"],
        currency=request["currency"],
        customer=request["stripe_customer_id"],
        capture_method="manual",  # authorize only
        metadata={"saga_id": saga_id},
        idempotency_key=f"auth-{saga_id}",
    )
    return StepResult(Outcome.SUCCESS, saga_id, "authorize_payment",
                      {"payment_intent_id": intent.id,
                       "auth_expires_at": intent.created + 7 * 86400})

def capture_payment(saga_id: str, payment_intent_id: str) -> StepResult:
    intent = stripe.PaymentIntent.capture(
        payment_intent_id,
        idempotency_key=f"capture-{saga_id}",
    )
    return StepResult(Outcome.SUCCESS, saga_id, "capture_payment",
                      {"charge_id": intent.latest_charge})

def void_authorization(saga_id: str, payment_intent_id: str) -> StepResult:
    # void releases the hold; it never shows on the customer statement
    intent = stripe.PaymentIntent.cancel(
        payment_intent_id,
        cancellation_reason="abandoned",
        idempotency_key=f"void-{saga_id}",
    )
    return StepResult(Outcome.SUCCESS, saga_id, "void_authorization",
                      {"payment_intent_id": intent.id})

A void on an uncaptured PaymentIntent leaves nothing behind. The hold drops off the customer's available balance within a few days. There is no refund line item. There is no support ticket asking "why was I charged $432 yesterday." The pivot turned an irreversible operation into a reversible one by inserting an intermediate state.

The pivot doesn't always exist. Email is sent or it isn't. A market order is filled or it isn't. But for any side effect that has an authorize/commit shape (payments, inventory holds, slot reservations, license grants), the pivot is the cleanest pattern.

The gotcha: the authorization has an expiry. Stripe holds for 7 days. Reservation slots might hold for 15 minutes. Your saga timeout has to be shorter than the shortest authorization in the chain, or the auth expires mid-saga and you're back to managing irreversible commits.

Pattern 3: Reconciliation queue

The third pattern is for the cases where neither forward recovery nor pivot applies. You issued the side effect, the next step failed, and you don't know whether the compensation succeeded.

A human has to decide what happens next. Your job is to make that human's job tractable.

The reconciliation queue is a durable, monitored queue of saga instances stuck in an ambiguous state. Each entry has the full event history, the current best guess at the customer-visible state, and a small set of resolution actions the operator can take.

The schema for queue entries:

CREATE TABLE saga_reconciliation_queue (
    saga_id        UUID PRIMARY KEY,
    saga_type      TEXT NOT NULL,
    stuck_at_step  TEXT NOT NULL,
    stuck_since    TIMESTAMPTZ NOT NULL DEFAULT now(),
    ambiguity      TEXT NOT NULL, -- 'compensation_unknown', 'partial_commit', ...
    customer_id    UUID NOT NULL,
    monetary_risk  NUMERIC(12, 2),
    event_history  JSONB NOT NULL,
    available_actions JSONB NOT NULL,
    assigned_to    TEXT,
    resolved_at    TIMESTAMPTZ,
    resolution     TEXT,
    resolved_by    TEXT
);

CREATE INDEX idx_reconciliation_unresolved
    ON saga_reconciliation_queue (stuck_since)
    WHERE resolved_at IS NULL;

CREATE INDEX idx_reconciliation_by_risk
    ON saga_reconciliation_queue (monetary_risk DESC)
    WHERE resolved_at IS NULL;

The available_actions field is what makes the queue usable. Don't make the operator invent a fix from scratch. Compute the safe set of resolutions when you put the saga on the queue, present them as buttons, log which one they picked.

For the Stripe-then-fail case:

def queue_for_reconciliation(saga_id: str, error: SagaStuckError):
    history = event_store.events_for(saga_id)
    charge_event = next(e for e in history if e.type == "PaymentCaptured")
    refund_attempts = [e for e in history if e.type == "RefundAttempted"]

    actions = []

    # if no refund ever attempted, we can still try
    if not refund_attempts:
        actions.append({
            "id": "issue_refund",
            "label": "Issue full refund via Stripe",
            "params": {"charge_id": charge_event.payload["charge_id"]},
        })

    # if a refund was attempted, check Stripe directly before deciding
    if refund_attempts:
        actions.append({
            "id": "reconcile_with_stripe",
            "label": "Query Stripe and reconcile our record",
            "params": {"charge_id": charge_event.payload["charge_id"]},
        })

    actions.append({
        "id": "manual_credit",
        "label": "Credit customer account, leave charge intact",
        "params": {"amount_cents": charge_event.payload["amount"]},
    })

    actions.append({
        "id": "escalate_to_finance",
        "label": "Escalate to finance team for manual handling",
        "params": {},
    })

    db.execute(
        """INSERT INTO saga_reconciliation_queue
           (saga_id, saga_type, stuck_at_step, ambiguity, customer_id,
            monetary_risk, event_history, available_actions)
           VALUES (%s, %s, %s, %s, %s, %s, %s::jsonb, %s::jsonb)""",
        (saga_id, error.saga_type, error.step, "compensation_unknown",
         error.customer_id, error.monetary_risk,
         json.dumps(history), json.dumps(actions)),
    )

The first time you build this, the queue will be your most-watched dashboard for a month. After that, the patterns settle and you'll catch most ambiguities with automated reconciliation logic. The queue becomes the safety net for the long tail.

The audit trail every regulator asks for

If your saga touches money, healthcare, or personal data, somebody will eventually ask you to reconstruct what happened to a specific customer on a specific day. The audit trail is the answer to that question.

Three things have to be on the trail: the events the system produced, the decisions the orchestrator made, and the outcomes the user experienced.

@dataclass
class AuditEvent:
    event_id: str          # UUIDv7 for sortable IDs
    saga_id: str
    saga_type: str
    occurred_at: str       # ISO-8601 with timezone
    event_type: str        # 'StepStarted', 'StepCompleted', 'CompensationTriggered'...
    step: str
    actor: str             # 'orchestrator', 'operator:jdoe', 'cron:retry'
    causation_id: str | None  # which event caused this one
    correlation_id: str       # request-id that started the whole thing
    decision: dict | None     # for orchestrator decisions: chosen branch + why
    outcome: dict | None      # for step completions: result + side-effect refs
    external_refs: dict       # Stripe charge IDs, vendor booking IDs, etc.

The fields people forget:

causation_id: lets you walk the chain backwards from "refund failed" to "what triggered the refund" to "what made us decide to refund." Without it, you have a flat list of events that look unconnected.
actor: was this an automated orchestrator decision, a retry from cron, an operator pulling a saga off the reconciliation queue? When something goes wrong, the first question is "who did this," and the actor field answers it.
external_refs: the foreign keys for every external system you touched. Stripe charge ID, vendor booking reference, support ticket number. When the regulator asks for the trail of a $4,300 chargeback, this is what you join on.

Append-only, immutable, and stored separately from the operational database. If somebody can UPDATE an audit row, your trail is theoretically untrustworthy and a competent auditor will say so.

When to abort, forward-recover, or pivot

A small decision tree, applied per step, not per saga:

Is the side effect cheap and reversible? (an idempotent write to your own DB, a hold on inventory you own): compensate normally, no special pattern needed.
Is the side effect external but the vendor offers a reliable undo? (Stripe refund, partner cancel-booking webhook with strong SLAs): compensate, but with retry budget, idempotency keys, and a fallback to the reconciliation queue.
Is the side effect external and the vendor offers authorize/commit? Pivot. Authorize early, commit late.
Is the side effect external, irreversible, but has acceptable substitutes? Forward recovery. Deviate and continue.
Is the side effect external, irreversible, and no substitute exists? Abort the saga before this step, or accept manual handling and route to the reconciliation queue.

The point of this tree is to stop you from defaulting to "trigger compensation" for every failure. Compensation is one of five answers, not the only one.

The gotcha: partial compensation is worse than none

This is the trap that kills more sagas than any other.

Your saga has four steps. Step 3 fails. You trigger compensations for steps 2 and 1. Step 2's compensation succeeds. Step 1's compensation fails after exhausting retries.

You now have a customer state worse than if you had done nothing. The hotel booking was canceled (step 2 compensation succeeded), but the credit card was charged (step 1 compensation failed). The customer paid for a hotel they no longer have a booking for. The on-call engineer who got paged at 3am has to manually compose an apology, manually refund the charge, and manually re-book the hotel if it's still available.

If you can't compensate atomically across all prior steps, compensating some of them is often worse than compensating none. Two options that mitigate this:

Per-step orphan tolerance. Each step declares whether its side effect is acceptable to leave behind. A hotel booking might be tolerable as an orphan (the customer keeps the room). A charge without a delivered service is not tolerable. If any non-orphan-tolerant step compensates successfully while a sibling step's compensation fails, escalate the whole saga to the reconciliation queue immediately. Don't keep walking backwards.
Compensation in reverse-dependency order, with a checkpoint after each. If step N's compensation fails, stop. Don't proceed to step N-1's compensation. The half-compensated state, with the audit trail explaining where you stopped, is easier to reconcile than a fully-mixed partial-compensation state.

The lesson generalizes. "Always compensate" is the saga fairy tale. "Compensate when it improves customer state, escalate when it doesn't" is the production reality.

What's the worst stuck-saga story you've had to clean up by hand? Was it a payment, an inventory hold, or a third-party API that didn't behave like the docs promised?

If this was useful

The pattern catalog in the Event-Driven Architecture Pocket Guide covers saga, CQRS, outbox, and the half-dozen failure modes that the original papers gloss over. The chapter on compensation and the reconciliation queue goes deeper on the orphan-tolerance heuristic and the audit-trail schema sketched here, with extended examples for payment, inventory, and multi-vendor sagas.