Saga Timeouts: The Compensation Path Most Teams Never Test

#architecture #eventdriven #backend #distributedsystems

Book: Event-Driven Architecture Pocket Guide: Saga, CQRS, Outbox, and the Traps Nobody Warns You About
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Your saga has tests. The happy path passes: reserve inventory, charge the card, create the shipment, mark the order confirmed. The failure path passes too: the card declines, you release the inventory, you tell the customer no. Both green. You shipped it.

Then one Tuesday the payment provider goes slow. Not down. Slow. The charge request sits at 40 seconds. Your saga step has a 30-second timeout, so the orchestrator gives up and moves on. But the provider didn't give up. At second 41 the charge succeeds. Now you have a reserved-but-unconfirmed order, a customer who got charged, and a saga that already decided this step failed.

That's the path nobody tested. Not the decline. The timeout.

A timeout is not a failure

Most saga implementations collapse two different outcomes into one branch. A step either succeeds or it fails, and "fails" triggers compensation. But a timeout is a third thing, and treating it like a plain failure is where the money leaks.

A failure has a definite answer: the operation did not happen. A decline, a 400, a validation error. You know the side effect never landed. Compensating is safe because there's nothing on the other side to reconcile.

A timeout has no answer. The request left your process and never came back. The remote side might have done nothing. It might have done everything and lost the response. It might be doing it right now, after you stopped waiting. You compensate against a state you cannot observe.

So the first move is to stop modelling steps as a boolean. Model the timeout as its own state.

type StepState int

const (
    StepPending StepState = iota
    StepSucceeded
    StepFailed     // definite: it did not happen
    StepTimedOut   // unknown: it might have happened
)

The StepTimedOut branch does not go straight to compensation. It goes to reconciliation first: ask the remote system what actually happened, then decide.

What "compensate" means when you don't know the state

Say the charge step times out. You have three honest options, and which one you pick depends on what the payment API gives you.

If the provider exposes a lookup by your idempotency key, query it. The charge either exists or it doesn't.

func reconcileCharge(
    ctx context.Context,
    api PaymentAPI,
    key string,
) (StepState, error) {
    charge, err := api.GetByIdempotencyKey(ctx, key)
    if errors.Is(err, ErrNotFound) {
        // provider has no record: the charge
        // never landed. Safe to treat as failed.
        return StepFailed, nil
    }
    if err != nil {
        // still can't tell. Stay timed out,
        // retry reconciliation later.
        return StepTimedOut, err
    }
    if charge.Status == "succeeded" {
        return StepSucceeded, nil
    }
    return StepFailed, nil
}

If the provider has no lookup, you cannot reconcile. Then the rule is: the compensation must be safe to run even if the step never happened. Refunding a charge that does not exist must be a no-op, not an error. Releasing inventory you never reserved must be a no-op. The compensation has to assume it might be cleaning up a thing that isn't there.

That single property — compensation that's safe to run against a step that may or may not have completed — is what makes the timeout path survivable.

Compensation has to be idempotent, for the same reasons consumers do

Here is the trap. The timeout fires, you start reconciling, reconciliation itself is slow, the orchestrator restarts, and now a second worker picks up the same saga and also starts compensating. Both run the refund. If your refund isn't idempotent, you refunded twice.

Compensation is a message handler like any other, on a broker that is at-least-once. It gets redelivered. It races itself across restarts. Every argument for idempotent consumers applies here, except the blast radius is worse, because compensation runs precisely when the system is already in a bad state.

The cleanest guard is a state transition the database refuses to apply twice.

UPDATE saga_steps
SET state = 'compensated',
    compensated_at = now()
WHERE saga_id = $1
  AND step = 'charge'
  AND state IN ('succeeded', 'timed_out');

If rowsAffected is zero, this step was already compensated, or was never in a compensatable state. Either way you stop. The external refund call goes after the guard, keyed on the same idempotency key as the original charge, so the provider also refuses the second one.

func (s *Saga) compensateCharge(
    ctx context.Context,
) error {
    n, err := s.store.MarkCompensated(ctx, s.id, "charge")
    if err != nil {
        return err
    }
    if n == 0 {
        return nil // already done, or nothing to undo
    }
    // keyed on the saga's charge key: a duplicate
    // refund request is a no-op on the provider side.
    return s.payments.Refund(ctx, s.chargeKey)
}

Two layers, same shape as a hardened consumer: the database guard stops the duplicate locally, the idempotency key stops it at the provider. The compensation can run a hundred times and the customer gets refunded once.

The orchestrator needs the timeout to be durable

A timeout that lives only in memory dies with the process. If your worker holds a 30-second timer and the pod gets evicted at second 20, the timer is gone, the saga is stuck halfway, and nothing ever fires the compensation. The order sits in pending forever.

So the deadline has to be persisted, not held in a goroutine or a setTimeout. Write the deadline to the saga row when you dispatch the step. A separate sweeper scans for steps past their deadline and drives them into reconciliation.

-- find steps whose deadline has passed and
-- that nobody has resolved yet
SELECT saga_id, step
FROM saga_steps
WHERE state = 'pending'
  AND deadline_at < now()
FOR UPDATE SKIP LOCKED
LIMIT 100;

FOR UPDATE SKIP LOCKED lets several sweeper instances share the work without stepping on each other. Each grabs a disjoint batch, transitions them to timed_out, and kicks off reconciliation. Now the timeout survives a deploy, an OOM, a rebalance. It's a row, not a timer.

Testing the path you keep skipping

The reason the timeout path breaks in production is that it's the one path your tests don't exercise, because it's annoying to simulate. The decline is a fixture. The timeout is a clock and a flaky dependency. So people skip it.

Make the timeout a thing you inject. Wrap the step call so a test can force the "request sent, response never arrived" outcome.

type stepResult struct {
    state StepState
}

// faultyPayment lets a test choose what the
// remote side did versus what we observed.
type faultyPayment struct {
    actuallyCharged bool // what the provider did
    respondInTime   bool // what we saw
}

func (f faultyPayment) Charge(
    ctx context.Context, key string,
) (stepResult, error) {
    if !f.respondInTime {
        // we time out; provider may have charged
        return stepResult{StepTimedOut},
            context.DeadlineExceeded
    }
    return stepResult{StepSucceeded}, nil
}

Now you can write the test that actually matters: the provider charged, but you timed out.

func TestTimeoutButChargeSucceeded(t *testing.T) {
    pay := faultyPayment{
        actuallyCharged: true,  // money moved
        respondInTime:   false, // we never saw it
    }
    saga := newTestSaga(pay)

    saga.runChargeStep(ctx)

    // the step must NOT be marked failed-and-done.
    // it must reconcile and discover the charge.
    if got := saga.stepState("charge"); got != StepSucceeded {
        t.Fatalf("reconciliation missed the charge: %v", got)
    }
    // and we must not have refunded a live charge
    if pay.refundCount != 0 {
        t.Fatalf("compensated a successful charge")
    }
}

Three test cases cover the timeout path properly:

Timed out, remote did nothing. Reconciliation reports StepFailed, compensation runs, and it's a clean no-op against the steps before it.
Timed out, remote succeeded. Reconciliation reports StepSucceeded, the saga continues forward instead of compensating a charge that took the customer's money.
Timed out, reconciliation also times out. The step stays StepTimedOut, the sweeper retries it later, and nothing fires compensation prematurely.

The second case is the one that gets skipped, and it's the one that double-charges customers or strands their money. Write it first.

What to check in your own saga this week

Three questions, in order of how much they'll hurt if the answer is no.

Does a step timeout go to a distinct state, or does it fall into the same branch as a hard failure? If it shares the branch, you compensate against unknown state and you'll eventually refund a charge that hasn't settled or release inventory you actually sold.

Is the compensation idempotent end to end — a local state guard plus an idempotency key at the external call? If only one of those exists, a redelivered compensation message or a racing worker will run it twice.

Is the deadline persisted, with a sweeper driving expired steps? If the timeout lives in process memory, a deploy in the wrong second leaves sagas stuck halfway with no one to finish them.

The happy path and the clean-failure path are the easy two-thirds. The timeout is the third that decides whether your saga is correct or just usually correct.

If this was useful

Sagas are the chapter where event-driven systems stop being a diagram and start being an on-call rotation. The Event-Driven Architecture Pocket Guide walks through orchestrated and choreographed sagas, the reconciliation patterns for ambiguous step state, and the compensation traps that don't show up until a dependency goes slow instead of down. If the timeout path in your own saga still feels untested, that's the part of the book worth reading first.