Gabriel Anhaia

Posted on May 1

Saga Orchestration in Go: A State Machine for Multi-Service Workflows

#go #architecture #distributedsystems #backend

Book: Hexagonal Architecture in Go
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

A team I talked to last spring shipped a "simple" checkout flow.
The order service called the inventory service, then the payment
service, then the shipping service. Three HTTP calls in a row,
inside one HTTP handler. It worked in staging. It died in
production the first time the payment service returned 504 after
the card was charged but before the response came back.

The order showed failed. The customer's card was charged. The
warehouse never reserved the item. Three different systems, three
different stories, no single place to ask "what happened?".

That flow needed a saga (the saga pattern, originally described
by Garcia-Molina and Salem in 1987 — paper).
The most boring, debuggable shape it can take in Go is an
explicit state machine, persisted to Postgres, driven by a single
goroutine that picks up where the last crash left off.

Choreography vs orchestration, fast

Two ways to coordinate a multi-step workflow across services:

Choreography. Each service emits events. Other services listen. No central coordinator. Easy to start, painful to reason about. The workflow only exists in your head and in Grafana.
Orchestration. One coordinator owns the workflow. It calls each step, records the result, and decides what comes next. The whole flow lives in one file you can read top-to-bottom.

This post is about orchestration. The win is operational: when a
support ticket says "where is my order?", you open one row in one
table and see the exact state.

The state machine

The order workflow has four forward states, a compensating state,
and a terminal pair:

created
  → inventory_reserved
    → card_charged
      → shipped
        → completed   (terminal happy path)

  → failed_compensating
    → failed          (terminal sad path)
    → failed_stuck    (compensation gave up — needs human)

Each transition is a step. Each step has a compensation that
runs on the way back if a later step fails.

package saga

type State string

const (
    StateCreated            State = "created"
    StateInventoryReserved  State = "inventory_reserved"
    StateCardCharged        State = "card_charged"
    StateShipped            State = "shipped"
    StateCompleted          State = "completed"
    StateFailedCompensating State = "failed_compensating"
    StateFailed             State = "failed"
    StateFailedStuck        State = "failed_stuck"
)

The saga itself is a plain struct. No framework, no embedded
machinery. Just the data the workflow needs.

type OrderSaga struct {
    ID          string
    OrderID     string
    CustomerID  string
    AmountCents int64
    SKU         string
    Quantity    int

    State       State
    LastError   string
    Attempts    int
    UpdatedAt   time.Time

    LockedBy   string
    LeaseUntil time.Time
}

The State field is the source of truth. Everything else is
data the steps need to do their work, plus two columns
(LockedBy, LeaseUntil) that the lease-based concurrency
control uses — covered below.

One step, one transition, one idempotency key

Every forward step takes the saga, calls one outbound port, and
returns the next state. The contract is small enough to fit in
your head:

type Step interface {
    Name() string
    Forward(
        ctx context.Context, s *OrderSaga,
    ) (State, error)
    Compensate(
        ctx context.Context, s *OrderSaga,
    ) error
}

The idempotency key for each step is derived from the saga ID
and the step name. That way, calling Forward twice for the
same step on the same saga reaches the same downstream record.
The inventory service deduplicates by key, the payment provider
treats it as the same charge, and the shipping API returns the
same shipment.

func stepKey(sagaID, step string) string {
    return fmt.Sprintf("%s:%s", sagaID, step)
}

Boring, but it is the difference between a saga that retries
safely and a saga that double-charges customers.

A real step: charging the card

Here is the charge step. It depends on a payment port the domain
defines (PaymentGateway), and it returns the next state on
success or an error on failure. No HTTP, no payment-provider SDK
in this file. That's an adapter's job.

type ChargeCardStep struct {
    payments PaymentGateway
}

func (ChargeCardStep) Name() string { return "charge_card" }

func (s ChargeCardStep) Forward(
    ctx context.Context, sg *OrderSaga,
) (State, error) {
    key := stepKey(sg.ID, s.Name())
    err := s.payments.Charge(ctx, PaymentRequest{
        IdempotencyKey: key,
        CustomerID:     sg.CustomerID,
        AmountCents:    sg.AmountCents,
    })
    if err != nil {
        return StateFailedCompensating, err
    }
    return StateCardCharged, nil
}

func (s ChargeCardStep) Compensate(
    ctx context.Context, sg *OrderSaga,
) error {
    key := stepKey(sg.ID, s.Name()) + ":refund"
    return s.payments.Refund(ctx, RefundRequest{
        IdempotencyKey: key,
        CustomerID:     sg.CustomerID,
        AmountCents:    sg.AmountCents,
    })
}

Forward charges. Compensate refunds. Both use idempotency
keys derived from the saga ID, so a worker that crashes mid-call
and resumes will not double-charge or double-refund.

The inventory and shipping steps follow the same shape:
ReserveInventoryStep.Forward reserves stock, Compensate
releases it. ShipOrderStep.Forward creates a shipment,
Compensate cancels it. Same contract, different ports.

The runner

The runner is the orchestrator. It loads a saga, looks at its
state, runs the matching step, and writes the new state back.
Crash anywhere and the next run resumes from the persisted state.

The runner holds named references to each step rather than a
positional slice, so a reorder cannot silently break the switch:

type Runner struct {
    repo         Repository
    workerID     string
    reserveStep  Step
    chargeStep   Step
    shipStep     Step
}

func (r *Runner) Step(
    ctx context.Context, sagaID string,
) error {
    sg, err := r.repo.AcquireLease(ctx, sagaID, r.workerID)
    if err != nil {
        return err
    }
    defer r.repo.ReleaseLease(ctx, sagaID, r.workerID)

    switch sg.State {
    case StateCreated:
        return r.advance(ctx, sg, r.reserveStep)
    case StateInventoryReserved:
        return r.advance(ctx, sg, r.chargeStep)
    case StateCardCharged:
        return r.advance(ctx, sg, r.shipStep)
    case StateShipped:
        sg.State = StateCompleted
        return r.repo.Save(ctx, sg)
    case StateFailedCompensating:
        return r.compensate(ctx, sg)
    default:
        return nil
    }
}

The StateShipped → StateCompleted transition is set by the
runner directly rather than by a fourth step, because there is
no outbound call to make at that point. Worth calling out so the
contract above is unambiguous: every forward step exists when
there is an external service to talk to; "completed" is just a
status flip.

advance is one place. It runs Forward, records the new
state, and saves. If the step returns an error, the saga moves
to failed_compensating and the next iteration walks back.

func (r *Runner) advance(
    ctx context.Context, sg *OrderSaga, step Step,
) error {
    next, err := step.Forward(ctx, sg)
    sg.Attempts++
    sg.UpdatedAt = time.Now()

    if err != nil {
        sg.LastError = err.Error()
        sg.State = StateFailedCompensating
        return r.repo.Save(ctx, sg)
    }

    sg.State = next
    sg.LastError = ""
    return r.repo.Save(ctx, sg)
}

Compensation walks the steps in reverse from the last successful
one. Reserve compensates if the saga got past inventory_reserved.
Charge compensates if it got past card_charged. Shipping never
needs to compensate here, because the only way to be past
shipped is to be completed.

compensationOrder returns only the steps that actually ran,
based on the saga's last successful state. It does not blindly
call Compensate on every step. That would refund a card that
never got charged.

const compensationBudget = 5

func (r *Runner) compensationOrder(
    sg *OrderSaga,
) []Step {
    switch sg.State {
    case StateFailedCompensating:
        // The state moved to failed_compensating from the
        // step that just failed. The saga's Attempts counter
        // and LastError tell us which step failed; the steps
        // that ran successfully before it are the ones to
        // compensate, in reverse.
        switch lastSuccessful(sg) {
        case StateCardCharged:
            return []Step{r.chargeStep, r.reserveStep}
        case StateInventoryReserved:
            return []Step{r.reserveStep}
        }
    }
    return nil
}

func lastSuccessful(sg *OrderSaga) State {
    // The saga records the highest forward state it reached
    // before flipping to failed_compensating. In practice
    // this is stored in a separate column (last_state) that
    // advance() writes alongside the failure transition;
    // omitted from the struct above for brevity.
    return sg.State
}

Compensation is bounded. Each compensation attempt increments
Attempts. After compensationBudget failed attempts the saga
moves to failed_stuck so a human can inspect it instead of
the worker hammering the same dead downstream forever.

func (r *Runner) compensate(
    ctx context.Context, sg *OrderSaga,
) error {
    if sg.Attempts >= compensationBudget {
        sg.State = StateFailedStuck
        sg.UpdatedAt = time.Now()
        return r.repo.Save(ctx, sg)
    }

    order := r.compensationOrder(sg)
    for _, step := range order {
        if err := step.Compensate(ctx, sg); err != nil {
            sg.LastError = err.Error()
            sg.Attempts++
            sg.UpdatedAt = time.Now()
            return r.repo.Save(ctx, sg)
        }
    }
    sg.State = StateFailed
    sg.UpdatedAt = time.Now()
    return r.repo.Save(ctx, sg)
}

A failed compensation now has a real ceiling. The saga goes to
failed_stuck, the dashboard query in the polling section
filters it out of the active queue, and an operator gets alerted.

Persisting state in Postgres

The persistence shape is one table. The state column is what the
runner reads on every tick. The lease columns (locked_by,
lease_until) are what keeps two workers from running the same
saga concurrently — explained right after the schema.

CREATE TABLE order_sagas (
    id              UUID PRIMARY KEY,
    order_id        UUID NOT NULL,
    customer_id     UUID NOT NULL,
    amount_cents    BIGINT NOT NULL,
    sku             TEXT NOT NULL,
    quantity        INT NOT NULL,
    state           TEXT NOT NULL,
    last_error      TEXT NOT NULL DEFAULT '',
    attempts        INT NOT NULL DEFAULT 0,
    updated_at      TIMESTAMPTZ NOT NULL,
    locked_by       TEXT,
    lease_until     TIMESTAMPTZ
);
CREATE INDEX ON order_sagas (state, updated_at);
CREATE INDEX ON order_sagas (lease_until)
    WHERE locked_by IS NOT NULL;

The naive way to lock a saga is SELECT ... FOR UPDATE inside a
transaction. The trap is that the transaction has to stay open
for the entire run-and-save cycle, and any commit (even an
implicit one) drops the lock. A cleaner pattern for polled
workers is a lease: a conditional UPDATE that flips
locked_by and lease_until if and only if no other worker
already holds it.

type PgRepo struct {
    db *sql.DB
}

const leaseDuration = 30 * time.Second

func (r *PgRepo) AcquireLease(
    ctx context.Context, id, workerID string,
) (*OrderSaga, error) {
    row := r.db.QueryRowContext(ctx, `
        UPDATE order_sagas
        SET locked_by = $1,
            lease_until = now() + $2::interval
        WHERE id = $3
          AND (locked_by IS NULL
               OR lease_until < now()
               OR locked_by = $1)
        RETURNING id, order_id, customer_id, amount_cents,
                  sku, quantity, state, last_error,
                  attempts, updated_at,
                  locked_by, lease_until`,
        workerID,
        fmt.Sprintf("%d seconds", int(leaseDuration.Seconds())),
        id,
    )

    var s OrderSaga
    var lockedBy sql.NullString
    var leaseUntil sql.NullTime
    if err := row.Scan(
        &s.ID, &s.OrderID, &s.CustomerID,
        &s.AmountCents, &s.SKU, &s.Quantity,
        &s.State, &s.LastError, &s.Attempts,
        &s.UpdatedAt, &lockedBy, &leaseUntil,
    ); err != nil {
        if errors.Is(err, sql.ErrNoRows) {
            return nil, ErrLeaseHeldByOther
        }
        return nil, err
    }
    s.LockedBy = lockedBy.String
    s.LeaseUntil = leaseUntil.Time
    return &s, nil
}

func (r *PgRepo) ReleaseLease(
    ctx context.Context, id, workerID string,
) error {
    _, err := r.db.ExecContext(ctx, `
        UPDATE order_sagas
        SET locked_by = NULL, lease_until = NULL
        WHERE id = $1 AND locked_by = $2`,
        id, workerID,
    )
    return err
}

The UPDATE is atomic. Either this worker now holds the lease
or it does not, decided in a single round-trip. The WHERE
clause accepts three cases: the row is unleased, the previous
lease has expired (the worker that held it crashed without
releasing), or this same worker is reacquiring (idempotent).
Save is a normal UPDATE on the saga columns and is safe
because only the lease holder is supposed to call it; if you
want belt-and-braces, add AND locked_by = $1 to Save too.

ErrLeaseHeldByOther is the signal to the caller to skip this
saga and try a different one. It is not an error worth alerting
on; it is the mechanism by which N workers share the queue.

Polling for work

The runner picks up sagas in non-terminal states by polling.
A simple query keeps the worker fed without picking up sagas
that are already leased or already done:

SELECT id
FROM order_sagas
WHERE state NOT IN ('completed', 'failed', 'failed_stuck')
  AND (locked_by IS NULL OR lease_until < now())
ORDER BY updated_at
LIMIT 100;

The terminal states (completed, failed, failed_stuck) are
the only ones the worker ignores. Everything else is fair game,
including failed_compensating — that is how compensation gets
re-driven on the next tick if it failed mid-walk.

Why a state machine, not a linear function

The first version of every saga is a function with five HTTP
calls in a row and a defer for cleanup. That works until the
process restarts mid-flight. Then the workflow is gone. There
is no pending charge to retry, no reserved stock to release.

The state machine pays for itself the first time a worker pod
gets killed during a deploy. The saga row stays in
card_charged. The next worker picks it up, sees the state,
and runs the shipping step. The customer never knows there was
a restart.

It also gives you something to point support at. "Order #1234
is stuck in failed_compensating, last error was inventory service: 503." That is a sentence a human can act on, not a
needle hunt across three log streams.

What to try next

Add observability. Wrap advance and compensate with a span
per step. The state machine already gives you the names. Every
span name is a step name, every state transition is an event.
You will get a trace timeline that mirrors the workflow exactly,
which is the second-best thing about orchestration after the
debuggable single-row state.

If this was useful

The full saga chapter is in Hexagonal Architecture in Go. It
covers the outbox-driven publish, multi-worker leasing, and the
trade-offs against choreography. The state-machine pattern shows
up again in chapters on long-running workflows and event
sourcing, where the same data shape carries you a long way
before you reach for a workflow engine. Pairs with The Complete
Guide to Go Programming if the language fundamentals are still
settling in.