- Book: Hexagonal Architecture in Go
- Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
A team I talked to last spring shipped a "simple" checkout flow.
The order service called the inventory service, then the payment
service, then the shipping service. Three HTTP calls in a row,
inside one HTTP handler. It worked in staging. It died in
production the first time the payment service returned 504 after
the card was charged but before the response came back.
The order showed failed. The customer's card was charged. The
warehouse never reserved the item. Three different systems, three
different stories, no single place to ask "what happened?".
That flow needed a saga (the saga pattern, originally described
by Garcia-Molina and Salem in 1987 — paper).
The most boring, debuggable shape it can take in Go is an
explicit state machine, persisted to Postgres, driven by a single
goroutine that picks up where the last crash left off.
Choreography vs orchestration, fast
Two ways to coordinate a multi-step workflow across services:
- Choreography. Each service emits events. Other services listen. No central coordinator. Easy to start, painful to reason about. The workflow only exists in your head and in Grafana.
- Orchestration. One coordinator owns the workflow. It calls each step, records the result, and decides what comes next. The whole flow lives in one file you can read top-to-bottom.
This post is about orchestration. The win is operational: when a
support ticket says "where is my order?", you open one row in one
table and see the exact state.
The state machine
The order workflow has four forward states, a compensating state,
and a terminal pair:
created
→ inventory_reserved
→ card_charged
→ shipped
→ completed (terminal happy path)
→ failed_compensating
→ failed (terminal sad path)
→ failed_stuck (compensation gave up — needs human)
Each transition is a step. Each step has a compensation that
runs on the way back if a later step fails.
package saga
type State string
const (
StateCreated State = "created"
StateInventoryReserved State = "inventory_reserved"
StateCardCharged State = "card_charged"
StateShipped State = "shipped"
StateCompleted State = "completed"
StateFailedCompensating State = "failed_compensating"
StateFailed State = "failed"
StateFailedStuck State = "failed_stuck"
)
The saga itself is a plain struct. No framework, no embedded
machinery. Just the data the workflow needs.
type OrderSaga struct {
ID string
OrderID string
CustomerID string
AmountCents int64
SKU string
Quantity int
State State
LastError string
Attempts int
UpdatedAt time.Time
LockedBy string
LeaseUntil time.Time
}
The State field is the source of truth. Everything else is
data the steps need to do their work, plus two columns
(LockedBy, LeaseUntil) that the lease-based concurrency
control uses — covered below.
One step, one transition, one idempotency key
Every forward step takes the saga, calls one outbound port, and
returns the next state. The contract is small enough to fit in
your head:
type Step interface {
Name() string
Forward(
ctx context.Context, s *OrderSaga,
) (State, error)
Compensate(
ctx context.Context, s *OrderSaga,
) error
}
The idempotency key for each step is derived from the saga ID
and the step name. That way, calling Forward twice for the
same step on the same saga reaches the same downstream record.
The inventory service deduplicates by key, the payment provider
treats it as the same charge, and the shipping API returns the
same shipment.
func stepKey(sagaID, step string) string {
return fmt.Sprintf("%s:%s", sagaID, step)
}
Boring, but it is the difference between a saga that retries
safely and a saga that double-charges customers.
A real step: charging the card
Here is the charge step. It depends on a payment port the domain
defines (PaymentGateway), and it returns the next state on
success or an error on failure. No HTTP, no payment-provider SDK
in this file. That's an adapter's job.
type ChargeCardStep struct {
payments PaymentGateway
}
func (ChargeCardStep) Name() string { return "charge_card" }
func (s ChargeCardStep) Forward(
ctx context.Context, sg *OrderSaga,
) (State, error) {
key := stepKey(sg.ID, s.Name())
err := s.payments.Charge(ctx, PaymentRequest{
IdempotencyKey: key,
CustomerID: sg.CustomerID,
AmountCents: sg.AmountCents,
})
if err != nil {
return StateFailedCompensating, err
}
return StateCardCharged, nil
}
func (s ChargeCardStep) Compensate(
ctx context.Context, sg *OrderSaga,
) error {
key := stepKey(sg.ID, s.Name()) + ":refund"
return s.payments.Refund(ctx, RefundRequest{
IdempotencyKey: key,
CustomerID: sg.CustomerID,
AmountCents: sg.AmountCents,
})
}
Forward charges. Compensate refunds. Both use idempotency
keys derived from the saga ID, so a worker that crashes mid-call
and resumes will not double-charge or double-refund.
The inventory and shipping steps follow the same shape:
ReserveInventoryStep.Forward reserves stock, Compensate
releases it. ShipOrderStep.Forward creates a shipment,
Compensate cancels it. Same contract, different ports.
The runner
The runner is the orchestrator. It loads a saga, looks at its
state, runs the matching step, and writes the new state back.
Crash anywhere and the next run resumes from the persisted state.
The runner holds named references to each step rather than a
positional slice, so a reorder cannot silently break the switch:
type Runner struct {
repo Repository
workerID string
reserveStep Step
chargeStep Step
shipStep Step
}
func (r *Runner) Step(
ctx context.Context, sagaID string,
) error {
sg, err := r.repo.AcquireLease(ctx, sagaID, r.workerID)
if err != nil {
return err
}
defer r.repo.ReleaseLease(ctx, sagaID, r.workerID)
switch sg.State {
case StateCreated:
return r.advance(ctx, sg, r.reserveStep)
case StateInventoryReserved:
return r.advance(ctx, sg, r.chargeStep)
case StateCardCharged:
return r.advance(ctx, sg, r.shipStep)
case StateShipped:
sg.State = StateCompleted
return r.repo.Save(ctx, sg)
case StateFailedCompensating:
return r.compensate(ctx, sg)
default:
return nil
}
}
The StateShipped → StateCompleted transition is set by the
runner directly rather than by a fourth step, because there is
no outbound call to make at that point. Worth calling out so the
contract above is unambiguous: every forward step exists when
there is an external service to talk to; "completed" is just a
status flip.
advance is one place. It runs Forward, records the new
state, and saves. If the step returns an error, the saga moves
to failed_compensating and the next iteration walks back.
func (r *Runner) advance(
ctx context.Context, sg *OrderSaga, step Step,
) error {
next, err := step.Forward(ctx, sg)
sg.Attempts++
sg.UpdatedAt = time.Now()
if err != nil {
sg.LastError = err.Error()
sg.State = StateFailedCompensating
return r.repo.Save(ctx, sg)
}
sg.State = next
sg.LastError = ""
return r.repo.Save(ctx, sg)
}
Compensation walks the steps in reverse from the last successful
one. Reserve compensates if the saga got past inventory_reserved.
Charge compensates if it got past card_charged. Shipping never
needs to compensate here, because the only way to be past
shipped is to be completed.
compensationOrder returns only the steps that actually ran,
based on the saga's last successful state. It does not blindly
call Compensate on every step. That would refund a card that
never got charged.
const compensationBudget = 5
func (r *Runner) compensationOrder(
sg *OrderSaga,
) []Step {
switch sg.State {
case StateFailedCompensating:
// The state moved to failed_compensating from the
// step that just failed. The saga's Attempts counter
// and LastError tell us which step failed; the steps
// that ran successfully before it are the ones to
// compensate, in reverse.
switch lastSuccessful(sg) {
case StateCardCharged:
return []Step{r.chargeStep, r.reserveStep}
case StateInventoryReserved:
return []Step{r.reserveStep}
}
}
return nil
}
func lastSuccessful(sg *OrderSaga) State {
// The saga records the highest forward state it reached
// before flipping to failed_compensating. In practice
// this is stored in a separate column (last_state) that
// advance() writes alongside the failure transition;
// omitted from the struct above for brevity.
return sg.State
}
Compensation is bounded. Each compensation attempt increments
Attempts. After compensationBudget failed attempts the saga
moves to failed_stuck so a human can inspect it instead of
the worker hammering the same dead downstream forever.
func (r *Runner) compensate(
ctx context.Context, sg *OrderSaga,
) error {
if sg.Attempts >= compensationBudget {
sg.State = StateFailedStuck
sg.UpdatedAt = time.Now()
return r.repo.Save(ctx, sg)
}
order := r.compensationOrder(sg)
for _, step := range order {
if err := step.Compensate(ctx, sg); err != nil {
sg.LastError = err.Error()
sg.Attempts++
sg.UpdatedAt = time.Now()
return r.repo.Save(ctx, sg)
}
}
sg.State = StateFailed
sg.UpdatedAt = time.Now()
return r.repo.Save(ctx, sg)
}
A failed compensation now has a real ceiling. The saga goes to
failed_stuck, the dashboard query in the polling section
filters it out of the active queue, and an operator gets alerted.
Persisting state in Postgres
The persistence shape is one table. The state column is what the
runner reads on every tick. The lease columns (locked_by,
lease_until) are what keeps two workers from running the same
saga concurrently — explained right after the schema.
CREATE TABLE order_sagas (
id UUID PRIMARY KEY,
order_id UUID NOT NULL,
customer_id UUID NOT NULL,
amount_cents BIGINT NOT NULL,
sku TEXT NOT NULL,
quantity INT NOT NULL,
state TEXT NOT NULL,
last_error TEXT NOT NULL DEFAULT '',
attempts INT NOT NULL DEFAULT 0,
updated_at TIMESTAMPTZ NOT NULL,
locked_by TEXT,
lease_until TIMESTAMPTZ
);
CREATE INDEX ON order_sagas (state, updated_at);
CREATE INDEX ON order_sagas (lease_until)
WHERE locked_by IS NOT NULL;
The naive way to lock a saga is SELECT ... FOR UPDATE inside a
transaction. The trap is that the transaction has to stay open
for the entire run-and-save cycle, and any commit (even an
implicit one) drops the lock. A cleaner pattern for polled
workers is a lease: a conditional UPDATE that flips
locked_by and lease_until if and only if no other worker
already holds it.
type PgRepo struct {
db *sql.DB
}
const leaseDuration = 30 * time.Second
func (r *PgRepo) AcquireLease(
ctx context.Context, id, workerID string,
) (*OrderSaga, error) {
row := r.db.QueryRowContext(ctx, `
UPDATE order_sagas
SET locked_by = $1,
lease_until = now() + $2::interval
WHERE id = $3
AND (locked_by IS NULL
OR lease_until < now()
OR locked_by = $1)
RETURNING id, order_id, customer_id, amount_cents,
sku, quantity, state, last_error,
attempts, updated_at,
locked_by, lease_until`,
workerID,
fmt.Sprintf("%d seconds", int(leaseDuration.Seconds())),
id,
)
var s OrderSaga
var lockedBy sql.NullString
var leaseUntil sql.NullTime
if err := row.Scan(
&s.ID, &s.OrderID, &s.CustomerID,
&s.AmountCents, &s.SKU, &s.Quantity,
&s.State, &s.LastError, &s.Attempts,
&s.UpdatedAt, &lockedBy, &leaseUntil,
); err != nil {
if errors.Is(err, sql.ErrNoRows) {
return nil, ErrLeaseHeldByOther
}
return nil, err
}
s.LockedBy = lockedBy.String
s.LeaseUntil = leaseUntil.Time
return &s, nil
}
func (r *PgRepo) ReleaseLease(
ctx context.Context, id, workerID string,
) error {
_, err := r.db.ExecContext(ctx, `
UPDATE order_sagas
SET locked_by = NULL, lease_until = NULL
WHERE id = $1 AND locked_by = $2`,
id, workerID,
)
return err
}
The UPDATE is atomic. Either this worker now holds the lease
or it does not, decided in a single round-trip. The WHERE
clause accepts three cases: the row is unleased, the previous
lease has expired (the worker that held it crashed without
releasing), or this same worker is reacquiring (idempotent).
Save is a normal UPDATE on the saga columns and is safe
because only the lease holder is supposed to call it; if you
want belt-and-braces, add AND locked_by = $1 to Save too.
ErrLeaseHeldByOther is the signal to the caller to skip this
saga and try a different one. It is not an error worth alerting
on; it is the mechanism by which N workers share the queue.
Polling for work
The runner picks up sagas in non-terminal states by polling.
A simple query keeps the worker fed without picking up sagas
that are already leased or already done:
SELECT id
FROM order_sagas
WHERE state NOT IN ('completed', 'failed', 'failed_stuck')
AND (locked_by IS NULL OR lease_until < now())
ORDER BY updated_at
LIMIT 100;
The terminal states (completed, failed, failed_stuck) are
the only ones the worker ignores. Everything else is fair game,
including failed_compensating — that is how compensation gets
re-driven on the next tick if it failed mid-walk.
Why a state machine, not a linear function
The first version of every saga is a function with five HTTP
calls in a row and a defer for cleanup. That works until the
process restarts mid-flight. Then the workflow is gone. There
is no pending charge to retry, no reserved stock to release.
The state machine pays for itself the first time a worker pod
gets killed during a deploy. The saga row stays in
card_charged. The next worker picks it up, sees the state,
and runs the shipping step. The customer never knows there was
a restart.
It also gives you something to point support at. "Order #1234
is stuck in failed_compensating, last error was inventory." That is a sentence a human can act on, not a
service: 503
needle hunt across three log streams.
What to try next
Add observability. Wrap advance and compensate with a span
per step. The state machine already gives you the names. Every
span name is a step name, every state transition is an event.
You will get a trace timeline that mirrors the workflow exactly,
which is the second-best thing about orchestration after the
debuggable single-row state.
If this was useful
The full saga chapter is in Hexagonal Architecture in Go. It
covers the outbox-driven publish, multi-worker leasing, and the
trade-offs against choreography. The state-machine pattern shows
up again in chapters on long-running workflows and event
sourcing, where the same data shape carries you a long way
before you reach for a workflow engine. Pairs with The Complete
Guide to Go Programming if the language fundamentals are still
settling in.

Top comments (0)