Retries and Backoff in Go: Where They Belong in a Hexagonal Service

#go #hexagonal #backend

Book: Hexagonal Architecture in Go
Also by me: The Complete Guide to Go Programming — the companion book in the Thinking in Go series
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

A downstream service starts returning 503 for ten seconds. Your
Go service has a retry: three attempts, 100ms apart. Every
request that lands during those ten seconds fires three times.
Your traffic to the struggling service triples at the exact
moment it can least afford it. The downstream stays down longer
because you and every other caller are hammering it in lockstep,
all retrying at the same 100ms tick.

That is a retry storm, and the fix has been known for a decade.
The AWS Architecture Blog post Exponential Backoff And
Jitter
laid out the math: spread retries over a growing, randomized
window so callers stop synchronizing. The interesting question
for a Go service is not the formula. It is where the retry
lives. Get the placement wrong and no amount of jitter saves
you.

The domain never retries

Start with the rule, because it decides everything else.

Your domain layer holds business logic. "An order over $500
needs manager approval." "A refund cannot exceed the original
charge." None of that knows the word retry. A retry is a fact
about an unreliable network, a rate-limited API, a database
that briefly refused a connection. Those are transport
concerns. They belong to the code that talks to the transport:
the adapter.

Here is the port the domain sees. It is an interface, and it
says nothing about attempts or delays.

// port/payment.go
package port

import "context"

type PaymentGateway interface {
    Charge(
        ctx context.Context,
        orderID string,
        cents int64,
    ) (string, error)
}

The application service calls Charge once. It does not loop.
If it fails, it fails, and the domain decides what a failed
charge means for the order.

// app/checkout.go
func (s *Checkout) Confirm(
    ctx context.Context, orderID string,
) error {
    order, err := s.orders.Get(ctx, orderID)
    if err != nil {
        return err
    }
    txID, err := s.gateway.Charge(
        ctx, order.ID, order.TotalCents,
    )
    if err != nil {
        return fmt.Errorf("charge: %w", err)
    }
    order.MarkPaid(txID)
    return s.orders.Save(ctx, order)
}

If you put the retry loop in Confirm, every business method
that touches the network grows the same loop, the domain starts
importing time, and your unit tests wait on real backoff
delays. The retry has leaked out of transport and into logic.

Backoff lives in the adapter

The adapter implements PaymentGateway. That is the one place
that knows there is an HTTP call underneath, so it is the one
place that gets to retry.

The full-jitter formula from the AWS post: pick a random delay
between zero and an exponentially growing cap. Random spread is
what breaks the lockstep.

// adapter/backoff.go
package adapter

import (
    "math/rand/v2"
    "time"
)

func fullJitter(
    attempt int, base, max time.Duration,
) time.Duration {
    // cap = base * 2^attempt, clamped to max
    cap := base << attempt
    if cap > max || cap <= 0 {
        cap = max
    }
    return time.Duration(rand.Int64N(int64(cap)))
}

Go 1.22 promoted math/rand/v2 to the standard library, and
rand.Int64N is safe for concurrent use with no seeding
ceremony. The attempt shift gives you 0, base, 2×base, 4×base
as the ceiling, and each real delay is a uniform draw under
that ceiling. Two callers that fail at the same millisecond now
wake up at different times.

Respect the context deadline

This is the part most retry loops get wrong. A retry that
ignores the caller's deadline is worse than no retry. If the
HTTP handler upstream has a 2-second budget and your adapter
sleeps for 4 seconds across attempts, you have burned the
budget on waiting and returned nothing.

The loop has to watch two clocks: the backoff timer and
ctx.Done(). Whichever fires first wins.

// adapter/retry.go
package adapter

import (
    "context"
    "time"
)

func retry(
    ctx context.Context,
    max int,
    op func(context.Context) error,
) error {
    const base = 50 * time.Millisecond
    const cap = 2 * time.Second

    var err error
    for attempt := 0; attempt < max; attempt++ {
        err = op(ctx)
        if err == nil {
            return nil
        }
        if !retryable(err) {
            return err
        }
        delay := fullJitter(attempt, base, cap)

        t := time.NewTimer(delay)
        select {
        case <-ctx.Done():
            t.Stop()
            return ctx.Err()
        case <-t.C:
        }
    }
    return err
}

The select is the whole point. If the context is cancelled or
its deadline passes while you are waiting out the backoff, you
stop immediately and return ctx.Err(). You never sleep past
the budget the caller handed you. time.NewTimer plus
t.Stop() avoids leaking a timer when the context wins the
race.

Retry the right errors, not all of them

A retry only helps for failures that might succeed next time: a
dropped connection, a 503, a 429, a timeout on the downstream.
Retrying a 400 or a "card declined" is pointless and, for a
non-idempotent charge, dangerous. You want a predicate that
decides.

// adapter/retryable.go
package adapter

import (
    "context"
    "errors"
    "net"
)

func retryable(err error) bool {
    // caller's deadline or cancel: never retry
    if errors.Is(err, context.DeadlineExceeded) ||
        errors.Is(err, context.Canceled) {
        return false
    }
    var netErr net.Error
    if errors.As(err, &netErr) && netErr.Timeout() {
        return true
    }
    var status *StatusError
    if errors.As(err, &status) {
        return status.Code == 429 ||
            status.Code >= 500
    }
    return false
}

type StatusError struct{ Code int }

func (e *StatusError) Error() string {
    return "unexpected status"
}

errors.As walks the wrapped chain, so a StatusError buried
under three fmt.Errorf("...: %w", err) calls still gets
matched. The default is false: an error you do not recognize
is not retried. Unknown failures should surface, not loop.

Wiring it into the adapter

The gateway adapter holds the HTTP client and runs its calls
through retry. The domain and the application service never
see any of it.

// adapter/payment_http.go
package adapter

import "context"

type HTTPGateway struct {
    client *Client
    max    int
}

func (g *HTTPGateway) Charge(
    ctx context.Context,
    orderID string,
    cents int64,
) (string, error) {
    var txID string
    err := retry(ctx, g.max,
        func(ctx context.Context) error {
            id, err := g.client.postCharge(
                ctx, orderID, cents,
            )
            if err != nil {
                return err
            }
            txID = id
            return nil
        },
    )
    return txID, err
}

One caveat worth stating out loud: retrying a charge is only
safe if the downstream treats it as idempotent. Send an
idempotency key with the request (most payment APIs support
one) so a retried postCharge cannot double-charge. The retry
lives in the adapter, and so does the responsibility for making
it safe. That is the right home for both.

What the placement bought you

The domain stayed pure. Confirm calls Charge once and reads
like the business rule it encodes. The retry policy is one file
you can tune, test with a fake clock, and reason about on its
own. Swap the HTTP gateway for a gRPC one and the backoff moves
with it, because it was never anywhere else. The context
deadline is honored end to end, so an upstream timeout cancels
the whole chain instead of leaving a goroutine sleeping on a
timer nobody is waiting for.

Retries are a transport detail. Keep them at the transport
boundary, and the rest of the service does not have to know
that the network is unreliable.

If this was useful

The mechanics here — math/rand/v2, context deadlines,
errors.As over wrapped chains, time.Timer races in a
select — are the language and runtime details The Complete
Guide to Go Programming covers in depth. Keeping the retry at
the adapter and out of the domain is the boundary discipline
Hexagonal Architecture in Go is built around: ports that stay
ignorant of transport, adapters that own the messy parts.