ndmt1at21

Posted on May 26

A Go outbox library that runs inside your own DB transaction

#go #postgres #microservices #opensource

TL;DR — tickr is a Go library. It stores messages in one Postgres table. You add messages inside your own database transaction. A worker pool in the same Go process reads them back using SELECT … FOR UPDATE SKIP LOCKED and runs your handler. It ships with Prometheus metrics, OpenTelemetry tracing, and a Grafana dashboard. No broker. No second datastore.

The problem: two writes, one of them will fail

Most services I've worked on end up with code like this:

func CreateOrder(ctx context.Context, o Order) error {
    if err := db.Insert(ctx, o); err != nil {
        return err
    }
    return broker.Publish(ctx, "order.created", o) // 👀
}

It looks fine. It is not. The two writes are not in the same transaction. There is no way to make a Postgres commit and a Kafka publish succeed or fail together. So you get one of these bugs:

DB commit works, broker publish fails → other services never hear about the order.
Broker publish works, DB rolls back → other services react to an order that does not exist.
The process crashes between the two → either of the above, at random.

The standard fix is the transactional outbox pattern. You write the message into a table in the same transaction as the business row. A separate process reads that table and sends the message. The send itself can still fail, but now you can retry it from a durable record. That gives you at-least-once delivery, even if the process crashes.

The usual setup is Debezium reading the Postgres write-ahead log and pushing into Kafka. That works. It is also a lot of moving parts for a team that only wanted reliable order.created events.

What if the outbox table just was the queue?

If your handlers run in Go anyway, you do not need to push the message to Kafka only to read it back into Go. The outbox table can be the queue. A worker pool in the same service can read it directly. That is the bet tickr makes.

Here is the producer side:

tx, _ := pool.Begin(ctx)
defer tx.Rollback(ctx)

if _, err := tx.Exec(ctx, `INSERT INTO orders ...`); err != nil {
    return err
}

payload, _ := tickr.Encode(order)
_, err := client.Enqueue(ctx, pgstore.WrapTx(tx), tickr.Message{
    Type:           "order.created",
    Payload:        payload,
    IdempotencyKey: order.ID,
})
if err != nil && !tickr.IsDuplicate(err) {
    return err
}

return tx.Commit(ctx)

Look at the second argument to Enqueue: it takes the caller's pgx.Tx. Not its own. That is the whole point. If your business INSERT rolls back, the outbox row rolls back with it. If it commits, the outbox row commits with it. The two cannot get out of sync.

This is the main difference between tickr and other Go job queues like River, Gue, or Asynq. They are good libraries, but each one owns its own connection. You cannot tie the enqueue to your application write in a single transaction.

The consumer side

Handlers are typed:

reg := tickr.NewRegistry()
_ = tickr.On(reg, "order.created",
    func(ctx context.Context, msg *tickr.InboundMessage, body OrderCreated) error {
        return chargeCustomer(ctx, body)
    },
    tickr.WithMaxAttempts(5),
    tickr.WithAttemptTimeout(10*time.Second),
)

w, _ := tickr.NewWorker(tickr.WorkerConfig{Storage: store, Registry: reg})
_ = w.Start(ctx)

tickr.On[T] decodes the JSON payload into T before your function runs. If the payload is broken, the message goes straight to the dead-letter queue — it will not decode on the next retry either, so retrying it would waste attempts.

The worker reads messages in batches with this SQL:

UPDATE tickr_messages
   SET status = 'HANDLING',
       attempt = attempt + 1,
       claimed_by = $worker_id,
       claimed_until = now() + $lease
 WHERE id IN (
   SELECT id FROM tickr_messages
    WHERE status IN ('CREATED', 'RETRYING')
      AND scheduled_at <= now()
    ORDER BY scheduled_at
    FOR UPDATE SKIP LOCKED
    LIMIT $batch
 )
RETURNING ...;

SKIP LOCKED lets you run many workers at the same time. Two workers reading the table never block each other. The slower one just sees fewer rows in its batch.

Status machine, with history

Each message moves through these states:

CREATED ──claim──▶ HANDLING ──nil────▶ SUCCESS
                     │
                     ├─ err (attempt<max) ─▶ FAILED ─▶ RETRYING ─▶ HANDLING
                     │
                     ├─ err (attempt==max) ▶ FAILED ─▶ DEAD
                     │
                     ├─ DeadLetter()       ─▶ DEAD
                     │
                     └─ ctx.Canceled       ─▶ CREATED | RETRYING (attempt not increased)

Every state change writes a row into tickr_history. The history is never read on the hot path. It is there so when production wakes you up at 3am asking "why did this message die?", you can run one SQL query and see every attempt, every error, and the worker that handled it.

If you do not want the audit cost, WithHistoryPolicy(HistoryOff) skips those inserts. More on the trade-off below.

Lease auto-extension

The feature I like most: WithAttemptTimeout(60*time.Second) works correctly even when the lease is only 30 seconds. The engine runs a small goroutine that extends claimed_until every Lease/3 while your handler runs. If the extension ever fails — say another worker takes the row because the network died — the handler's context is cancelled so it stops writing.

When a worker crashes hard (SIGKILL), the reclaimer moves the orphan rows back to RETRYING once the lease expires. The attempt counter stays as it was. So a message that crashes the process every time eventually ends up in DEAD instead of looping forever.

Where tickr wins, where it loses

The repo has a benchmarks/ module that runs the same workload against River, Gue, Watermill SQL, and Asynq. The honest summary on a single-host Docker setup:

Library	Enqueue (msgs/sec)	Drain (msgs/sec)
Watermill SQL	106,802	40,010
Gue	95,045	1,835
River	52,854	5,991
tickr (HistoryOff)	39,133	4,809
Asynq	4,370	2,655

A few things I want to be upfront about:

Watermill leads both lists because it is not a job queue. It is pub/sub with an offsets table. An ack is one row update, not N row updates. Different guarantees, different workload.
tickr is slower than River on drain (about 80%). Each ack writes two rows: an UPDATE on tickr_messages and an INSERT into tickr_history for the SUCCESS transition. One CTE folds them into a single round-trip, but the write volume is still about 2x River's. HistoryOff recovers 13%. The next planned change — batched ack via pgx.SendBatch — should close most of the gap.
These are single-host Docker numbers. On a real Postgres cluster with PgBouncer and tuned autovacuum, the absolute numbers go up a lot. The repo's throughput section shows the baseline config for 1M msg/min.

If raw drain throughput is your only concern, River is faster today on my laptop. If you want "enqueue inside my own transaction, and observability built in", that is what tickr is for.

Full numbers and methodology: BENCHMARKS.md.

Observability is built in

This was the part I did not want to ship without. Adding observability to an outbox after the fact is painful.

Prometheus — pass a metrics/prom adapter into ClientConfig.Metrics and WorkerConfig.Metrics. You get counters and histograms for every useful number: enqueue rate, handler outcomes, queue depth by status, claim batch size, reclaimed leases, in-flight handlers. The repo has a Grafana dashboard at grafana/tickr-dashboard.json you can import.

OpenTelemetry — the tracer puts W3C traceparent into Message.Headers at enqueue time. It reads them back when the worker picks up the message. So one trace covers producer → outbox → consumer, even if the consumer runs hours later (for example, after a retry). Span attributes follow the OTel messaging conventions, so Tempo, Jaeger, or Honeycomb all work.

Built-in transport handlers — if your handler just forwards the message to an HTTP webhook or a gRPC method, handlers/http and handlers/grpc are one-liners. They handle status code classification (which codes retry, which dead-letter), Retry-After headers, idempotency-key forwarding, and trace propagation. Each one lives in its own Go module, so its deps stay out of your core go.mod.

Where this is not a good fit

The Limitations section in the README has the full list. The two that matter most:

Producer and outbox must share a database. The guarantee depends on a single transaction. If your business data lives in MySQL and you want the outbox in Postgres, this library cannot help you.
At-least-once only. Your handlers must be idempotent. The library gives you InboundMessage.IdempotencyKey and the message ID. You build dedup on top. The orders example shows one common pattern: a side table keyed by (handler_name, idempotency_key) written inside the handler's own transaction.

If those are not a problem, the rest is production-shaped. Integration tests run against real Postgres via testcontainers. The storage interface has a conformance suite that any new adapter has to pass. The examples/orders directory spins up the full stack (service, Postgres, Prometheus, Grafana, Tempo) with one docker compose up.

Try it

go get github.com/ndmt1at21/tickr@latest

Or clone and run the example:

git clone https://github.com/ndmt1at21/tickr
cd tickr/examples/orders
docker compose up --build

Then in another terminal:

curl -X POST http://localhost:8080/orders \
  -H 'content-type: application/json' \
  -d '{"order_id":"o-1","customer_id":"c-1","total":42.50}'

You will see the order land in Postgres, the outbox row in tickr_messages, the handler run, and the full trace in Tempo at http://localhost:3000.

Repo: github.com/ndmt1at21/tickr — MIT licensed. Issues and PRs are welcome. If you have shipped an outbox in production before, I would love feedback on the history retention design and the partitioning notes in ARCHITECTURE.md.

DEV Community