DaOfficialWizard

Posted on Sep 13

Outbox Done Right in Go: Building Resilient Event-Driven Systems with NATS and SQL

#go #microservices #eventdriven #distributedsystems

🪧 Introduction

In modern distributed systems, maintaining data consistency while reliably delivering events is largely solved but still complex challenge.

Let's look at some reasonably common scenarios; where your service commits a database transaction successfully but crashes before publishing the corresponding event, or a high-load causing congestion and contention. Example, 100,000+ users all trying to checkout their cart simultaneously.

This is the classic dual-write problem, which can result in lost events or inconsistent system states.

Such issues are particularly critical in microservices, event-driven architectures, and high-throughput applications.

Using third-party applications like Apache Kafka largely abstract these complexities in today's dev experience. I would like to take some time to help reinforce the systems thinking around these issues.

To do so, let's explores how the Outbox pattern addresses this challenge, we will be using Go.

Provided below is a practical implementation using SQL and NATS, with SQLite/libSQL as a lightweight alternative.

We'll cover schema design, repository abstraction, dispatcher loops, broker integration, retry strategies, observability, and integration testing. By the end, you'll have a blueprint how to reason over and think about what it takes to actually build a fault-tolerant, reliable event delivery system.

⚡ The Dual-Write Problem

A large portion of design etiquette teaches us to treat truth of our world as the database. The order exists, the balance moved, the state changed and in the same breath we promise to tell the rest of our system about it.

Fundamentally though, breaths aren't atomic. A process dies, a socket blips, a broker backpressures for a few seconds that feel like hours under load. Suddenly we've created a world where the data is right here and wrong everywhere else.

The database commit and the message publish happen in different universes. If the first succeeds and the second never happens, we've silently lost a heartbeat.

If the second happens and the first doesn't, we've created a phantom. Try explaining either to a downstream system at 3 a.m.

What makes this so slippery is that it's rare ... until it isn't. High traffic stretches the tiny windows between things just wide enough for reality to fall through. Once you see the gap, you stop trying to close it with optimism and start designing around it.

Key causes:

Independent database commits and broker operations
Network issues or broker outages

Consequences:

Lost or delayed events
Duplicate events when retrying
Hard-to-trace downstream bugs

Design patterns are largely here to help us avoid such pitfalls. The Outbox pattern, being a staunch figure in this case, decouples database writes from message dispatch, ensuring reliability even under failures.

🧰 The Outbox Pattern Explained

The outbox is a very human solution. Write everything down as you do it, then tell everyone later. Seems pretty straightforward.

In database terms, you couple your domain change with a new row in a dedicated outbox table inside the same transaction.

If the state changes, the message exists guaranteed. If the transaction fails, nothing exists. There is no "half-story" to reconcile later.

Only after the transaction commits does a separate, boring background process take the time to deliver those messages to the broker.

If the broker is sleepy or the network is fickle, it tries again. And again. The system devolves gracefully into "eventually" instead of "never."

Two details make this ergonomic in production. First, every message carries a stable subject, something you can reason about across time. Second, every event has an envelope with a durable ID.

With those two anchors, your publishers can be patient and your consumers can be idempotent.

Atomic transactions: Domain updates and outbox inserts occur in a single transaction.
Asynchronous dispatch: A background dispatcher reads from the outbox and publishes to the broker, retrying on failures.
At-least-once delivery: Consumers must handle idempotency.
Stable subjects: Human-readable, consistent event topics improve observability.

func SubjectFor(event any) string {
  switch event.(type) {
  case domain.MarketOrderPlaced:
    return "energy.market.epex.idm.order.placed"
  }
  return "energy.unknown"
}

Consistent naming allows easier monitoring and debugging.

🛠️ Implementing the Outbox in Go

While most of us are used to the expectation of yet another framework, Design Patterns persist across frameworks as they are integral to the system design ethos, not the language or domain. In this sense, the outbox is a set of responsibilities that one follows to ensure eventual consistency.

Start by insisting on a clear storage contract. It should let you append messages inside an existing transaction, dequeue them in sensible batches, and mark each outcome as definitively done or not-yet.

Most of the reliability you'll feel later comes from the honesty of those methods.

When you publish, you don't "send." You record.

You take the domain event, wrap it in a small, explicit envelope with metadata; who produced it, when, on what subject, with which unique ID and you append it to the outbox alongside the domain write. That's your atomicity boundary: the truth and the promise live or die together.

Then comes the humblest hero in the system: the dispatcher loop. It wakes on a cadence you choose, picks up a handful of due messages, and offers them to NATS (or your broker of choice). If NATS shrugs, the dispatcher shrugs back and schedules another attempt a few seconds in the future. No drama, no recursion, no cleverness. Just forward motion and receipts.

NATS, for its part, meets you where you are. Publishing raw payloads is a single call. Turn on JetStream, and you get acknowledgements, deadlines, and a place to put the few messages that truly refuse to go quietly. Turn it off, and you still have the outbox’s durability.

1) Outbox Repository Interface

Abstract the storage layer to manage retries and batching:

type OutboxRepository interface {
  Append(ctx context.Context, msg OutboxMessage) error
  AppendTx(ctx context.Context, tx Transaction, msg OutboxMessage) error
  DequeueBatch(ctx context.Context, limit int) ([]OutboxMessage, error)
  MarkDone(ctx context.Context, id string) error
  MarkFailed(ctx context.Context, id string, nextAttempt time.Time, errMsg string) error
}

SQL schema (pgsql dialect for demonstration):

CREATE TABLE outbox (
  id UUID PRIMARY KEY,
  topic TEXT NOT NULL,
  payload BYTEA NOT NULL,
  headers JSONB,
  attempts INT NOT NULL DEFAULT 0,
  next_attempt_at TIMESTAMPTZ,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  last_error TEXT
);
CREATE INDEX ON outbox ((COALESCE(next_attempt_at, '-infinity'::timestamptz)), created_at);

Supports batching, backoff retries, and efficient pending-event queries.

2) Appending Events Atomically

func (p *OutboxEventPublisher) Publish(ctx context.Context, event any) error {
  subj := ports.SubjectFor(event)
  env := domain.EventEnvelope{Subject: subj, Payload: event}
  payload, _ := json.Marshal(env)

  msg := ports.OutboxMessage{Topic: subj, Payload: payload, CreatedAt: time.Now().UTC()}
  return p.outbox.Append(ctx, msg)
}

This ensures events persist even if a crash occurs post-transaction.

3) Dispatcher Loop

func (d *OutboxDispatcher) Run(ctx context.Context) {
  ticker := time.NewTicker(d.interval)
  defer ticker.Stop()

  for {
    select {
    case <-ctx.Done():
      return
    case <-ticker.C:
      msgs, _ := d.repo.DequeueBatch(ctx, d.batch)
      for _, m := range msgs {
        if err := d.publisher.PublishRaw(ctx, m.Topic, m.Payload); err != nil {
          _ = d.repo.MarkFailed(ctx, m.ID, time.Now().Add(5*time.Second), err.Error())
          continue
        }
        _ = d.repo.MarkDone(ctx, m.ID)
      }
    }
  }
}

Continuous dispatch with retries and batching ensures reliable event propagation.

4) NATS Adapter

func (p *NatsPublisher) PublishRaw(_ context.Context, subject string, payload []byte) error {
  if p.useJS {
    _, err := p.js.Publish(subject, payload)
    return err
  }
  return p.conn.Publish(subject, payload)
}

Seamlessly integrates with NATS or JetStream while keeping application logic unchanged.

⚖️ Handling Edge Cases

At-least-once delivery is a contract that is used by the outbox guarantees that messages will show up, possibly more than once, never less.

Your consumers keep a tiny ledger of what they've already handled, keyed by the envelope's ID.

If they see the same fact twice, they nod and carry on.

Start with a small delay, five seconds is a good default and escalate gently if the broker stays grumpy.

You'll be tempted to invent a brilliant backoff function; resist. The more exotic your retry rhythm, the harder it is to reason about under stress.

Backpressure is not an emergency; it's feedback. Narrow your dispatcher's batch size. Widen the tick.

On the subscriber side, prefer bounded channels and explicit "not now" acknowledgements when using JetStream. You are shaping flow, not fighting physics.

And yes, cleanup matters. Once messages are marked done, you decide what history you want to keep. Some teams archive outbox rows for audit trails; others delete eagerly. Pick a posture, automate it, and make it visible. Reliability without housekeeping turns into archaeology.

Delivery semantics: At-least-once delivery; consumers handle deduplication.
Retries/backoff: Use attempts and next_attempt_at for exponential backoff.
Backpressure management: Tune batch size and interval; JetStream aids flow control.
Cleanup: Archive or purge old events to maintain performance.

Proper edge-case management ensures a reliable production system.

🔍 Observability & Testing

If the outbox is your backbone, observability is your nervous system. The dispatcher should count what it publishes, what it postpones, how long each publish took, and how big each batch was. The NATS adapter should keep score too; successes, retries, dead letters, oversize payloads. None of these numbers are vanity; they are the early warnings that let you steer instead of swerve.

Logs should read like a story, not a riddle. BE extremely clear and concise. Include the envelope ID, the subject, and the attempt number. When something fails, say how and when you'll try again. When it succeeds, say so plainly. Your future incident reports will quote these lines verbatim.

And then you test like a pessimist. Bring up a real broker. Point a real repository at a real database.
Start the dispatcher. Place an order, cancel another, and watch the envelopes appear where they should.

Then pull the plug on NATS for a minute. See the retries tick up. Bring it back. Watch the system catch up, calmly.

That feeling, watching a failure script play out exactly as designed—is why the pattern exists.

Metrics: Track success/failure rates, retry counts, batch sizes.
Logging: Include event ID and subject for traceability.
Integration tests: Validate end-to-end event delivery.

Thorough testing guarantees robustness under failure scenarios.

🚀 Why This Matters

Because correctness isn’t enough. A system that is correct in a vacuum and brittle in the world will betray you the first time the world behaves like itself.

The outbox pattern reframes reliability from a promise ("we'll publish right after we commit, cross our hearts") into a mechanism ("we recorded the intent; the machine will finish the job").

When engineers trust that a committed change will become a published fact; no ifs, no fingers crossed, they stop writing defensive, ad‑hoc recovery code.

They stop hiding publish calls in try/catch blocks that swallow exceptions because "it should be fine." They stop arguing about the perfect moment to send the message. The moment is always the same: record now, relay later.

It matters to your operations. Batch sizes and intervals are no longer guesses. Retries and DLQs become signals, not surprises.

Subjects turn dashboards from abstract art into maps.

You begin to see your system not as a set of racing threads but as a conversation with memory.

And it matters to your users, though they'll never know why their order confirmations arrive on time even during the midnight sale. They'll think you scaled. You did ... but not just with CPUs.

Consistency: Eliminates dual-write hazards.
Reliability: Automatic retries, backoff, and monitoring enhance resilience.
Scalability: Ideal for microservices, CQRS, and event-driven systems.
Flexibility: Works with PostgreSQL, SQLite/libSQL, NATS, and JetStream.

Empowers teams to build resilient, event-driven applications.

✅ Conclusion

If you've ever tried to close the gap between "we wrote it" and "we told everyone," you know how slippery that space can be. The outbox pattern removes the slipperiness by changing the shape of the work.

Instead of attempting a flawless two‑step, you take one firm step: make the state change and record the message.

Then you let a quiet loop carry that message wherever it needs to go, as many times as it needs to try.

What follows is a different posture under failure. A broker outage is a delay, not a disaster. A duplicate delivery is a no‑op, not a bug. A surge is a backlog, not a meltdown. You trade a little latency for a lot of certainty.

Start where you are. Wrap your events in a small, honest envelope. Append them where you already write the truth. Add the simplest possible dispatcher. Watch it run. Turn the knobs when you must. And when the night finally throws you that timing gap you used to dread, enjoy the silence where the panic used to be.

The Outbox pattern is a reliable approach for event delivery in Go. Key takeaways:

Append events in domain transactions for atomicity.
Dispatch asynchronously with retries and exponential backoff.
Monitor metrics, logs, and test thoroughly.

Implement this pattern in your next service and design consumers for idempotent processing to ensure maximum reliability.

DEV Community