Why You Need a Transactional Outbox for Audit Logs

#programming #webdev #productivity #architecture

The cleanest-looking audit log design has a hidden bug that almost every team hits eventually. The application updates the database, then writes an audit entry, and most of the time both succeed. The cases where one succeeds and the other fails are where the trouble lives, and those cases account for the audit logs that lie to the auditor about what actually happened.

The transactional outbox pattern is the standard fix for this. It is the difference between an audit log that is roughly correct and an audit log that is provably consistent with the underlying state.

Photo by Slidebean on Unsplash

The Naive Approach and Why It Fails

The naive approach looks something like this in pseudocode:

db.transaction(() => {
  db.update_user(user_id, new_email)
})
audit_log.write({
  actor: current_user,
  action: "user.update_email",
  resource: user_id,
  before: old_email_hash,
  after: new_email_hash,
})

Most of the time this is fine. The database update commits, the audit log write succeeds, the world is consistent. The failures are the interesting cases.

Case one: the audit log write fails. The database commit went through. The user email changed. The audit log knows nothing about it. From the auditor's perspective, the user's email changed in an undocumented way. The system is in a state that violates its own logging policy.

Case two: the database transaction is rolled back after the audit write. Some application logic later in the function throws an exception and the transaction rolls back. The user email does not actually change. The audit log says it did. The auditor pulls the log and sees a phantom event for a change that never happened.

Case three: the application crashes between the database commit and the audit write. This is the most common one because most application crashes are network timeouts and pod restarts, not exception throws. The database commit lands. The pod dies. The audit log never sees the event. When the pod restarts, it has no memory of what was supposed to happen. The OWASP logging cheat sheet lists this kind of write-loss as one of the most common audit log integrity gaps in production systems.

These three cases are not edge cases. They happen at scale. In any system processing more than a trivial volume of audited events, the gap between what actually happened and what the audit log thinks happened grows steadily over time.

The Transactional Outbox Pattern

The fix is to write the audit event into the application database, in the same transaction as the data change. Then a separate worker reads from the outbox table and writes to the actual audit log storage.

The pattern looks like this:

db.transaction(() => {
  db.update_user(user_id, new_email)
  db.insert_outbox({
    actor: current_user,
    action: "user.update_email",
    resource: user_id,
    before: old_email_hash,
    after: new_email_hash,
    status: "pending",
  })
})

// In a separate worker:
forever:
  pending_events = db.select_outbox(status="pending")
  for event in pending_events:
    audit_log.write(event)
    db.update_outbox(event.id, status="processed")

The data change and the outbox entry are now in the same database transaction. They either both commit or both roll back. There is no gap between them. The audit log entry will eventually be written by the worker, but the application's commit is the source of truth for whether the event happened.

The microservices.io page on the transactional outbox pattern covers the design in more depth and is the canonical reference. It applies equally well in monoliths.

What the Worker Actually Does

The worker reads pending entries from the outbox, writes each one to the immutable audit log storage, and marks the outbox entry as processed. The worker should:

Process events in order by their outbox ID. This preserves the cryptographic chain in the audit log.
Handle duplicate processing gracefully. If the worker crashes after writing to the audit log but before marking the outbox row as processed, the next run will try to write the same event again. The audit log writer needs to recognize and skip duplicates, usually via the outbox ID as an idempotency key.
Run continuously with a short polling interval, or use database notifications (LISTEN/NOTIFY in Postgres) to wake up immediately when new events are inserted.
Have a backlog alarm. If the outbox grows beyond a threshold, something is wrong with the worker or with the audit log storage, and the team needs to know.

Why Not Just Use Two-Phase Commit

Two-phase commit across the database and the audit log storage looks like an obvious alternative, and it is, in theory. In practice, distributed transactions across different storage systems have hard performance and operational costs, and most audit log storage (S3, managed services) does not support them at all.

The transactional outbox sidesteps the distributed transaction problem by keeping everything in the application database for the commit moment. The audit log storage gets written to asynchronously, and the worker handles the consistency. The Wikipedia article on two-phase commit covers the alternative if you want background on why it is usually avoided.

Edge Cases the Outbox Handles Well

A few situations the outbox handles naturally that ad-hoc audit code struggles with.

Replays. If the audit log storage went down for two hours, the outbox accumulates events during the outage. When storage comes back, the worker drains the outbox in order. No events are lost.

Schema migrations. If the audit log schema changes, the outbox keeps accepting writes in the old format and the worker handles translation to the new schema. The application code does not have to be deployed in lockstep with the audit log schema.

Audit vendor swaps. If you change the audit log storage vendor, the outbox keeps writing to the same place. The worker is the only piece that needs to know about the new vendor.

Throttling. If the audit log vendor rate-limits writes, the worker slows down. The application is unaffected and keeps committing transactions. The backlog grows during the throttle and drains afterward.

Performance Notes

The outbox approach adds one insert per audited event to the application database transaction. The cost is small but not zero. For a system with high throughput, the outbox table can become a hot spot if it is not designed carefully.

A few practical tips:

Partition the outbox table by date or by event hash to prevent index contention on inserts.
Move processed rows out of the active outbox table periodically. Either delete them or move them to an archive table. The active table stays small and the queries stay fast.
For very high volumes, consider Debezium or a change-data-capture approach that reads the outbox table from the database's write-ahead log rather than via SQL polling. This adds complexity but removes the polling overhead.

When to Reach for the Pattern

For an audit log that is going to be used as evidence in any serious context (compliance audit, security investigation, legal discovery), the transactional outbox is essentially mandatory. Without it, there is no way to prove that the audit log accurately reflects what happened in the system.

For lighter-weight activity logging that is only used for engineering convenience, the simpler synchronous write may be acceptable. The cases where the write fails are also the cases where you are unlikely to care.

The longer write-up on the full audit log architecture, including the two-log design (query log plus evidence log) that the outbox feeds into, lives in the article from 137Foundry. The outbox is one piece of that broader design.

The Underlying Principle

The transactional outbox pattern is an example of a deeper principle. Any time the application needs to do two things atomically and one of them involves an external system, route the second thing through the application database first. The database commit becomes the single point of truth. The asynchronous step happens later, idempotently, with consistency guaranteed by the commit.

This pattern shows up in event-driven architectures, webhook delivery, payment processing, and many other places. Audit logging is just one of the higher-stakes applications. The discipline is the same. Use the database transaction as the source of truth, and treat the downstream side as a consumer that catches up.