Ivole32

Posted on May 7 • Originally published at queueforge.dev on May 7

Outages QueueForge Could Have Prevented #1 | Transactional Queue Failu

#queueforge #queuesystems #transactionalqueues #rabbitmqoutage

Outages QueueForge Could Have Prevented #1 — The Transactional Queue Failure That Silently Lost 21,500 User Actions

In January 2026, Sean Hammond published a detailed writeup about a queue-related outage that caused thousands of user actions to silently disappear.

The engineering team behind Hypothesis had a system architecture that looked completely reasonable on paper:

Save data to Postgres
Push a background job to RabbitMQ
Index the data asynchronously into Elasticsearch

The system worked for years.

Until RabbitMQ failed.

Suddenly, users started creating annotations that appeared to save successfully but later vanished from the application.

Not because Postgres lost data.

Because the queue failed at exactly the wrong moment.

The Architecture That Failed

Their request flow looked roughly like this:

Client Request
    ↓
Write annotation to Postgres
    ↓
Commit transaction
    ↓
Enqueue indexing job in RabbitMQ
    ↓
Return 200 OK

The critical flaw was subtle.

The database transaction and the queue publish operation were completely separate systems with no shared transactional guarantee.

So when RabbitMQ stopped responding:

Postgres commits still succeeded
Queue publish operations failed
The API still returned success responses
Elasticsearch never received indexing jobs

The result was catastrophic:

More than 21,500 annotations existed in Postgres but never appeared in Elasticsearch.

Since Hypothesis relied on Elasticsearch for reads, users experienced this as missing or deleted data.

This is one of the most dangerous categories of outages: silent inconsistency.

The Distributed Systems Problem Behind It

This outage is effectively a real-world version of the Two Generals Problem.

You cannot perfectly coordinate two distributed systems if communication between them can fail.

In practice, the system ended up in this state:

Database commit succeeded
BUT
Queue publish maybe failed

Once this happens, retries alone are not enough unless the architecture was specifically designed for recovery.

Most queue systems still leave this edge case entirely to application developers.

Most teams only discover the flaw after production traffic triggers the exact failure mode.

Why These Failures Are So Dangerous

This outage was not caused by scaling problems, malformed requests, or traffic spikes.

It was caused by a missing reliability guarantee between persistence and asynchronous execution.

Failures like this often create:

Ghost writes
Missing events
Broken search indexes
Orphaned jobs
Impossible-to-debug race conditions

The most dangerous part is that systems often appear healthy while users are actively losing data.

Some requests work perfectly.

Others silently disappear forever.

The Fix: Transactional Job Queues

Hypothesis eventually redesigned the architecture around a transactional queue stored directly inside Postgres.

Instead of:

DB transaction
THEN
external queue publish

They changed the flow to:

BEGIN;

INSERT annotation;
INSERT job;

COMMIT;

Now the write operation and the queued work existed inside the same atomic transaction.

Either both succeeded.

Or neither existed.

That architectural change eliminated the entire category of consistency bugs.

Why More Teams Are Moving Toward Transactional Queues

Increasingly, engineering teams are rediscovering the same principle:

Reliability comes from transactional guarantees, not retries.

Retries help.

Dead-letter queues help.

Backoff strategies help.

But none of them solve the core issue if queue publishing is not transactionally coupled to the state change that created the work.

This is why more systems are moving toward:

Transactional outboxes
Postgres-backed queues
Durable event logs
Exactly-once delivery systems
Atomic enqueue patterns

The industry is slowly realizing that queues are not just infrastructure.

They are part of the application’s consistency model.

What QueueForge Could Have Prevented

This is exactly the category of outage that modern queue tooling should detect before production users notice.

A platform like QueueForge could surface warning signals such as:

Successful database commits paired with failed enqueue attempts
Queue publish latency spikes
Growing transaction-to-job mismatch rates
Missing downstream consumers
Jobs never reaching processing pipelines
Consistency drift between systems

The dangerous part of the outage was not that RabbitMQ failed.

Infrastructure failures happen constantly.

The dangerous part was that the application kept acknowledging success while silently dropping critical background work.

That is the exact failure mode teams need visibility into.

The Bigger Lesson

Queues are usually marketed as scalability infrastructure.

But the hardest queue problems are rarely about throughput.

They are about correctness.

The moment an architecture says:

"Save data now, process later"

it introduces a distributed consistency problem.

If that boundary is not explicitly designed for, production will eventually find it.

Usually after years of “it has always worked fine.”

Usually at 3 AM.

And usually after thousands of records have already gone missing.

DEV Community