Outages QueueForge Could Have Prevented #1 — The Transactional Queue Failure That Silently Lost 21,500 User Actions
In January 2026, Sean Hammond published a detailed writeup about a queue-related outage that caused thousands of user actions to silently disappear.
The engineering team behind Hypothesis had a system architecture that looked completely reasonable on paper:
- Save data to Postgres
- Push a background job to RabbitMQ
- Index the data asynchronously into Elasticsearch
The system worked for years.
Until RabbitMQ failed.
Suddenly, users started creating annotations that appeared to save successfully but later vanished from the application.
Not because Postgres lost data.
Because the queue failed at exactly the wrong moment.
The Architecture That Failed
Their request flow looked roughly like this:
Client Request
↓
Write annotation to Postgres
↓
Commit transaction
↓
Enqueue indexing job in RabbitMQ
↓
Return 200 OK
The critical flaw was subtle.
The database transaction and the queue publish operation were completely separate systems with no shared transactional guarantee.
So when RabbitMQ stopped responding:
- Postgres commits still succeeded
- Queue publish operations failed
- The API still returned success responses
- Elasticsearch never received indexing jobs
The result was catastrophic:
More than 21,500 annotations existed in Postgres but never appeared in Elasticsearch.
Since Hypothesis relied on Elasticsearch for reads, users experienced this as missing or deleted data.
This is one of the most dangerous categories of outages: silent inconsistency.
The Distributed Systems Problem Behind It
This outage is effectively a real-world version of the Two Generals Problem.
You cannot perfectly coordinate two distributed systems if communication between them can fail.
In practice, the system ended up in this state:
Database commit succeeded
BUT
Queue publish maybe failed
Once this happens, retries alone are not enough unless the architecture was specifically designed for recovery.
Most queue systems still leave this edge case entirely to application developers.
Most teams only discover the flaw after production traffic triggers the exact failure mode.
Why These Failures Are So Dangerous
This outage was not caused by scaling problems, malformed requests, or traffic spikes.
It was caused by a missing reliability guarantee between persistence and asynchronous execution.
Failures like this often create:
- Ghost writes
- Missing events
- Broken search indexes
- Orphaned jobs
- Impossible-to-debug race conditions
The most dangerous part is that systems often appear healthy while users are actively losing data.
Some requests work perfectly.
Others silently disappear forever.
The Fix: Transactional Job Queues
Hypothesis eventually redesigned the architecture around a transactional queue stored directly inside Postgres.
Instead of:
DB transaction
THEN
external queue publish
They changed the flow to:
BEGIN;
INSERT annotation;
INSERT job;
COMMIT;
Now the write operation and the queued work existed inside the same atomic transaction.
Either both succeeded.
Or neither existed.
That architectural change eliminated the entire category of consistency bugs.
Why More Teams Are Moving Toward Transactional Queues
Increasingly, engineering teams are rediscovering the same principle:
Reliability comes from transactional guarantees, not retries.
Retries help.
Dead-letter queues help.
Backoff strategies help.
But none of them solve the core issue if queue publishing is not transactionally coupled to the state change that created the work.
This is why more systems are moving toward:
- Transactional outboxes
- Postgres-backed queues
- Durable event logs
- Exactly-once delivery systems
- Atomic enqueue patterns
The industry is slowly realizing that queues are not just infrastructure.
They are part of the application’s consistency model.
What QueueForge Could Have Prevented
This is exactly the category of outage that modern queue tooling should detect before production users notice.
A platform like QueueForge could surface warning signals such as:
- Successful database commits paired with failed enqueue attempts
- Queue publish latency spikes
- Growing transaction-to-job mismatch rates
- Missing downstream consumers
- Jobs never reaching processing pipelines
- Consistency drift between systems
The dangerous part of the outage was not that RabbitMQ failed.
Infrastructure failures happen constantly.
The dangerous part was that the application kept acknowledging success while silently dropping critical background work.
That is the exact failure mode teams need visibility into.
The Bigger Lesson
Queues are usually marketed as scalability infrastructure.
But the hardest queue problems are rarely about throughput.
They are about correctness.
The moment an architecture says:
"Save data now, process later"
it introduces a distributed consistency problem.
If that boundary is not explicitly designed for, production will eventually find it.
Usually after years of “it has always worked fine.”
Usually at 3 AM.
And usually after thousands of records have already gone missing.
Top comments (0)