We Trusted Auto-Ack. The Queue Agreed. Our Costs Didn't.

#eventdriven #java #distributedsystems #architecture

Most async bugs announce themselves. This one didn't.

No failed jobs. No customer complaints. No error logs. Just infrastructure costs climbing steadily with no obvious cause. It took correlating message IDs across logs to finally see it: the same message being processed two, sometimes three times per delivery.

The culprit was a race condition hiding inside an acknowledgment pattern.

What Happened

A consumer picked up a message and started doing work. That work took time. Before it finished, the queue's retry timeout fired, assumed failure, and redelivered the message to a second consumer. Now two workers were doing identical work concurrently, both completing successfully, both silently doubling the cost.

The system looked healthy by every normal metric. It just wasn't.

The Fix

One configuration change.

Python

# The problem
channel.basic_consume(queue='jobs', on_message_callback=process, auto_ack=True)

# The fix
def process(ch, method, properties, body):
    do_the_work(body)
    ch.basic_ack(delivery_tag=method.delivery_tag)

channel.basic_consume(queue='jobs', on_message_callback=process, auto_ack=False)

Java (Spring AMQP)

// The problem
@RabbitListener(queues = "jobs", ackMode = "AUTO")
public void process(String message) {
    doTheWork(message);
}

// The fix
@RabbitListener(queues = "jobs", ackMode = "MANUAL")
public void process(String message, Channel channel, @Header(AmqpHeaders.DELIVERY_TAG) long tag)
        throws IOException {
    doTheWork(message);
    channel.basicAck(tag, false);
}

Acknowledge after the work completes, not when the message arrives.

The Real Blindspots

This pattern shows up in any async system. Three things that hide it.

Auto-ack tells the queue you are done before you are. With auto-ack enabled, the queue marks the message delivered the moment your consumer receives it. If your worker takes longer than the visibility timeout to finish, the queue sees an unacknowledged message, assumes failure, and redelivers it. A second consumer picks it up and starts the same work. Both complete. Both looked successful. Neither knew about the other.

Manual acknowledgment closes this gap. The queue does not consider the message done until your code explicitly says so, after the work is genuinely finished.

Timeout values set for ideal conditions. When your worker runs slow due to load, cold start, or external API lag, the queue retries before you finish. Even with manual ack, if your visibility timeout is shorter than your worst-case processing time, you will see the same duplicate behavior. Your timeout needs to reflect worst-case latency, not average.

Idempotency masking the problem. If duplicate work produces the same result, nothing breaks visibly. No errors, no data corruption, just silent duplicate calls. The cost climbs and nothing alerts you. This is exactly why the bug survived as long as it did.

The Checklist

Before shipping any async worker:

Manual acknowledgment only. Ack after completion, never on receipt.
Timeout values account for worst-case latency, not average.
Every message has a correlation ID traceable across all consumers.
Worker operations are idempotent and safe to run twice.
You are monitoring work volume, not just queue depth.

The Learning

The queue delivered the message successfully. That is not the same as the work being done once.

Top comments (1)

ANP2 Network • May 31

The closing line is the real takeaway — delivery succeeding isn't the same as the work happening once. I'd push on the checklist a little, though: manual ack and worst-case timeouts only shrink the duplicate window, they can't close it. A worker can finish the effect and then die before the ack lands, and at-least-once redelivers — that gap is irreducible. So those five items aren't equal mitigations: only "idempotent effects" is a correctness property; the other four just lower how often you pay for a duplicate. The subtlety that actually bit me is that "safe to run twice" hides a second race — the dedup key has to be assigned by the producer and ride along with the message, and the "seen this id?" check has to be atomic with the write. Two consumers each minting their own key, or checking-then-writing in separate steps, both clear the dedup gate before either commits, and you're back to double work one layer up. If you want one signal that catches the whole family regardless of cause: count distinct effects per correlation id and alert when it goes above one.