What Back-Pressure Is, and Why Your Queues Need It

#webdev #tutorial

A queue sits between something that produces work and something that consumes it. The pitch is that it smooths out bursts: when requests arrive faster than you can handle them, they wait in line instead of failing. That works right up until the line grows faster than it drains. Then the queue stops being a buffer and becomes the place your system goes to die — usually by running out of memory, sometimes by serving responses so stale they are worthless. Back-pressure is the mechanism that stops that from happening. It is the signal a slow consumer sends back to a fast producer that says: slow down, I cannot keep up.

A queue is a shock absorber until it isn't

The failure mode is worth spelling out because it is so common. Say your producer pushes 1,000 messages per second and your consumer processes 800 per second. Every second, 200 messages pile up. The queue depth climbs linearly. For the first few minutes nothing looks wrong — latency creeps up, but no errors. Then one of three things happens, depending on where the queue lives.

If the queue is an in-memory list in your process, you exhaust the heap and the process is killed by the OOM killer, dropping every queued message at once. If it is a managed broker with a bounded size, it starts rejecting new messages — and now your producer is the one getting errors, often without code to handle them. If it is an unbounded broker with disk spillover, the queue keeps growing and your consumers fall further and further behind, so by the time a message is finally processed the event it describes is minutes old and the user has long since given up.

None of these is the queue "absorbing" load. The load was never absorbed. It was deferred, and deferral with no exit is just a slower crash. An unbounded queue does not have a capacity problem you can buy your way out of with more RAM — it has a rate problem, and the only real fixes are to speed up the consumer, slow down the producer, or shed work. Back-pressure is how the system chooses one of those automatically instead of waiting for the OOM killer to choose for it.

If you cannot answer "what happens when this queue is full?", the answer is decided by whichever runs out first: memory, disk, or your patience reading the postmortem. Set a maximum size on every queue. A bounded queue that rejects work loudly is far easier to operate than an unbounded one that fails silently at 3 a.m.

How back-pressure actually works

Back-pressure is not one technique — it is a category. The shared idea is that the consumer's capacity is allowed to influence the producer's rate. There are four common ways to wire that influence, in rough order of how gentle they are.

Blocking. The simplest form. When the queue is full, the producer's put call blocks until a slot frees up. Go channels do this by default: send to a full channel and the goroutine parks until a receiver pulls a value. Java's ArrayBlockingQueue does the same. Blocking propagates back-pressure for free — a slow consumer literally stalls the producer — but it only works when producer and consumer share a thread-of-control or process. It does not cross a network.

Dropping and load shedding. When you cannot afford to block — say the producer is handling live HTTP requests and stalling it would stall users — you drop instead. Either drop the newest message (reject the incoming request with a 429 or 503), or drop the oldest (overwrite stale data nobody will miss). Load shedding is the deliberate version: under overload, reject a fraction of requests immediately so the ones you do accept finish on time. A system that sheds 20% of load and serves the rest in 50 ms is healthier than one that accepts everything and serves all of it in 8 seconds.

Credit-based flow control. The consumer hands the producer a budget of "credits" — permission to send N messages. The producer spends a credit per message and stops when it runs out, replenishing only when the consumer grants more. TCP's receive window works exactly this way: the receiver advertises how many bytes it has buffer for, and the sender is not allowed to exceed it. gRPC and HTTP/2 carry the same idea up to the application layer. Credits are how you do back-pressure across a network, where blocking is not an option.

Pull instead of push. Invert the relationship: instead of the producer pushing whenever it has data, the consumer pulls when it is ready. Kafka consumers poll for records at their own pace; the broker never forces messages on them. Reactive Streams (the spec behind RxJava, Project Reactor, and Java's Flow API) builds the entire protocol around a request(n) call where the subscriber asks for exactly as many items as it can handle. Pull-based systems have back-pressure built into their shape — a slow consumer simply pulls less often.

Putting back-pressure into your own services

You rarely need to invent any of this. You need to not opt out of it, which is surprisingly easy to do by accident.

Start by bounding every queue, including the implicit ones. A thread pool's task queue is a queue. A buffered channel is a queue. An async runtime's pending-task list is a queue. Each has a configuration for maximum size and a policy for what to do when full — set both deliberately rather than taking the default, which is frequently "unbounded."

Next, make sure the bound actually propagates. A bounded queue that silently drops messages when full has back-pressure that goes nowhere. The whole point is that fullness becomes a signal someone upstream reacts to — a blocked call, a rejected request, a paused poll. If your producer ignores the rejection and retries in a tight loop, you have replaced a memory leak with a busy-wait. Honor the signal: back off, or propagate the rejection further upstream until it reaches something that can legitimately slow down or shed.

Throughput tells you how fast work is leaving. Queue depth over time tells you whether work is arriving faster than it leaves — the leading indicator of the failure above. A steadily climbing queue depth is a system that is already losing, even if latency still looks fine. Alert on the slope, not just the absolute number.

Finally, decide your overload policy before you are overloaded. "What do we drop, and how do we tell the caller?" is a product question as much as a technical one. Dropping the oldest analytics events is fine; dropping the oldest payment instructions is not. Returning a 503 with a Retry-After header lets a well-behaved client cooperate; silently timing out does not. The systems that survive traffic spikes are not the ones with the biggest queues. They are the ones that decided, in advance, exactly how they would say no.

Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.