- Book: System Design Pocket Guide: Fundamentals — Core Building Blocks for Scalable Systems
- Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
Of the five backpressure mechanisms in common use, three of them make the outage worse the harder they work. Two of them only help if your producer cooperates, which, in a microservices shop, it usually doesn't.
That's the part nobody tells you when you read the wiki page on bounded queues. You wire up a chan with a capacity of 1024, you watch it stay flat in dev, you ship it, and three weeks later a downstream slowness turns your service into a memory bomb that also drops every fifth request silently. The queue did its job. The system fell over anyway.
This post ranks the five mechanisms by how they fail, not by how they look in a textbook. Then it shows the layered stack that actually survives production.
What backpressure actually is (and isn't)
Backpressure is a signal that propagates from consumer back to producer. Slow consumer, slower producer. That's it.
It is not "a queue with a max size." A bounded queue caps memory. When the queue is full, the producer either blocks, drops, or gets an error. None of those are signals the producer learns from unless you wire them into something. A bounded chan in Go that drops on full is the equivalent of a fire alarm in a soundproof room.
Real backpressure has three parts:
- The consumer tells someone it's behind.
- The signal reaches a producer (or a load balancer, or an admission controller, or a human).
- The producer slows down, sheds, or fails fast.
Every mechanism below implements one or two of those three. None of them implement all three on their own. That's why you compose them.
Mechanism 1: Bounded queues
The default everywhere. Bounded chan in Go, BlockingQueue in Java, bounded() in Akka Streams, mpsc::channel(n) in Rust.
// Go: bounded channel, producer blocks when full
jobs := make(chan Job, 1024)
// producer side
select {
case jobs <- j:
// accepted
case <-time.After(50 * time.Millisecond):
// queue full for 50ms, bail
return ErrQueueFull
}
// Java: ArrayBlockingQueue with offer + timeout
BlockingQueue<Job> jobs = new ArrayBlockingQueue<>(1024);
boolean accepted = jobs.offer(job, 50, TimeUnit.MILLISECONDS);
if (!accepted) {
return Response.status(503).build();
}
The blocking put() version is worse than it looks. It propagates backpressure exactly one hop, to the immediate caller. If that caller is an HTTP handler, you've moved the queue from memory to your thread pool, and now you've got 800 threads parked on put() while your load balancer cheerfully sends you more.
The non-blocking offer() version with a timeout is honest about what it's doing: shedding load. But on its own, it has no idea why it's shedding, and the producer (an upstream service, usually) has no idea it should slow down. It just gets a 503 and retries.
How it breaks: bounded queues alone are memory caps, not backpressure. In a fan-in topology (twenty services feeding one) each upstream sees a 1-in-20 reject rate and keeps the firehose on. The downstream is at 100% CPU, the queue is full, and no signal reached anyone with the power to slow down.
Mechanism 2: Drop-on-full (load shedding)
Same as above, but instead of waiting, you drop. The improvement is that you stop pretending the queue is the backpressure.
Drop-on-full only works if you tag every drop with a reason. Without the reason tag, your metrics tell you drops_total = 47823 and you have no idea whether your downstream is slow, your queue is mis-sized, or a single bad client is doing 90% of the work.
// drop with reason-tagged metric
var dropCounter = prometheus.NewCounterVec(
prometheus.CounterOpts{Name: "queue_drops_total"},
[]string{"reason", "queue"},
)
func enqueue(j Job) error {
select {
case jobs <- j:
return nil
default:
// tag the reason; this is the whole point
dropCounter.WithLabelValues("queue_full", "ingest").Inc()
return ErrShed
}
}
// elsewhere: a different drop reason
if j.PriorityClass == LowPriority && load.Current() > 0.85 {
dropCounter.WithLabelValues("low_priority_shed", "ingest").Inc()
return ErrShed
}
Reasons you actually want as labels: queue_full, timeout_in_queue, low_priority_shed, client_rate_limit, circuit_open, dependency_timeout. Five reasons. Five different remediation paths in the runbook. Without the labels, you have one number that means nothing.
How it breaks: drop-on-full alone is a producer-doesn't-care mechanism. The upstream service that just got a 503 will retry. If it has exponential backoff with jitter, fine. If it doesn't (and a depressing number of HTTP clients don't, by default) you get a retry storm that turns a 20% overload into a 200% overload.
Mechanism 3: Token buckets / rate limits
This is the first mechanism in the list that actually tries to push back. A token bucket gives the producer an HTTP-level signal: 429 Too Many Requests, with a Retry-After header that tells it when to come back.
// Guava-flavored token bucket
RateLimiter limiter = RateLimiter.create(1000.0); // 1000 req/sec
@PostMapping("/ingest")
public ResponseEntity<?> ingest(@RequestBody Event e) {
if (!limiter.tryAcquire(1, 50, TimeUnit.MILLISECONDS)) {
return ResponseEntity.status(429)
.header("Retry-After", "1")
.build();
}
return ResponseEntity.ok(process(e));
}
The 429 + Retry-After pattern is the only one in this list that gives the producer a structured "slow down" signal that's part of an actual spec (RFC 6585). A well-behaved client will respect it. Most clients are well-behaved when you write them. Most clients are not well-behaved six months later when the original author has left.
How it breaks: the rate limit is a configuration number, not a property of your system. If your downstream gets 30% slower because a database is under-provisioned, your rate limit is now 30% too high. The config file says 1000, so the bucket lets 1000 through, and your queue fills up anyway. Rate limits encode the version of the world you had when you set them. The world drifts; the config doesn't.
The other failure mode: rate limits applied at the edge protect the edge. They do nothing for a hot internal path between two services where the producer is your own code.
Mechanism 4: Reactive streams
Reactor, RxJS, Akka Streams, Project Reactor's Flux. These use a request(n) protocol where the consumer explicitly asks the producer for n items, and the producer is contractually forbidden from sending more.
// Reactor: explicit request(n) backpressure
Flux<Event> source = redisListener.events();
source
.onBackpressureBuffer(
10_000,
dropped -> dropCounter.labels("reactor_buffer_overflow").inc(),
BufferOverflowStrategy.DROP_OLDEST
)
.flatMap(this::process, 32) // concurrency = 32 outstanding
.subscribe(new BaseSubscriber<Result>() {
@Override
protected void hookOnSubscribe(Subscription s) {
s.request(128); // initial demand
}
@Override
protected void hookOnNext(Result r) {
// pull the next one only after handling this
request(1);
}
});
This is the cleanest model in the list. Demand flows backwards from the slow consumer through every operator, all the way to the source. If the consumer is at capacity, the source gets paused, period. No drops. No timeouts.
It is also the most fragile. The contract holds end-to-end only inside one runtime. The instant your reactive pipeline calls an HTTP endpoint, you've crossed a boundary where there is no request(n) protocol. Now you're back to bounded buffers and timeouts and 429s.
In a microservices shop where 80% of your hops are network hops, reactive streams give you exquisite backpressure inside each service and zero backpressure between them. It's worth it for the in-process pipeline. Don't kid yourself that you've solved the system-level problem.
Mechanism 5: Circuit breakers
A circuit breaker is the Hystrix-style "stop calling the broken thing" pattern. It tracks failure rate against a downstream and, when it crosses a threshold, opens the circuit so all subsequent calls fail fast instead of timing out.
// resilience4j (Hystrix's spiritual successor)
CircuitBreakerConfig cfg = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.slowCallRateThreshold(50)
.slowCallDurationThreshold(Duration.ofMillis(500))
.waitDurationInOpenState(Duration.ofSeconds(10))
.slidingWindowSize(100)
.build();
CircuitBreaker breaker = CircuitBreaker.of("payments", cfg);
Supplier<Receipt> call = CircuitBreaker.decorateSupplier(
breaker,
() -> paymentsClient.charge(req)
);
try {
return Try.ofSupplier(call).get();
} catch (CallNotPermittedException e) {
dropCounter.labels("circuit_open").inc();
return Receipt.deferred();
}
Circuit breakers protect the caller, not the system. They keep your service from hanging on a broken downstream. They do not push back on whoever's calling you, they do not slow producers, they do not signal a load problem upstream.
How it breaks: broken downstream stays broken. While the circuit is open, requests fail in 5ms instead of 5000ms. Your callers, seeing 5ms responses with errors, often interpret that as "service is fast and unhealthy" and retry harder. You've replaced a slow-failure mode with a fast-failure mode and accidentally amplified the upstream's load.
The three that break under load, alone
Bounded queues, drop-on-full, and circuit breakers all have the same blind spot: none of them tell the producer to slow down. They all reject. The producer's retry policy decides what happens next, and the producer's retry policy is almost never coordinated with your shedding strategy.
Run any one of these in isolation and you get an outage pattern that looks like: latency creeps up, then a step-function spike, then a flood of retries from upstream, then a memory or thread exhaustion, then a restart loop. The mechanism worked exactly as designed. The system died anyway.
The two that actually compose
Rate limits with cooperative clients, and reactive streams end-to-end inside a single runtime. Those are the two with a real backpressure signal.
Rate limits work because 429 + Retry-After is a contract the producer understands. The producer-side requirement is a retry library that honors Retry-After and applies jitter. Most HTTP clients now do this by default: okhttp, reqwest, Go's stdlib http.Client with a wrapper, the AWS SDKs. Check yours.
Reactive streams work in-process because the contract is enforced by the operator graph. They don't work across HTTP unless you add something like RSocket, which gives you a request(n)-style transport. At that point you've bought into a new wire protocol and most teams won't.
A production stack that works
You don't pick one mechanism. You stack them, with each layer doing the one job the next layer can't.
+--------------------------------------------------------+
| EDGE: token-bucket rate limit (per-tenant, per-route) |
| - 429 + Retry-After to external producers |
| - drops tagged: client_rate_limit |
+--------------------------------------------------------+
|
+--------------------------------------------------------+
| ADMISSION: bounded queue per service |
| - non-blocking offer with 50ms timeout |
| - drops tagged: queue_full, timeout_in_queue |
+--------------------------------------------------------+
|
+--------------------------------------------------------+
| SHED: drop-on-full with priority + reason tags |
| - low-priority shed above 85% capacity |
| - drops tagged: low_priority_shed |
+--------------------------------------------------------+
|
+--------------------------------------------------------+
| HOT PATH: reactive Flux with request(n) inside |
| - in-process backpressure between stages |
| - drops tagged: reactor_buffer_overflow |
+--------------------------------------------------------+
|
+--------------------------------------------------------+
| PROTECT: circuit breaker per downstream dep |
| - fast-fail with fallback (queued, cached, default) |
| - drops tagged: circuit_open, dependency_timeout |
+--------------------------------------------------------+
Each layer's job:
- Edge rate limit is the only place you talk to clients you don't control. Cooperative clients see 429 and back off; the rest get capped.
- Bounded queue is the memory cap. Without it, a sudden spike eats your heap.
- Drop-on-full with priority is where you make policy decisions visible: which traffic matters more, and why.
- Reactive inside the hot path keeps your async pipeline from filling memory between operators. Don't extend it past a network boundary.
- Circuit breaker protects you from a downstream that's already broken. It is not the layer that talks to your producers.
The reason-tagged drop metric runs through all of them. That's the second non-negotiable.
What to measure
If you only build one dashboard for this system, build this one. Five panels:
- Queue depth p99, per queue, per service. Not the mean. The p99. Means lie. A queue at mean depth 12 / p99 depth 1023 is a queue that periodically pegs full.
-
Drop rate by reason, stacked. The shape of this chart is your diagnosis: a wall of
circuit_openmeans a downstream incident; a wall ofqueue_fullmeans you're undersized; a wall ofclient_rate_limitmeans a client just got noisy. - Time-in-queue p99. The single best indicator of "we are shedding correctly but slowly." If items wait 4 seconds in queue before getting processed, you'll get downstream timeouts that look like a different problem.
- Downstream rejection rate (from your side), per dependency. Counts the 429s, 503s, and circuit-open responses you receive. A spike here is what your circuit breaker is reacting to. It should never be a surprise.
-
Retry-storm detection: ratio of
(requests_received / unique_request_ids)over a 30-second window. When the ratio jumps from 1.05 to 2.5, a client is retrying everything. Page someone.
If your platform supports it, a sixth panel: per-tenant or per-client request rate as a heatmap. The single most useful diagnostic in a load incident is "is this everyone, or one customer." A heatmap answers that in two seconds.
The shape of the win
You can ship a service with one mechanism. Lots of teams do. It works fine until the day it doesn't, and when it doesn't, the failure mode is loud, expensive, and confusing. The one mechanism you picked did exactly what it was designed to do while the system around it fell apart.
The shape of the win is boring: layered, tagged, measured. Edge rate limit, bounded queue, drop-on-full with reasons, reactive in the hot path, circuit breaker per dep. A dashboard that tells you which layer is firing. A runbook that maps each drop reason to an action.
That's the stack. It survives Black Friday. It survives a bad deploy at a downstream team. It doesn't survive everything (nothing does) but when it falls over, you know why within 90 seconds, and the next page of the runbook tells you what to do.
What's the one mechanism your team is leaning on hardest right now, and which of the failure modes above is it actually exposed to? Drop a comment with the layer you'd add next.
If this was useful
The five-mechanism breakdown, the reason-tagged drop pattern, and the layered stack diagram are all chapters in the System Design Pocket Guide: Fundamentals. It covers the queue, async, and admission-control building blocks the same way: pick the mechanism, understand how it fails, then learn the layered version that actually survives load. It's the book to hand to a backend engineer who's shipped a service that fell over and wants to know what to read next.

Top comments (0)