Most inference backends degrade under burst.
This is not specific to LLMs.
It applies to any constrained compute system:
• a single GPU
• a local model runner
• a CPU-bound worker
• a tightly sized inference fleet
When demand spikes, most systems do one of two things:
1. Accept everything and let requests accumulate internally.
2. Rate-limit arrival at the edge.
Both approaches hide the real problem.
Queues grow.
Latency stretches.
Retries amplify pressure.
Memory usage becomes unpredictable.
Overload turns opaque.
You don’t see failure immediately.
You see slow decay.
⸻
The Missing Boundary
There’s a difference between rate limiting and execution governance.
Rate limiting controls how fast requests arrive.
Execution governance controls how many requests are allowed to run.
Those are not the same.
You can rate-limit and still build an unbounded internal queue.
If you don’t enforce a hard cap on concurrent execution, the backend becomes the queue.
And queues under burst are silent liabilities.
⸻
A Different Approach: Explicit Yield
Instead of buffering overload, convert it into an explicit response.
When capacity is full:
• Do not queue.
• Do not block.
• Do not defer silently.
Return:
status = yield
retry_hint_ms =
The system remains bounded.
The client decides when to retry.
Overload becomes explicit instead of hidden.
⸻
What This Looks Like
Here’s a simple test:
• max_inflight = 1
• 20 concurrent clients
• backend execution time = 10 seconds
Observed state transitions:
t=44 inflight=1 executed_total=1 yielded_total=19
t=79 inflight=0 executed_total=1 yielded_total=19
Interpretation:
• Inflight never exceeded 1.
• One request executed.
• Nineteen yielded immediately.
• No queue growth.
The system did not degrade.
It remained bounded.
⸻
Why This Matters for Inference Systems
Inference workloads are bursty.
Prompts don’t arrive in smooth curves.
They arrive in clusters:
• user refresh storms
• retry loops
• concurrent UI events
• load balancer reshuffles
• autoscaler lag
If your backend silently buffers that burst,
you inherit the tail latency and memory consequences later.
If you bound execution and yield instead,
you trade implicit instability for explicit backpressure.
That trade is almost always worth it.
⸻
What This Is Not
This is not:
• a scheduler
• a policy engine
• a fairness system
• a gateway
• a dashboard
• a distributed runtime
It is a narrow primitive:
Hard concurrency cap + explicit yield.
Nothing more.
⸻
A Small Tool, Intentionally
I built a small ingress governor around this idea.
It:
• accepts newline-delimited JSON frames over TCP
• validates upload integrity
• enforces max_inflight
• returns yield immediately when saturated
• exposes minimal metrics (inflight, executed_total, yielded_total)
It does not inspect prompts.
It does not introspect models.
It does not count tokens.
It does not apply policy.
It governs execution slots.
That’s all.
⸻
Why Not Just Use Nginx?
Because rate limiting is not execution governance.
You can limit requests per second and still allow an unbounded number of concurrent backend submissions.
Bounded concurrency and explicit yield are different primitives.
They can coexist.
They solve different problems.
⸻
The Core Idea
Stop treating overload as something to buffer.
Treat it as something to expose.
If capacity is full, say so.
Return yield.
Remain bounded.
⸻
If you operate constrained compute systems and care about deterministic behavior under burst, this approach may be useful.
Reference implementation:
https://github.com/newssourcecrawler/heptamini
Top comments (0)