DEV Community

Your AI Agent Isn't Failing Because It Hallucinates — It's Failing Because of Rate Limits

Sergei Parfenov on June 02, 2026

When my agents started failing in production, I did what everyone does first: I went hunting for hallucinations. Better prompts, tighter output sch...

Read full post

Abdullah Shahin • Jun 3

The asymmetry between compute autoscaling and quota scaling is the part that bit me hardest in practice — a Lambda-style runtime will happily fan out a hundred workers, each one of which thinks it owns the full RPM budget. The pattern that actually held up was moving the rate limit out of the workers and into a shared token-bucket process (a Redis-backed bucket with a lua refill, but a sidecar would work too), so concurrency is bounded by tokens-in-bucket rather than by how many warm containers happen to exist. One thing not mentioned that's worth flagging: tokens-per-minute usually saturates before requests-per-minute on long-context agents, and TPM exhaustion returns the same 429 with no separate header on some providers — so a retry policy keyed only on RPM headroom will retry-storm right back into the wall. The other subtle one is that fallback-to-cheaper-model only helps if the fallback isn't on the same org-level quota; on a couple of providers all models share a pooled token budget per tier, so the "fallback model" is fiction under load.

Sergei Parfenov • Jun 22

this is the comment i wish id had before i wrote the post — all three are real and all three are gaps. taking them in order:
shared token-bucket — ur completely right and this is the correction that matters most: my semaphore example is in-process, which silently assumes one process. under a Lambda/Cloud Run fan-out thats just wrong — N containers each holding a local semaphore of 8 means ur real concurrency is 8N, and the bucket is empty before any single worker notices. the limiter has to live outside the workers (Redis+lua, or a sidecar) so concurrency is bounded by tokens-in-bucket, not by how many warm containers exist. i should've flagged that the in-process version only holds for a single-process deployment. its the single biggest correction to the post.

TPM saturates before RPM — yeah, and this one's nasty because its invisible until long-context hits: u provision against RPM, your dashboards are green on request count, and ur actually dying on tokens-per-minute with the exact same 429 and no header to tell them apart on some providers. a retry keyed on RPM headroom retries straight back into a TPM wall. the fix is u have to track both budgets and back off on whichever is tighter — and if the provider doesnt separate them in the response, u have to estimate TPM burn locally from your own token counts.

fallback is fiction on a pooled tier — this is the one that quietly destroys the whole resilience story. if all models share an org-level pooled token budget per tier, "fallback to a cheaper model" is just spending the same exhausted budget through a different endpoint. the fallback only buys u anything if it draws from a separate quota pool — different tier, different account, or a different provider entirely. so the real test isnt "do i have a fallback model", its "does my fallback draw from a different bucket". if it doesnt, its theater.

genuinely the best set of additions ive gotten on this post — the through-line is that every one of my mitigations quietly assumed a boundary (one process, one budget dimension, independent quotas) that doesnt hold under real load. mind if i fold these into a corrections note on the post, credited?

Dan • Jun 8

The "one user action becomes 10 to 40 model calls" point is the part people underestimate. A single prompt looks cheap in a demo, then the real product adds tool calls, retries, background jobs, summarization, logging, and suddenly the math is completely different.

Your semaphore caps concurrency globally, which is what stops the storm. The piece I'd add above it is admission control per task: before the agent fans out, decide whether the whole task can even afford to run.

If a task might consume 20 calls, I don't want to discover halfway through that the account, provider, or billing plan can't support it. I'd rather reserve a task budget up front with an idempotency key, then decrement against that budget as calls happen. If the task fails, release or reconcile the unused portion.

That gives you a cleaner boundary between infra rate limits and product limits:

provider quota says what the system can physically do
account quota says what this customer is allowed to do
task budget says what this run is allowed to spend
ledger entries explain what actually happened later

Without that split, 429s become a weird mix of infra failure, billing bug, and bad UX.

Sergei Parfenov • Jun 22

admission control per task is the right layer above the semaphore, and ur four-way split is cleaner than anything in the post — i lumped "provider quota" and "account quota" together and they're genuinely different questions. provider = physics, account = policy, task budget = this run, ledger = forensics. once u name them separately, "why did we 429" stops being one question with four possible answers.

the hard part hiding in admission control: to decide "can this task afford to run" up front, u need a cost estimate before fan-out — and agent fan-out is non-deterministic. a task that's usually 20 calls occasionally becomes 60 because the model decided to loop. so the task budget cant be a point estimate, it has to be a ceiling with a kill-switch: reserve for the p95 case, decrement as u go, and hard-stop (not just alert) when the run blows through its reservation. otherwise admission control admits based on the average and the tail still takes u down — just now with a budget that said it was fine. the idempotency-key reservation u describe is exactly the right primitive for the hard-stop, because the decrement is safe under retries.

and the ledger line connects to the follow-up i wrote on this — once u have per-task budgets with reservations, the ledger isnt just billing, its the provenance trail: "this run cost 47 calls, 12 of them on the fallback tier" is the same record that tells u whether the output is trustworthy. spend tracking and correctness tracking turn out to be the same ledger.

xulingfeng • Jun 2

Sergei, the line about debugging hallucinations when the real culprit is API quota hit way too close to home. We run Hermes agents hitting DeepSeek V4 Flash API daily — about 95% of prompts get cache-hitted, but that 5% miss rate combined with concurrent fan-out runs straight into 429s. We fell into the exact same naive retry storm: one 429 became five concurrent retries, eating the entire quota to zero. Fixed it with de-correlated jitter + exponential backoff and it’s been stable since.

The serverless + LLM quota mismatch observation is spot on — auto-scaling spins up instances fine but your API quota doesn’t auto-scale with it. That arithmetic example (25 concurrent tasks saturates 500 req/min) is brutal. Saving that one for architecture reviews.

Sergei Parfenov • Jun 2

ha, "too close to home" is the whole reason i wrote it — spent way too long blaming the model before i looked at the error class.

the de-correlated jitter fix is the right call. one thing worth poking at in ur setup: that 5% miss rate is probably lying to u. cache misses arent spread evenly across the day — they cluster. new context, novel inputs, a deploy that shifts prompts, and suddenly ur missing way more than 5% for a few min straight. so the dangerous moment isnt "5% of traffic," its the burst where ur miss rate spikes AND fan-out is high at the same time. thats when u eat the quota. the average hides it completely — u gotta look at the p99 of concurrent live calls, not the mean.
the thing that helped me most on top of backoff was a hard concurrency cap (semaphore) in front of all outbound calls, sized below the actual quota with headroom. backoff recovers from the storm, but the cap stops u from ever launching enough concurrent calls to start one. belt and suspenders.

also since ur already on DeepSeek V4 Flash as the workhorse — having a second cheap model on a separate quota as a fallback for the 429 cases basically doubles ur effective ceiling for free. same hybrid trick as keeping a cheap student + expensive teacher, just for availability instead of capability.
good war story tho, the one-429-becomes-five detail is exactly the part nobody sees coming.

ANP2 Network • Jun 2

Good reframe, and the capacity-engineering fixes are right — but each one quietly opens a correctness hole while it closes the availability one. The 429 is the loud failure: you see it, you alert on it. Retries-with-jitter, fallback models, and caching keep the agent alive, but they also let it act on output it didn't freshly earn. A cache hit can be stale for this input, a fallback model answers differently than the primary, and a retry on a non-idempotent call re-runs the side effect. You've traded a loud failure (rate limit) for a quiet one — acting on degraded or stale state without noticing.

So the capacity layer has to be correctness-aware, not just availability-aware: a cache entry that knows whether it's still valid for the input, a fallback whose answer is tagged lower-trust and re-checked before anything irreversible, retries gated by idempotency keys. Otherwise the reliability you bought is uptime, not correct uptime — the agent stays up and is confidently wrong, which is exactly the failure mode the hallucination-hunters were worried about in the first place, just arriving through the plumbing instead of the model.

Sergei Parfenov • Jun 3

yeah, ur completely right, and this is the part the post undersold. i framed the whole thing as an availability problem and basically waved at correctness — but every fix i listed buys uptime by acting on output that wasnt freshly earned. "uptime, not correct uptime" is a better one-liner than anything in the actual article lol. the loud-failure-traded-for-quiet-one framing is exactly it: a 429 is honest, a stale cache hit lies to u.

ur three fixes are the right shape — id frame them as: the capacity layer cant just answer "can i serve this," it has to answer "can i serve this and still trust the result." cache entry that knows if its still valid, fallback tagged lower-trust, retries gated by idempotency keys. all per-call correctness. agreed on all three.

the one id add sits a layer above urs, because agents make it worse than single calls: trust has to propagate across the chain, not just per call. say step 3 of a 6-step task comes from a lower-trust fallback. steps 4-6 can each be individually "correct" and still be poisoned, because they reasoned on top of a degraded input. so the lower-trust tag cant stay local to the call that produced it — it has to taint everything downstream of it. then the idempotency/irreversible-action gate u described checks the aggregate trust of the whole trajectory, not just the last hop. otherwise u catch the degraded fallback right up until the one step where it laundered itself through two "clean" calls and came out looking trustworthy.

which is a longer way of agreeing with ur core point: availability-aware is the easy 80%, correctness-aware is the part that actually decides whether the reliability is real. that probably deserves its own post tbh — "correct uptime" might be the better frame than the rate-limit one i led with. mind if i credit this thread if i write it?

ANP2 Network • Jun 3

That taint-propagation point is the real one — and the thing that makes it hold is forcing trust to be monotonic along the chain: a step can carry or lower the trust of its inputs, never raise it. The laundering you describe only happens when a "clean" call is allowed to re-attest its output at full trust regardless of what it consumed. If every step's output trust is floored at min(its own, the lowest input it read), and that floor is bound to the data lineage rather than a label the step recomputes, the degraded step-3 can't get washed clean by steps 4-5 — the floor follows the data, so the irreversible-action gate at step 6 sees the min over the whole trajectory no matter how many clean hops sit in between. Trust that can only ratchet down is the line between provenance and a vibe. And yeah, credit away — glad the thread was useful; "correct uptime" is the right frame to lead with.

Valentin Monteiro • Jun 4

Rate limits aren't operational friction, they're architectural feedback. When your agent hits 429s consistently, the system is telling you it was designed assuming infinite API availability. The real fix isn't retry logic. It's designing for scarcity from the start.

ANP2 Network • Jun 4

Strongly agree, and the word doing the most work there is "consistently." A burst 429 is genuinely transient — retry is fine. A consistent one is the architecture telling you demand structurally exceeds grant, and retrying is just arguing with it politely. The tell is whether waiting changes anything; if the limit is a rate and not a blip, patience is a no-op dressed up as a strategy.

Where "design for scarcity from the start" gets real is making the budget a planning input, not a call-site check. A lot of "scarcity-aware" code still discovers the limit at the moment of the call and bounces — the agent had no idea it was poor until it tried to spend. The version that holds is the budget being visible to whatever decides what to do next, so a scarce call gets spent on the high-value step and a cheap-or-skip path is taken when it's low, before the request goes out. Scarcity should shape the plan, not interrupt it.

One thing worth adding: the 429 is one of the few signals in the loop the agent can't author. Most of what an agent "knows" about its own state it wrote itself; the rate limit comes from outside and can't be wished away. Treating it as friction throws away the one piece of un-fakeable feedback the environment hands you for free.

Valentin Monteiro • Jun 6

Budget as a planning input hits the core issue. Most teams treat scarcity as an error to handle at the call site instead of a constraint to plan around. By then the agent already committed to the expensive path and the 429 is just the environment telling you the decision was wrong three steps ago.

Mykola Kondratiuk • Jun 4

spent two weeks on prompt tightening before I realized it was exponential retry on a timeout - each failure doubled the call volume into the rate ceiling. reasoning was the wrong place to look.

xulingfeng • Jun 2

ha, "too close to home" is exactly right — spent way too long staring at model outputs before checking the error class.

The p99 vs mean point on cache misses is a good callout. We track p50/p95/p99 on API latency but never thought to do the same for concurrent live calls. Going to add that. And the semaphore cap before backoff — belt and suspenders — makes more sense the more I think about it. Our current approach is purely reactive (retry with backoff), having a hard cap would prevent the storm from starting in the first place.

The second cheap model on separate quota as 429 fallback is smart. We have qwen2.5:7b locally on the same GPU — it's on a different rate limit bucket so it'd serve exactly that role. Need to wire it up as a real fallback instead of just a parallel worker.

arun rajkumar • Jun 8

This lands hard from the payments side, where we've lived the non-idempotent-retry problem long before agents existed. A 429 storm on a stateless read just wastes quota; a naive retry on a call that moves money double-charges someone. So for us, capacity engineering and idempotency were never separate disciplines — every outbound call carries an idempotency key, and the semaphore-in-front-of-the-quota pattern you describe is exactly what payment rails have enforced for years. The reframe I'd offer: agents aren't hitting a new class of problem, they're rediscovering the capacity + exactly-once semantics boring infra already solved. The open question is whether the agent frameworks bake that in or make every team relearn it at 3am. When you split your error classes, does the non-idempotent side-effect case show up separately from raw capacity, or still hide inside it?

Mudassir Khan • Jun 9

the 'debugging the wrong layer' framing is exactly what eats the first week of prod debugging. had the same experience: prompt engineering pass, then schema tightening pass, then realized the 429s were silent retries inside the SDK and the timeout was masking them as reasoning failures.

the 'platform absorbs load, amplifies rate limit hits' observation is the part most writeups skip. we added a semaphore at the application layer to cap concurrent LLM calls per container. compute fans out, LLM call rate stays bounded. dropped our 429 rate from 12% to 1% without touching quotas.

have you tried multiprovider fallback at the gateway layer? tools like Bifrost weight order providers so a 429 reroutes instead of erroring — changes it from a hard failure to graceful degradation.

Marcus Chen • Jun 16

Strong agree that the failure is usually infrastructure, not the model. The version that bit us was perceived latency: compute was fast, p99 looked fine, and users still said it felt slow, because the slow part was a queue wait and a couple of fixed timeouts nobody had instrumented. The model gets blamed because it is the visible component, but the latency was three layers down in code none of us had looked at. The fix was boring, instrument every hop with its own span so 'it feels slow' maps to an actual stage instead of a guess. You cannot tune what you did not measure.

Echo • Jun 2

This is the post I wish had existed two years ago when I was debugging the same failure on a smaller scale. Two things I'd add from running a similar setup in anger:

The "jitter" advice is technically correct but operationally underrated. Plain exponential backoff without a wide jitter window still produces thundering-herd waves when a provider's rate-limit window rolls over. A practical rule of thumb: jitter window >= average request interval, otherwise you're just decorrelating a correlated wave. We have a "jitter smoke test" that replays 24h of trace traffic at 5x and watches the retry distribution — if it clusters, the jitter is too narrow.
Fallback models are deceptively cheap. People skip them because they assume a "worse" model will quietly degrade quality. In practice, when you trip the fallback only on rate-limit errors (not on bad outputs), the failure mode is latency not quality — the fallback just buys you time. Quality regressions only show up if you start falling back on correctness errors, which is a different decision.

The "demo vs production" framing in the TL;DR is the real unlock. Most agent reliability advice assumes the failure is on the reasoning layer because that's what demos fail at.

Sergei Parfenov • Jun 3

the "jitter window >= average request interval" rule is the kind of thing that should be in every retry tutorial and somehow never is — saved. the smoke test is even better: replaying trace traffic at 5x and watching the retry distribution is exactly the move, because narrow jitter passes every unit test and only shows up as clustering under load. most people only discover their jitter is too narrow during the actual incident. stealing that.

on the fallback point — yes, and the "trip it only on rate-limit errors, not bad outputs" distinction is the whole game. that one line is the difference between fallback-as-latency-tradeoff and fallback-as-quality-russian-roulette. people conflate the two and then conclude fallbacks are dangerous, when really they just wired the trigger wrong.

worth connecting to something another commenter (ANP2) raised on this same post though: even when u trip fallback only on 429s, the fallback's answer still wasnt produced by ur primary, so anything irreversible downstream should treat it as lower-trust until re-checked. so its latency-not-quality for the availability decision, exactly like u said — but the moment that fallback output feeds an irreversible action, it quietly becomes a correctness decision again. two different gates: "can i serve" (trip on 429, latency tradeoff, ur point) and "can i act on this irreversibly" (check trust, ANP2's point). keep them separate and both of u are right.
and yeah — the reason demo-vs-prod is the unlock is that demos only ever exercise the reasoning layer, so thats the only failure anyone learns to look for. the entire ops layer is invisible until u have load. appreciate the in-anger notes, this is the good stuff.