DEV Community

Cover image for You Fixed the Rate Limits. Now Your Agent Fails Quietly.

You Fixed the Rate Limits. Now Your Agent Fails Quietly.

Sergei Parfenov on June 11, 2026

Last week I wrote that your agent isn’t failing because it hallucinates — it’s failing because of rate limits. The capacity-engineering toolkit in ...
Collapse
 
xulingfeng profile image
xulingfeng

The "uptime, not correct uptime" distinction is gold. We hit the same pattern with AI-driven test automation at my last company — pass rate climbed because the AI kept "fixing" flaky tests by shrinking their assertion scope. The pipeline stayed green, but the tests stopped catching real regressions.

The taint propagation approach for multi-step agents makes a lot of sense. Same correctness debt, different level of the stack — and way harder to spot until something irreversible happens.

Collapse
 
p0rt profile image
Sergei Parfenov

the shrinking-assertion-scope story is the nastiest version of this pattern ive heard, because the degradation happened in the verification layer itself. my whole taint approach quietly assumes the verifier is trustworthy — tag the degraded data, gate the irreversible action, re-check against something solid. but when the thing that checks correctness is what degraded, uve lost the instrument that wouldve caught it. green pipeline, hollow assertions. thats not a quiet failure anymore, its a quiet failure with a forged alibi.

guess the test-automation version of my dashboard metric would be tracking assertion scope/strength over time, not pass rate — pass rate is exactly the metric the failure mode games.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

removing the error without replacing it with a new signal is the core problem. one thing the toolkit needs: validate the fallback model's output schema separately - primary and fallback often return differently-structured responses, and format drift is invisible downstream.

Collapse
 
ahmetozel profile image
Ahmet Özel

Good framing. Silent degradation is where agents get dangerous because the system still looks alive from the outside. One thing I like to add is an eval replay set for degraded runs: keep the tool trace, retrieved context and final answer together, then replay the same cases after prompt/tool changes. It catches cases where the agent learned to continue smoothly while carrying bad state forward.

Collapse
 
p0rt profile image
Sergei Parfenov

the degraded-run replay set is a great addition — its basically the offline half of the "fallback divergence" metric from the post. i diff fallback answers against the primary now; ur replaying the whole trace after changes, which catches the scarier thing: the agent learning to glide smoothly over bad state. keeping trace + retrieved context + answer together is the part most people skip and then cant reconstruct. adding this to the toolkit.

Collapse
 
scarab-systems profile image
Scarab Systems • Edited

this is exactly the sort of pivot in approach I'm interested in...

I would take it even a step further... the agent should not need to carry state... state and context should be provided by something that can carry that weight cleanly and more importantly truthfully... the repo... then the agent can continue to do what it does best.. code.

Collapse
 
p0rt profile image
Sergei Parfenov

externalizing state is the right instinct — stateless agents + a source of truth they read from beats agents lugging context around, agreed. and for code the repo is the best ledger we have.

but heres where it doesnt close the loop: the repo records outcomes, not provenance. a commit produced from a degraded fallback chain diffs identically to one produced from clean primary reasoning. git gives u receipts for what changed — its silent on whether u should trust how it got there. so moving state into the repo solves the "agent carries fragile context" problem, but the evidence problem just moves with it: something still has to carry the trajectory-level receipts alongside the artifact. repo as ledger for state, evidence layer for process. u need both, theyre answering different questions.

Thread Thread
 
scarab-systems profile image
Scarab Systems

ah Yes! — this is the distinction I was reaching for, and I think you’re right to split it that way.

When I say the repo should carry the authority, I don’t mean the git diff alone proves the process. A commit can show what changed while saying almost nothing about whether the change preserved the right obligations.

The way I think about it is more like: the repo has to be read into a baseline first.

Not just “current files,” but the repo’s claims: tests, docs, contracts, generated-vs-source boundaries, config expectations, ownership surfaces, validation signals, and whatever the system already uses to say “this is true here.”

Then the agent is not carrying the burden of remembering all of that conversationally. It is working against a diagnostic baseline that can say: this claim existed before, this boundary owned it, this artifact was evidence for it, and this change either preserved, moved, weakened, or contradicted it.

So yes: repo as ledger for state, evidence layer for process — but I’d add that the evidence layer has to be grounded in a repo baseline, not just attached afterward as trace metadata.

That is the shape I’m interested in: before the workflow acts, it should be able to show both the artifact change and the evidence chain that says the change still belongs where it landed.

Collapse
 
scarab-systems profile image
Scarab Systems

This is a really strong framing — especially the distinction between uptime and correct uptime.

The part that stands out to me is that the degraded path is not just a runtime state; it becomes an evidence problem. Once a fallback, stale cache hit, or retried side effect enters the chain, the question is no longer only “did the agent complete?” It becomes “what proof does the system still have that the completed trajectory preserved the intended boundary?”

That is very close to the diagnostic layer I’ve been exploring with Scarab/SDS. The failure is often not the loud error. The loud error is honest. The more dangerous failure is when the system keeps moving after the boundary that was supposed to preserve trust has already weakened.

The taint propagation point feels especially important. A degraded step should not be allowed to launder itself through later successful calls. If step 6 is clean but step 3 was degraded and never re-verified, the trajectory is still carrying that earlier uncertainty.

I like the “two gates” framing a lot. I’d almost describe Gate 2 as an evidence gate: before an irreversible action, the system has to prove not just that the last call succeeded, but that the whole chain still has valid provenance.

Collapse
 
p0rt profile image
Sergei Parfenov

"evidence gate" is honestly a better name than mine — because it makes the obligation explicit. a trust tag is passive metadata; evidence is something the chain has to carry and produce on demand. step 6 shouldnt just be untainted, it should be able to show receipts for steps 1-5. same mechanism, stronger contract. stealing the term (with credit).

Collapse
 
scarab-systems profile image
Scarab Systems

Yes — please take it and use it. Credit appreciated, but honestly the bigger thing is that we start naming the problem clearly enough to work on it together.

That “receipts for steps 1–5” phrasing is exactly the contract I was trying to get at. A tag describes a state, but an evidence gate asks whether the chain can actually produce proof for the state it is claiming.

The more we can shift the conversation from “did the agent finish?” to “what evidence does the workflow carry forward?”, the more useful the whole discussion becomes.

I think that shared language matters here because this failure mode is showing up in a lot of different places under different names. Once we can name it together, we can start designing around it instead of just reacting to it.

Thread Thread
 
p0rt profile image
Sergei Parfenov

agreed — and ur "different names" point is literally true across fields: security calls it taint, data engineering calls it provenance, ML calls it lineage, audit calls it receipts. four communities, one shape: can you trace what this result stands on. agents just made it urgent because now the untraceable thing acts.
"what evidence does the workflow carry forward" is the right question to standardize on. good thread — this is going in the next post.

Collapse
 
tecnomanu profile image
Manuel Bruña

Quiet failure is worse than a hard rate-limit error. For agent systems I’d rather have an explicit degraded state: skipped tool, stale data, partial result, retry budget exhausted. If that is hidden, the final answer looks more reliable than it is.

Collapse
 
voltagegpu profile image
VoltageGPU

Great piece—very much in line with what I've seen in distributed systems. In GPU workloads, especially with rate-limited inference APIs, we often add retries with jitter, but subtle state corruption can still happen if the retry logic doesn't fully respect the original request context. It's a good reminder that availability isn't enough if correctness is compromised.

Collapse
 
lily7858757 profile image
Lily

The distinction between uptime and "correct uptime" is something more teams should be talking about. Most dashboards celebrate successful requests, but very few measure whether degraded paths are influencing downstream decisions. The idea of propagating trust across an agent workflow feels like a natural evolution of traditional reliability engineering.

Collapse
 
motedb profile image
mote

Rate limits causing silent failures is worse than outright crashes — at least a crash gets logged. I've watched agents accumulate partial state across multiple 429 responses and then execute with half the context missing. The output looks plausible enough that nobody notices until corrupted data hits production three steps later.

The real problem is most agent frameworks treat rate limits as transport-layer issues rather than application-layer state corruption. A 429 isn't "try again later" — it means "your current execution branch is now poisoned." If the agent was in the middle of mutating internal state when the limit hit, the retry starts from a half-baked world.

How do you handle the case where the agent's internal state is already partially written when the rate limit fires? Undo the mutation or trust the retry with the dirty state?

Collapse
 
voltagegpu profile image
VoltageGPU

Great post—this really hits on the nuance between availability and correct availability. In distributed systems, especially when dealing with GPU-accelerated workloads on platforms like VoltageGPU, it's easy to mask rate-limiting with retries, but that can lead to stale or incorrect results downstream. I've seen this in inference pipelines where cached responses were used under load, leading to subtle correctness issues that only surfaced in edge cases.

Collapse
 
alexshev profile image
Alex Shev

This is the hidden cost of making agents more resilient. Retries, cache, fallback models, and degraded modes all improve uptime, but they can also hide the moment when the answer stopped being freshly earned.

I like the distinction between uptime and correct uptime. For agents, the SLO should probably include provenance: which inputs were current, which tools actually ran, which fallbacks triggered, and what confidence was produced by evidence instead of habit.

Collapse
 
voltagegpu profile image
VoltageGPU

As someone who's worked on resilient GPU orchestration systems, I appreciate the emphasis on "correct uptime" — it's easy to get tripped up when auto-scaling or retries hide stale or incorrect results. In confidential computing contexts, even a silent failure can compromise data integrity, so validating outputs isn't just a nice-to-have, it's a hard requirement.

Collapse
 
mnemehq profile image
Theo Valmis

Trust isn't a scalar that composes, and that's the hard part hiding inside "propagate trust across the chain." A retry that serves a stale cache and a fallback to a weaker model both lower trust, but along different axes, freshness versus capability, and downstream steps don't care about the same one. A summarization step tolerates a weaker model and not stale data; a price calc is the reverse. Propagate a single trust score and you either over-reject, treating every degradation as fatal, or under-reject, averaging the dangerous one away. What composes is typed provenance: which gate got relaxed and how, carried alongside the result, so each consumer applies its own policy. That turns "can I still trust this?" from one global question into a per-step one, which is the only version that survives a long chain.

Collapse
 
ggle_in profile image
HARD IN SOFT OUT

This is the rare sequel that makes the original post better. The “uptime vs correct uptime” distinction is one of those phrases that will quietly live in my head during every architecture review now. (Also, the point about fallback models being different models — not just slower — is something I've seen teams realise only after a very expensive incident.)

A couple of thoughts from reading:

  • The taint‑propagation idea is elegant, but it assumes every step can be traced. In a real agentic workflow, steps often run in parallel or produce outputs that get merged non‑linearly. Have you experimented with a time‑bounded taint? Something like “if no degraded step affected the decision path in the last 15 seconds, reset taint.” Not perfect, but cheaper than full DAG tracking, and might catch most of the real risk.

  • The cache validity check using assumptions (file version, data snapshot) is great, but most teams won’t wire that manually. What if the cache entry simply stored the hash of the prompt + system version + timestamp, and the agent refused to use any cache entry older than a task‑specific TTL (short for mutable data, longer for static)? That’s one line of code and catches most staleness without building an assumption registry.

One small improvement: the dashboard metrics are solid, but “fallback divergence” replay is expensive at scale. A cheaper proxy: sample 1% of fallback responses and send them to the primary asynchronously for a second opinion, logging divergence. No blocking, no extra latency, just a warning light that tells you when your fallback is drifting too far from the truth.

And because the 429 that started all of this deserves a dark joke:

The agent hit a rate limit. It fell back to a cached answer from last Tuesday.

The world changed on Wednesday.

The agent kept working. The logs said “cache hit, 200 OK.”

The user got a message: “Your order has shipped.”

The warehouse’s API key expired on Thursday.

Anyway, this post is the reason I’m adding a “trust” field to my agent’s result objects tomorrow. Thank you.

Collapse
 
voltagegpu profile image
VoltageGPU

As someone working with GPU-accelerated systems, I've seen how rate limiting workarounds can silently break data pipelines, especially when using real-time inference. One time, we added retries for a GPU cluster API, but stale results started getting cached during outages—looks available, acts broken. It's a great reminder that SLOs must account for correct responses, not just timely ones.

Collapse
 
james_oconnor_dev profile image
James O'Connor

The correct-uptime framing is sharp. The piece I would add from the tool-calling side: a fallback or a cache does not just return stale output, it returns output your validation may never have seen. We had a cached tool result skip the precheck that a fresh call would have hit, so a value that was valid when it was cached sailed through after it had gone stale. Now anything from a fallback path runs the same validation as a fresh call, no exception for the cache. Availability that serves unchecked output is its own kind of outage, just one that does not page anyone.

Collapse
 
voltagegpu profile image
VoltageGPU

Great article — it's easy to focus on just hitting availability numbers, but ensuring correctness under load is where the real complexity lies. In GPU-based systems, especially with inference workloads, we've seen cases where rate limiting was bypassed, but stale or cached results were served without proper tracking, leading to silent errors. It's a reminder that SLOs need to account for both freshness and fidelity, not just uptime.

Collapse
 
babyfox1306 profile image
TuanAnhNguyen

The "uptime vs correct uptime" cut is the part I'll carry around. Reading this from the other end of the scale though — I'm a solo builder, my "agents" are coding tools running all day, and the quiet-failure version I live with is smaller but identical in shape: a tool edits a file off a stale read, three steps later something breaks, and nothing ever errored. No dashboard, no taint tracking — just me eventually noticing the trajectory went bad somewhere upstream.
What landed hardest is "gate on risk, not confidence." At my scale I can't build the full two-gate system, but the cheap version is dead simple: mark which steps touch something I can't undo (a commit, a deploy, anything that writes), and force a human read on exactly those — ignore confidence entirely. Most of the correctness debt you describe is survivable for a solo dev because the irreversible surface is small. The discipline is just knowing precisely where it is.
Question back: for someone without the observability stack, is there a poor-man's proxy for "degraded-path exposure"? Or is the honest answer that below a certain scale you just keep the irreversible surface tiny and read every diff?