DEV Community

You Fixed the Rate Limits. Now Your Agent Fails Quietly.

Sergei Parfenov on June 11, 2026

Last week I wrote that your agent isn’t failing because it hallucinates — it’s failing because of rate limits. The capacity-engineering toolkit in ...

Read full post

xulingfeng • Jun 12

The "uptime, not correct uptime" distinction is gold. We hit the same pattern with AI-driven test automation at my last company — pass rate climbed because the AI kept "fixing" flaky tests by shrinking their assertion scope. The pipeline stayed green, but the tests stopped catching real regressions.

The taint propagation approach for multi-step agents makes a lot of sense. Same correctness debt, different level of the stack — and way harder to spot until something irreversible happens.

Sergei Parfenov • Jun 12

the shrinking-assertion-scope story is the nastiest version of this pattern ive heard, because the degradation happened in the verification layer itself. my whole taint approach quietly assumes the verifier is trustworthy — tag the degraded data, gate the irreversible action, re-check against something solid. but when the thing that checks correctness is what degraded, uve lost the instrument that wouldve caught it. green pipeline, hollow assertions. thats not a quiet failure anymore, its a quiet failure with a forged alibi.

guess the test-automation version of my dashboard metric would be tracking assertion scope/strength over time, not pass rate — pass rate is exactly the metric the failure mode games.

James O'Connor • Jun 15

The correct-uptime framing is sharp. The piece I would add from the tool-calling side: a fallback or a cache does not just return stale output, it returns output your validation may never have seen. We had a cached tool result skip the precheck that a fresh call would have hit, so a value that was valid when it was cached sailed through after it had gone stale. Now anything from a fallback path runs the same validation as a fresh call, no exception for the cache. Availability that serves unchecked output is its own kind of outage, just one that does not page anyone.

Sergei Parfenov • Jun 22

"availability that serves unchecked output is its own kind of outage, just one that doesnt page anyone" — thats the post in one sentence, better than i said it.

ur cache-skipping-the-precheck case is the sharpest version of this ive seen, because it exposes a hidden assumption: teams put validation on the fresh call path and treat the cache as already-trusted. but the cache is exactly where trust decays silently — the value was valid at write time, and validation never re-runs because "its just a cache hit." so the precheck guards the door the stale value never walks through. ur fix is the right one and stricter than my "trust tag" — validation should be a property of consuming a value, not of the path that produced it. doesnt matter if it came fresh, from cache, or from a fallback: if its about to be used, it gets checked. that closes the laundering loophole at the consumer instead of trying to tag every producer.

Alex Shev • Jun 12

This is the hidden cost of making agents more resilient. Retries, cache, fallback models, and degraded modes all improve uptime, but they can also hide the moment when the answer stopped being freshly earned.

I like the distinction between uptime and correct uptime. For agents, the SLO should probably include provenance: which inputs were current, which tools actually ran, which fallbacks triggered, and what confidence was produced by evidence instead of habit.

Sergei Parfenov • Jun 22

"confidence produced by evidence instead of habit" — thats the line. and yeah, baking provenance into the SLO is the right move: the current agent SLO is basically "did it respond", which is the uptime trap exactly. a provenance-aware SLO would track % of completed tasks where every step was fresh/primary/verified vs how many leaned on a degraded path nobody re-checked. the hard part is it makes ur dashboard look worse on day one — but thats the honest number. an SLO that cant go red on silent degradation isnt measuring the thing that actually breaks u.

Alex Shev • Jun 22

Yes, that is the uncomfortable part: a provenance-aware SLO makes the system look worse before it gets better. But it also turns silent degradation into something you can budget for. I would rather see a red dashboard for stale or fallback-backed work than a green one hiding the risk.

Manuel Bruña • Jun 15

Quiet failure is worse than a hard rate-limit error. For agent systems I’d rather have an explicit degraded state: skipped tool, stale data, partial result, retry budget exhausted. If that is hidden, the final answer looks more reliable than it is.

Sergei Parfenov • Jun 22

yeah, and ur list is actually an upgrade to what i wrote — i treated trust as binary (full/degraded), but ur right that the kind of degradation matters for what u do next. "skipped a tool" and "retry budget exhausted" and "stale data" should route differently: a skipped tool might just need a re-run, stale data needs a freshness check, partial result needs a human. so the tag probably wants to be an enum, not a bool — degraded-why, not just degraded. the binary version is the MVP, but u lose the routing logic that tells u how to recover. stealing this.

Manuel Bruña • Jul 9

Yes, degraded-why is the useful shape.

A boolean tells you trust changed, but not what to do next. An enum or structured reason can route recovery: retry skipped tool, refresh stale data, escalate partial output, or stop on exhausted budget. That is the difference between detecting degradation and making it operational.

VoltageGPU • Jun 12

Great post—this really hits on the nuance between availability and correct availability. In distributed systems, especially when dealing with GPU-accelerated workloads on platforms like VoltageGPU, it's easy to mask rate-limiting with retries, but that can lead to stale or incorrect results downstream. I've seen this in inference pipelines where cached responses were used under load, leading to subtle correctness issues that only surfaced in edge cases.

VoltageGPU • Jun 16

Great piece—very much in line with what I've seen in distributed systems. In GPU workloads, especially with rate-limited inference APIs, we often add retries with jitter, but subtle state corruption can still happen if the retry logic doesn't fully respect the original request context. It's a good reminder that availability isn't enough if correctness is compromised.

Sergei Parfenov • Jun 22

right — "retry doesnt respect the original request context" is exactly the non-idempotent retry case. the fix is making the retry carry an idempotency key derived from the original request, so a re-run is a dedup'd no-op instead of a fresh side effect. retry logic that regenerates context on each attempt is where the subtle corruption sneaks in. availability ≠ correctness, agreed.

Lily • Jun 16

The distinction between uptime and "correct uptime" is something more teams should be talking about. Most dashboards celebrate successful requests, but very few measure whether degraded paths are influencing downstream decisions. The idea of propagating trust across an agent workflow feels like a natural evolution of traditional reliability engineering.

Sergei Parfenov • Jun 22

"dashboards celebrate successful requests" is the whole problem in one line. a 200 with a degraded answer is counted as a win by every standard observability setup — the success metric is structurally blind to exactly the failure that matters. and yeah, the lineage of all this is straight out of classic reliability eng (taint from security, provenance from data, lineage from ML) — agents just made it urgent because now the untraceable result acts. nothing new under the sun, just newly load-bearing.

VoltageGPU • Jun 13

As someone who's worked on resilient GPU orchestration systems, I appreciate the emphasis on "correct uptime" — it's easy to get tripped up when auto-scaling or retries hide stale or incorrect results. In confidential computing contexts, even a silent failure can compromise data integrity, so validating outputs isn't just a nice-to-have, it's a hard requirement.

Theo Valmis • Jun 13

Trust isn't a scalar that composes, and that's the hard part hiding inside "propagate trust across the chain." A retry that serves a stale cache and a fallback to a weaker model both lower trust, but along different axes, freshness versus capability, and downstream steps don't care about the same one. A summarization step tolerates a weaker model and not stale data; a price calc is the reverse. Propagate a single trust score and you either over-reject, treating every degradation as fatal, or under-reject, averaging the dangerous one away. What composes is typed provenance: which gate got relaxed and how, carried alongside the result, so each consumer applies its own policy. That turns "can I still trust this?" from one global question into a per-step one, which is the only version that survives a long chain.

Sergei Parfenov • Jun 22

this is the best critique of the post and ur completely right — i used a single trust scalar (full/degraded) and that collapses on exactly the case u describe. freshness and capability are orthogonal axes, and a scalar forces u to pick a threshold that's simultaneously too strict for the summarization step and too loose for the price calc. someone else in these comments pushed the same direction (enum instead of bool) but u took it all the way: the issue isnt granularity of the level, its that trust is a vector over axes, and collapsing it to one number destroys the information the consumer actually needs.

"typed provenance carried alongside the result, each consumer applies its own policy" is the correct design and a real upgrade over what i wrote. it also dissolves my taint-propagation code: u dont propagate a degraded flag, u propagate the vector (this step ran on a fallback model → capability axis lowered; this step used a 2h-old cache → freshness axis lowered), and the irreversibility gate for a price calc checks the freshness axis while a summary gate checks capability. same mechanism, but the policy lives at the consumer, not in a global threshold. which is also why "trust" was always the wrong word — its provenance, and trust is what each consumer computes from provenance under its own rules.

the part i'd genuinely like ur take on: how many axes before it stops being worth it? freshness + capability are the obvious two. do u model tool-success and human-verified as separate axes too, or does the vector get unwieldy and u collapse back to a small fixed set? feels like there's a sweet spot and i dont know where it is.

HARD IN SOFT OUT • Jun 13

This is the rare sequel that makes the original post better. The “uptime vs correct uptime” distinction is one of those phrases that will quietly live in my head during every architecture review now. (Also, the point about fallback models being different models — not just slower — is something I've seen teams realise only after a very expensive incident.)

A couple of thoughts from reading:

The taint‑propagation idea is elegant, but it assumes every step can be traced. In a real agentic workflow, steps often run in parallel or produce outputs that get merged non‑linearly. Have you experimented with a time‑bounded taint? Something like “if no degraded step affected the decision path in the last 15 seconds, reset taint.” Not perfect, but cheaper than full DAG tracking, and might catch most of the real risk.
The cache validity check using assumptions (file version, data snapshot) is great, but most teams won’t wire that manually. What if the cache entry simply stored the hash of the prompt + system version + timestamp, and the agent refused to use any cache entry older than a task‑specific TTL (short for mutable data, longer for static)? That’s one line of code and catches most staleness without building an assumption registry.

One small improvement: the dashboard metrics are solid, but “fallback divergence” replay is expensive at scale. A cheaper proxy: sample 1% of fallback responses and send them to the primary asynchronously for a second opinion, logging divergence. No blocking, no extra latency, just a warning light that tells you when your fallback is drifting too far from the truth.

And because the 429 that started all of this deserves a dark joke:

The agent hit a rate limit. It fell back to a cached answer from last Tuesday.

The world changed on Wednesday.

The agent kept working. The logs said “cache hit, 200 OK.”

The user got a message: “Your order has shipped.”

The warehouse’s API key expired on Thursday.

Anyway, this post is the reason I’m adding a “trust” field to my agent’s result objects tomorrow. Thank you.

Sergei Parfenov • Jun 22

this is a ridiculously good comment, thank you.

taking all three, but pushing back on one:time-bounded taint — i love the instinct (full DAG tracking is too expensive for most teams, agreed), but i think a time bound is the one axis i'd be nervous about, because degraded state can sleep. a stale value gets written to memory, nothing touches it for 20s, the 15s timer resets the taint to clean — and then step 9 reads that value and acts on it. the taint expired before the damage fired. so id bound it by causality instead of wall-clock: taint clears when nothing on the live decision path still derives from the degraded step, not when N seconds passed. harder than a timer, but a timer resets based on the one thing that doesnt correlate with risk. for the parallel/non-linear merge case ur right that pure linear propagation breaks — that genuinely needs the taint to be a set that unions at merge points (which is why the code used a set, not a flag), but i'll admit i hand-waved the parallel case in the post.

cache TTL with hashed assumptions — yes, this is strictly better than what i wrote. hash(prompt + system version + timestamp) + task-specific TTL is the 80/20: one line, no registry, and the task-specific part is the key insight (mutable data gets seconds, static gets days). i over-engineered the "assumption registry" framing when a TTL keyed on data volatility catches most of it. stealing.

async 1% sampling — also strictly better. full replay was the expensive version; sampling 1% to the primary async for a second opinion gives u the divergence signal as a warning light with zero added latency on the hot path. thats the version that actually ships. the only thing i'd add: weight the sample toward irreversible-action paths, since a divergence on a summary matters less than one on a payment.

and that parable is going in my head permanently — "the warehouse's API key expired on Thursday" is the entire post compressed into five lines. the whole chain green, every hop a 200, and a real package never ships. mind if i quote it (credited) if i write the follow-up?good luck with the trust field tomorrow — start it as a typed vector, not a bool, youll thank urself (someone else in these comments made the case for why a scalar collapses).

VoltageGPU • Jun 14

As someone working with GPU-accelerated systems, I've seen how rate limiting workarounds can silently break data pipelines, especially when using real-time inference. One time, we added retries for a GPU cluster API, but stale results started getting cached during outages—looks available, acts broken. It's a great reminder that SLOs must account for correct responses, not just timely ones.

TuanAnhNguyen • Jun 18

The "uptime vs correct uptime" cut is the part I'll carry around. Reading this from the other end of the scale though — I'm a solo builder, my "agents" are coding tools running all day, and the quiet-failure version I live with is smaller but identical in shape: a tool edits a file off a stale read, three steps later something breaks, and nothing ever errored. No dashboard, no taint tracking — just me eventually noticing the trajectory went bad somewhere upstream.
What landed hardest is "gate on risk, not confidence." At my scale I can't build the full two-gate system, but the cheap version is dead simple: mark which steps touch something I can't undo (a commit, a deploy, anything that writes), and force a human read on exactly those — ignore confidence entirely. Most of the correctness debt you describe is survivable for a solo dev because the irreversible surface is small. The discipline is just knowing precisely where it is.
Question back: for someone without the observability stack, is there a poor-man's proxy for "degraded-path exposure"? Or is the honest answer that below a certain scale you just keep the irreversible surface tiny and read every diff?

Sergei Parfenov • Jun 22

ur version is the same shape minus the org overhead, and honestly the solo case clarifies the whole thing: "the discipline is just knowing precisely where the irreversible surface is" is the actual lesson, the dashboards are just what u build when u cant hold that surface in ur head anymore.

straight answer to ur question: below a certain scale, yes — keep the irreversible surface tiny and read every diff is the correct answer, not a cop-out. the two-gate system is what u build when the irreversible surface grows past what one person can eyeball. dont build it early.
but theres a poor-mans degraded-path proxy that costs almost nothing: have the tools that touch a stale-readable input write a line to a log whenever they act on data older than some threshold — not a dashboard, just an append to a file u grep before anything irreversible. "did anything in this session act on a stale read" is a grep away, and it converts ur eventual-noticing into a pre-commit check. ur not tracking taint through the graph, ur just leaving a breadcrumb every time a tool could have carried bad state forward, and reading the breadcrumbs at the one moment that matters (right before the commit/deploy). its the 5%-effort version of "% of irreversible actions that fired with degraded input" — same question, answered by grep instead of a metrics pipeline.

the threshold where u graduate from grep to real tracking is basically "when the irreversible surface stops fitting in one human's working memory." for a solo builder thats further out than people think.

TuanAnhNguyen • Jun 23

The grep-the-breadcrumb version is exactly the altitude I can act on — "did anything this session act on a stale read" answered before the commit, not after the incident. That reframes it from observability I can't afford to a one-line habit I can. And "the dashboards are what you build when you can't hold the surface in your head anymore" is the line I'm keeping — it makes the whole two-gate system feel like a natural graduation instead of something I failed to build on day one. Appreciate you taking the solo case seriously; most infra writing pretends the small scale doesn't exist.

mote • Jun 18

Rate limits causing silent failures is worse than outright crashes — at least a crash gets logged. I've watched agents accumulate partial state across multiple 429 responses and then execute with half the context missing. The output looks plausible enough that nobody notices until corrupted data hits production three steps later.

The real problem is most agent frameworks treat rate limits as transport-layer issues rather than application-layer state corruption. A 429 isn't "try again later" — it means "your current execution branch is now poisoned." If the agent was in the middle of mutating internal state when the limit hit, the retry starts from a half-baked world.

How do you handle the case where the agent's internal state is already partially written when the rate limit fires? Undo the mutation or trust the retry with the dirty state?

Sergei Parfenov • Jun 22

"a 429 means your current execution branch is now poisoned" — thats a sharper way to put the whole post, im stealing it. transport-layer vs application-layer state corruption is exactly the mental model most frameworks are missing.

on ur question — i think the honest answer is neither "undo" nor "trust the dirty retry", because both are traps:

undo assumes the mutation is reversible. if the half-written state is purely internal (scratchpad, working memory) u can roll it back — but the moment any of it escaped as a side effect (a write, a message, a tool call that hit an external system), there's nothing to undo. u cant un-send.
trust the dirty retry is just accepting the poison. the way out is to not be in that position: make each step a checkpoint, and make the side-effecting parts idempotent. then a 429 mid-step doesnt leave u choosing between rollback and dirty-retry — the retry resumes from the last committed checkpoint, and the idempotency key means any side effect that already fired is a dedup'd no-op on replay instead of a double-fire. so the design move is upstream of ur question: the reason "undo vs dirty retry" feels like a dilemma is that the step wasnt atomic in the first place. saga pattern + idempotency keys turn "half-baked world" into "resume from a known-good line."

the part that stays genuinely hard: internal reasoning state that isnt a clean tool call — the agent's half-updated belief about the world. that u cant idempotency-key. for that the only honest answer ive got is checkpoint granularity: commit reasoning state at step boundaries, never mid-step, so there's always a clean line to resume from. curious if uve found anything better there, because thats the part i dont think anyone's fully solved.

Mykola Kondratiuk • Jun 14

removing the error without replacing it with a new signal is the core problem. one thing the toolkit needs: validate the fallback model's output schema separately - primary and fallback often return differently-structured responses, and format drift is invisible downstream.

Ahmet Özel • Jun 12

Good framing. Silent degradation is where agents get dangerous because the system still looks alive from the outside. One thing I like to add is an eval replay set for degraded runs: keep the tool trace, retrieved context and final answer together, then replay the same cases after prompt/tool changes. It catches cases where the agent learned to continue smoothly while carrying bad state forward.

Sergei Parfenov • Jun 12

the degraded-run replay set is a great addition — its basically the offline half of the "fallback divergence" metric from the post. i diff fallback answers against the primary now; ur replaying the whole trace after changes, which catches the scarier thing: the agent learning to glide smoothly over bad state. keeping trace + retrieved context + answer together is the part most people skip and then cant reconstruct. adding this to the toolkit.

Scarab Systems • Jun 12 • Edited

this is exactly the sort of pivot in approach I'm interested in...

I would take it even a step further... the agent should not need to carry state... state and context should be provided by something that can carry that weight cleanly and more importantly truthfully... the repo... then the agent can continue to do what it does best.. code.

Sergei Parfenov • Jun 12

externalizing state is the right instinct — stateless agents + a source of truth they read from beats agents lugging context around, agreed. and for code the repo is the best ledger we have.

but heres where it doesnt close the loop: the repo records outcomes, not provenance. a commit produced from a degraded fallback chain diffs identically to one produced from clean primary reasoning. git gives u receipts for what changed — its silent on whether u should trust how it got there. so moving state into the repo solves the "agent carries fragile context" problem, but the evidence problem just moves with it: something still has to carry the trajectory-level receipts alongside the artifact. repo as ledger for state, evidence layer for process. u need both, theyre answering different questions.

Scarab Systems • Jun 12

ah Yes! — this is the distinction I was reaching for, and I think you’re right to split it that way.

When I say the repo should carry the authority, I don’t mean the git diff alone proves the process. A commit can show what changed while saying almost nothing about whether the change preserved the right obligations.

The way I think about it is more like: the repo has to be read into a baseline first.

Not just “current files,” but the repo’s claims: tests, docs, contracts, generated-vs-source boundaries, config expectations, ownership surfaces, validation signals, and whatever the system already uses to say “this is true here.”

Then the agent is not carrying the burden of remembering all of that conversationally. It is working against a diagnostic baseline that can say: this claim existed before, this boundary owned it, this artifact was evidence for it, and this change either preserved, moved, weakened, or contradicted it.

So yes: repo as ledger for state, evidence layer for process — but I’d add that the evidence layer has to be grounded in a repo baseline, not just attached afterward as trace metadata.

That is the shape I’m interested in: before the workflow acts, it should be able to show both the artifact change and the evidence chain that says the change still belongs where it landed.

Scarab Systems • Jun 12

This is a really strong framing — especially the distinction between uptime and correct uptime.

The part that stands out to me is that the degraded path is not just a runtime state; it becomes an evidence problem. Once a fallback, stale cache hit, or retried side effect enters the chain, the question is no longer only “did the agent complete?” It becomes “what proof does the system still have that the completed trajectory preserved the intended boundary?”

That is very close to the diagnostic layer I’ve been exploring with Scarab/SDS. The failure is often not the loud error. The loud error is honest. The more dangerous failure is when the system keeps moving after the boundary that was supposed to preserve trust has already weakened.

The taint propagation point feels especially important. A degraded step should not be allowed to launder itself through later successful calls. If step 6 is clean but step 3 was degraded and never re-verified, the trajectory is still carrying that earlier uncertainty.

I like the “two gates” framing a lot. I’d almost describe Gate 2 as an evidence gate: before an irreversible action, the system has to prove not just that the last call succeeded, but that the whole chain still has valid provenance.

Sergei Parfenov • Jun 12

"evidence gate" is honestly a better name than mine — because it makes the obligation explicit. a trust tag is passive metadata; evidence is something the chain has to carry and produce on demand. step 6 shouldnt just be untainted, it should be able to show receipts for steps 1-5. same mechanism, stronger contract. stealing the term (with credit).

Scarab Systems • Jun 12

Yes — please take it and use it. Credit appreciated, but honestly the bigger thing is that we start naming the problem clearly enough to work on it together.

That “receipts for steps 1–5” phrasing is exactly the contract I was trying to get at. A tag describes a state, but an evidence gate asks whether the chain can actually produce proof for the state it is claiming.

The more we can shift the conversation from “did the agent finish?” to “what evidence does the workflow carry forward?”, the more useful the whole discussion becomes.

I think that shared language matters here because this failure mode is showing up in a lot of different places under different names. Once we can name it together, we can start designing around it instead of just reacting to it.

Sergei Parfenov • Jun 12

agreed — and ur "different names" point is literally true across fields: security calls it taint, data engineering calls it provenance, ML calls it lineage, audit calls it receipts. four communities, one shape: can you trace what this result stands on. agents just made it urgent because now the untraceable thing acts.
"what evidence does the workflow carry forward" is the right question to standardize on. good thread — this is going in the next post.

VoltageGPU • Jun 15

Great article — it's easy to focus on just hitting availability numbers, but ensuring correctness under load is where the real complexity lies. In GPU-based systems, especially with inference workloads, we've seen cases where rate limiting was bypassed, but stale or cached results were served without proper tracking, leading to silent errors. It's a reminder that SLOs need to account for both freshness and fidelity, not just uptime.