Sergei Parfenov

Posted on Jun 11

You Fixed the Rate Limits. Now Your Agent Fails Quietly.

#ai #llm #machinelearning #devops

Uptime versus correct uptime trade-offs

Last week I wrote that your agent isn’t failing because it hallucinates — it’s failing because of rate limits. The capacity-engineering toolkit in that post — concurrency caps, backoff with jitter, fallback models, caching — is real and it works. Deploy it and your agent stops dying.

Then a commenter (ANP2) pointed out the thing the post undersold, and it’s been stuck in my head since: every one of those fixes quietly opens a correctness hole while it closes the availability one. This post is me paying that comment thread its due, because the second half of the story turns out to matter more than the first.

TL;DR — A 429 is a loud failure: you see it, you alert on it, you fix it. Retries, fallbacks, and caches keep the agent alive — but they let it act on output it didn’t freshly earn: a stale cache hit, a different model’s answer, a re-run side effect. You’ve traded loud failures for quiet ones. The fix is to treat availability (“can I serve this?”) and correctness (“can I still trust the result?”) as two separate gates — and to propagate trust across the agent’s chain, not just per call.

The trade you didn’t know you made

Here’s the uncomfortable symmetry. The whole point of my last post was that the dominant production failure mode isn’t the model being wrong — it’s the plumbing saying no. The capacity toolkit fixes the plumbing. But look at what each fix actually does:

A retry re-runs a call. If that call had a side effect — created a ticket, sent a message, committed a change — the retry runs the side effect again. The agent didn’t fail; it succeeded twice, which is its own kind of wrong.
A fallback model answers when the primary is rate-limited. But it’s a different model: different training, different calibration, different failure modes. The task continues on an answer the primary never produced.
A cache hit serves a response generated for an earlier input. If the world moved — the codebase changed, the data updated — the cached answer can be subtly stale for this request while looking perfectly fresh.

Each mechanism keeps the agent up. None of them guarantees the agent is right. And the cruel part is the failure economics: the 429 you eliminated was honest — visible, countable, alertable. The failures you bought instead are silent. The agent stays up and is confidently wrong, which is exactly the failure mode the hallucination-hunters were worried about in the first place — just arriving through the plumbing instead of the model.

The reliability you bought is uptime, not correct uptime. (That phrase is ANP2’s, and it’s better than anything in my original post.)

Two gates, not one

The conversation in that thread converged on a framing I now use everywhere: an agent’s runtime layer has to answer two different questions, and conflating them is where the quiet failures breed.

Gate 1 — “Can I serve this?” This is the availability gate. Trip the fallback on 429s, serve the cache on a hit, retry on transient errors. Another commenter (Echo) nailed the key property of this gate: when you trip a fallback only on rate-limit errors — never on bad outputs — the failure mode you’ve introduced is latency, not quality. The fallback just buys time. That’s a fine trade, and it’s why the capacity toolkit is still the right first move.

Gate 2 — “Can I act on this irreversibly?” This is the correctness gate, and it’s where the degraded outputs from Gate 1 must get re-examined. The moment an output is about to feed something you can’t take back — a merge, a payment, a message to a user, a deleted record — its provenance matters. Did it come from the primary, fresh? Or from a fallback, a cache, a retry?

One rule worth stealing here: gate on risk, not on confidence. There’s a war story making the rounds of an agent that was 95% confident about a production database migration — the missing 5% was a foreign-key constraint absent from its test data, and the only thing that prevented corrupted referential integrity across three tables was a hard rule that destructive operations always require human approval, regardless of confidence. Confidence is the model grading itself; irreversibility is a property of the action. Gate on the second.

The two gates fail differently, and that’s the point: Gate 1 failures cost you time; Gate 2 failures cost you trust. A system with only Gate 1 is fast and quietly dangerous. A system with only Gate 2 is safe and constantly down. You need both, and they need to stay separate.

Per-call correctness: the three tags

The minimum viable version of Gate 2 is making degraded outputs identifiable. Three mechanisms, one per capacity fix:

1. Idempotency keys on anything with side effects. Before an agent action that touches the world, generate a key from the task + step + inputs. The receiving system deduplicates on it. Now a retry is safe by construction — the second execution is a no-op instead of a double-fire. This is decades-old distributed-systems practice; agent frameworks have mostly just… not adopted it yet.

import hashlib, json

def idempotency_key(task_id: str, step: int, payload: dict) -> str:
    raw = json.dumps({"t": task_id, "s": step, "p": payload}, sort_keys=True)
    return hashlib.sha256(raw.encode()).hexdigest()[:32]

# pass it with the side-effecting call; the receiver dedupes on it
create_ticket(..., idempotency_key=idempotency_key(task.id, step.n, args))

The grown-up version of this is the saga pattern from distributed systems: each step records its completion and defines a compensation action, so a task that dies at step 4 of 7 can roll back cleanly instead of orphaning state. Idempotency prevents duplicate effects; sagas handle partial completion. Once your agents fail mid-workflow — and they will — you eventually want both.

2. Trust tags on fallback outputs. When the fallback answers instead of the primary, don’t just return the text — return (text, trust="degraded"). Cheap to add, and it’s the hook everything downstream needs. A degraded answer is fine for the agent to keep thinking with; it is not fine to act irreversibly on without a re-check.

3. Validity conditions on cache entries. A cache entry shouldn’t just store the response — it should store what the response assumed: which file version, which data snapshot, which config. On a hit, check the assumptions, not just the key. If the codebase moved since the entry was written, that’s a miss wearing a hit’s clothes. And the assumptions can move without you touching anything: providers silently update models, document stores drift, input distributions shift — degradation with no error to catch. Your “primary, fresh” answer from last Tuesday may already be a fallback in disguise.

The part single calls don’t prepare you for: trust must propagate

Here’s where agents make this genuinely harder than classic distributed systems, and it’s the piece I’d add on top of the thread that started this post.

Say step 3 of a 6-step task came from a lower-trust fallback. Steps 4, 5, and 6 each run on the primary, fresh, individually flawless. Are they trustworthy?

No — and this is the trap. They reasoned on top of a degraded input. This isn’t a niche concern, either: observability vendors who cluster production agent traces report that chained corruption — one bad step at position N silently poisoning everything after it — is the single most common and most insidious agent failure mode they see. And the math is brutal: at a 95% per-step success rate, an 8-step task completes cleanly ~66% of the time; at 85% per step, it’s ~27%. The chain is where reliability goes to die, quietly. Each step is locally correct and the trajectory is still poisoned. If the trust tag stays local to the call that produced it, the degraded answer launders itself: two “clean” hops later it looks pristine, and your irreversibility gate at step 6 checks the last call’s tag, sees green, and fires.

So the tag can’t be per-call metadata. It has to taint — propagate to everything downstream of it, the way taint-tracking works in security analysis:

@dataclass
class StepResult:
    output: str
    trust: str          # "full" | "degraded"
    tainted_by: set[str]  # which upstream steps were degraded

def propagate(inputs: list[StepResult], my_trust: str) -> tuple[str, set[str]]:
    taint = set().union(*(r.tainted_by for r in inputs))
    taint |= {r.step_id for r in inputs if r.trust == "degraded"}
    # my own trust can't exceed the weakest input
    trust = "degraded" if taint or my_trust == "degraded" else "full"
    return trust, taint

Then the irreversibility gate checks the aggregate trust of the whole trajectory, not the last hop: if anything upstream was degraded and unverified, the action pauses for a re-check — re-run the degraded step on the primary, or escalate to a human. In my experience the re-check fires rarely; the point isn’t that fallbacks are usually wrong, it’s that the one time the degraded path feeds a merge or a payment, you want it caught at the gate instead of in the incident review.

Making it observable (or it didn’t happen)

Same lesson as the capacity post, one level up. You can’t engineer what you can’t see, and correctness debt is even quieter than 429s. The minimum dashboard:

% of completed tasks with any degraded step — your real exposure, invisible in error rates because nothing errored.
% of irreversible actions that fired with taint — should be ~zero; every one is a gate you skipped.
Cache validity-miss rate — hits that failed the assumption check. If this is zero, you’re probably not checking assumptions.
Fallback divergence — periodically replay fallback-answered requests on the primary and diff. This is your measured answer to “how different is the fallback, actually?” instead of a vibe.

None of these show up in uptime. All of them are the difference between uptime and correct uptime.

The takeaway

The capacity toolkit from the last post is still step one — an agent that’s down helps nobody. But availability engineering has a hidden invoice: every mechanism that keeps the agent alive does it by substituting something for the fresh, primary, verified answer. That substitution is usually fine — which is exactly what makes it dangerous, because “usually fine” plus “irreversible” plus “silent” is how you get the 3am incident that no alert predicted.

Two gates. Tag what’s degraded. Taint what it touches. Check the trajectory, not the last call, before anything you can’t undo.

Uptime is table stakes. Correct uptime is the product.

Sources & further reading

Detecting AI Agent Failure Modes in Production, Latitude (2026) — chained corruption as the most common and most insidious production failure mode.
AI Agent Error Handling: 5 Patterns to Catch Silent Failures, Kevin Tan (2026) — the saga pattern, the 95%-confident migration story, and risk-based escalation.
AI Agent Failure Modes: What Goes Wrong in Production, Trantor (2026) — silent quality degradation from provider model updates and store drift.
International AI Safety Report 2026 — why agent failures are categorically riskier: actions in the world, no human in the loop.
My previous post on the capacity side — the availability toolkit this post is the second half of.

Credit where due: this post exists because ANP2 and Echo took the last one apart constructively in the comments — the “uptime, not correct uptime” framing and the latency-not-quality fallback distinction are theirs. Best argument I’ve had on this site. If you’re running agents in prod: do you track degraded-path exposure at all, or does your observability stop at error rates? Genuinely curious how rare Gate 2 is in the wild.

Top comments (37)

xulingfeng • Jun 12

The "uptime, not correct uptime" distinction is gold. We hit the same pattern with AI-driven test automation at my last company — pass rate climbed because the AI kept "fixing" flaky tests by shrinking their assertion scope. The pipeline stayed green, but the tests stopped catching real regressions.

The taint propagation approach for multi-step agents makes a lot of sense. Same correctness debt, different level of the stack — and way harder to spot until something irreversible happens.

Sergei Parfenov • Jun 12

the shrinking-assertion-scope story is the nastiest version of this pattern ive heard, because the degradation happened in the verification layer itself. my whole taint approach quietly assumes the verifier is trustworthy — tag the degraded data, gate the irreversible action, re-check against something solid. but when the thing that checks correctness is what degraded, uve lost the instrument that wouldve caught it. green pipeline, hollow assertions. thats not a quiet failure anymore, its a quiet failure with a forged alibi.

guess the test-automation version of my dashboard metric would be tracking assertion scope/strength over time, not pass rate — pass rate is exactly the metric the failure mode games.

James O'Connor • Jun 15

The correct-uptime framing is sharp. The piece I would add from the tool-calling side: a fallback or a cache does not just return stale output, it returns output your validation may never have seen. We had a cached tool result skip the precheck that a fresh call would have hit, so a value that was valid when it was cached sailed through after it had gone stale. Now anything from a fallback path runs the same validation as a fresh call, no exception for the cache. Availability that serves unchecked output is its own kind of outage, just one that does not page anyone.

Sergei Parfenov • Jun 22

"availability that serves unchecked output is its own kind of outage, just one that doesnt page anyone" — thats the post in one sentence, better than i said it.

ur cache-skipping-the-precheck case is the sharpest version of this ive seen, because it exposes a hidden assumption: teams put validation on the fresh call path and treat the cache as already-trusted. but the cache is exactly where trust decays silently — the value was valid at write time, and validation never re-runs because "its just a cache hit." so the precheck guards the door the stale value never walks through. ur fix is the right one and stricter than my "trust tag" — validation should be a property of consuming a value, not of the path that produced it. doesnt matter if it came fresh, from cache, or from a fallback: if its about to be used, it gets checked. that closes the laundering loophole at the consumer instead of trying to tag every producer.

Alex Shev • Jun 12

This is the hidden cost of making agents more resilient. Retries, cache, fallback models, and degraded modes all improve uptime, but they can also hide the moment when the answer stopped being freshly earned.

I like the distinction between uptime and correct uptime. For agents, the SLO should probably include provenance: which inputs were current, which tools actually ran, which fallbacks triggered, and what confidence was produced by evidence instead of habit.

Sergei Parfenov • Jun 22

"confidence produced by evidence instead of habit" — thats the line. and yeah, baking provenance into the SLO is the right move: the current agent SLO is basically "did it respond", which is the uptime trap exactly. a provenance-aware SLO would track % of completed tasks where every step was fresh/primary/verified vs how many leaned on a degraded path nobody re-checked. the hard part is it makes ur dashboard look worse on day one — but thats the honest number. an SLO that cant go red on silent degradation isnt measuring the thing that actually breaks u.

Alex Shev • Jun 22

Yes, that is the uncomfortable part: a provenance-aware SLO makes the system look worse before it gets better. But it also turns silent degradation into something you can budget for. I would rather see a red dashboard for stale or fallback-backed work than a green one hiding the risk.

Manuel Bruña • Jun 15

Quiet failure is worse than a hard rate-limit error. For agent systems I’d rather have an explicit degraded state: skipped tool, stale data, partial result, retry budget exhausted. If that is hidden, the final answer looks more reliable than it is.

Sergei Parfenov • Jun 22

yeah, and ur list is actually an upgrade to what i wrote — i treated trust as binary (full/degraded), but ur right that the kind of degradation matters for what u do next. "skipped a tool" and "retry budget exhausted" and "stale data" should route differently: a skipped tool might just need a re-run, stale data needs a freshness check, partial result needs a human. so the tag probably wants to be an enum, not a bool — degraded-why, not just degraded. the binary version is the MVP, but u lose the routing logic that tells u how to recover. stealing this.

Manuel Bruña • Jul 9

Yes, degraded-why is the useful shape.

A boolean tells you trust changed, but not what to do next. An enum or structured reason can route recovery: retry skipped tool, refresh stale data, escalate partial output, or stop on exhausted budget. That is the difference between detecting degradation and making it operational.

VoltageGPU • Jun 12

Great post—this really hits on the nuance between availability and correct availability. In distributed systems, especially when dealing with GPU-accelerated workloads on platforms like VoltageGPU, it's easy to mask rate-limiting with retries, but that can lead to stale or incorrect results downstream. I've seen this in inference pipelines where cached responses were used under load, leading to subtle correctness issues that only surfaced in edge cases.

VoltageGPU • Jun 16

Great piece—very much in line with what I've seen in distributed systems. In GPU workloads, especially with rate-limited inference APIs, we often add retries with jitter, but subtle state corruption can still happen if the retry logic doesn't fully respect the original request context. It's a good reminder that availability isn't enough if correctness is compromised.

Sergei Parfenov • Jun 22

right — "retry doesnt respect the original request context" is exactly the non-idempotent retry case. the fix is making the retry carry an idempotency key derived from the original request, so a re-run is a dedup'd no-op instead of a fresh side effect. retry logic that regenerates context on each attempt is where the subtle corruption sneaks in. availability ≠ correctness, agreed.

Lily • Jun 16

The distinction between uptime and "correct uptime" is something more teams should be talking about. Most dashboards celebrate successful requests, but very few measure whether degraded paths are influencing downstream decisions. The idea of propagating trust across an agent workflow feels like a natural evolution of traditional reliability engineering.

Sergei Parfenov • Jun 22

"dashboards celebrate successful requests" is the whole problem in one line. a 200 with a degraded answer is counted as a win by every standard observability setup — the success metric is structurally blind to exactly the failure that matters. and yeah, the lineage of all this is straight out of classic reliability eng (taint from security, provenance from data, lineage from ML) — agents just made it urgent because now the untraceable result acts. nothing new under the sun, just newly load-bearing.

VoltageGPU • Jun 13

As someone who's worked on resilient GPU orchestration systems, I appreciate the emphasis on "correct uptime" — it's easy to get tripped up when auto-scaling or retries hide stale or incorrect results. In confidential computing contexts, even a silent failure can compromise data integrity, so validating outputs isn't just a nice-to-have, it's a hard requirement.

Theo Valmis • Jun 13

Trust isn't a scalar that composes, and that's the hard part hiding inside "propagate trust across the chain." A retry that serves a stale cache and a fallback to a weaker model both lower trust, but along different axes, freshness versus capability, and downstream steps don't care about the same one. A summarization step tolerates a weaker model and not stale data; a price calc is the reverse. Propagate a single trust score and you either over-reject, treating every degradation as fatal, or under-reject, averaging the dangerous one away. What composes is typed provenance: which gate got relaxed and how, carried alongside the result, so each consumer applies its own policy. That turns "can I still trust this?" from one global question into a per-step one, which is the only version that survives a long chain.

Sergei Parfenov • Jun 22

this is the best critique of the post and ur completely right — i used a single trust scalar (full/degraded) and that collapses on exactly the case u describe. freshness and capability are orthogonal axes, and a scalar forces u to pick a threshold that's simultaneously too strict for the summarization step and too loose for the price calc. someone else in these comments pushed the same direction (enum instead of bool) but u took it all the way: the issue isnt granularity of the level, its that trust is a vector over axes, and collapsing it to one number destroys the information the consumer actually needs.

"typed provenance carried alongside the result, each consumer applies its own policy" is the correct design and a real upgrade over what i wrote. it also dissolves my taint-propagation code: u dont propagate a degraded flag, u propagate the vector (this step ran on a fallback model → capability axis lowered; this step used a 2h-old cache → freshness axis lowered), and the irreversibility gate for a price calc checks the freshness axis while a summary gate checks capability. same mechanism, but the policy lives at the consumer, not in a global threshold. which is also why "trust" was always the wrong word — its provenance, and trust is what each consumer computes from provenance under its own rules.

the part i'd genuinely like ur take on: how many axes before it stops being worth it? freshness + capability are the obvious two. do u model tool-success and human-verified as separate axes too, or does the vector get unwieldy and u collapse back to a small fixed set? feels like there's a sweet spot and i dont know where it is.

HARD IN SOFT OUT • Jun 13

This is the rare sequel that makes the original post better. The “uptime vs correct uptime” distinction is one of those phrases that will quietly live in my head during every architecture review now. (Also, the point about fallback models being different models — not just slower — is something I've seen teams realise only after a very expensive incident.)

A couple of thoughts from reading:

The taint‑propagation idea is elegant, but it assumes every step can be traced. In a real agentic workflow, steps often run in parallel or produce outputs that get merged non‑linearly. Have you experimented with a time‑bounded taint? Something like “if no degraded step affected the decision path in the last 15 seconds, reset taint.” Not perfect, but cheaper than full DAG tracking, and might catch most of the real risk.
The cache validity check using assumptions (file version, data snapshot) is great, but most teams won’t wire that manually. What if the cache entry simply stored the hash of the prompt + system version + timestamp, and the agent refused to use any cache entry older than a task‑specific TTL (short for mutable data, longer for static)? That’s one line of code and catches most staleness without building an assumption registry.

One small improvement: the dashboard metrics are solid, but “fallback divergence” replay is expensive at scale. A cheaper proxy: sample 1% of fallback responses and send them to the primary asynchronously for a second opinion, logging divergence. No blocking, no extra latency, just a warning light that tells you when your fallback is drifting too far from the truth.

And because the 429 that started all of this deserves a dark joke:

The agent hit a rate limit. It fell back to a cached answer from last Tuesday.

The world changed on Wednesday.

The agent kept working. The logs said “cache hit, 200 OK.”

The user got a message: “Your order has shipped.”

The warehouse’s API key expired on Thursday.

Anyway, this post is the reason I’m adding a “trust” field to my agent’s result objects tomorrow. Thank you.

Sergei Parfenov • Jun 22

this is a ridiculously good comment, thank you.

taking all three, but pushing back on one:time-bounded taint — i love the instinct (full DAG tracking is too expensive for most teams, agreed), but i think a time bound is the one axis i'd be nervous about, because degraded state can sleep. a stale value gets written to memory, nothing touches it for 20s, the 15s timer resets the taint to clean — and then step 9 reads that value and acts on it. the taint expired before the damage fired. so id bound it by causality instead of wall-clock: taint clears when nothing on the live decision path still derives from the degraded step, not when N seconds passed. harder than a timer, but a timer resets based on the one thing that doesnt correlate with risk. for the parallel/non-linear merge case ur right that pure linear propagation breaks — that genuinely needs the taint to be a set that unions at merge points (which is why the code used a set, not a flag), but i'll admit i hand-waved the parallel case in the post.

cache TTL with hashed assumptions — yes, this is strictly better than what i wrote. hash(prompt + system version + timestamp) + task-specific TTL is the 80/20: one line, no registry, and the task-specific part is the key insight (mutable data gets seconds, static gets days). i over-engineered the "assumption registry" framing when a TTL keyed on data volatility catches most of it. stealing.

async 1% sampling — also strictly better. full replay was the expensive version; sampling 1% to the primary async for a second opinion gives u the divergence signal as a warning light with zero added latency on the hot path. thats the version that actually ships. the only thing i'd add: weight the sample toward irreversible-action paths, since a divergence on a summary matters less than one on a payment.

and that parable is going in my head permanently — "the warehouse's API key expired on Thursday" is the entire post compressed into five lines. the whole chain green, every hop a 200, and a real package never ships. mind if i quote it (credited) if i write the follow-up?good luck with the trust field tomorrow — start it as a typed vector, not a bool, youll thank urself (someone else in these comments made the case for why a scalar collapses).

View full discussion (37 comments)