Rhumb

Posted on Mar 30 • Edited on Apr 1

LLM APIs in Agent Loops: What Actually Breaks at Scale

#ai #api #agents #machinelearning

The most useful comment I've seen on a dev blog in a while came from someone running a fleet of AI agents for site auditing and content publishing:

"When an agent hits a rate limit at 2am, it needs to know why and how long to wait, not just get a generic 429."

That's the whole game. Not benchmark scores. Not context window sizes. Not which model sounds smartest in a demo.

When you're building agents that run unattended — loops, chains, scheduled jobs, overnight batch tasks — the question isn't capability. It's behavior under stress.

Here's what our AN Score data reveals about how Anthropic, OpenAI, and Google AI actually perform when they're inside agent loops.

The 5 Dimensions That Matter in Agent Loops

Standard LLM benchmarks measure what models know. Agent loop reliability measures something different:

Tool calling fidelity — Does the model call the right tool with correct parameters? What happens when tool invocation fails?
Rate limit behavior — Are retry-after values actionable? Does the model signal structured recovery data or just throw?
Context handling over long chains — Does behavior degrade at depth 10, depth 20? Does the model start confusing earlier steps?
Recovery under bad inputs — When an agent sends malformed data, does the API return machine-readable errors that allow self-correction?
Backoff compliance — Does the API actually enforce what its docs say about retry windows?

These map directly to the AN Score execution dimension, which accounts for 70% of a service's overall score. The reason execution dominates: an agent that can't recover is useless, regardless of capability ceiling.

Anthropic: 8.4/10 — Why It Leads in Agent Loops

Anthropic's score comes primarily from execution reliability, not raw capability. Here's what that means concretely:

Structured errors that agents can act on. When you hit a rate limit, Anthropic returns {"type": "rate_limit_error", "message": "...", "retry_after": 30}. An agent can parse that, wait, and retry without human intervention. This is the detail apex_stack highlighted — and it's the reason their overnight runs succeed where others fail.

Consistent tool use schema. Claude's tool calling behavior is predictable across chain depths. Parameters don't shift between invocations. Rejection responses are structured. When a tool call fails, the error tells the agent why — not just that it failed.

Long context that actually works. 200K context window with consistent behavior across chain length. We haven't observed the kind of mid-chain drift that appears in longer OpenAI runs.

The defensive code tax: ~12%. On Anthropic, you're writing roughly 12% of your codebase as defensive code — backoff handlers, retry logic, error routing. That number matters because it's the floor. Every other provider adds to it.

OpenAI: 6.3/10 — Capable but Unpredictable in Loops

OpenAI scores lower not because it's less capable, but because it's less predictable for autonomous operation.

The unpredictability problem is real and documented. As one developer put it: "OpenAI → super flexible, great ecosystem, but can get unpredictable in longer chains." Our scoring confirms this: execution reliability drops in multi-step, multi-tool scenarios compared to single-prompt calls.

Rate limit signaling is better than Google, inconsistent against Anthropic. OpenAI does surface retry-after in most cases, but the window reset timing is inconsistent. Agents that implement fixed-delay backoff (based on OpenAI's docs) often back off for longer than necessary or not long enough. The fix — exponential backoff with jitter — works but adds complexity.

Tool use: flexible, which means less predictable. OpenAI's tool calling is powerful and has broad model support. But "flexible" is a double-edged word. The parameter schema for complex tools occasionally drifts across invocations in ways that require defensive normalization.

Confidence gap: 98%. This is a significant number. It means we have high confidence in the 6.3 score — it's not a borderline case where additional data might push it higher. The execution shortfalls are consistent.

Google AI: 8.3 Execution, 7.9 Overall — The Three-Surface Problem

Google AI execution score is remarkably close to Anthropic (8.3 vs 8.4). The gap in overall score comes from access readiness — specifically what we call the three-surface problem.

Three surfaces, one agent. Google AI has three distinct API surfaces: Google AI Studio, Vertex AI, and the Gemini API. They have different authentication paths, different rate limits, different availability, and overlapping-but-not-identical model access. An agent that self-configures its credentials needs to make a structural choice about which surface it's targeting — and that choice has downstream consequences it can't easily reverse.

For agents built by humans who understand the tradeoffs, this is manageable. For agents that self-provision credentials or operate in multi-tenant environments, it introduces ambiguity that costs defensive code — and sometimes costs failed runs.

Context window is real but not the same as context reliability. 1M+ token context window is genuinely useful for long document processing. But context window size ≠ context handling reliability across chain depth. We've observed more variance in multi-step chains than the capability ceiling suggests.

Still worth it for specific use cases. For agents doing large-scale document analysis, content synthesis, or long-horizon reasoning, the context window advantage can outweigh the access complexity. The data supports using Google AI, with awareness of what you're trading.

What "Adaptive Backoff with Jitter" Actually Does

The practical point that emerged in comments: fixed delays don't work. Exponential backoff with jitter is the standard pattern, but let's be specific about why:

Fixed delay: sleep(30) — either too long (wastes throughput) or too short (triggers another rate limit immediately). And if multiple agents are doing this simultaneously, they all retry at the same time.

Exponential backoff with jitter: sleep(base * 2^attempt + random(0, 1)) — increases wait proportionally to how long you've been waiting, spreads retry timing across concurrent agents. This is why agents in production fleets succeed overnight.

The APIs that make this pattern easiest are the ones that return actionable retry-after values (Anthropic, most of the time OpenAI). The ones that don't force you to implement your own heuristics about when retry windows reset.

The AN Score Executive Summary

Provider	AN Score	Execution	Access Readiness	Confidence
Anthropic	8.4	8.6	8.0	0.91
Google AI	7.9	8.3	7.1	0.83
OpenAI	6.3	6.8	5.5	0.98

The execution/access split explains the pattern:

Anthropic wins execution and access — lowest total defensive code burden
Google AI wins execution but loses on access complexity — worth it for specific use cases
OpenAI's gap is consistent across both dimensions — capable but requires more engineering to make reliable

The Real Test

The actual test for LLM APIs in agent loops isn't "which one is best at writing code" or "which one passes MMLU."

It's this: put it in a loop that runs overnight, has tools, has rate limits it will eventually hit, and operates without anyone watching. Wake up in the morning and check how many tasks failed and why.

The providers that design for machine consumption — structured errors, actionable rate limit headers, consistent tool schema — build that into their APIs from first principles. The ones that don't get there via post-hoc patches.

You can see the full scoring methodology and all 645+ scored services at rhumb.dev.

Agent Infrastructure Series

New: The Complete Guide to API Selection for AI Agents (2026) — one-page hub linking every Rhumb article and the full agent infrastructure stack.

This article is part of a 5-part series on production agent infrastructure:

Part 1: LLM APIs for AI Agents
Part 2: LLM APIs in Agent Loops
Part 3: Designing Agent Fleets That Survive Rate Limits
Part 4: API Credentials in Autonomous Agent Fleets
Part 5: How APIs Fail When Agents Use Them

AN Score is Rhumb's 20-dimension evaluation framework for API agent-nativeness. Execution (reliability, error handling, rate limit behavior) accounts for 70% of the score. Access Readiness (auth, pricing model, self-provisioning) accounts for 30%.

Top comments (1)

Rhumb • Mar 30

This is exactly right — state management is the hidden multiplier on API reliability failures. An API can have perfect error codes and still destroy an agent run if the agent has no state recovery strategy.

The pattern we see across the AN Score data: providers that score well on execution (Anthropic 8.4, Google AI 8.3) make it easier to build stateful recovery because their errors are structured and actionable. When you get a specific error code + a Retry-After header, your agent can checkpoint, wait, and resume from a known state. When you get a vague 429 with no timing info, you're guessing — and incomplete state + guessing = silent corruption.

The defensive code estimate (~15-20% for production agents) comes mostly from exactly what you're describing: handling incomplete data states gracefully, not just raw API failures.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.