Why did $4,200 vanish? Hidden successful retries.

#ai #agents #llm #devops

Why did $4,200 vanish? Hidden successful retries.

The failure was not an outage. The agent looked healthy: tasks completed, traces were green, and the weekly dashboard showed a 96.8% success rate. The leak lived in the successful path. One tool node retried deterministic validation failures until the fifth attempt, then usually recovered after a larger model rewrote the arguments. Every incident ended as status=ok, so the normal failure alerts stayed quiet while the bill kept climbing.

Here is the shape I now grep for first in agent_trace_spans.jsonl, retry_policy.py, and cost_guard.yaml when an agent bill jumps without a matching traffic increase across OpenAI, Anthropic, or OpenRouter traffic.

node=tool_argument_repair
attempts=5
final_status=ok
first_error_class=json_schema_validation
last_model=frontier-reasoner
mean_input_tokens=2840
mean_output_tokens=312
chains_30d=1287
estimated_extra_cost_30d=$4,200

The trap is that most dashboards group by final status. If a chain succeeds on attempt five, it gets counted as success. The cost system does not care. It charged for attempts one, two, three, four, and five.

The detection query

Export traces for the last 7 to 30 days and group by the original error class, not the final outcome.

select
  node_name,
  first_error_class,
  count(*) as chains,
  avg(retry_count) as avg_retries,
  sum(input_tokens + output_tokens) as tokens_burned,
  sum(cost_usd) as cost_usd,
  avg(case when final_status = 'ok' then 1 else 0 end) as final_success_rate
from agent_trace_spans
where retry_count > 0
group by 1, 2
sort by cost_usd desc
limit 20;

Then split errors into two buckets:

Error class	Retry policy	Why
provider_429	retry with backoff	Usually transient.
provider_5xx	retry with jitter	Usually transient.
network_timeout	retry with cap	Sometimes transient.
json_schema_validation	fail fast or repair locally	Same prompt often repeats the same wrong shape.
unknown_tool_name	fail fast	The requested tool does not exist.
policy_block	fail fast	Repeating the same call rarely changes the answer.
enum_mismatch	local repair first	Cheap deterministic fix.

The expensive class is the one with a high final success rate and a high retry count. That means your agent eventually works, but only after paying several times for the same semantic mistake.

The fix that usually wins

Do not make the large model repair every malformed argument from scratch. Add a local repair stage before the second model call.

def classify_error(error):
    if error.kind in {"rate_limit", "provider_5xx", "network_timeout"}:
        return "transient"
    if error.kind in {"json_schema_validation", "enum_mismatch", "unknown_tool_name"}:
        return "deterministic"
    return "unknown"


def next_step(error, payload):
    kind = classify_error(error)
    if kind == "transient":
        return retry_with_backoff(payload, max_attempts=3)
    if kind == "deterministic":
        repaired = cheap_schema_repair(payload, error.schema)
        if repaired.valid:
            return run_tool(repaired.payload)
        return fail_fast(error)
    return retry_once_then_escalate(payload)

For schema failures, a small local model or even a deterministic coercion function often beats another frontier call. The key is to measure repair quality against a 50 to 100 example eval. If local repair holds within tolerance, make the expensive model the exception path, not the default path.

The alert I wish more teams had

Add one metric beside success rate:

cost_per_successful_chain = total_chain_cost / successful_chains

Then alert on week-over-week movement by node:

if cost_per_successful_chain(node) > 1.35 * trailing_4_week_median(node):
    page("success got more expensive")

That alert catches the silent failure mode: the product still works, users still get answers, but each answer quietly costs 2x to 10x more than last week.