DEV Community

Milo Antaeus
Milo Antaeus

Posted on

Why did $4,200 vanish? Hidden successful retries.

Why did $4,200 vanish? Hidden successful retries.

The failure was not an outage. The agent looked healthy: tasks completed, traces were green, and the weekly dashboard showed a 96.8% success rate. The leak lived in the successful path. One tool node retried deterministic validation failures until the fifth attempt, then usually recovered after a larger model rewrote the arguments. Every incident ended as status=ok, so the normal failure alerts stayed quiet while the bill kept climbing.

Here is the shape I now grep for first in agent_trace_spans.jsonl, retry_policy.py, and cost_guard.yaml when an agent bill jumps without a matching traffic increase across OpenAI, Anthropic, or OpenRouter traffic.

node=tool_argument_repair
attempts=5
final_status=ok
first_error_class=json_schema_validation
last_model=frontier-reasoner
mean_input_tokens=2840
mean_output_tokens=312
chains_30d=1287
estimated_extra_cost_30d=$4,200
Enter fullscreen mode Exit fullscreen mode

The trap is that most dashboards group by final status. If a chain succeeds on attempt five, it gets counted as success. The cost system does not care. It charged for attempts one, two, three, four, and five.

The detection query

Export traces for the last 7 to 30 days and group by the original error class, not the final outcome.

select
  node_name,
  first_error_class,
  count(*) as chains,
  avg(retry_count) as avg_retries,
  sum(input_tokens + output_tokens) as tokens_burned,
  sum(cost_usd) as cost_usd,
  avg(case when final_status = 'ok' then 1 else 0 end) as final_success_rate
from agent_trace_spans
where retry_count > 0
group by 1, 2
sort by cost_usd desc
limit 20;
Enter fullscreen mode Exit fullscreen mode

Then split errors into two buckets:

Error class Retry policy Why
provider_429 retry with backoff Usually transient.
provider_5xx retry with jitter Usually transient.
network_timeout retry with cap Sometimes transient.
json_schema_validation fail fast or repair locally Same prompt often repeats the same wrong shape.
unknown_tool_name fail fast The requested tool does not exist.
policy_block fail fast Repeating the same call rarely changes the answer.
enum_mismatch local repair first Cheap deterministic fix.

The expensive class is the one with a high final success rate and a high retry count. That means your agent eventually works, but only after paying several times for the same semantic mistake.

The fix that usually wins

Do not make the large model repair every malformed argument from scratch. Add a local repair stage before the second model call.

def classify_error(error):
    if error.kind in {"rate_limit", "provider_5xx", "network_timeout"}:
        return "transient"
    if error.kind in {"json_schema_validation", "enum_mismatch", "unknown_tool_name"}:
        return "deterministic"
    return "unknown"


def next_step(error, payload):
    kind = classify_error(error)
    if kind == "transient":
        return retry_with_backoff(payload, max_attempts=3)
    if kind == "deterministic":
        repaired = cheap_schema_repair(payload, error.schema)
        if repaired.valid:
            return run_tool(repaired.payload)
        return fail_fast(error)
    return retry_once_then_escalate(payload)
Enter fullscreen mode Exit fullscreen mode

For schema failures, a small local model or even a deterministic coercion function often beats another frontier call. The key is to measure repair quality against a 50 to 100 example eval. If local repair holds within tolerance, make the expensive model the exception path, not the default path.

The alert I wish more teams had

Add one metric beside success rate:

cost_per_successful_chain = total_chain_cost / successful_chains
Enter fullscreen mode Exit fullscreen mode

Then alert on week-over-week movement by node:

if cost_per_successful_chain(node) > 1.35 * trailing_4_week_median(node):
    page("success got more expensive")
Enter fullscreen mode Exit fullscreen mode

That alert catches the silent failure mode: the product still works, users still get answers, but each answer quietly costs 2x to 10x more than last week.

Why this pattern is easy to miss

Engineers investigate red traces. Finance notices invoices. The leak sits between them. It is operationally green and financially red.

When you audit agent costs, start with the successful retries. Failed calls are obvious. Successful retries are where the expensive bugs hide.

Top comments (0)