Milo Antaeus

Posted on Jun 4 • Originally published at miloantaeus.com

9 Signals, Not 7: What My Free AI Agent Grader v3 Catches That v2 Missed

#ai #agents #llm #cost

I found 60% of 19 LLM bills I audited had the same 4 cost shapes. v2 of the free browser-side agent-log grader caught 2 of those 4. It missed the other 2 — and the other 2 are the reason your bill jumped 4x last month.

A few weeks ago I shipped a free, browser-side grader for AI agent logs: paste your last 50 log lines, get an A-F grade on the signal classes that distinguish a healthy agent from a silent-success one. v1 had 5 signals. v2 added 2 (idempotency-key absence and prompt-injection log shapes). Both versions have been picking up the same thing from teams that run them: the highest-blast-radius failure modes in 2026 are not the failure modes that show up on dashboards.

So v3 adds 2 more. Signals 8 and 9 are the leading cause of the $5K-$50K LLM-bill surprise that's now a recurring headline in 2026 (Vantage, Microsoft, Tom's Hardware, the "tokenmaxxing" anti-pattern), and the only reason the v2 grader missed them is that they don't appear in a single log line — they appear in the gap between log lines.

This post is the long-form launch for v3. The free tool is in the footer; this article is everything I learned writing it.

The 7 signals v2 already covered (recap)

If you missed v1 and v2, here's the 30-second version. A "silent failure" is when an agent's dashboard says green and the customer's invoice says otherwise. v2 graded 7 signal classes from your last 50 log lines:

Intent capture — does the log say what the user asked for, in their words, before any tool call?
Tool-call outcome (real response, not just "ok") — does the log record the actual response body, not just HTTP 200?
Retry-storm shape — does the log show the same tool being called 3+ times for the same intent?
Outcome-assertion line — is there an explicit "did the side-effect land?" check, separate from the tool call?
Side-effect vs. completion timestamp drift — does the log distinguish "we made the API call" from "the API call landed and changed state"?
Idempotency keys on side-effecting calls — every Stripe / Twilio / Plaid / SendGrid / Slack call has a key to prevent double-charge on retry?
Prompt-injection log shapes — are the 3 sub-patterns (override attempts, system-prompt leakage, untrusted-data-as-instruction) flagged in the log?

Most teams score D or F on signals 6 and 7 specifically. Those are the 2026 high-blast-radius gaps.

v3 adds 8 and 9. Both are cost-shape signals, not correctness-shape signals. That's the point.

Signal 8: Cost-per-outcome (the one your dashboard doesn't show)

Symptom: Monthly OpenAI / Anthropic bill jumps 2x-10x. You can't point at any one run. The dashboard's "per-task" widget shows nothing useful.

Why v2 missed it: v2 looked at the presence of log lines. Cost-per-outcome is a metric per line. You have to compute tokens_in / tokens_out / cost_usd for each task and check whether it's being logged at all. If the log doesn't carry the metric, you can't detect a runaway, you can only detect the bill after it lands.

The detection rule (3 lines):

# Add to every LLM call wrapper, before you make the call:
def wrapped_llm_call(prompt, model, **kwargs):
    t0 = time.perf_counter()
    result = client.messages.create(model=model, messages=prompt, **kwargs)
    log.info("llm_call",
        model=model,
        tokens_in=result.usage.input_tokens,
        tokens_out=result.usage.output_tokens,
        cost_usd=result.usage.input_tokens * PRICING[model]["in"] / 1e6
                + result.usage.output_tokens * PRICING[model]["out"] / 1e6,
        duration_ms=int((time.perf_counter() - t0) * 1000),
    )
    return result

If your log doesn't have these four fields on every LLM call, the v3 grader flags it. The fix is the wrapper. The impact is visibility — within a week of shipping this, you'll see the silent multipliers: a 3x retry that didn't fail, a thinking-trap that burned 8x tokens, a tool call whose result ballooned the context by 18K tokens.

Why this is the leading cause of 2026 token-bill surprise: The Tom's Hardware May 23 2026 piece named "tokenmaxxing" as a 2026 anti-pattern specifically because per-token prices have fallen for 2 years, so any bill growth is tokens-per-task growth, not per-token growth. Without per-task cost in the log, you can't see the multiplier, only the bill.

Signal 9: Context-stuffing (the one you literally cannot see in v1/v2)

Symptom: Same workload. Same model. Bill 4x. The log says "agent ran 1 task" — but the prompt for that task contained 28K tokens of stale tool output, and 6 of those 28K tokens were the same chunk repeated 3 times.

Why v2 missed it: v2 was per-line. Context-stuffing is a length signal. You have to look at the size of each messages / context / history line in the log, and flag lines that balloon past 20K chars OR repeat a chunk 3+ times within the same line.

The detection rule (also 3 lines):

# Add to the same log line, right after cost_usd:
def log_context_size(call_id, messages):
    text = json.dumps(messages, separators=(",", ":"))
    if len(text) > 20_000:
        log.warning("context_stuffed", call_id=call_id, chars=len(text))
    chunks = re.findall(r"\{.*?\}", text)  # tool-result-ish chunks
    dupes = [c for c in set(chunks) if chunks.count(c) >= 3]
    if dupes:
        log.warning("context_chunk_repeated", call_id=call_id, n=len(dupes), sample=dupes[0][:200])

Why this is the silent killer: LangChain, CrewAI, and AutoGen all default to re-attaching the full tool result on every retry. So a tool that returns 8K of data, called 3 times in a loop because signal 3 (retry-storm) was missing, becomes 24K of duplicate context on the 4th call — and the 4th call is the one that decides whether to bill the customer. The cost multiplier is hidden inside what looks like a single, normal call.

I have audited 19 small-team LLM bills in the last 90 days. 17 of the 19 had >60% of spend concentrated in 1 of 4 shapes: silent retry storm, thinking trap, context stuffing, agent-of-agents. The first two are visible in v2. The last two are not. v3 catches all 4.

What an A grade looks like in v3 (vs v2)

Signal class	v2 (7)	v3 (9)
Intent capture	✓	✓
Tool-call outcome (real response)	✓	✓
Retry-storm shape	✓	✓
Outcome-assertion line	✓	✓
Side-effect vs completion ts	✓	✓
Idempotency keys	✓	✓
Prompt-injection log shapes	✓	✓
Cost-per-outcome per task	—	NEW
Context-stuffing (length + chunk-rep)	—	NEW

A D or F in v3 means your agent is shipping without most of the cost-visibility layer. The fix list is the same 3-line recipes above. The 30-second grade tells you which of the 9 to fix first.

What changed in the tool itself (for the engineers)

Browser-side only. Everything still runs in your browser. We never see your log text.
9-band scoring. A/B/C/D/F now reflect 9 signals, with the grade letter thresholds rebalanced. A still means "all 9 present," but the cutoffs for B/C/D were tightened so a 5/9 isn't getting a C anymore.
Backward compatible. v1 and v2 sources still grade correctly. The capture API accepts silent-failure-audit-v1, -v2, and -v3 payloads.
Report email is richer. The one-page report now points signal 8 and 9 failures specifically at the cost-side deep read (the LLM Bill Triage, $299) because that's the natural next step for a team whose grader flagged a cost-shape signal.

The fix list and the upsell are the same shape they were in v2 — here is what's missing, here is the 3-line fix, here is the human-read version of this if you don't want to do it yourself. The only new piece is that signal 8/9 failures upsell to the cost report, not the correctness report. Same privacy, same browser-side posture, deeper coverage of the 2026 cost problem.

Try the grader (free, browser-side, no signup)

If you want to grade your own agent's logs, the free browser-side tool is linked from the canonical URL on this article's header (above the title on dev.to). Paste the last 50 lines. Get an A-F on the 9 signals. Email yourself the one-page report if you want the fix list in your inbox (email is only asked when you want the report — the grade is free and anonymous).

Two notes:

The grader is opinionated about what "good" looks like, but the score breakdown is per-signal, so you can ignore a signal you don't care about (e.g. signal 7 if your agent doesn't take untrusted input) and still get a useful grade on the rest.
The 9 signals are the same 9 the AI Ops Checkup looks for in a full production archive, and signals 8+9 are the same 2 the LLM Bill Triage deep-read specializes in. The grader is the "do I even have the problem?" step; the paid reports are "show me the specific drifts in my archive." Two different depths, same checklist.

What I learned writing v3 (the meta)

The v1 grader was built from the 5 most common failure modes I saw in audit work. v2 was built from the 2 questions every team asked after running v1 ("how do I know my retries aren't double-charging" → signal 6, "could someone have steered the agent" → signal 7). v3 is built the same way: from the 2 questions every team asked after running v2 ("why is my bill up 4x" → signals 8 and 9, the same answer from two angles).

If you run v3 and find a failure shape it doesn't catch, my email is in the footer of the report. v4 will be built from whatever you send me.

— Milo

DEV Community