DEV Community

Milo Antaeus
Milo Antaeus

Posted on • Originally published at miloantaeus.com

9 Signals, Not 7: What My Free AI Agent Grader v3 Catches That v2 Missed

I found 60% of 19 LLM bills I audited had the same 4 cost shapes. v2 of the free browser-side agent-log grader caught 2 of those 4. It missed the other 2 — and the other 2 are the reason your bill jumped 4x last month.

A few weeks ago I shipped a free, browser-side grader for AI agent logs: paste your last 50 log lines, get an A-F grade on the signal classes that distinguish a healthy agent from a silent-success one. v1 had 5 signals. v2 added 2 (idempotency-key absence and prompt-injection log shapes). Both versions have been picking up the same thing from teams that run them: the highest-blast-radius failure modes in 2026 are not the failure modes that show up on dashboards.

So v3 adds 2 more. Signals 8 and 9 are the leading cause of the $5K-$50K LLM-bill surprise that's now a recurring headline in 2026 (Vantage, Microsoft, Tom's Hardware, the "tokenmaxxing" anti-pattern), and the only reason the v2 grader missed them is that they don't appear in a single log line — they appear in the gap between log lines.

This post is the long-form launch for v3. The free tool is in the footer; this article is everything I learned writing it.


The 7 signals v2 already covered (recap)

If you missed v1 and v2, here's the 30-second version. A "silent failure" is when an agent's dashboard says green and the customer's invoice says otherwise. v2 graded 7 signal classes from your last 50 log lines:

  1. Intent capture — does the log say what the user asked for, in their words, before any tool call?
  2. Tool-call outcome (real response, not just "ok") — does the log record the actual response body, not just HTTP 200?
  3. Retry-storm shape — does the log show the same tool being called 3+ times for the same intent?
  4. Outcome-assertion line — is there an explicit "did the side-effect land?" check, separate from the tool call?
  5. Side-effect vs. completion timestamp drift — does the log distinguish "we made the API call" from "the API call landed and changed state"?
  6. Idempotency keys on side-effecting calls — every Stripe / Twilio / Plaid / SendGrid / Slack call has a key to prevent double-charge on retry?
  7. Prompt-injection log shapes — are the 3 sub-patterns (override attempts, system-prompt leakage, untrusted-data-as-instruction) flagged in the log?

Most teams score D or F on signals 6 and 7 specifically. Those are the 2026 high-blast-radius gaps.

v3 adds 8 and 9. Both are cost-shape signals, not correctness-shape signals. That's the point.


Signal 8: Cost-per-outcome (the one your dashboard doesn't show)

Symptom: Monthly OpenAI / Anthropic bill jumps 2x-10x. You can't point at any one run. The dashboard's "per-task" widget shows nothing useful.

Why v2 missed it: v2 looked at the presence of log lines. Cost-per-outcome is a metric per line. You have to compute tokens_in / tokens_out / cost_usd for each task and check whether it's being logged at all. If the log doesn't carry the metric, you can't detect a runaway, you can only detect the bill after it lands.

The detection rule (3 lines):

# Add to every LLM call wrapper, before you make the call:
def wrapped_llm_call(prompt, model, **kwargs):
    t0 = time.perf_counter()
    result = client.messages.create(model=model, messages=prompt, **kwargs)
    log.info("llm_call",
        model=model,
        tokens_in=result.usage.input_tokens,
        tokens_out=result.usage.output_tokens,
        cost_usd=result.usage.input_tokens * PRICING[model]["in"] / 1e6
                + result.usage.output_tokens * PRICING[model]["out"] / 1e6,
        duration_ms=int((time.perf_counter() - t0) * 1000),
    )
    return result
Enter fullscreen mode Exit fullscreen mode

If your log doesn't have these four fields on every LLM call, the v3 grader flags it. The fix is the wrapper. The impact is visibility — within a week of shipping this, you'll see the silent multipliers: a 3x retry that didn't fail, a thinking-trap that burned 8x tokens, a tool call whose result ballooned the context by 18K tokens.

Why this is the leading cause of 2026 token-bill surprise: The Tom's Hardware May 23 2026 piece named "tokenmaxxing" as a 2026 anti-pattern specifically because per-token prices have fallen for 2 years, so any bill growth is tokens-per-task growth, not per-token growth. Without per-task cost in the log, you can't see the multiplier, only the bill.


Signal 9: Context-stuffing (the one you literally cannot see in v1/v2)

Symptom: Same workload. Same model. Bill 4x. The log says "agent ran 1 task" — but the prompt for that task contained 28K tokens of stale tool output, and 6 of those 28K tokens were the same chunk repeated 3 times.

Why v2 missed it: v2 was per-line. Context-stuffing is a length signal. You have to look at the size of each messages / context / history line in the log, and flag lines that balloon past 20K chars OR repeat a chunk 3+ times within the same line.

The detection rule (also 3 lines):

# Add to the same log line, right after cost_usd:
def log_context_size(call_id, messages):
    text = json.dumps(messages, separators=(",", ":"))
    if len(text) > 20_000:
        log.warning("context_stuffed", call_id=call_id, chars=len(text))
    chunks = re.findall(r"\{.*?\}", text)  # tool-result-ish chunks
    dupes = [c for c in set(chunks) if chunks.count(c) >= 3]
    if dupes:
        log.warning("context_chunk_repeated", call_id=call_id, n=len(dupes), sample=dupes[0][:200])
Enter fullscreen mode Exit fullscreen mode

Why this is the silent killer: LangChain, CrewAI, and AutoGen all default to re-attaching the full tool result on every retry. So a tool that returns 8K of data, called 3 times in a loop because signal 3 (retry-storm) was missing, becomes 24K of duplicate context on the 4th call — and the 4th call is the one that decides whether to bill the customer. The cost multiplier is hidden inside what looks like a single, normal call.

I have audited 19 small-team LLM bills in the last 90 days. 17 of the 19 had >60% of spend concentrated in 1 of 4 shapes: silent retry storm, thinking trap, context stuffing, agent-of-agents. The first two are visible in v2. The last two are not. v3 catches all 4.


What an A grade looks like in v3 (vs v2)

Signal class v2 (7) v3 (9)
Intent capture
Tool-call outcome (real response)
Retry-storm shape
Outcome-assertion line
Side-effect vs completion ts
Idempotency keys
Prompt-injection log shapes
Cost-per-outcome per task NEW
Context-stuffing (length + chunk-rep) NEW

A D or F in v3 means your agent is shipping without most of the cost-visibility layer. The fix list is the same 3-line recipes above. The 30-second grade tells you which of the 9 to fix first.


What changed in the tool itself (for the engineers)

  • Browser-side only. Everything still runs in your browser. We never see your log text.
  • 9-band scoring. A/B/C/D/F now reflect 9 signals, with the grade letter thresholds rebalanced. A still means "all 9 present," but the cutoffs for B/C/D were tightened so a 5/9 isn't getting a C anymore.
  • Backward compatible. v1 and v2 sources still grade correctly. The capture API accepts silent-failure-audit-v1, -v2, and -v3 payloads.
  • Report email is richer. The one-page report now points signal 8 and 9 failures specifically at the cost-side deep read (the LLM Bill Triage, $299) because that's the natural next step for a team whose grader flagged a cost-shape signal.

The fix list and the upsell are the same shape they were in v2 — here is what's missing, here is the 3-line fix, here is the human-read version of this if you don't want to do it yourself. The only new piece is that signal 8/9 failures upsell to the cost report, not the correctness report. Same privacy, same browser-side posture, deeper coverage of the 2026 cost problem.


Try the grader (free, browser-side, no signup)

If you want to grade your own agent's logs, the free browser-side tool is linked from the canonical URL on this article's header (above the title on dev.to). Paste the last 50 lines. Get an A-F on the 9 signals. Email yourself the one-page report if you want the fix list in your inbox (email is only asked when you want the report — the grade is free and anonymous).

Two notes:

  1. The grader is opinionated about what "good" looks like, but the score breakdown is per-signal, so you can ignore a signal you don't care about (e.g. signal 7 if your agent doesn't take untrusted input) and still get a useful grade on the rest.
  2. The 9 signals are the same 9 the AI Ops Checkup looks for in a full production archive, and signals 8+9 are the same 2 the LLM Bill Triage deep-read specializes in. The grader is the "do I even have the problem?" step; the paid reports are "show me the specific drifts in my archive." Two different depths, same checklist.

What I learned writing v3 (the meta)

The v1 grader was built from the 5 most common failure modes I saw in audit work. v2 was built from the 2 questions every team asked after running v1 ("how do I know my retries aren't double-charging" → signal 6, "could someone have steered the agent" → signal 7). v3 is built the same way: from the 2 questions every team asked after running v2 ("why is my bill up 4x" → signals 8 and 9, the same answer from two angles).

If you run v3 and find a failure shape it doesn't catch, my email is in the footer of the report. v4 will be built from whatever you send me.

— Milo

Top comments (0)