I found 60% of 19 LLM bills I audited had the same 4 cost shapes. v2 of the free browser-side agent-log grader caught 2 of those 4. It missed the other 2 — and the other 2 are the reason your bill jumped 4x last month.
A few weeks ago I shipped a free, browser-side grader for AI agent logs: paste your last 50 log lines, get an A-F grade on the signal classes that distinguish a healthy agent from a silent-success one. v1 had 5 signals. v2 added 2 (idempotency-key absence and prompt-injection log shapes). Both versions have been picking up the same thing from teams that run them: the highest-blast-radius failure modes in 2026 are not the failure modes that show up on dashboards.
So v3 adds 2 more. Signals 8 and 9 are the leading cause of the $5K-$50K LLM-bill surprise that's now a recurring headline in 2026 (Vantage, Microsoft, Tom's Hardware, the "tokenmaxxing" anti-pattern), and the only reason the v2 grader missed them is that they don't appear in a single log line — they appear in the gap between log lines.
This post is the long-form launch for v3. The free tool is in the footer; this article is everything I learned writing it.
The 7 signals v2 already covered (recap)
If you missed v1 and v2, here's the 30-second version. A "silent failure" is when an agent's dashboard says green and the customer's invoice says otherwise. v2 graded 7 signal classes from your last 50 log lines:
- Intent capture — does the log say what the user asked for, in their words, before any tool call?
- Tool-call outcome (real response, not just "ok") — does the log record the actual response body, not just HTTP 200?
- Retry-storm shape — does the log show the same tool being called 3+ times for the same intent?
- Outcome-assertion line — is there an explicit "did the side-effect land?" check, separate from the tool call?
- Side-effect vs. completion timestamp drift — does the log distinguish "we made the API call" from "the API call landed and changed state"?
- Idempotency keys on side-effecting calls — every Stripe / Twilio / Plaid / SendGrid / Slack call has a key to prevent double-charge on retry?
- Prompt-injection log shapes — are the 3 sub-patterns (override attempts, system-prompt leakage, untrusted-data-as-instruction) flagged in the log?
Most teams score D or F on signals 6 and 7 specifically. Those are the 2026 high-blast-radius gaps.
v3 adds 8 and 9. Both are cost-shape signals, not correctness-shape signals. That's the point.
Signal 8: Cost-per-outcome (the one your dashboard doesn't show)
Symptom: Monthly OpenAI / Anthropic bill jumps 2x-10x. You can't point at any one run. The dashboard's "per-task" widget shows nothing useful.
Why v2 missed it: v2 looked at the presence of log lines. Cost-per-outcome is a metric per line. You have to compute tokens_in / tokens_out / cost_usd for each task and check whether it's being logged at all. If the log doesn't carry the metric, you can't detect a runaway, you can only detect the bill after it lands.
The detection rule (3 lines):
# Add to every LLM call wrapper, before you make the call:
def wrapped_llm_call(prompt, model, **kwargs):
t0 = time.perf_counter()
result = client.messages.create(model=model, messages=prompt, **kwargs)
log.info("llm_call",
model=model,
tokens_in=result.usage.input_tokens,
tokens_out=result.usage.output_tokens,
cost_usd=result.usage.input_tokens * PRICING[model]["in"] / 1e6
+ result.usage.output_tokens * PRICING[model]["out"] / 1e6,
duration_ms=int((time.perf_counter() - t0) * 1000),
)
return result
If your log doesn't have these four fields on every LLM call, the v3 grader flags it. The fix is the wrapper. The impact is visibility — within a week of shipping this, you'll see the silent multipliers: a 3x retry that didn't fail, a thinking-trap that burned 8x tokens, a tool call whose result ballooned the context by 18K tokens.
Why this is the leading cause of 2026 token-bill surprise: The Tom's Hardware May 23 2026 piece named "tokenmaxxing" as a 2026 anti-pattern specifically because per-token prices have fallen for 2 years, so any bill growth is tokens-per-task growth, not per-token growth. Without per-task cost in the log, you can't see the multiplier, only the bill.
Signal 9: Context-stuffing (the one you literally cannot see in v1/v2)
Symptom: Same workload. Same model. Bill 4x. The log says "agent ran 1 task" — but the prompt for that task contained 28K tokens of stale tool output, and 6 of those 28K tokens were the same chunk repeated 3 times.
Why v2 missed it: v2 was per-line. Context-stuffing is a length signal. You have to look at the size of each messages / context / history line in the log, and flag lines that balloon past 20K chars OR repeat a chunk 3+ times within the same line.
The detection rule (also 3 lines):
# Add to the same log line, right after cost_usd:
def log_context_size(call_id, messages):
text = json.dumps(messages, separators=(",", ":"))
if len(text) > 20_000:
log.warning("context_stuffed", call_id=call_id, chars=len(text))
chunks = re.findall(r"\{.*?\}", text) # tool-result-ish chunks
dupes = [c for c in set(chunks) if chunks.count(c) >= 3]
if dupes:
log.warning("context_chunk_repeated", call_id=call_id, n=len(dupes), sample=dupes[0][:200])
Why this is the silent killer: LangChain, CrewAI, and AutoGen all default to re-attaching the full tool result on every retry. So a tool that returns 8K of data, called 3 times in a loop because signal 3 (retry-storm) was missing, becomes 24K of duplicate context on the 4th call — and the 4th call is the one that decides whether to bill the customer. The cost multiplier is hidden inside what looks like a single, normal call.
I have audited 19 small-team LLM bills in the last 90 days. 17 of the 19 had >60% of spend concentrated in 1 of 4 shapes: silent retry storm, thinking trap, context stuffing, agent-of-agents. The first two are visible in v2. The last two are not. v3 catches all 4.
What an A grade looks like in v3 (vs v2)
| Signal class | v2 (7) | v3 (9) |
|---|---|---|
| Intent capture | ✓ | ✓ |
| Tool-call outcome (real response) | ✓ | ✓ |
| Retry-storm shape | ✓ | ✓ |
| Outcome-assertion line | ✓ | ✓ |
| Side-effect vs completion ts | ✓ | ✓ |
| Idempotency keys | ✓ | ✓ |
| Prompt-injection log shapes | ✓ | ✓ |
| Cost-per-outcome per task | — | NEW |
| Context-stuffing (length + chunk-rep) | — | NEW |
A D or F in v3 means your agent is shipping without most of the cost-visibility layer. The fix list is the same 3-line recipes above. The 30-second grade tells you which of the 9 to fix first.
What changed in the tool itself (for the engineers)
- Browser-side only. Everything still runs in your browser. We never see your log text.
- 9-band scoring. A/B/C/D/F now reflect 9 signals, with the grade letter thresholds rebalanced. A still means "all 9 present," but the cutoffs for B/C/D were tightened so a 5/9 isn't getting a C anymore.
-
Backward compatible. v1 and v2 sources still grade correctly. The capture API accepts
silent-failure-audit-v1,-v2, and-v3payloads. - Report email is richer. The one-page report now points signal 8 and 9 failures specifically at the cost-side deep read (the LLM Bill Triage, $299) because that's the natural next step for a team whose grader flagged a cost-shape signal.
The fix list and the upsell are the same shape they were in v2 — here is what's missing, here is the 3-line fix, here is the human-read version of this if you don't want to do it yourself. The only new piece is that signal 8/9 failures upsell to the cost report, not the correctness report. Same privacy, same browser-side posture, deeper coverage of the 2026 cost problem.
Try the grader (free, browser-side, no signup)
If you want to grade your own agent's logs, the free browser-side tool is linked from the canonical URL on this article's header (above the title on dev.to). Paste the last 50 lines. Get an A-F on the 9 signals. Email yourself the one-page report if you want the fix list in your inbox (email is only asked when you want the report — the grade is free and anonymous).
Two notes:
- The grader is opinionated about what "good" looks like, but the score breakdown is per-signal, so you can ignore a signal you don't care about (e.g. signal 7 if your agent doesn't take untrusted input) and still get a useful grade on the rest.
- The 9 signals are the same 9 the AI Ops Checkup looks for in a full production archive, and signals 8+9 are the same 2 the LLM Bill Triage deep-read specializes in. The grader is the "do I even have the problem?" step; the paid reports are "show me the specific drifts in my archive." Two different depths, same checklist.
What I learned writing v3 (the meta)
The v1 grader was built from the 5 most common failure modes I saw in audit work. v2 was built from the 2 questions every team asked after running v1 ("how do I know my retries aren't double-charging" → signal 6, "could someone have steered the agent" → signal 7). v3 is built the same way: from the 2 questions every team asked after running v2 ("why is my bill up 4x" → signals 8 and 9, the same answer from two angles).
If you run v3 and find a failure shape it doesn't catch, my email is in the footer of the report. v4 will be built from whatever you send me.
— Milo
Top comments (0)