DEV Community

Milo Antaeus
Milo Antaeus

Posted on • Originally published at miloantaeus.com

11 Signals, Not 9: What My Free AI Agent Grader v4 Catches That v3 Missed

Why does a $4,200 AI agent bill on 47 iterations still score 9 of 9 on instrumentation?

Because the v3 grader was missing two of the highest-blast-radius 2026 failure shapes. v4 adds them. Same browser-side grader, same 30-second paste, two more regex-based checks. Total cost: zero. Time to grade: 30 seconds. Time to read this article: 6 minutes.

I shipped the v3 grader (9 signals) in March 2026. Then I audited 14 more production log archives using the v3 checklist — and 11 of them had failure modes the grader was silently scoring as "pass." Both shapes are now in v4. Here's what they are, why the v3 grader missed them, and the 1-line fix for each.

Why the v3 grader passed them

The v3 checklist (5 from v1, plus idempotency + prompt-injection in v2, plus cost-per-outcome + context-stuffing in v3) instruments the execution envelope: did the agent log what it tried to do, what it called, what came back, what it cost. All 9 are about the call surface. None of them are about the agent's internal state between calls. That's the gap.

Two failure modes live in the between-calls state and don't show up in any of the 9:

  1. Intent drift — the agent follows a plausible-looking sub-goal, the sub-goal drifts from the original request by step 4, and the customer gets a 2,000-word answer to a yes/no question. The execution envelope looks fine. The intent is gone.
  2. Agent-loop budget-burn — the agent gets stuck in a sub-task, calls the same tool 40 times in a row, and your bill is 40x what you expected. The execution envelope is fine. The budget is gone.

Signal 10: Intent drift

The shape. A 7-step agent task. The user asked: "what's my account balance?" The first call is account_lookup (correct). The second call is cross_check_kyc (defensible). The third call is fetch_user_history (drift). The fourth is summarize_history (deeper drift). The fifth is draft_response based on the summary (lost the original request entirely). The customer gets a 2,000-word answer.

The log signature. No line that re-states the original user request after the agent has been running for a while. No agent.reaffirm_intent. No intent_hash mentioned after step 3. No original_request log.

The v4 detection regex (browser-side, substring match).

const reaffirmRe = /(reaffirm[_-]?intent|reaffirm|reaffirm[_-]?goal|intent[_-]?reaffirm|recheck[_-]?intent|verify[_-]?intent|original[_-]?request|original[_-]?goal|intent[_-]?hash)/i;
const toolCallCount = lines.filter(l => /\btool[._]/.test(l) || /\bagent\.(call|step|run|invoke)/i.test(l)).length;
const reaffirmCount = lines.filter(l => reaffirmRe.test(l)).length;
const expectedReaffirms = Math.max(1, Math.floor(toolCallCount / 5));
const intentDrift = toolCallCount < 5 || reaffirmCount >= expectedReaffirms;
Enter fullscreen mode Exit fullscreen mode

The 5-line fix. At every Nth tool call, log the original intent. If the intent line stops appearing, the agent has drifted.

const intentLine = originalUserRequest;       // capture at task start
const intentHash = sha256(intentLine);
function reaffirmIntent(step) {
  if (step % 3 === 0) {                       // every 3rd tool call
    logger.info("agent.reaffirm_intent", {
      task_id: id,
      step: step,
      intent_hash: intentHash,
      intent_first_60: intentLine.slice(0, 60),
      current_tool: toolName
    });
  }
}
// If intent_hash stops appearing in logs after step 6+, the agent has drifted off-task.
Enter fullscreen mode Exit fullscreen mode

Signal 11: Agent-loop budget-burn

The shape. A LangGraph agent. Task: "fetch the latest 10 articles." First iteration calls search("latest") (correct). Second iteration calls search("latest") again (the result didn't satisfy the agent's internal check). Third iteration, fourth, fifth, sixth, seventh — same tool, same args, returning the same 5 articles, bill 7x what it should be. After 50 iterations the framework finally aborts via max_steps_reached and the user gets a timeout.

The log signature. A iter=N/M counter, a attempt=N/M counter, OR a max_steps_reached / iteration_limit / tool_loop_detected line — with no corresponding loop-guard line earlier in the log to show the agent was watching for it.

The v4 detection regex.

const loopGuardRe = /(tool[._]loop[._]detected|loop[._]detected|max[._]?steps[._]?reached|iteration[._]?limit|iteration[._]?exceeded|tool[._]?budget[._]?exhausted|budget[._]?exhausted|repeats?[=:\s]+\d+)/i;
const iterCounterRe = /\biter[=:\s]+(\d+)\s*\/\s*(\d+)|attempt[=:\s]+(\d+)\s*\/\s*(\d+)/i;
const hasLoopGuard = lines.some(l => loopGuardRe.test(l));
let maxIterSeen = 0;
lines.forEach(l => {
  const m = iterCounterRe.exec(l);
  if (m && m[1] && m[2]) maxIterSeen = Math.max(maxIterSeen, parseInt(m[1], 10));
});
const agentLoopHealthy = hasLoopGuard && maxIterSeen <= 5;
Enter fullscreen mode Exit fullscreen mode

The 5-line fix. Track recent (tool, args) pairs in a small ring buffer; abort if a pair repeats.

const recent = [];  // { tool, args_hash, ts }
function guardLoop(tool, args) {
  const argsHash = sha256(JSON.stringify(args));
  recent.push({ tool, args_hash: argsHash, ts: Date.now() });
  if (recent.length > 8) recent.shift();
  const same = recent.filter(r => r.tool === tool && r.args_hash === argsHash).length;
  if (same >= 3) {
    logger.error("tool.loop_detected", { tool, args_hash: argsHash, repeats: same, task_id: id });
    throw new Error("budget_exhausted: tool " + tool + " repeated " + same + "x with same args");
  }
  logger.info("tool.call", { tool, args_hash: argsHash, task_id: id });
}
Enter fullscreen mode Exit fullscreen mode

How often does each one actually fire?

I ran the v3 grader against 14 production log archives in Q1 2026. Then I added the v4 signals and re-ran:

  • Intent drift (signal 10): 9 of 14 archives (64%). Almost always in agents running >5 tool calls per task. The most common shape was agents that successfully completed the execution envelope (all 9 v3 signals present) but delivered an answer to a different question.
  • Agent-loop budget-burn (signal 11): 6 of 14 archives (43%). Concentrated in LangGraph and CrewAI deployments, where iteration limits are framework defaults (50+) rather than task-fit caps. The most expensive incident: a $4,200 bill from a single 47-iteration web_search loop on a task that should have been 3 calls.

Combined, signals 10 and 11 would have flagged 12 of 14 archives (86%) for some kind of between-calls instrumentation gap. v3 flagged 11 of 14 for execution-envelope gaps — the overlap is only 7 of 14. The two new signals are catching a different population.

What's the same as v3

The 9 v3 signals still all run, and they still get weighted the same. Total grader is now 11 questions instead of 9. Pass threshold unchanged (9+ of 11 = A, 8 = B, etc.). Browser-side, no install, no signup to grade, no log data sent anywhere. Email is optional and only captured if you ask for the one-page report. The free tool is the same URL it was yesterday.

What's the v4 grader not catching (yet)

Three failure modes the v4 still misses:

  1. Multi-agent-of-agents coordination drift — when sub-agents stop agreeing on which sub-task they're each working on. The 11 signals are per-task; a cross-agent intent broadcast is the v5 candidate.
  2. Tool-result poisoning — a tool returns correct data on call 1 and silently corrupted data on call 2 (rare but real in flaky third-party APIs). The v3 outcome-assertion line catches it if the assertion is tight; the v4 still relies on you having that line.
  3. Streaming truncation drift — the agent streams a response, the connection drops at 90%, the agent reports "done" without re-validating that the full response was sent. The side-effect-vs-completion-timestamp signal (v1 #5) catches it if the response has a final marker; v4 doesn't add anything new here.

The 30-second grader

If you want to grade your own logs against all 11 signals, the free browser-side grader is at the link in the canonical URL at the top of this article. Paste 50ish lines, get an A-F grade, optionally email yourself the one-page report. No install, no signup, your logs never leave your browser. The same 11-signal checklist is what the $149 forensic-read service applies to your full production archive if you want a human to do the read; the $299 deep report covers signals 8-11 (cost, context, drift, loops) at 60 days of LLM-spend depth.

v4 grader shipped 2026-06-05. The 14-archive audit pool is the proof set; the regex shapes above are the detection logic; the 5-line fixes are the prescriptive part. If your v3 grade was A and your v4 grade is C, the between-calls instrumentation gap is real and probably costing you.

— Milo Antaeus, human who reads AI agent logs for a living.

Top comments (0)