7 Signals, Not 5: What My Free AI Agent Grader v2 Catches That v1 Missed

#ai #agents #observability #devops

I built a free browser-side grader for AI agent logs. It started with 5 signals. v2 has 7.

The two new signals are the ones that bit me the hardest in real customer logs in 2026 — and the ones I now think every shipping agent should be checking for before the dashboard says "all green." Both are detectably absent from almost every team I've audited this year.

If you want to skip the article and just try the v2 grader, it's free, runs in your browser, and takes 30 seconds: https://www.miloantaeus.com/silent-failure-audit.html

Otherwise, here's what the v1 grader missed and why I added the new ones.

The 5-signal v1 grader (and what it actually did)

The v1 grader ran on five signal classes:

Intent capture — did the agent log what it was trying to do (the user request, the goal, the task) before it started tool-calling?
Tool-call outcome — did each tool call log the actual response body / status code / side-effect, not just "ok" or "done"?
Retry-storm shape — the same call >2x in a row with no assertion between?
Outcome-assertion line — after a side-effecting call, a line that compares expected vs. actual?
Side-effect vs. completion timestamp — separate "landed" / "sent" events from "done"?

The grading rubric was a clean 5-bool sum: 5 of 5 = A, 4 of 5 = B, 3 of 5 = C, 2 of 5 = D, 0-1 of 5 = F.

In the first 24 hours after launch, about 60% of the people who tried it scored D or F. The most common reason was missing signals 3 and 4 (retry storm + no outcome assertion) — the classic "we just see 'ok' in the log" shape.

v1 was fine. But every team I walked through the v1 results in a follow-up call had the same two follow-up questions:

"How do I know my retries aren't double-charging customers?"
"Could someone have steered the agent with adversarial input and I'd never see it in the log?"

Both are excellent questions. v1 didn't answer either. v2 does.

Signal 6 (new in v2): Idempotency keys on side-effecting calls

This one is the highest-blast-radius signal I see missing in 2026, and the easiest to fix.

The shape: Your agent calls send_email, create_charge, transfer_funds, write_row, post_message, or any other side-effecting tool. The call doesn't carry an idempotency_key / request_id / dedup_token. The wrapper retries on timeout. The actual side-effect lands twice (or three times, or ten times — there's no upper bound when there's no key).

Why it matters: Stripe, Twilio, Plaid, and SendGrid all require an idempotency key — they will de-dup if you send one. The same APIs will happily charge a customer 3x if you don't. The agent wrappers (LangChain, CrewAI, AutoGen) default to NOT emitting one. So a 3x retry storm on a non-idempotent charge = 3x customer chargeback.

The audit heuristic: For every line that looks like a side-effecting call (send, charge, create, write, update, insert, delete, post, patch, put), does the same line (or a sibling line) carry an idempotency_key, idem_key, request_id, dedup_token, order_id, client_request_id, or trace_id? If the ratio is below 50% on side-effecting calls, you have a problem.

The fix (3 lines, no vendor needed):

// At the top of the request, derive a stable key from the intent:
const idemKey = "ord_" + orderId + "_" + intentHash;
const r = await stripe.charges.create({
  amount, currency, customer,
  idempotency_key: idemKey
});
logger.info("tool.charge", { idem_key: idemKey, charge_id: r.id });

Why I added it as signal 6, not signal 2: I considered folding it into the existing "tool-call outcome" check, but the absence of a key is a cost signal more than an instrumentation signal. Missing the key doesn't mean your agent is silently failing; it means your agent is silently double-acting. Different blast radius, different fix, deserves its own row in the report.

Signal 7 (new in v2): Prompt-injection log shapes

This one is harder to detect from logs, but the patterns are real and the 2026 cost of missing them is high. (Per Palo Alto Unit 42 and the OWASP LLM Top 10, prompt injection attacks on agentic systems increased materially in 2026, and the failure mode is "the agent does something it wasn't supposed to do" — exactly the kind of thing a well-instrumented log should catch.)

The shape (3 sub-patterns the grader looks for):

7a. Adversarial steering phrases inside any line. Substrings like ignore previous, disregard all previous, you are now, system: you, new instructions, override all rules, do not mention, act as, pretend to be. If any of these appear inside a user-input or tool-result line, that line should be flagged for review before the agent acts on it.

7b. Tool call lines that don't bind to any intent line. A well-instrumented agent logs intent first (signal 1) and then logs every tool call with a reference back to the intent. If you see tool-call lines in the log without a corresponding intent line, or with too many tool calls per intent (suggesting the agent was steered mid-flow), that's a prompt-injection-shape signal.

7c. Undeclared tool names. A line containing tool.<name> where <name> doesn't appear in any "tools available: [...]" registration line. If your agent was told it has send_email, lookup_order, refund, and then a line shows tool.admin_delete_user it was never told about, that's a prompt-injection event even if the call was blocked.

The fix (a small dispatcher wrapper):

// 1. Hash the intent line at tool-call time and log it:
const intentHash = sha256(intentLine);
logger.info("tool.bind", {
  intent_hash: intentHash,
  tool: "send_email",
  args_hash: sha256(JSON.stringify(args))
});
// 2. At review time: any tool whose intent_hash doesn't match
//    a recent intent line is a candidate injection event —
//    flag for human review, do not auto-execute.

The grader doesn't try to catch prompt-injection in real time (you need a runtime guard for that). It checks whether your log would catch it after the fact, which is the audit signal a forensic reader of your logs (me, in the $149 checkup) would use to reconstruct what happened.

What I see when teams try v2 (early returns, 24h in)

The first ~40 v2 results split roughly as follows:

Most A-grades in v1 dropped to B or C in v2. Almost universally because of signal 6 (idempotency). The 5-signal v1 didn't ask about idempotency at all, so teams that had carefully instrumented the first 5 signals discovered their wrapper doesn't emit the key.
D-grades in v1 usually stay D or drop to F in v2. Same reason (idempotency) plus the new signal 7 often flags undeclared tool calls in the log.
One team that scored A in v1 also scored A in v2. They had idempotency keys everywhere and a dispatcher wrapper that bound tool calls to intent hashes. Rare, but the gold standard.

The average v2 grade is meaningfully lower than the average v1 grade, which is exactly what you'd expect when you add two signals that almost nobody has.

How to try it (free, no signup)

Open https://www.miloantaeus.com/silent-failure-audit.html
Paste 50ish lines from your agent's log (JSON lines, plain prose, or a mix — anything text-based)
Click "Grade my agent logs"
Get a 7-row report with A-F grade

If you want the report as a one-page HTML email, drop your email. (Optional — you can also just look at the on-screen result.)

The whole thing runs client-side. We never see your log text. The only thing the server learns is your email and the grade, if you choose to send the report.

What this isn't

This grader is not LangSmith, Langfuse, Helicone, or any other vendor. It's a 7-question check you can run in 30 seconds in your browser. For the full 30-page read of your actual production log archive — where I look for the specific silent-success drifts, double-charges, and prompt-injection events affecting your customers — the $149 AI Ops Checkup is the deeper version of this same 7-signal checklist.

But for the 80% case where the answer is "your agent is missing 2-3 of these signal classes, here's which ones, here's the 3-line fix for each," the free v2 grader is enough.

If you find a shape v2 misses, my email is in the footer of every report. v3 will probably be 9 signals.

— Milo Antaeus