weiseer

Posted on May 27 • Edited on May 31

Dogfooding an LLM agent eval pack on my own production agent — what 6-dim methodology surfaced

#ai #llm #agents #evaluation

I built a 20-case YAML eval pack for tool-using AI agents (the kind that call APIs / tools to do work). To test whether the methodology actually catches real failure modes, I applied it to my own production LLM-driven agent — one I've been running for months and had already documented 15+ failure modes for.

Result: ~80% of the eval pack's surface area was already covered by my agent's existing defenses. That validated the 6-dimension cut. 5 gaps surfaced that my agent's own failure-mode documentation didn't catalogue — 3 of them serious enough to add as v1.1 cases.

This post is about those gaps. They're worth knowing if you're building an LLM-driven agent.

What the pack is

Briefly: 20 YAML test cases across 6 dimensions: accuracy, safety, edge cases, prompt injection, hallucination, cost efficiency. Each case is a YAML file describing a failure mode + the expected agent behavior + deterministic evaluation rules (no LLM judge — you can run them without paying for an external "judge model").

Free 5-case starter on GitHub:
https://github.com/weiseer/ai-agent-qa-eval-pack-starter

Paid 20-case pack:
https://weiseer.gumroad.com/l/dcipxt

What it means to "dogfood" against an existing agent

My agent is an LLM-driven generator embedded in a larger quantitative system. The LLM proposes candidates; downstream deterministic code validates and acts on them. The agent isn't generic chat — it's tool-using in the structural sense (typed schema in/out, downstream consumers).

I ran the 6-dimension methodology mentally + via code review against this LLM subsystem:

Walked through each of the 26 audit questions (4-6 per dimension)
Cited the file/line where defense exists, OR flagged "no defense visible"

After ~45 minutes of disciplined read-only review:

21 of 26 questions: existing defense ✓
5 questions: gap of some severity

The 5 gaps (severity-ordered)

Gap 1 (MEDIUM) — LLM cost cap was logged, not enforced

I had a $X/day cap on the LLM subsystem in my design docs. The code path:

Logged every API call's cost to a per-cycle audit YAML file
Did NOT check cumulative spend before the next call

So if anything misbehaved (large response, retry loop, prompt cache miss across a batch), the daily total would silently overshoot. Detection would happen the next morning during log review — which is "fast" for governance, but slow for damage containment.

The eval pack's "detection-quality" axis explicitly tests for this: the system must catch a fault faster than the fault spreads. Logging-but-not-enforcing fails that axis.

Lesson generalized: if your spec says "stay under $X", write the code that says if today_spend >= X: abort(), not just the code that says log(today_spend). The eval methodology made me notice the gap.

Gap 2 (MEDIUM) — Predicted vs actual self-assessment drift wasn't tracked

My agent emits self-assessments along with its proposals — predicted success score, expected outcome quality. Downstream validation produces actual measurements. So far so good: prediction vs ground truth, well-separated.

What I didn't have: monitoring of the DELTA between predicted and actual over time.

If the LLM systematically over-claims by 30% across 100 proposals, no single proposal triggers an alert (each one passes downstream validation independently). But the DRIFT between LLM-prediction and ground-truth becomes invisible. The LLM's predictions silently lose calibration.

The fix is meta-monitoring: track the rolling delta. If 30-day moving mean(predicted - actual) starts climbing, the model needs a reset / re-prompt / explicit calibration constraint.

Gap 3 (MEDIUM) — Parallel workers without pre-call diversity planning

My agent dispatches multiple LLM workers in parallel (one "seed" generator + several "variant" generators), each with the same prompt. I had a POST-call diversity gate: compute set distance between worker outputs, reject too-similar candidates.

But the diversity gate runs AFTER all workers have completed. If they converge, I've paid N× the cost for ~1 unique result.

The fix is pre-call diversity planning: explicitly assign each worker an anchor before they fire (worker_1 → category A, worker_2 → category B, ...). Forces structural diversity, not luck-based.

Gap 4 (LOW) — Full-prompt retry vs corrective retry

When my agent's output fails validation (say, references a non-existent feature), the retry sends the full original prompt. With Anthropic prompt caching, the input cost is cheap — but output is fully re-sampled. ~5-10% cost penalty per retry that could be avoided by including the specific correction in the prompt ("you mentioned feature X which doesn't exist; valid features are: ...").

Gap 5 (ADVISORY) — Scope adherence via prompt text only

My system prompt instructs the LLM to span certain conceptual zones. There's no programmatic check that the actual outputs distribute across those zones. Downstream validators catch many ways this can go wrong, but not pattern drift across cycles.

What the gaps have in common

All 5 gaps are meta-monitoring gaps, not architecture bugs. The agent's individual components do their jobs correctly. What was missing: cross-call patterns, cross-time drift, cumulative-cost tracking — the layer above the individual call.

This generalizes: LLM-system reliability is built bottom-up (per-call correctness) but the failures that bite production are top-down (cumulative drift / cumulative cost / cumulative diversity loss). Most engineers (myself included) build the bottom layer first. The eval pack methodology pulled my attention to the top layer.

Why this validates the eval pack's framework, not undermines it

It's tempting to read "80% already covered" as "the pack didn't help much." That's the wrong frame. The right frame:

The 6 dimensions are the right cuts. A mature engineer building an LLM agent will hit the 6 cuts independently.
The pack codifies those cuts. New builders don't have to rediscover them.
The methodology surfaces blind spots even in agents whose builders already think carefully about failure modes. Anyone who built an LLM agent without hitting at least one of these gaps either:
- Got lucky
- Hasn't been in production long enough yet
- Or built something simpler than what they think they built

The pack's value proposition is: 10-30 hours of disciplined failure-mode thinking compressed into 20 YAML files you can read in an hour and apply to your own agent in 3-line glue code per case.

If you build LLM agents and want to compress your "production hardening" timeline:

Free 5-case starter (CC BY 4.0): https://github.com/weiseer/ai-agent-qa-eval-pack-starter
Full 23-case pack: weiseer.gumroad.com/l/dcipxt (launch week: code LAUNCH7 → $29)
国内付款: dl.weiseer.com/pay

v1.1 cases adding the 3 gaps above are queued for the next release.

Built solo. Refund 7 days, no questions asked. If you've built an agent and want to compare your defenses against this list, reply or DM with what failure mode you'd add as case #21.

Test your agent for failures like this in CI — free, deterministic, no LLM-as-judge:

Free 5-case starter (MIT): https://github.com/weiseer/ai-agent-qa-eval-pack-starter
Failure-mode guides (how to test each): https://guides.weiseer.com/
Get new cases + the 6-dimension cheatsheet (free): https://dl.weiseer.com/cases
Full 28-case OWASP-Agentic pack: https://weiseer.gumroad.com/l/dcipxt

Top comments (1)

Tae Kim • Jun 21

The faithfulness dimension surprised me most when I first measured it on a RAG agent. I assumed faithfulness failures would be obvious wrong answers, but most were subtle scope creeps: the agent answered a slightly broader question than asked, technically correct but not grounded in the retrieved context. The eval only caught it because I was running exact substring checks against retrieved passages rather than embedding similarity. Your 6-dim framing forces you to define what faithful means in your domain before you can measure it, which is the harder problem.