seanyan

Posted on May 20

Why Your AI Agent Keeps Making the Same Mistakes (And a Structured Fix)

#ai #productivity #agents

If you've worked with AI agent skills — Hermes, Claude Code, LangChain, whatever — you know the cycle. Something breaks. You (or your agent) jot down a fix somewhere in the docs. After a skill has been through a few iteration cycles, you've accumulated 30+ fix notes scattered across the document — no categories, no priority, and your agent still can't find the right fix when something goes wrong.

I ran into this maintaining production agent skills. Here's what I found: as far as I can tell, nobody has systematically addressed how to organize error-related knowledge inside skills. Maybe someone has and I just haven't found it — but in everything I looked at, this angle was missing. And here's what worked for me.

What Everyone Else Is Doing

I looked around the usual places — official forums, technical blogs, developer communities. Here's roughly what I found:

Player	What They Cover	What's Missing
Hermes official docs	`## Pitfalls` section header — one line of guidance	No format spec, no size limits
agentskills.io best practices	"Gotchas" as flat bullet lists	No decision tables, no structure
Claude Code community (Jenny Ouyang's SSOT Audit)	Skill auditing tools: broken paths, orphans, duplicates	No content-level governance — no rules for how pitfalls should be categorized, formatted, or sized
"AI Skills in Production" guide (Tao An)	System reliability, circuit breakers, structured output	Knowledge document governance

One pattern stands out: everyone checks whether skill files resolve correctly and whether content is duplicated. But nobody asks — when an agent hits an error mid-execution, can it efficiently find the correct fix in SKILL.md or its reference files?

"Won't bigger models just handle this?"

Fair question. If context windows keep growing, do we even need structure?

Partially. Bigger contexts solve reading — the model can ingest everything. But they don't solve attention. Anthropic's own context engineering research documents the "lost in the middle" effect: critical details buried in a long flat list get overlooked. Every irrelevant item your agent has to scan past is noise competing with the signal it actually needs.

Think of it this way: you could give someone a 500-page manual or a 1-page troubleshooting card. They can read both, but the card is faster. This is an efficiency problem, not a capability problem.

What Worked for Me

After banging my head against this for a while, I landed on five rules for organizing error documentation in skills (PITFALL Rules):

Rule 1: Categorize when >5 items

Group by priority. The specific names depend on what your skill handles, but the structure looks like:

Priority	What Goes Here	Example Categories
Highest	Symptom → diagnosis → fix decision tables	Anomaly Diagnosis, Error Recovery
High	Tool-specific gotchas, API quirks	Tool Traps, Integration Gotchas
Medium	Data formats, parameter edge cases	Data & Parameters, Input Validation
Low	Project conventions, naming rules	Project Conventions, Style Rules

Key principle: anomaly diagnosis always goes first, because that's what agents need when something breaks.

Rule 2: Decision tables for diagnostics

Instead of prose like "sometimes the task times out because there's too much data and you should split it", write this:

| Symptom | Diagnosis | Fix |
|---------|-----------|-----|
| Task timeout, no output written | Check if output file exists | Don't retry same config. Split input and rerun |
| Same input fails ≥3 times | Persistent bottleneck | Bypass delegation, process directly in main session |

An agent can scan the Symptom column, find a match, and read the Fix — typically 1-2 steps instead of reading through 30 prose items.

Rule 3: Don't duplicate the skill body (when applicable)

This applies to operation-manual-style skills — ones with detailed step-by-step instructions and inline warnings (⚠️). If your skill body already explains a gotcha right next to the relevant operation, the pitfall section should give a one-line cross-reference only. Pitfalls cover blind spots the body doesn't address.

For reference-style or API-doc-style skills where the body doesn't include inline warnings, this rule doesn't apply — all error knowledge goes in the pitfall section.

Rule 4: Pre-append checklist

Before adding any item: Is the section already categorized? → Which category? → Does it need a decision table? → Does it duplicate an existing item? → Too many items? Consider splitting to a separate file.

Rule 5: Quality audit

After refactoring, check 7 dimensions: categorized, decision tables for diagnostics, no flat lists >5, no duplicates, <50% body-text overlap, no prose narratives, no information loss.

Does It Actually Work? An A/B Experiment

I wanted data, not just intuition. So I ran A/B tests.

How the test works

I picked three real errors from production agent skills. For each error, I constructed two isolated contexts — one group of agents only saw the old flat-list pitfall documentation, another only saw the new structured version. Same error description, same prompt. Multiple runs per version to reduce randomness.

Scoring

Each test run is scored on 4 dimensions:

Dimension	Scale	Meaning
Diagnosis Accuracy	0-2	0 = wrong, 1 = partially correct, 2 = pinpointed root cause
Fix Accuracy	0-2	0 = wrong fix, 1 = right direction but not actionable, 2 = correct and executable
Wrong Suggestions	N (penalty)	Number of misleading or irrelevant suggestions
Reasoning Steps	N (penalty if >3)	How many intermediate steps from problem to fix

Overall Score = Diagnosis + Fix − Wrong Suggestions − (Steps > 3 ? 1 : 0). Max: 4. Min: -2.

Scenario A — Task Timeout

Error: Two sub-agents both timed out at 600 seconds with no output
Old version: Fix buried as item #2 in a 32-item flat list, described in prose as "occasionally succeeds, occasionally times out"
New version: Decision table placed as the first category, Fix column directly says "split to 1 item per task"
Shared behavior: Both versions correctly diagnosed the root cause — dense analysis workload + 2-item batch config
Old version's quirk: Misunderstood "batch" as a 2-item unit, suggested splitting into 2 sub-agents instead of 4; averaged 3.7 reasoning steps
New version's advantage: No ambiguity in the Fix column; averaged 3.0 reasoning steps

Score: Flat list 3.0 avg → Structured 4.0 avg (+1.0)

Scenario B — Session Connection Lost

Error: After 2 successful operations, connection broke mid-task — session appeared alive in listings but commands silently failed
Old version: Relevant information scattered across 4 separate sections (24 items total), had to be pieced together
New version: Consolidated into one decision table — symptom → diagnosis → fix at a glance
Shared behavior: Both versions tried to match the error against known patterns
Old version's quirk: Diagnosed as "session ID changed after navigation" — plausible but wrong, assembled from fragments across 4 sections
New version's advantage: Found a partial match against the decision table, explicitly flagged it as partial, gave a conservative fix (stop and report) plus a meta-suggestion to add a new row for this scenario

Score: Flat list 1.0 → Structured 3.0 (+2.0)

Scenario C — Duplicate Database Records

Error: Write operation created duplicate rows instead of replacing existing ones, due to a missing unique constraint
Old version: Last item in an 11-item flat list, but the item itself was clearly written
New version: Placed under a "Database" category with a fixed position
Shared behavior: Both versions scored perfectly — diagnosis and fix both correct
Old version's quirk: The item was well-written enough to find regardless of position
New version's advantage: Category placement makes location more predictable, but in this case it didn't matter

Score: Flat list 4.0 → Structured 4.0 (0)

Summary

Metric	Flat List	Structured	Change
Overall Score	2.67	3.67	+37%
Fix Accuracy	1.67	2.0	+20%
Reasoning Steps	3.6	2.7	-25%

Three findings:

1. Scattered information = biggest win. Scenario B went from 4 sections of 24 items to 1 decision table — a +2.0 improvement. When error knowledge is scattered across a document, agents waste steps piecing together clues.

2. Decision tables remove ambiguity. The flat-list version described "occasionally succeeds, occasionally times out." An agent misunderstood and gave the wrong fix. The decision table directly says what to do — no room for misinterpretation.

3. Decision tables make agents honest. When no table row perfectly matched the symptoms, the agent said "partial match" instead of forcing a wrong answer. Flat lists don't encourage this kind of honesty.

Try It

The whole project is up on my repo — feedback welcome: github.com/seanyan1984/skill-pitfalls

It's framework-agnostic — the rules work for any markdown-based skill or prompt documentation. If your error knowledge is growing wild in your agent skills, give it a try.

DEV Community