DEV Community

seanyan
seanyan

Posted on

Why Your AI Agent Keeps Making the Same Mistakes (And a Structured Fix)

If you've worked with AI agent skills — Hermes, Claude Code, LangChain, whatever — you know the cycle. Something breaks. You (or your agent) jot down a fix somewhere in the docs. After a skill has been through a few iteration cycles, you've accumulated 30+ fix notes scattered across the document — no categories, no priority, and your agent still can't find the right fix when something goes wrong.

I ran into this maintaining production agent skills. Here's what I found: as far as I can tell, nobody has systematically addressed how to organize error-related knowledge inside skills. Maybe someone has and I just haven't found it — but in everything I looked at, this angle was missing. And here's what worked for me.

What Everyone Else Is Doing

I looked around the usual places — official forums, technical blogs, developer communities. Here's roughly what I found:

Player What They Cover What's Missing
Hermes official docs ## Pitfalls section header — one line of guidance No format spec, no size limits
agentskills.io best practices "Gotchas" as flat bullet lists No decision tables, no structure
Claude Code community (Jenny Ouyang's SSOT Audit) Skill auditing tools: broken paths, orphans, duplicates No content-level governance — no rules for how pitfalls should be categorized, formatted, or sized
"AI Skills in Production" guide (Tao An) System reliability, circuit breakers, structured output Knowledge document governance

One pattern stands out: everyone checks whether skill files resolve correctly and whether content is duplicated. But nobody asks — when an agent hits an error mid-execution, can it efficiently find the correct fix in SKILL.md or its reference files?

"Won't bigger models just handle this?"

Fair question. If context windows keep growing, do we even need structure?

Partially. Bigger contexts solve reading — the model can ingest everything. But they don't solve attention. Anthropic's own context engineering research documents the "lost in the middle" effect: critical details buried in a long flat list get overlooked. Every irrelevant item your agent has to scan past is noise competing with the signal it actually needs.

Think of it this way: you could give someone a 500-page manual or a 1-page troubleshooting card. They can read both, but the card is faster. This is an efficiency problem, not a capability problem.

What Worked for Me

After banging my head against this for a while, I landed on five rules for organizing error documentation in skills (PITFALL Rules):

Rule 1: Categorize when >5 items

Group by priority. The specific names depend on what your skill handles, but the structure looks like:

Priority What Goes Here Example Categories
Highest Symptom → diagnosis → fix decision tables Anomaly Diagnosis, Error Recovery
High Tool-specific gotchas, API quirks Tool Traps, Integration Gotchas
Medium Data formats, parameter edge cases Data & Parameters, Input Validation
Low Project conventions, naming rules Project Conventions, Style Rules

Key principle: anomaly diagnosis always goes first, because that's what agents need when something breaks.

Rule 2: Decision tables for diagnostics

Instead of prose like "sometimes the task times out because there's too much data and you should split it", write this:

| Symptom | Diagnosis | Fix |
|---------|-----------|-----|
| Task timeout, no output written | Check if output file exists | Don't retry same config. Split input and rerun |
| Same input fails ≥3 times | Persistent bottleneck | Bypass delegation, process directly in main session |
Enter fullscreen mode Exit fullscreen mode

An agent can scan the Symptom column, find a match, and read the Fix — typically 1-2 steps instead of reading through 30 prose items.

Rule 3: Don't duplicate the skill body (when applicable)

This applies to operation-manual-style skills — ones with detailed step-by-step instructions and inline warnings (⚠️). If your skill body already explains a gotcha right next to the relevant operation, the pitfall section should give a one-line cross-reference only. Pitfalls cover blind spots the body doesn't address.

For reference-style or API-doc-style skills where the body doesn't include inline warnings, this rule doesn't apply — all error knowledge goes in the pitfall section.

Rule 4: Pre-append checklist

Before adding any item: Is the section already categorized? → Which category? → Does it need a decision table? → Does it duplicate an existing item? → Too many items? Consider splitting to a separate file.

Rule 5: Quality audit

After refactoring, check 7 dimensions: categorized, decision tables for diagnostics, no flat lists >5, no duplicates, <50% body-text overlap, no prose narratives, no information loss.

Does It Actually Work? An A/B Experiment

I wanted data, not just intuition. So I ran A/B tests.

How the test works

I picked three real errors from production agent skills. For each error, I constructed two isolated contexts — one group of agents only saw the old flat-list pitfall documentation, another only saw the new structured version. Same error description, same prompt. Multiple runs per version to reduce randomness.

Scoring

Each test run is scored on 4 dimensions:

Dimension Scale Meaning
Diagnosis Accuracy 0-2 0 = wrong, 1 = partially correct, 2 = pinpointed root cause
Fix Accuracy 0-2 0 = wrong fix, 1 = right direction but not actionable, 2 = correct and executable
Wrong Suggestions N (penalty) Number of misleading or irrelevant suggestions
Reasoning Steps N (penalty if >3) How many intermediate steps from problem to fix

Overall Score = Diagnosis + Fix − Wrong Suggestions − (Steps > 3 ? 1 : 0). Max: 4. Min: -2.


Scenario A — Task Timeout

  • Error: Two sub-agents both timed out at 600 seconds with no output
  • Old version: Fix buried as item #2 in a 32-item flat list, described in prose as "occasionally succeeds, occasionally times out"
  • New version: Decision table placed as the first category, Fix column directly says "split to 1 item per task"

  • Shared behavior: Both versions correctly diagnosed the root cause — dense analysis workload + 2-item batch config

  • Old version's quirk: Misunderstood "batch" as a 2-item unit, suggested splitting into 2 sub-agents instead of 4; averaged 3.7 reasoning steps

  • New version's advantage: No ambiguity in the Fix column; averaged 3.0 reasoning steps

Score: Flat list 3.0 avg → Structured 4.0 avg (+1.0)


Scenario B — Session Connection Lost

  • Error: After 2 successful operations, connection broke mid-task — session appeared alive in listings but commands silently failed
  • Old version: Relevant information scattered across 4 separate sections (24 items total), had to be pieced together
  • New version: Consolidated into one decision table — symptom → diagnosis → fix at a glance

  • Shared behavior: Both versions tried to match the error against known patterns

  • Old version's quirk: Diagnosed as "session ID changed after navigation" — plausible but wrong, assembled from fragments across 4 sections

  • New version's advantage: Found a partial match against the decision table, explicitly flagged it as partial, gave a conservative fix (stop and report) plus a meta-suggestion to add a new row for this scenario

Score: Flat list 1.0 → Structured 3.0 (+2.0)


Scenario C — Duplicate Database Records

  • Error: Write operation created duplicate rows instead of replacing existing ones, due to a missing unique constraint
  • Old version: Last item in an 11-item flat list, but the item itself was clearly written
  • New version: Placed under a "Database" category with a fixed position

  • Shared behavior: Both versions scored perfectly — diagnosis and fix both correct

  • Old version's quirk: The item was well-written enough to find regardless of position

  • New version's advantage: Category placement makes location more predictable, but in this case it didn't matter

Score: Flat list 4.0 → Structured 4.0 (0)


Summary

Metric Flat List Structured Change
Overall Score 2.67 3.67 +37%
Fix Accuracy 1.67 2.0 +20%
Reasoning Steps 3.6 2.7 -25%

Three findings:

1. Scattered information = biggest win. Scenario B went from 4 sections of 24 items to 1 decision table — a +2.0 improvement. When error knowledge is scattered across a document, agents waste steps piecing together clues.

2. Decision tables remove ambiguity. The flat-list version described "occasionally succeeds, occasionally times out." An agent misunderstood and gave the wrong fix. The decision table directly says what to do — no room for misinterpretation.

3. Decision tables make agents honest. When no table row perfectly matched the symptoms, the agent said "partial match" instead of forcing a wrong answer. Flat lists don't encourage this kind of honesty.

Try It

The whole project is up on my repo — feedback welcome: github.com/seanyan1984/skill-pitfalls

It's framework-agnostic — the rules work for any markdown-based skill or prompt documentation. If your error knowledge is growing wild in your agent skills, give it a try.

Top comments (0)