If you've worked with AI agent skills — Hermes, Claude Code, LangChain, whatever — you know the cycle. Something breaks. You (or your agent) jot down a fix somewhere in the docs. After a skill has been through a few iteration cycles, you've accumulated 30+ fix notes scattered across the document — no categories, no priority, and your agent still can't find the right fix when something goes wrong.
I ran into this maintaining production agent skills. Here's what I found: as far as I can tell, nobody has systematically addressed how to organize error-related knowledge inside skills. Maybe someone has and I just haven't found it — but in everything I looked at, this angle was missing. And here's what worked for me.
What Everyone Else Is Doing
I looked around the usual places — official forums, technical blogs, developer communities. Here's roughly what I found:
| Player | What They Cover | What's Missing |
|---|---|---|
| Hermes official docs |
## Pitfalls section header — one line of guidance |
No format spec, no size limits |
| agentskills.io best practices | "Gotchas" as flat bullet lists | No decision tables, no structure |
| Claude Code community (Jenny Ouyang's SSOT Audit) | Skill auditing tools: broken paths, orphans, duplicates | No content-level governance — no rules for how pitfalls should be categorized, formatted, or sized |
| "AI Skills in Production" guide (Tao An) | System reliability, circuit breakers, structured output | Knowledge document governance |
One pattern stands out: everyone checks whether skill files resolve correctly and whether content is duplicated. But nobody asks — when an agent hits an error mid-execution, can it efficiently find the correct fix in SKILL.md or its reference files?
"Won't bigger models just handle this?"
Fair question. If context windows keep growing, do we even need structure?
Partially. Bigger contexts solve reading — the model can ingest everything. But they don't solve attention. Anthropic's own context engineering research documents the "lost in the middle" effect: critical details buried in a long flat list get overlooked. Every irrelevant item your agent has to scan past is noise competing with the signal it actually needs.
Think of it this way: you could give someone a 500-page manual or a 1-page troubleshooting card. They can read both, but the card is faster. This is an efficiency problem, not a capability problem.
What Worked for Me
After banging my head against this for a while, I landed on five rules for organizing error documentation in skills (PITFALL Rules):
Rule 1: Categorize when >5 items
Group by priority. The specific names depend on what your skill handles, but the structure looks like:
| Priority | What Goes Here | Example Categories |
|---|---|---|
| Highest | Symptom → diagnosis → fix decision tables | Anomaly Diagnosis, Error Recovery |
| High | Tool-specific gotchas, API quirks | Tool Traps, Integration Gotchas |
| Medium | Data formats, parameter edge cases | Data & Parameters, Input Validation |
| Low | Project conventions, naming rules | Project Conventions, Style Rules |
Key principle: anomaly diagnosis always goes first, because that's what agents need when something breaks.
Rule 2: Decision tables for diagnostics
Instead of prose like "sometimes the task times out because there's too much data and you should split it", write this:
| Symptom | Diagnosis | Fix |
|---------|-----------|-----|
| Task timeout, no output written | Check if output file exists | Don't retry same config. Split input and rerun |
| Same input fails ≥3 times | Persistent bottleneck | Bypass delegation, process directly in main session |
An agent can scan the Symptom column, find a match, and read the Fix — typically 1-2 steps instead of reading through 30 prose items.
Rule 3: Don't duplicate the skill body (when applicable)
This applies to operation-manual-style skills — ones with detailed step-by-step instructions and inline warnings (⚠️). If your skill body already explains a gotcha right next to the relevant operation, the pitfall section should give a one-line cross-reference only. Pitfalls cover blind spots the body doesn't address.
For reference-style or API-doc-style skills where the body doesn't include inline warnings, this rule doesn't apply — all error knowledge goes in the pitfall section.
Rule 4: Pre-append checklist
Before adding any item: Is the section already categorized? → Which category? → Does it need a decision table? → Does it duplicate an existing item? → Too many items? Consider splitting to a separate file.
Rule 5: Quality audit
After refactoring, check 7 dimensions: categorized, decision tables for diagnostics, no flat lists >5, no duplicates, <50% body-text overlap, no prose narratives, no information loss.
Does It Actually Work? An A/B Experiment
I wanted data, not just intuition. So I ran A/B tests.
How the test works
I picked three real errors from production agent skills. For each error, I constructed two isolated contexts — one group of agents only saw the old flat-list pitfall documentation, another only saw the new structured version. Same error description, same prompt. Multiple runs per version to reduce randomness.
Scoring
Each test run is scored on 4 dimensions:
| Dimension | Scale | Meaning |
|---|---|---|
| Diagnosis Accuracy | 0-2 | 0 = wrong, 1 = partially correct, 2 = pinpointed root cause |
| Fix Accuracy | 0-2 | 0 = wrong fix, 1 = right direction but not actionable, 2 = correct and executable |
| Wrong Suggestions | N (penalty) | Number of misleading or irrelevant suggestions |
| Reasoning Steps | N (penalty if >3) | How many intermediate steps from problem to fix |
Overall Score = Diagnosis + Fix − Wrong Suggestions − (Steps > 3 ? 1 : 0). Max: 4. Min: -2.
Scenario A — Task Timeout
- Error: Two sub-agents both timed out at 600 seconds with no output
- Old version: Fix buried as item #2 in a 32-item flat list, described in prose as "occasionally succeeds, occasionally times out"
New version: Decision table placed as the first category, Fix column directly says "split to 1 item per task"
Shared behavior: Both versions correctly diagnosed the root cause — dense analysis workload + 2-item batch config
Old version's quirk: Misunderstood "batch" as a 2-item unit, suggested splitting into 2 sub-agents instead of 4; averaged 3.7 reasoning steps
New version's advantage: No ambiguity in the Fix column; averaged 3.0 reasoning steps
Score: Flat list 3.0 avg → Structured 4.0 avg (+1.0)
Scenario B — Session Connection Lost
- Error: After 2 successful operations, connection broke mid-task — session appeared alive in listings but commands silently failed
- Old version: Relevant information scattered across 4 separate sections (24 items total), had to be pieced together
New version: Consolidated into one decision table — symptom → diagnosis → fix at a glance
Shared behavior: Both versions tried to match the error against known patterns
Old version's quirk: Diagnosed as "session ID changed after navigation" — plausible but wrong, assembled from fragments across 4 sections
New version's advantage: Found a partial match against the decision table, explicitly flagged it as partial, gave a conservative fix (stop and report) plus a meta-suggestion to add a new row for this scenario
Score: Flat list 1.0 → Structured 3.0 (+2.0)
Scenario C — Duplicate Database Records
- Error: Write operation created duplicate rows instead of replacing existing ones, due to a missing unique constraint
- Old version: Last item in an 11-item flat list, but the item itself was clearly written
New version: Placed under a "Database" category with a fixed position
Shared behavior: Both versions scored perfectly — diagnosis and fix both correct
Old version's quirk: The item was well-written enough to find regardless of position
New version's advantage: Category placement makes location more predictable, but in this case it didn't matter
Score: Flat list 4.0 → Structured 4.0 (0)
Summary
| Metric | Flat List | Structured | Change |
|---|---|---|---|
| Overall Score | 2.67 | 3.67 | +37% |
| Fix Accuracy | 1.67 | 2.0 | +20% |
| Reasoning Steps | 3.6 | 2.7 | -25% |
Three findings:
1. Scattered information = biggest win. Scenario B went from 4 sections of 24 items to 1 decision table — a +2.0 improvement. When error knowledge is scattered across a document, agents waste steps piecing together clues.
2. Decision tables remove ambiguity. The flat-list version described "occasionally succeeds, occasionally times out." An agent misunderstood and gave the wrong fix. The decision table directly says what to do — no room for misinterpretation.
3. Decision tables make agents honest. When no table row perfectly matched the symptoms, the agent said "partial match" instead of forcing a wrong answer. Flat lists don't encourage this kind of honesty.
Try It
The whole project is up on my repo — feedback welcome: github.com/seanyan1984/skill-pitfalls
It's framework-agnostic — the rules work for any markdown-based skill or prompt documentation. If your error knowledge is growing wild in your agent skills, give it a try.
Top comments (0)