<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: seanyan</title>
    <description>The latest articles on DEV Community by seanyan (@_10e34d2463b4a0aecf191).</description>
    <link>https://dev.to/_10e34d2463b4a0aecf191</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3941925%2F8d2beb01-fa2c-4082-8fea-56b1812487de.webp</url>
      <title>DEV Community: seanyan</title>
      <link>https://dev.to/_10e34d2463b4a0aecf191</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/_10e34d2463b4a0aecf191"/>
    <language>en</language>
    <item>
      <title>Why Your AI Agent Keeps Making the Same Mistakes (And a Structured Fix)</title>
      <dc:creator>seanyan</dc:creator>
      <pubDate>Wed, 20 May 2026 10:12:02 +0000</pubDate>
      <link>https://dev.to/_10e34d2463b4a0aecf191/why-your-ai-agent-keeps-making-the-same-mistakes-and-a-structured-fix-1j43</link>
      <guid>https://dev.to/_10e34d2463b4a0aecf191/why-your-ai-agent-keeps-making-the-same-mistakes-and-a-structured-fix-1j43</guid>
      <description>&lt;p&gt;If you've worked with AI agent skills — Hermes, Claude Code, LangChain, whatever — you know the cycle. Something breaks. You (or your agent) jot down a fix somewhere in the docs. After a skill has been through a few iteration cycles, you've accumulated 30+ fix notes scattered across the document — no categories, no priority, and your agent still can't find the right fix when something goes wrong.&lt;/p&gt;

&lt;p&gt;I ran into this maintaining production agent skills. Here's what I found: as far as I can tell, nobody has systematically addressed how to organize error-related knowledge inside skills. Maybe someone has and I just haven't found it — but in everything I looked at, this angle was missing. And here's what worked for me.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Everyone Else Is Doing
&lt;/h2&gt;

&lt;p&gt;I looked around the usual places — official forums, technical blogs, developer communities. Here's roughly what I found:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Player&lt;/th&gt;
&lt;th&gt;What They Cover&lt;/th&gt;
&lt;th&gt;What's Missing&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://hermes-agent.nousresearch.com/docs/developer-guide/creating-skills" rel="noopener noreferrer"&gt;Hermes official docs&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;## Pitfalls&lt;/code&gt; section header — one line of guidance&lt;/td&gt;
&lt;td&gt;No format spec, no size limits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://agentskills.io/skill-creation/best-practices" rel="noopener noreferrer"&gt;agentskills.io best practices&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;"Gotchas" as flat bullet lists&lt;/td&gt;
&lt;td&gt;No decision tables, no structure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;a href="https://buildtolauch.substack.com/p/claude-skills-not-working-fix" rel="noopener noreferrer"&gt;Claude Code community&lt;/a&gt; (Jenny Ouyang's SSOT Audit)&lt;/td&gt;
&lt;td&gt;Skill auditing tools: broken paths, orphans, duplicates&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;No content-level governance — no rules for how pitfalls should be categorized, formatted, or sized&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;a href="https://tao-hpu.medium.com/how-to-write-ai-skills-that-dont-fail-in-production-6bb679897f30" rel="noopener noreferrer"&gt;"AI Skills in Production" guide&lt;/a&gt; (Tao An)&lt;/td&gt;
&lt;td&gt;System reliability, circuit breakers, structured output&lt;/td&gt;
&lt;td&gt;Knowledge document governance&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;One pattern stands out: everyone checks whether skill files resolve correctly and whether content is duplicated. But nobody asks — when an agent hits an error mid-execution, can it efficiently find the correct fix in SKILL.md or its reference files?&lt;/p&gt;

&lt;h3&gt;
  
  
  "Won't bigger models just handle this?"
&lt;/h3&gt;

&lt;p&gt;Fair question. If context windows keep growing, do we even need structure?&lt;/p&gt;

&lt;p&gt;Partially. Bigger contexts solve &lt;em&gt;reading&lt;/em&gt; — the model can ingest everything. But they don't solve &lt;em&gt;attention&lt;/em&gt;. Anthropic's own &lt;a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents" rel="noopener noreferrer"&gt;context engineering research&lt;/a&gt; documents the "lost in the middle" effect: critical details buried in a long flat list get overlooked. Every irrelevant item your agent has to scan past is noise competing with the signal it actually needs.&lt;/p&gt;

&lt;p&gt;Think of it this way: you could give someone a 500-page manual or a 1-page troubleshooting card. They can read both, but the card is faster. This is an efficiency problem, not a capability problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Worked for Me
&lt;/h2&gt;

&lt;p&gt;After banging my head against this for a while, I landed on five rules for organizing error documentation in skills (PITFALL Rules):&lt;/p&gt;

&lt;h3&gt;
  
  
  Rule 1: Categorize when &amp;gt;5 items
&lt;/h3&gt;

&lt;p&gt;Group by priority. The specific names depend on what your skill handles, but the structure looks like:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Priority&lt;/th&gt;
&lt;th&gt;What Goes Here&lt;/th&gt;
&lt;th&gt;Example Categories&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Highest&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Symptom → diagnosis → fix decision tables&lt;/td&gt;
&lt;td&gt;Anomaly Diagnosis, Error Recovery&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Tool-specific gotchas, API quirks&lt;/td&gt;
&lt;td&gt;Tool Traps, Integration Gotchas&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Data formats, parameter edge cases&lt;/td&gt;
&lt;td&gt;Data &amp;amp; Parameters, Input Validation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Project conventions, naming rules&lt;/td&gt;
&lt;td&gt;Project Conventions, Style Rules&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Key principle: &lt;strong&gt;anomaly diagnosis always goes first&lt;/strong&gt;, because that's what agents need when something breaks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rule 2: Decision tables for diagnostics
&lt;/h3&gt;

&lt;p&gt;Instead of prose like "sometimes the task times out because there's too much data and you should split it", write this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;| Symptom | Diagnosis | Fix |
|---------|-----------|-----|
| Task timeout, no output written | Check if output file exists | Don't retry same config. Split input and rerun |
| Same input fails ≥3 times | Persistent bottleneck | Bypass delegation, process directly in main session |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An agent can scan the Symptom column, find a match, and read the Fix — typically 1-2 steps instead of reading through 30 prose items.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rule 3: Don't duplicate the skill body (when applicable)
&lt;/h3&gt;

&lt;p&gt;This applies to &lt;strong&gt;operation-manual-style skills&lt;/strong&gt; — ones with detailed step-by-step instructions and inline warnings (⚠️). If your skill body already explains a gotcha right next to the relevant operation, the pitfall section should give a &lt;strong&gt;one-line cross-reference&lt;/strong&gt; only. Pitfalls cover blind spots the body doesn't address.&lt;/p&gt;

&lt;p&gt;For reference-style or API-doc-style skills where the body doesn't include inline warnings, this rule doesn't apply — all error knowledge goes in the pitfall section.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rule 4: Pre-append checklist
&lt;/h3&gt;

&lt;p&gt;Before adding any item: Is the section already categorized? → Which category? → Does it need a decision table? → Does it duplicate an existing item? → Too many items? Consider splitting to a separate file.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rule 5: Quality audit
&lt;/h3&gt;

&lt;p&gt;After refactoring, check 7 dimensions: categorized, decision tables for diagnostics, no flat lists &amp;gt;5, no duplicates, &amp;lt;50% body-text overlap, no prose narratives, no information loss.&lt;/p&gt;

&lt;h2&gt;
  
  
  Does It Actually Work? An A/B Experiment
&lt;/h2&gt;

&lt;p&gt;I wanted data, not just intuition. So I ran A/B tests.&lt;/p&gt;

&lt;h3&gt;
  
  
  How the test works
&lt;/h3&gt;

&lt;p&gt;I picked three real errors from production agent skills. For each error, I constructed two isolated contexts — one group of agents only saw the old flat-list pitfall documentation, another only saw the new structured version. Same error description, same prompt. Multiple runs per version to reduce randomness.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scoring
&lt;/h3&gt;

&lt;p&gt;Each test run is scored on 4 dimensions:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Scale&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Diagnosis Accuracy&lt;/td&gt;
&lt;td&gt;0-2&lt;/td&gt;
&lt;td&gt;0 = wrong, 1 = partially correct, 2 = pinpointed root cause&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fix Accuracy&lt;/td&gt;
&lt;td&gt;0-2&lt;/td&gt;
&lt;td&gt;0 = wrong fix, 1 = right direction but not actionable, 2 = correct and executable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Wrong Suggestions&lt;/td&gt;
&lt;td&gt;N (penalty)&lt;/td&gt;
&lt;td&gt;Number of misleading or irrelevant suggestions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reasoning Steps&lt;/td&gt;
&lt;td&gt;N (penalty if &amp;gt;3)&lt;/td&gt;
&lt;td&gt;How many intermediate steps from problem to fix&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Overall Score&lt;/strong&gt; = Diagnosis + Fix − Wrong Suggestions − (Steps &amp;gt; 3 ? 1 : 0). Max: 4. Min: -2.&lt;/p&gt;




&lt;h3&gt;
  
  
  Scenario A — Task Timeout
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Error:&lt;/strong&gt; Two sub-agents both timed out at 600 seconds with no output&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Old version:&lt;/strong&gt; Fix buried as item #2 in a 32-item flat list, described in prose as "occasionally succeeds, occasionally times out"&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;New version:&lt;/strong&gt; Decision table placed as the first category, Fix column directly says "split to 1 item per task"&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Shared behavior:&lt;/strong&gt; Both versions correctly diagnosed the root cause — dense analysis workload + 2-item batch config&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Old version's quirk:&lt;/strong&gt; Misunderstood "batch" as a 2-item unit, suggested splitting into 2 sub-agents instead of 4; averaged 3.7 reasoning steps&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;New version's advantage:&lt;/strong&gt; No ambiguity in the Fix column; averaged 3.0 reasoning steps&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Score: Flat list 3.0 avg → Structured 4.0 avg (+1.0)&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Scenario B — Session Connection Lost
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Error:&lt;/strong&gt; After 2 successful operations, connection broke mid-task — session appeared alive in listings but commands silently failed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Old version:&lt;/strong&gt; Relevant information scattered across 4 separate sections (24 items total), had to be pieced together&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;New version:&lt;/strong&gt; Consolidated into one decision table — symptom → diagnosis → fix at a glance&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Shared behavior:&lt;/strong&gt; Both versions tried to match the error against known patterns&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Old version's quirk:&lt;/strong&gt; Diagnosed as "session ID changed after navigation" — plausible but wrong, assembled from fragments across 4 sections&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;New version's advantage:&lt;/strong&gt; Found a partial match against the decision table, explicitly flagged it as partial, gave a conservative fix (stop and report) plus a meta-suggestion to add a new row for this scenario&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Score: Flat list 1.0 → Structured 3.0 (+2.0)&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Scenario C — Duplicate Database Records
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Error:&lt;/strong&gt; Write operation created duplicate rows instead of replacing existing ones, due to a missing unique constraint&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Old version:&lt;/strong&gt; Last item in an 11-item flat list, but the item itself was clearly written&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;New version:&lt;/strong&gt; Placed under a "Database" category with a fixed position&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Shared behavior:&lt;/strong&gt; Both versions scored perfectly — diagnosis and fix both correct&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Old version's quirk:&lt;/strong&gt; The item was well-written enough to find regardless of position&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;New version's advantage:&lt;/strong&gt; Category placement makes location more predictable, but in this case it didn't matter&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Score: Flat list 4.0 → Structured 4.0 (0)&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Summary
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Flat List&lt;/th&gt;
&lt;th&gt;Structured&lt;/th&gt;
&lt;th&gt;Change&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Overall Score&lt;/td&gt;
&lt;td&gt;2.67&lt;/td&gt;
&lt;td&gt;3.67&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+37%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fix Accuracy&lt;/td&gt;
&lt;td&gt;1.67&lt;/td&gt;
&lt;td&gt;2.0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+20%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reasoning Steps&lt;/td&gt;
&lt;td&gt;3.6&lt;/td&gt;
&lt;td&gt;2.7&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-25%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three findings:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Scattered information = biggest win.&lt;/strong&gt; Scenario B went from 4 sections of 24 items to 1 decision table — a +2.0 improvement. When error knowledge is scattered across a document, agents waste steps piecing together clues.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Decision tables remove ambiguity.&lt;/strong&gt; The flat-list version described "occasionally succeeds, occasionally times out." An agent misunderstood and gave the wrong fix. The decision table directly says what to do — no room for misinterpretation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Decision tables make agents honest.&lt;/strong&gt; When no table row perfectly matched the symptoms, the agent said "partial match" instead of forcing a wrong answer. Flat lists don't encourage this kind of honesty.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;The whole project is up on my repo — feedback welcome: &lt;a href="https://github.com/seanyan1984/skill-pitfalls" rel="noopener noreferrer"&gt;github.com/seanyan1984/skill-pitfalls&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It's framework-agnostic — the rules work for any markdown-based skill or prompt documentation. If your error knowledge is growing wild in your agent skills, give it a try.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
