DEV Community

Soleiman Mansouri
Soleiman Mansouri

Posted on

I Built a Debugging Memory for AI Coding Agents — Here's the System Behind It

Here's a question that changed how I debug with AI agents:

What if the agent checked "have I seen this before?" before investigating every bug?

I started logging my debugging sessions — every root cause, every false lead, every fix. After 100+ production bugs across voice pipelines, API integrations, and distributed systems, a clear pattern emerged: the same ~22 root causes explain nearly everything. Config chain gaps. Stale caches. Silent fallbacks. Observer multipliers. Retry/timeout mismatches.

The bugs repeat. The agents don't remember. That's the gap.

So I built Debug Bank — a pattern-first debugging memory system that teaches AI agents to remember.

The Problem: AI Agents Learn Nothing

Here's what happens today:

  1. Bug appears
  2. Agent investigates from scratch
  3. Agent finds root cause, fixes it
  4. Session ends
  5. Same bug appears in a different file
  6. Agent investigates from scratch again

Stack Overflow data shows AI-generated code has 2.66x more formatting problems and 1.5-2x more security bugs than human code. Much of that comes from agents never learning from past mistakes.

Google's ReasoningBank research (arxiv.org/abs/2504.09762) proved that distilling failures into reusable patterns yields +8.3% on WebArena and +4.6% on SWE-Bench. But that research was theoretical. I needed a production system.

The Insight: Patterns Repeat

After documenting my own debugging trajectories, I extracted 22 root cause patterns that account for ~95% of the bugs I encounter:

  • P03 (Observer/Hook Multiplier) — Event listeners registered multiple times, causing duplicate processing per trigger
  • P07 (Partial Rollback) — A multi-step deploy succeeds partway, leaving the system in an inconsistent state
  • P11 (Retry/Timeout Mismatch) — Retry interval exceeds the timeout window, so retries never actually execute
  • P15 (Async Fire-and-Forget) — A background task fails silently because no one awaits its result or checks its status

These aren't theoretical. Each has 2-3 real-world examples from production, a 30-second checklist, and a fix strategy.

The key insight: If an agent checks "have I seen this before?" at the start of every debugging session, 70% of investigations end at step 1.

How Debug Bank Works

It's three layers stacked on top of each other:

Layer 1: Pattern Bank (P01-P22)

When a bug is reported, the agent reads the pattern bank and asks: "Is this a known pattern?"

If yes, it pulls the checklist and verification strategy. If the known fix applies, we're done in 30 seconds. If not, we know why it doesn't match and adjust.

If no pattern matches, we proceed to the 7-step protocol.

Layer 2: The 7-Step Protocol

I call it the "Debug Trajectory" — it's the exact sequence I follow on every production incident:

  1. Reproduce — Get the exact error with full output (logs, stack trace, HTTP status)
  2. Hypothesize — State 2-3 ranked, falsifiable root causes
  3. Isolate — Test hypotheses one at a time using binary search
  4. Diagnose — Identify the single root cause by tracing the full call chain
  5. Fix — Make a minimal change addressing the root cause, not a symptom
  6. Record — Document the trajectory in a domain catalog
  7. Capture feedback — When corrected, turn the correction into a persistent rule

Every step produces evidence for the next. You never skip.

Layer 3: The 3-Exchange Stop Rule

This is the single most impactful rule in the system.

If 3 rounds of iterative fixing show no progress, STOP. Don't continue the same approach. Re-plan from scratch, add logging, or switch strategy entirely.

That's it. That rule alone prevents the #1 failure mode of AI agents — circular debugging that wastes tokens and produces nothing. Most agents loop 5-10 times on hard problems. This forces a strategy pivot.

Pre-Deploy: Catching Bugs Before They Ship

The pattern bank is reactive — it kicks in after a bug appears. But the cheapest time to catch a bug is before it ships.

So I built a pre-deploy scanner. Here's how it works:

bash integrations/pre-deploy-check.sh
Enter fullscreen mode Exit fullscreen mode

The scanner:

  1. Reads your git diff (staged changes)
  2. Greps for keywords linked to each pattern (e.g., "observer" for P03, "fallback" for P08)
  3. Prints a ranked list of flagged patterns with their quick-check
  4. Exits non-zero if matches are found (so it blocks your deploy pipeline)

Example output:

[debug-bank] Pre-Deploy Pattern Scan
Scanning git diff for known failure patterns...

  FLAGGED  P03 Observer/Hook Multiplier
           keyword: subscribe
           Check: Deduplicate by event/frame ID

  FLAGGED  P08 Config Resolution Chain Gap
           keyword: fallback
           Check: Trace the full fallback chain

2 pattern(s) flagged. Review before deploying.
Exit code: 1
Enter fullscreen mode Exit fullscreen mode

You fix the flagged issues, run the scanner again, and when there are no matches, the deploy proceeds.

This runs before human review — a catch-all safety net that uses the same pattern bank for prevention, not just diagnosis.

How Feedback Becomes Rules

When you tell Claude "don't do that" or correct its approach, that correction becomes a permanent rule:

---
name: no-mocking-database
type: feedback
---
Integration tests must hit a real database, not mocks.

**Why:** Prior incident where mock/prod divergence masked a broken migration.
**How to apply:** Any test file touching database operations.
Enter fullscreen mode Exit fullscreen mode

The Why field lets the agent judge edge cases instead of blindly following rules. After 30+ feedback rules, the agent rarely needs the same correction twice.

Zero Dependencies, Just Markdown

Here's the radical part: It's all markdown files.

No database. No API. No external dependencies. No installation.

Setup takes 30 seconds:

curl -O https://raw.githubusercontent.com/soleimanmansouri/debug-bank/main/CLAUDE.md
Enter fullscreen mode Exit fullscreen mode

Drop it in your project. Your agent reads it. That's it.

Works with:

  • Claude Code (drop-in via CLAUDE.md)
  • Codex CLI, Gemini CLI, Cursor (via AGENTS.md)
  • Custom agents (copy patterns, follow the protocol)

The pre-deploy scanner is a bash script — no external dependencies beyond git and grep.

Real-World Example: P08 in Production

Let me show you how this works with a real pattern.

The bug: Transfer requests were routing to wrong department numbers.

Step 1: Pattern Check — I read P08 (Config Resolution Chain Gap) and recognized the exact symptom: "System falls through to stale data when a link in the fallback chain is missing."

Step 2: Verify the fix applies — Config was resolved via: API response → database table → YAML → hardcoded fallback. The department table was empty. System fell through to YAML with outdated numbers.

Step 3: Fix — Populate the database table. Add monitoring for empty entries.

Result: 30 seconds. No investigation needed.

Without the pattern bank, I would have:

  • Checked the API response (working)
  • Checked the database query (returns no results, but why?)
  • Checked the YAML file (contains old numbers, but why is it using those?)
  • Spent an hour on a config chain I didn't fully understand

Pattern bank shortened it to 30 seconds. Pattern bank applies to every subsequent bug of that type — in this project and in any future project.

Building the Pattern Bank

The 22 patterns came from documenting my own debugging work across diverse systems:

  • Voice pipelines (Pipecat, ElevenLabs)
  • API integrations (Twilio, Odoo, Supabase)
  • Configuration management (database → YAML fallback chains)
  • Distributed systems (cache invalidation, retry storms)

Each pattern includes:

  • A clear description of the failure mode
  • A 30-second checklist
  • 2-3 real-world examples
  • A fix strategy
  • Prevention guidance

The patterns transfer across projects. P02 (Multiple Writers) appears in voice pipelines, web APIs, databases, and infrastructure. Once learned, it's learned forever.

Scenarios and Postmortems

The repo includes higher-tier debugging challenges:

  • Scenarios (S01-S03) — Multi-service L3-L4 problems where the symptom is in one place and the root cause is somewhere else entirely
  • Postmortems (PM01-PM03) — Anonymized production incidents with full timelines, blast radius analysis, and systemic mitigation

These are for learning. They teach you to think like a senior engineer tracking distributed failures.

Why This Works

Compound learning — Every bug fix teaches the system. After 50 bugs, most issues resolve at step 1 (pattern match).

Transfers across projects — You build the pattern bank once. It moves with you. P02 (Multiple Writers) is P02 everywhere.

User-driven self-improvement — Feedback rules capture corrections with context. The agent gets better at matching your expectations.

Evidence-based — Every pattern has a checklist. Every catalog entry links to a pattern ID. Nothing is "just trust me."

Stops circular debugging — The 3-exchange rule forces strategy pivots. No more looping endlessly on the wrong approach.

What's Next: Debug Bank v2

The roadmap includes runnable Docker scenarios where you can practice debugging L3-L4 problems autonomously. Imagine: a broken voice pipeline, a misconfigured database fallback chain, and a timing race condition — all in a sandbox you can experiment in.

The goal is to turn Debug Bank from a reference manual into a practice environment.

Getting Started

  1. Clone or downloadgit clone https://github.com/soleimanmansouri/debug-bank
  2. Copy CLAUDE.mdcurl -O https://raw.githubusercontent.com/soleimanmansouri/debug-bank/main/CLAUDE.md
  3. Read the patterns — Skim patterns/ to learn what's available
  4. On your first bug — Pattern check first, then trajectory protocol
  5. Set up pre-deploy — Add bash integrations/pre-deploy-check.sh to your deploy pipeline

The full setup guide lives in the repo. You can start using patterns in under 5 minutes.

Why I'm Open-Sourcing This

I built Debug Bank from months of production debugging across 3 major projects and 100+ real incidents. Every pattern has been verified on actual bugs. Every rule has been tested under pressure.

But patterns are only as useful as they are shared. A pattern bank gains value when developers from different domains contribute new patterns, challenge existing assumptions, and improve the checklists.

If you've debugged a production system and found a repeating failure pattern, Debug Bank wants it. Submit a PR with a real example, and the pattern becomes part of the shared knowledge base.

The Bet

My bet is simple: If you give your AI agent access to a well-organized pattern bank, the agent will solve 70% of future bugs at step 1 instead of re-investigating every time.

That's a 10-100x speedup depending on the complexity. That's fewer wasted tokens, faster fixes, and debugging that feels less like Sisyphus pushing a boulder uphill.


Debug Bank is MIT licensed. Zero dependencies. Just markdown and a bash script.

Start here: github.com/soleimanmansouri/debug-bank

Have a bug that doesn't fit the 22 patterns? Open an issue or PR. Let's grow this together.

Top comments (0)