bchtitihi

Posted on Mar 7

How I Audited 250K Lines of Legacy Code with 11 AI Agents in One Week

#ai #claudecode #programming #methodology

I inherited a monolith. 250,000 lines of Python. 20+ years old. The framework was end-of-life since 2018. The language was end-of-life since 2020. Zero tests. Passwords stored in plain text. A proprietary library maintained by 2 people, embedded in 133 imports across 47 files. A database with 462 tables using exotic PostgreSQL inheritance instead of standard ORM patterns. And 900+ production websites depending on it.

My job: audit the entire thing before a rebuild decision. Traditional approach: 2-3 senior consultants, 4-8 weeks, six figures.

My approach: 11 AI agents, 2 adversarial teams, 7 iterations, 10 days.

Here's what happened — including the mistakes that made it work.

Iteration 1: The Naive Start (1 agent)

I started where everyone starts. One Claude conversation. Upload the codebase. Ask questions.

The results looked impressive: 1,100 paragraphs, 18 sections covering architecture, security, performance, business rules. My first thought was "this is amazing."

My second thought, three days later, was "half of this is wrong."

The hallucinations:

The agent claimed a major frontend library was "not present in the codebase." A simple grep later found it in 11 files.
It estimated "200+ SQL triggers." The actual count was 401.
Most findings had no file references. When I tried to verify them, I couldn't find what the agent was talking about.
4 database classes were referenced that didn't exist anywhere in the codebase. The agent had invented plausible-sounding names with field counts and relationships.

The lesson: AI hallucinates when it can't verify. Without file:line proof, findings are fiction.

Rule #1 established: Every finding must include file:line proof. No exceptions.

Iteration 2: Filesystem Access (4 agents)

The fix seemed obvious: give the agents filesystem access so they can actually grep and find before making claims.

I set up 4 specialized agents running sequentially:

#	Agent	Role
1	Security Hunter	OWASP Top 10, credentials, injections
2	Code Archaeologist	Dead code, business rules, module scope
3	Metrics Counter	Exact counts, schema, performance
4	Cross-Checker	Consolidation, contradictions

I used a configuration inspired by the shanraisshan/claude-code-best-practice repo: YAML frontmatter for agents, glob-based rules that load only when needed, and progressive disclosure for skills — only descriptions load at startup, full content on demand. This saved ~60% of context window.

What worked: Agent 3 ran grep -c "CREATE TRIGGER" and got 954 (not the "200+" estimate from Iteration 1). Real numbers replaced guesses.

What went wrong: 15+ findings were left marked "[TO VERIFY]." The agents couldn't verify each other because they ran sequentially — by the time Agent 4 found issues, Agents 1-3 were done.

Also, Agent 4 estimated security remediation at "€95K + 4 weeks." Agents have zero basis for cost estimation. This was pure hallucination dressed as analysis.

Rules established:

No "[TO VERIFY]" in final deliverables
No effort estimates — agents audit, humans estimate
Cross-review between agents is mandatory

Iteration 3: Parallel with Cross-Review (5 agents)

The big change: agents running in parallel, each reviewing one other agent's work.

The breakthrough: Zero "[TO VERIFY]" markers. When Agent 1 claimed 35 imports and Agent 3 counted 38, the consolidator re-ran the grep and settled it (38 was correct).

The problem: Good breadth, shallow depth. The security agent found 12 vulnerabilities. A deeper audit later found 19+ with 7 critical.

Context management became critical. I learned that compacting at 70% of context usage (not the default 95%) prevents agents from losing instructions mid-analysis. And CLAUDE.md files over 200 lines get partially ignored — details need to move into separate rule and skill files.

Iteration 4: Specialized Deep-Dive (10 agents)

One agent per domain: workflows, batch processes, forms, templates, middleware, database schema, integrations, module classification, security+GDPR, quality arbiter.

This produced exhaustive reports. The Module Classifier categorized every view, model, and route as IN_SIMPLE / IN_COMPLEX / OUT / GRAY_ZONE — giving the CTO a clear decision framework: "28 things are easy, 47 are hard, 120 are out of scope, 12 need your decision."

But this is also where the biggest error was introduced.

Agent 6 (Schema Architect) reported 889 foreign keys. This number came from counting columns named ref_* — a naming convention. Previous iterations had reported the same number. Nobody questioned it.

889 traveled from Iteration 2 → 3 → 4 without anyone verifying the actual database constraints. I'll tell you the real number soon.

Iteration 5: The Validation Tribunal (15 agents)

15 fresh agents, one per domain, each re-verifying every finding from Iterations 1-4. They could only read source code — not previous reports. This prevented bias.

Verdict	Count	%
Confirmed	131	78.0%
Partially confirmed	20	11.9%
Invalidated	14	8.3%
Not verifiable	1	0.6%
New findings	17	—
Reliability score	89.4%

The 4 hallucinations from earlier iterations were caught here. The agents had invented database classes that sounded plausible but didn't exist.

But the 889 FK survived. The validators re-ran grep -c "ref_" and got 889 again. The query was correct. The interpretation was wrong.

Per-domain scores revealed the weak spot: Architecture: 100%. Business rules: 100%. Database: 37.5%. That score screamed "investigate further."

Iteration 6: Active Exploration (7 agents)

Instead of re-reading the same code, I introduced new data sources: git history, CVE scanning (pip-audit), and the production database schema.

The 889 → 15 moment

The Schema Inspector obtained the production schema and ran:

SELECT COUNT(*) 
FROM information_schema.table_constraints 
WHERE constraint_type = 'FOREIGN KEY';

Result: 15.

All 15 were on system tables. Zero on business tables. For 20+ years, the application had operated with zero referential integrity on its core data.

The migration strategy changed completely. Instead of "migrate 889 FK relationships," it became "design proper constraints for the new system."

Other discoveries

97% of code was dormant (135 active files out of 4,369)
36 CVEs confirmed (4 critical, CVSS ≥ 9.0)
Bus factor = 1 (one developer owned 67-72% of commits on 7/8 critical modules)
The database dump was 8 years old

Iteration 7: The Adversarial Dual-Team (11 agents)

The final iteration. Two independent teams: Team A (7 agents) audits, Team B (4 agents) tries to prove Team A wrong.

Critical rule: Team B cannot modify Team A's reports. It produces its own files.

Metric	Value
Findings challenged	22
Confirmed	15
Nuanced (correct but misleading)	5
Invalidated	2
Reliability score	81.8%

Why 81.8% is better than 89.4%

Iteration 5's 89.4% was a validation score: "Are the facts correct?" Iteration 7's 81.8% was an adversarial score: "Can we find reasons these facts are wrong or misleading?"

The lower score is more trustworthy. It means the process works.

The 8 Rules (Learned the Hard Way)

Rule	Learned in	Why
Every finding needs `file:line` proof	Iteration 1	Without proof, agents hallucinate
Search before claiming	Iteration 1	"Library absent" — found in 11 files
No effort estimates	Iteration 2	Agents are terrible at estimation
Cross-review between agents	Iteration 3	Agents contradict each other silently
Classify everything	Iteration 4	Decision-makers need decisions
Re-verify previous iterations	Iteration 5	889 FK traveled 4 iterations unchecked
Use production data	Iteration 6	889 → 15 changed everything
Adversarial beats validation	Iteration 7	89.4% looked good, 81.8% was honest

The Open-Source Methodology

I've open-sourced everything: github.com/bchtitihi/legacy-audit-agents

The repo includes:

7 detailed iteration documents (every mistake documented)
6 progressive setup levels (Level 2 → 7) — you can iterate the same way I did
11 agent definitions (7 Team A + 4 Team B)
Rules, skills, commands, hooks — all battle-tested
Stack-specific examples (Django, Rails, Node.js)
Reliability scoring formula
References to Anthropic's official docs and community best practices

Quick start

git clone https://github.com/bchtitihi/legacy-audit-agents.git
cp -r legacy-audit-agents/setup/.claude your-project/.claude
cp legacy-audit-agents/setup/CLAUDE.md your-project/CLAUDE.md
cd your-project && claude --dangerously-skip-permissions
# Type: /audit-run

Or follow the progressive path: start at Level 2, iterate to Level 7.

Why I'm Publishing This

Every other Claude Code repo gives you configs and templates. This one gives you a methodology — with every mistake documented so you don't repeat them.

The methodology didn't start with 11 agents. It started with 1 agent that hallucinated. This progression is the real value. Not just the final setup — the entire journey from naive to adversarial.

Nobody else has published this. I'm first, and I want others to build on it.

⭐ Star the repo if this is useful. Questions? Open an issue