I Built a Multi-Agent Code Review Skill for Claude Code — Here's How It Works
Here's a stat that should bother every developer: 96% of us don't fully trust AI-generated code, but only 48% actually verify it before committing (SonarSource, 2026). That's a verification gap wide enough to ship critical bugs through.
I've been watching AI write more and more of my code over the past year. Claude Code became my daily driver. It's fast. It's capable. But speed without review is just velocity toward technical debt. So I built something to close that gap — a multi-agent code review skill that runs directly inside Claude Code, performing senior-engineer-level static analysis without ever touching your files.
This post breaks down why I built it, how the architecture works, and how you can start using it today.
what agentic AI is and how multi-agent systems work
TL;DR: I built an open-source, multi-agent code review skill for Claude Code with 9 specialized sub-skills — one domain expert each for security, SOLID, architecture, error handling, performance, testing, code smells, patterns, and framework idioms. It runs them in parallel, generates copy-pasteable fix prompts, and saves every audit to
./codeprobe-reports/<timestamp>.md. Install in one command:npx skills add nishilbhave/codeprobe. AI-coauthored PRs show 1.7x more issues than human-only PRs (Jellyfish, 2026). This helps close that gap.
Why Does AI-Generated Code Need Better Review?
Forty-two percent of committed code is now AI-generated or AI-assisted, and 38% of developers say reviewing that code takes more effort than reviewing human-written code (SonarSource, 2026). We're producing code faster than we can verify it. That's not a productivity win — it's a quality time bomb.
The numbers paint a clear picture. Across 600+ organizations, teams with the highest AI adoption saw PRs per engineer jump 113%, from 1.36 to 2.9 per week. But those AI-coauthored PRs showed roughly 1.7x more issues than human-only PRs (Jellyfish, 2026). More output, more problems.
And here's the trust paradox. 80% of developers now use AI tools regularly, yet trust in AI accuracy has actually declined — from 40% to 29% year-over-year (Stack Overflow, 2026). We keep using tools we don't trust because the productivity pull is too strong to ignore.
The economic stakes are real. Poor software quality costs the US economy $2.41 trillion annually, and bug-fix costs multiply up to 30x when caught in production versus during development (CISQ/NIST, 2026). Every bug your review process misses compounds that cost.
So the question isn't whether to review AI-generated code. It's how to review it at the same speed AI produces it. That's where multi-agent systems come in.
The gap between AI adoption and verification represents unreviewed AI code entering production.
AI-coauthored pull requests exhibit approximately 1.7 times more issues than human-only PRs, while PRs per engineer jumped 113% in high-AI-adoption teams (Jellyfish, 2026). The speed-quality tradeoff means AI-assisted teams need stronger automated review, not weaker review because "the AI wrote it."
building personal AI agent teams for productivity
What Is the Code Review Skill?
Teams using AI for code review alongside productivity tools saw quality improvements 81% of the time, compared to just 55% for fast teams without AI review (Qodo, 2026). I wanted those gains without leaving my terminal.
The CodeProbe skill is an open-source, multi-agent code review system that runs directly inside Claude Code. Think of it as a team of nine domain-expert reviewers that analyze your code simultaneously — one focused on security, another on SOLID principles, another on architecture, and so on.
Three design decisions shaped everything:
Read-only, always. The skill never modifies, writes, or deletes your files. It reads your code, finds issues, and generates copy-pasteable fix prompts you can run separately. You stay in control.
One-command install. npx skills add nishilbhave/codeprobe — no repo clone, no config files, no API keys, no OS-specific variants. The skill drops into ~/.claude/skills/ and is immediately available in any project.
Fix prompts, not auto-fixes. Every finding includes a fix_prompt field — a ready-to-paste Claude Code instruction that applies the fix. You decide which fixes to accept. The skill just does the thinking.
It works in two modes. Full mode in Claude Code gives you filesystem access, parallel agents, and deterministic metrics. Degraded mode in Claude.ai still analyzes pasted or uploaded code — you lose parallelism and file scanning, but keep the core review logic.
The skill performs senior-engineer-level static analysis across nine domains including security, SOLID principles, and architecture. With 80% of AI-reviewed PRs requiring no additional human comments (Qodo, 2026), automated review is becoming the quality backstop AI-heavy teams need.
how AI-native tools differ from traditional developer workflows
How Does the Multi-Agent Architecture Work?
Gartner reported a 1,445% surge in multi-agent system inquiries from Q1 2026 to Q2 2026, and predicts 40% of enterprise apps will embed task-specific AI agents by 2026 (Gartner, 2026). The multi-agent pattern isn't hype anymore. It's how you solve problems that are too broad for a single prompt.
The architecture follows an orchestrator pattern. A central router (SKILL.md) receives your command, auto-detects your tech stack, loads the right reference guides, and dispatches work to specialized sub-skills.
The Nine Domain Experts
Each sub-skill is a focused analyzer with its own detection rules:
| Sub-Skill | What It Catches |
|---|---|
| codeprobe-solid | Classes with 5+ unrelated methods, switch-on-type patterns, LSP violations, fat interfaces |
| codeprobe-security | SQL injection, XSS, hardcoded credentials, mass assignment, broken access control |
| codeprobe-architecture | Coupling issues, circular dependencies, god objects (>500 LOC), missing module boundaries |
| codeprobe-code-smells | Long methods (>30 LOC), feature envy, data clumps, dead code, magic numbers, deep nesting |
| codeprobe-patterns | Recommends Builder, Factory, Strategy only when concrete problems exist; detects Singleton abuse |
| codeprobe-performance | N+1 queries, missing indexes, O(n²) algorithms, race conditions, unnecessary re-renders |
| codeprobe-error-handling | Swallowed exceptions, missing try-catch, absent timeout/retry, insufficient validation |
| codeprobe-testing | Untested public methods, mock abuse, missing edge cases, brittle test data |
| codeprobe-framework | Stack-specific checks — Laravel Eloquent patterns, React hook rules, Next.js conventions |
Auto-Detection and Reference Loading
The skill doesn't ask you what stack you're using. It scans file extensions and project markers — .tsx files plus a next.config.* triggers React/Next.js references. A migrations/ directory loads SQL rules. Six language-specific reference guides cover PHP/Laravel, JavaScript/TypeScript, Python, React/Next.js, SQL, and API design.
Design decision I'm glad I made: building auto-detection instead of requiring manual configuration. Most review tools front-load setup with config files and CLI flags. I wanted
cd my-project && /codeprobe auditto just work. The detection adds maybe 200ms and saves every first-time user from reading docs before getting value.
The execution flow runs ten steps: parse command, validate against routing table, load config (or defaults), detect stack, load references, route to sub-skills, collect findings, compute scores, render report, present results. Each step is deterministic — same codebase, same findings every time.
Multi-agent AI system inquiries surged 1,445% in just over a year, reflecting explosive enterprise interest in orchestrated AI workflows (Gartner, 2026). The code review skill applies this pattern to a problem every developer faces daily: ensuring code quality at AI-generation speed.
Why Nine Sub-Skills Instead of One Generic Reviewer?
McKinsey found that companies with 80–100% developer AI adoption saw productivity gains exceeding 110%, with teams saving an average of 6 hours per week (McKinsey, 2026). You only capture those gains if your review catches the right problems — and a single generic reviewer, human or AI, can't hold nine specialized domains in its head at the same time.
When you run /codeprobe audit, the orchestrator dispatches all nine sub-skills in parallel. Each one is a focused domain expert — security, SOLID, architecture, error handling, performance, testing, code smells, patterns, framework — with its own detection rules, reference guides, and scoring logic. Instead of one reviewer loosely sweeping across everything, you get nine reviewers that each specialize.
Why granular specialization outperforms broad-bucket reviewers in practice:
- Focused scope per sub-skill. A security sub-skill reviewing for SQL injection isn't distracted by naming conventions. A SOLID reviewer isn't diluted by performance heuristics. Each sub-skill loads only what's relevant to its domain, so every token it processes is earning its keep.
- Domain-specific reference guides. The security sub-skill works from OWASP patterns. The framework sub-skill loads Laravel, React, or Django idioms depending on the detected stack. No single prompt can be equally deep in nine different domains — specialization lets each one be genuinely senior in its lane.
- Independent scoring. Findings don't get averaged into a generic "code quality" number. Security is tracked separately from test quality, so a perfect security score can't mask mediocre testing (or vice versa).
- Parallel execution. Sub-skills run concurrently, not sequentially. A multi-hundred-file codebase gets audited in roughly the time it takes the slowest single sub-skill to finish, not the sum of nine serial passes.
What I learned building this: my first version tried to cram all review logic into one giant prompt. It was worse at every individual category than the specialized version that came later. Narrowing each sub-skill's scope made each one sharper. The lesson generalizes — when you give an LLM a focused role with a focused reference set, it outperforms a generalist instruction every single time.
Nine sub-skills isn't arbitrary. It maps to the nine domains senior reviewers actually track on a large PR — security, structure (SOLID), architecture, error paths, performance, tests, smells, patterns, framework idioms. Adding a tenth would overlap with the others; collapsing to fewer buckets would lose coverage or force a single sub-skill to juggle domains that don't share reference material. The granularity reflects how review work already divides in practice.
Companies with the highest AI adoption reported productivity gains exceeding 110%, with engineering teams saving six hours per week on routine tasks (McKinsey, 2026). Parallel specialist execution applies that same principle to code review — nine simultaneous experts instead of one sequential pass.
the shift toward AI-first development workflows
How Does the Scoring System Work?
GitHub's randomized controlled trial of 202 developers found Copilot users had a 53.2% greater likelihood of passing all unit tests (GitHub, 2026). But passing tests isn't the same as passing review. The scoring system quantifies what tests don't measure — design quality, security posture, and architectural health.
Every finding from every sub-skill gets classified by severity: critical, major, or minor. The category score uses capped per-severity penalties:
crit_penalty = min(50, critical × 15)
major_penalty = min(30, major × 6)
minor_penalty = min(10, minor × 2)
score = max(0, 100 - crit_penalty - major_penalty - minor_penalty)
One critical finding drops your score by 15 points. The caps matter — no single severity tier can push a category below 10 on its own, which means one noisy sub-skill can't bury signal from the other eight. A genuinely broken category (many criticals + majors) still ends up in the red; a category with one false positive doesn't. The math reflects how real-world impact compounds while keeping scores interpretable.
The overall health score weights categories by importance:
Security carries double the weight of test quality — because a vulnerability in production costs orders of magnitude more than a missing unit test.
Why these weights? I modeled them after incident post-mortem frequency data from my own projects. Security and architecture issues caused the most expensive production incidents. Code smells and pattern issues, while annoying, rarely caused outages. The weights reflect real-world blast radius, not theoretical importance.
Three health thresholds make the score actionable. 80–100 means healthy. 60–79 means needs attention — you've got issues accumulating but nothing's on fire. Below 60 is critical — stop shipping features and fix what's broken.
Every single finding follows a strict format:
{
"id": "SEC-003",
"severity": "critical",
"location": { "file": "src/UserService.php", "lines": "45-67" },
"problem": "Raw user input concatenated into SQL query",
"evidence": "Direct code quotes proving the issue",
"suggestion": "Use parameterized queries via prepared statements",
"fix_prompt": "In src/UserService.php lines 45-67, replace the raw SQL concatenation with a prepared statement using parameter binding"
}
That fix_prompt field is the key integration point. Copy it, paste it into Claude Code, and the fix gets applied. No context switching. No manual translation from "you should do X" to actually doing X.
What Does It Look Like in Practice?
Eighty-five percent of developers now use AI tools regularly for coding, with nearly 9 in 10 saving at least one hour per week (JetBrains, 2026). Here's how those savings look with the review skill.
Full Audit
The command you'll use most:
/codeprobe audit src/
This dispatches all nine sub-skills in parallel, scans every file in the directory, and produces a complete report with per-category scores, prioritized findings, and fix prompts for each issue. The full audit also saves to ./codeprobe-reports/<timestamp>.md so you can share it in a PR or diff it against the previous run. On a medium-sized project (20-50 files), expect results in under a minute.
Quick Review
When you just want the top problems:
/codeprobe quick src/
Returns the five highest-priority findings with fix prompts. Perfect for a fast sanity check before pushing.
Health Dashboard
For a bird's-eye view:
/codeprobe health
Shows aggregate scores across all categories, highlights hot spots (files with the most issues), and gives you a deterministic metrics summary — lines of code, class counts, method counts, comment ratios — powered by the bundled file_stats.py script.
Configuration
Drop a .codeprobe-config.json in your project root to customize thresholds:
{
"severity_overrides": {
"long_method_loc": 50,
"large_class_loc": 500,
"max_constructor_deps": 6
},
"skip_categories": ["codeprobe-testing"],
"skip_rules": ["SPEC-GEN-001"]
}
Don't want pattern analysis on a quick prototype? Skip it. Your legacy codebase has 80-line methods by convention? Raise the threshold. The defaults are opinionated but overridable.
What I use daily: I run
/codeprobe quickbefore every commit and/codeprobe auditbefore every PR. The quick review catches the obvious stuff — hardcoded strings, missing error handling, methods that got too long. The full audit catches the architectural drift that accumulates over weeks. It's become the part of my workflow I'd miss most if I lost it.
understanding orchestrator patterns in multi-agent systems
What Makes This Different from Linters?
Developers using AI tools report that 45% of their top frustration is "AI solutions that are almost right, but not quite" (Stack Overflow, 2026). Linters catch the obviously wrong. This skill catches the almost-right-but-not-quite.
ESLint catches syntax issues. Prettier fixes formatting. PHPStan checks types. Those tools are necessary. They're not sufficient.
The review skill operates at a layer above. It catches SOLID violations — a class doing five unrelated things. It finds architectural drift — a service layer calling the database directly. It flags security patterns a linter won't see — mass assignment vulnerabilities, broken access control, insecure deserialization chains.
Linters answer "does this code follow syntax rules?" The review skill answers "will I regret this code in six months?"
It also works across your entire codebase, not file-by-file. Circular dependencies, coupling between modules, god objects spanning hundreds of lines — these are codebase-level concerns that single-file linters miss entirely.
The fix-prompt model is the real differentiator. Most review tools tell you what's wrong. This one tells you what's wrong and hands you the instruction to fix it. That's not a convenience feature — it's a workflow design choice. When the friction between "finding a problem" and "fixing it" drops to a single paste, developers actually fix things instead of adding TODO comments.
Getting Started
Claude Code became the #1 AI coding tool in just 8 months, overtaking GitHub Copilot and Cursor (The Pragmatic Engineer, 2026). If you're already using it, installing the review skill takes one command:
One-Command Install
npx skills add nishilbhave/codeprobe
That's it. Same command on macOS, Linux, and Windows — no shell-script variants, no curl | bash, no repo clone. npx skills add drops CodeProbe into ~/.claude/skills/ and it's immediately available in any project.
Manage it: npx skills update to pull upgrades, npx skills remove to uninstall.
Python 3.8+ is optional — it enables the deterministic metrics engine in /codeprobe health and the full audit dashboard. Every other review capability works without it.
Reports save to disk. Every audit writes a timestamped Markdown report to ./codeprobe-reports/<timestamp>.md in your current directory, so you can diff reports across commits, share them in a PR, or grep for regressions over time. Add codeprobe-reports/ to .gitignore if you don't want them committed.
Then open any project in Claude Code and run:
/codeprobe audit .
That's it. No config files to create. No API keys to set up. The skill auto-detects your stack and starts reviewing.
It's MIT licensed. Fork it, extend it, build on it. Contributions are welcome — especially for additional language reference guides and framework-specific detection rules.
emerging tools that are reshaping developer workflows
Frequently Asked Questions
Does the skill modify my code?
Never. The skill is strictly read-only — it won't write, edit, or delete any of your files. Every issue comes with a fix_prompt you can paste into Claude Code to apply the fix yourself. This design keeps you in full control, which 96% of developers prefer given that most don't fully trust AI-generated changes (SonarSource, 2026).
What languages and frameworks does it support?
Primary support covers PHP/Laravel, JavaScript/TypeScript, Python, React/Next.js, SQL, and API design — with dedicated reference guides for each. The deterministic metrics engine (file_stats.py) recognizes 13+ language types including Java, Ruby, Go, Rust, Vue, and Svelte. Stack detection is automatic based on file extensions and project markers.
Can I use it in Claude.ai without Claude Code?
Yes, in degraded mode. Paste or upload your code and run review commands normally. You'll lose parallel agent execution, filesystem scanning, and script-dependent metrics — but the core review logic, scoring, and fix prompts all work. The skill automatically detects which environment it's running in.
How do I customize severity thresholds?
Add a .codeprobe-config.json file to your project root. You can override defaults like long_method_loc (default: 30 lines), large_class_loc (default: 300), and max_constructor_deps (default: 5). You can also skip entire categories or individual rules. See the repo documentation for the full config schema.
Is it open source?
Fully open source under the MIT license. The repo is at github.com/nishilbhave/codeprobe-claude. Install it via npx skills add nishilbhave/codeprobe. Python 3.8+ is an optional runtime dependency that unlocks the deterministic metrics dashboard — everything else runs without it. Contributions are welcome, especially for new language reference guides and framework detection rules.
Conclusion
AI writes code faster than humans can review it. That's not going to change — it's going to accelerate. The question is whether your review process can keep up.
Here's what I'd take away:
- The verification gap is real. Only 48% of developers always verify AI code, despite 96% not trusting it. Automated review isn't optional anymore.
- Specialized sub-skills beat a single pass. Nine focused domain experts catch what a generalist reviewer — human or AI — would miss. Parallel execution keeps it fast.
- Fix prompts close the loop. Finding problems is easy. Getting developers to fix them is hard. Copy-pasteable instructions remove the friction.
-
One-command install wins adoption.
npx skills add nishilbhave/codeprobeplus auto-detection means zero setup between installing and getting your first audit.
The skill is free, open source, and ready to use: github.com/nishilbhave/codeprobe-claude. Install it with npx skills add nishilbhave/codeprobe, run /codeprobe audit on your next project, and see what it finds. I'd bet it catches something you didn't expect.
building your own AI agent toolkit
CodeProbe: 9 Specialized AI Agents That Audit Your Codebase for SOLID, Security & Performance
The Complete Claude Code Workflow: How I Ship 10x Faster
The End of User Interfaces: How AI Agents Will Kill the Dashboard


Top comments (0)