DEV Community

Cover image for I Built a Multi-Agent Code Review Skill for Claude Code — Here's How It Works
Nishil Bhave
Nishil Bhave

Posted on • Originally published at maketocreate.com

I Built a Multi-Agent Code Review Skill for Claude Code — Here's How It Works

Code displayed on multiple monitor screens with blue and red neon lighting in a dark developer workspace

I Built a Multi-Agent Code Review Skill for Claude Code — Here's How It Works

Here's a stat that should bother every developer: 96% of us don't fully trust AI-generated code, but only 48% actually verify it before committing (SonarSource, 2026). That's a verification gap wide enough to ship critical bugs through.

I've been watching AI write more and more of my code over the past year. Claude Code became my daily driver. It's fast. It's capable. But speed without review is just velocity toward technical debt. So I built something to close that gap — a multi-agent code review skill that runs directly inside Claude Code, performing senior-engineer-level static analysis without ever touching your files.

This post breaks down why I built it, how the architecture works, and how you can start using it today.

what agentic AI is and how multi-agent systems work

TL;DR: I built an open-source, multi-agent code review skill for Claude Code with 9 specialized sub-skills and 4 parallel agents. It catches SOLID violations, security flaws, architecture issues, and code smells — then generates copy-pasteable fix prompts. Zero dependencies. AI-coauthored PRs show 1.7x more issues than human-only PRs (Jellyfish, 2026). This helps close that gap.


Why Does AI-Generated Code Need Better Review?

Forty-two percent of committed code is now AI-generated or AI-assisted, and 38% of developers say reviewing that code takes more effort than reviewing human-written code (SonarSource, 2026). We're producing code faster than we can verify it. That's not a productivity win — it's a quality time bomb.

The numbers paint a clear picture. Across 600+ organizations, teams with the highest AI adoption saw PRs per engineer jump 113%, from 1.36 to 2.9 per week. But those AI-coauthored PRs showed roughly 1.7x more issues than human-only PRs (Jellyfish, 2026). More output, more problems.

And here's the trust paradox. 80% of developers now use AI tools regularly, yet trust in AI accuracy has actually declined — from 40% to 29% year-over-year (Stack Overflow, 2026). We keep using tools we don't trust because the productivity pull is too strong to ignore.

The economic stakes are real. Poor software quality costs the US economy $2.41 trillion annually, and bug-fix costs multiply up to 30x when caught in production versus during development (CISQ/NIST, 2026). Every bug your review process misses compounds that cost.

So the question isn't whether to review AI-generated code. It's how to review it at the same speed AI produces it. That's where multi-agent systems come in.

Grouped bar chart comparing AI tool adoption rate at 80 percent, trust in AI accuracy at 29 percent, and developers who always verify AI code at 48 percent

The gap between AI adoption and verification represents unreviewed AI code entering production.

AI-coauthored pull requests exhibit approximately 1.7 times more issues than human-only PRs, while PRs per engineer jumped 113% in high-AI-adoption teams (Jellyfish, 2026). The speed-quality tradeoff means AI-assisted teams need stronger automated review, not weaker review because "the AI wrote it."

building personal AI agent teams for productivity


What Is the Code Review Skill?

Close-up of HTML and CSS code with colorful syntax highlighting on a dark monitor screen

Teams using AI for code review alongside productivity tools saw quality improvements 81% of the time, compared to just 55% for fast teams without AI review (Qodo, 2026). I wanted those gains without leaving my terminal.

The CodeProbe skill is an open-source, multi-agent code review system that runs directly inside Claude Code. Think of it as a team of nine domain-expert reviewers that analyze your code simultaneously — one focused on security, another on SOLID principles, another on architecture, and so on.

Three design decisions shaped everything:

Read-only, always. The skill never modifies, writes, or deletes your files. It reads your code, finds issues, and generates copy-pasteable fix prompts you can run separately. You stay in control.

Zero dependencies. Python 3.8+ standard library only. No pip installs, no package conflicts, no supply chain worries. Clone it and go.

Fix prompts, not auto-fixes. Every finding includes a fix_prompt field — a ready-to-paste Claude Code instruction that applies the fix. You decide which fixes to accept. The skill just does the thinking.

It works in two modes. Full mode in Claude Code gives you filesystem access, parallel agents, and deterministic metrics. Degraded mode in Claude.ai still analyzes pasted or uploaded code — you lose parallelism and file scanning, but keep the core review logic.

The skill performs senior-engineer-level static analysis across nine domains including security, SOLID principles, and architecture. With 80% of AI-reviewed PRs requiring no additional human comments (Qodo, 2026), automated review is becoming the quality backstop AI-heavy teams need.

how AI-native tools differ from traditional developer workflows


How Does the Multi-Agent Architecture Work?

Gartner reported a 1,445% surge in multi-agent system inquiries from Q1 2026 to Q2 2026, and predicts 40% of enterprise apps will embed task-specific AI agents by 2026 (Gartner, 2026). The multi-agent pattern isn't hype anymore. It's how you solve problems that are too broad for a single prompt.

The architecture follows an orchestrator pattern. A central router (SKILL.md) receives your command, auto-detects your tech stack, loads the right reference guides, and dispatches work to specialized sub-skills.

The Nine Domain Experts

Each sub-skill is a focused analyzer with its own detection rules:

Sub-Skill What It Catches
codeprobe-solid Classes with 5+ unrelated methods, switch-on-type patterns, LSP violations, fat interfaces
codeprobe-security SQL injection, XSS, hardcoded credentials, mass assignment, broken access control
codeprobe-architecture Coupling issues, circular dependencies, god objects (>500 LOC), missing module boundaries
codeprobe-code-smells Long methods (>30 LOC), feature envy, data clumps, dead code, magic numbers, deep nesting
codeprobe-patterns Recommends Builder, Factory, Strategy only when concrete problems exist; detects Singleton abuse
codeprobe-performance N+1 queries, missing indexes, O(n²) algorithms, race conditions, unnecessary re-renders
codeprobe-error-handling Swallowed exceptions, missing try-catch, absent timeout/retry, insufficient validation
codeprobe-testing Untested public methods, mock abuse, missing edge cases, brittle test data
codeprobe-framework Stack-specific checks — Laravel Eloquent patterns, React hook rules, Next.js conventions

Auto-Detection and Reference Loading

The skill doesn't ask you what stack you're using. It scans file extensions and project markers — .tsx files plus a next.config.* triggers React/Next.js references. A migrations/ directory loads SQL rules. Six language-specific reference guides cover PHP/Laravel, JavaScript/TypeScript, Python, React/Next.js, SQL, and API design.

Design decision I'm glad I made: building auto-detection instead of requiring manual configuration. Most review tools front-load setup with config files and CLI flags. I wanted cd my-project && /codeprobe audit to just work. The detection adds maybe 200ms and saves every first-time user from reading docs before getting value.

The execution flow runs ten steps: parse command, validate against routing table, load config (or defaults), detect stack, load references, route to sub-skills, collect findings, compute scores, render report, present results. Each step is deterministic — same codebase, same findings every time.

Multi-agent AI system inquiries surged 1,445% in just over a year, reflecting explosive enterprise interest in orchestrated AI workflows (Gartner, 2026). The code review skill applies this pattern to a problem every developer faces daily: ensuring code quality at AI-generation speed.


What Do the Four Parallel Agents Do?

McKinsey found that companies with 80–100% developer AI adoption saw productivity gains exceeding 110%, with teams saving an average of 6 hours per week (McKinsey, 2026). Parallelism is how you keep review speed matched to generation speed.

When you run /codeprobe audit, the orchestrator doesn't run nine sub-skills one at a time. It spawns four specialized agents that execute simultaneously:

  • Agent-Structural runs SOLID, architecture, and patterns analysis. Is your codebase maintainable? Are your abstractions sound? Does the dependency graph make sense?
  • Agent-Safety runs security and error handling checks. Could an attacker exploit this? Will it fail gracefully under pressure?
  • Agent-Quality runs code smells and testing analysis. Is the code clean? Are the tests meaningful and resilient?
  • Agent-Runtime runs performance and framework convention checks. Will it scale? Does it follow idiomatic patterns for your stack?

Each agent works independently with its own context window. No shared state, no coordination overhead. They report back in parallel, and the orchestrator merges their findings into a single scored report.

Why four agents and not nine? Grouping related sub-skills reduces context-switching overhead. A security reviewer thinking about error handling is already in the right headspace. A structural reviewer thinking about SOLID principles naturally considers architecture. The groupings reflect how senior engineers actually think during review.

Donut chart showing the four parallel agents and their sub-skill coverage: Agent-Structural covers 34 percent of review weight, Agent-Safety covers 32 percent, Agent-Quality covers 18 percent, and Agent-Runtime covers 16 percent

Structural and safety analysis carry the most weight — because maintainability and security issues compound fastest.

Companies with the highest AI adoption (80–100% of developers) reported productivity gains exceeding 110%, with engineering teams saving an average of six hours per week on routine tasks (McKinsey, 2026). Parallel agent execution applies that same principle to code review — four simultaneous reviewers instead of one sequential pass.

the shift toward AI-first development workflows


How Does the Scoring System Work?

GitHub's randomized controlled trial of 202 developers found Copilot users had a 53.2% greater likelihood of passing all unit tests (GitHub, 2026). But passing tests isn't the same as passing review. The scoring system quantifies what tests don't measure — design quality, security posture, and architectural health.

Every finding from every sub-skill gets classified by severity: critical, major, or minor. The category score formula is simple:

score = max(0, 100 - (critical × 25) - (major × 10) - (minor × 3))
Enter fullscreen mode Exit fullscreen mode

One critical finding drops your score by 25 points. That's intentional. A single SQL injection vulnerability matters more than ten minor style complaints. The math reflects how real-world impact compounds.

The overall health score weights categories by importance:

Lollipop chart showing scoring weights for nine review categories: Security at 20 percent, SOLID at 15 percent, Architecture at 15 percent, Error Handling at 12 percent, Performance at 12 percent, Test Quality at 10 percent, Code Smells at 8 percent, Design Patterns at 4 percent, and Framework at 4 percent

Security carries double the weight of test quality — because a vulnerability in production costs orders of magnitude more than a missing unit test.

Why these weights? I modeled them after incident post-mortem frequency data from my own projects. Security and architecture issues caused the most expensive production incidents. Code smells and pattern issues, while annoying, rarely caused outages. The weights reflect real-world blast radius, not theoretical importance.

Three health thresholds make the score actionable. 80–100 means healthy. 60–79 means needs attention — you've got issues accumulating but nothing's on fire. Below 60 is critical — stop shipping features and fix what's broken.

Every single finding follows a strict format:

{
  "id": "SEC-003",
  "severity": "critical",
  "location": { "file": "src/UserService.php", "lines": "45-67" },
  "problem": "Raw user input concatenated into SQL query",
  "evidence": "Direct code quotes proving the issue",
  "suggestion": "Use parameterized queries via prepared statements",
  "fix_prompt": "In src/UserService.php lines 45-67, replace the raw SQL concatenation with a prepared statement using parameter binding"
}
Enter fullscreen mode Exit fullscreen mode

That fix_prompt field is the key integration point. Copy it, paste it into Claude Code, and the fix gets applied. No context switching. No manual translation from "you should do X" to actually doing X.


What Does It Look Like in Practice?

Three software developers collaborating and reviewing code together on a laptop

Eighty-five percent of developers now use AI tools regularly for coding, with nearly 9 in 10 saving at least one hour per week (JetBrains, 2026). Here's how those savings look with the review skill.

Full Audit

The command you'll use most:

/codeprobe audit src/
Enter fullscreen mode Exit fullscreen mode

This spawns all four parallel agents, scans every file in the directory, and produces a complete report with per-category scores, prioritized findings, and fix prompts for each issue. On a medium-sized project (20-50 files), expect results in under a minute.

Quick Review

When you just want the top problems:

/codeprobe quick src/
Enter fullscreen mode Exit fullscreen mode

Returns the five highest-priority findings with fix prompts. Perfect for a fast sanity check before pushing.

Health Dashboard

For a bird's-eye view:

/codeprobe health
Enter fullscreen mode Exit fullscreen mode

Shows aggregate scores across all categories, highlights hot spots (files with the most issues), and gives you a deterministic metrics summary — lines of code, class counts, method counts, comment ratios — powered by the bundled file_stats.py script.

Configuration

Drop a .codeprobe-config.json in your project root to customize thresholds:

{
  "severity_overrides": {
    "long_method_loc": 50,
    "large_class_loc": 500,
    "max_constructor_deps": 6
  },
  "skip_categories": ["codeprobe-testing"],
  "skip_rules": ["SPEC-GEN-001"]
}
Enter fullscreen mode Exit fullscreen mode

Don't want pattern analysis on a quick prototype? Skip it. Your legacy codebase has 80-line methods by convention? Raise the threshold. The defaults are opinionated but overridable.

What I use daily: I run /codeprobe quick before every commit and /codeprobe audit before every PR. The quick review catches the obvious stuff — hardcoded strings, missing error handling, methods that got too long. The full audit catches the architectural drift that accumulates over weeks. It's become the part of my workflow I'd miss most if I lost it.

understanding orchestrator patterns in multi-agent systems


What Makes This Different from Linters?

Developers using AI tools report that 45% of their top frustration is "AI solutions that are almost right, but not quite" (Stack Overflow, 2026). Linters catch the obviously wrong. This skill catches the almost-right-but-not-quite.

ESLint catches syntax issues. Prettier fixes formatting. PHPStan checks types. Those tools are necessary. They're not sufficient.

The review skill operates at a layer above. It catches SOLID violations — a class doing five unrelated things. It finds architectural drift — a service layer calling the database directly. It flags security patterns a linter won't see — mass assignment vulnerabilities, broken access control, insecure deserialization chains.

Linters answer "does this code follow syntax rules?" The review skill answers "will I regret this code in six months?"

It also works across your entire codebase, not file-by-file. Circular dependencies, coupling between modules, god objects spanning hundreds of lines — these are codebase-level concerns that single-file linters miss entirely.

The fix-prompt model is the real differentiator. Most review tools tell you what's wrong. This one tells you what's wrong and hands you the instruction to fix it. That's not a convenience feature — it's a workflow design choice. When the friction between "finding a problem" and "fixing it" drops to a single paste, developers actually fix things instead of adding TODO comments.


Getting Started

Claude Code became the #1 AI coding tool in just 8 months, overtaking GitHub Copilot and Cursor (The Pragmatic Engineer, 2026). If you're already using it, adding the review skill takes thirty seconds:

One-Command Install (macOS/Linux)

curl -fsSL https://raw.githubusercontent.com/nishilbhave/codeprobe-claude/main/install.sh | bash
Enter fullscreen mode Exit fullscreen mode

Manual Install

git clone https://github.com/nishilbhave/codeprobe-claude.git
cd codeprobe-claude
./install.sh
Enter fullscreen mode Exit fullscreen mode

Windows (Git Bash)

Requires Git for Windows which includes Git Bash.

# Option 1: One-command install (run from Git Bash, not PowerShell/CMD)
curl -fsSL https://raw.githubusercontent.com/nishilbhave/codeprobe-claude/main/install-win.sh | bash

# Option 2: Manual install
git clone https://github.com/nishilbhave/codeprobe-claude.git
cd codeprobe-claude
./install-win.sh
Enter fullscreen mode Exit fullscreen mode

Note: Right-click the folder and select "Open Git Bash here", or open Git Bash and navigate to the directory. Do not use PowerShell or Command Prompt.

The install script symlinks the skill into ~/.claude/skills/. It checks for Python 3 availability and degrades gracefully if missing — you lose the deterministic metrics but keep all review capabilities.

Then open any project in Claude Code and run:

/codeprobe audit .
Enter fullscreen mode Exit fullscreen mode

That's it. No config files to create. No API keys to set up. No packages to install. The skill auto-detects your stack and starts reviewing.

It's MIT licensed. Fork it, extend it, build on it. Contributions are welcome — especially for additional language reference guides and framework-specific detection rules.

emerging tools that are reshaping developer workflows


Frequently Asked Questions

Does the skill modify my code?

Never. The skill is strictly read-only — it won't write, edit, or delete any of your files. Every issue comes with a fix_prompt you can paste into Claude Code to apply the fix yourself. This design keeps you in full control, which 96% of developers prefer given that most don't fully trust AI-generated changes (SonarSource, 2026).

What languages and frameworks does it support?

Primary support covers PHP/Laravel, JavaScript/TypeScript, Python, React/Next.js, SQL, and API design — with dedicated reference guides for each. The deterministic metrics engine (file_stats.py) recognizes 13+ language types including Java, Ruby, Go, Rust, Vue, and Svelte. Stack detection is automatic based on file extensions and project markers.

Can I use it in Claude.ai without Claude Code?

Yes, in degraded mode. Paste or upload your code and run review commands normally. You'll lose parallel agent execution, filesystem scanning, and script-dependent metrics — but the core review logic, scoring, and fix prompts all work. The skill automatically detects which environment it's running in.

How do I customize severity thresholds?

Add a .codeprobe-config.json file to your project root. You can override defaults like long_method_loc (default: 30 lines), large_class_loc (default: 300), and max_constructor_deps (default: 5). You can also skip entire categories or individual rules. See the repo documentation for the full config schema.

Is it open source?

Fully open source under the MIT license. The repo is at github.com/nishilbhave/codeprobe-claude. Zero external dependencies — just Python 3.8+ standard library. Contributions are welcome, especially for new language reference guides and framework detection rules.


Conclusion

AI writes code faster than humans can review it. That's not going to change — it's going to accelerate. The question is whether your review process can keep up.

Here's what I'd take away:

  • The verification gap is real. Only 48% of developers always verify AI code, despite 96% not trusting it. Automated review isn't optional anymore.
  • Multi-agent beats single-pass. Nine domain experts catch what a single reviewer — human or AI — would miss. Parallelism keeps it fast.
  • Fix prompts close the loop. Finding problems is easy. Getting developers to fix them is hard. Copy-pasteable instructions remove the friction.
  • Zero-config wins adoption. Auto-detection, zero dependencies, and one-command installation mean developers actually use it.

The skill is free, open source, and ready to use: github.com/nishilbhave/codeprobe-claude. Clone it, run /codeprobe audit on your next project, and see what it finds. I'd bet it catches something you didn't expect.

building your own AI agent toolkit

Top comments (0)