Nishil Bhave

Posted on Mar 31 • Edited on May 17 • Originally published at maketocreate.com

I Built a Multi-Agent Claude Code Review Skill: 9 AI Experts

#claudecode #codereview #multiagentai #aicodingtools

I Built a Multi-Agent Claude Code Review Skill — Here's How It Works

Here's a stat that should bother anyone shipping AI-written code: 96% of developers don't fully trust AI-generated code, but only 48% actually verify it before committing (SonarSource, 2026). That's a verification gap wide enough to ship critical bugs through, which is why I built a Claude Code review skill that closes the gap automatically.

I've been watching Claude Code take over more of my daily output over the past year, and the speed is genuinely useful. It's fast and capable, but speed without review is just velocity toward technical debt. So I built a multi-agent Claude Code review skill that runs directly inside the terminal, performing senior-engineer-level static analysis without ever touching your files. This post breaks down why I built it, how the architecture works, and how you can start using it today.

TL;DR: I built an open-source, multi-agent Claude Code review skill with 9 specialized sub-skills — one domain expert each for security, SOLID, architecture, error handling, performance, testing, code smells, patterns, and framework idioms. The Claude Code review skill runs them in parallel, generates copy-pasteable fix prompts, and saves every audit to ./codeprobe-reports/<timestamp>.md. Install in one command: npx skills add nishilbhave/codeprobe. AI-coauthored PRs show 1.7x more issues than human-only PRs (Jellyfish, 2026). This Claude Code review skill helps close that gap.

Why Does AI-Generated Code Need Better Review?

Forty-two percent of committed code is now AI-generated or AI-assisted, and 38% of developers say reviewing that code takes more effort than reviewing human-written code (SonarSource, 2026). We're producing code faster than we can verify it. That's not a productivity win — it's a quality time bomb.

The numbers paint a clear picture. Across 600+ organizations, the teams with the most AI use doubled their output. PRs per engineer jumped 113%, from 1.36 to 2.9 per week. But those AI-coauthored PRs showed roughly 1.7x more issues than human-only PRs (Jellyfish, 2026). More output, more problems — and that is exactly the gap a good Claude Code review needs to fill.

And here's the trust paradox. 80% of developers now use AI tools regularly, yet trust in AI accuracy has actually declined — from 40% to 29% year-over-year (Stack Overflow, 2026). We keep using tools we don't trust because the productivity pull is too strong to ignore.

The economic stakes are real. Poor software quality costs the US economy $2.41 trillion every year. Bug-fix costs also multiply up to 30x when caught in production versus during development (CISQ/NIST, 2026). Every bug your review process misses compounds that cost, which is why I pair pre-commit Claude Code review with runtime telemetry. Here's how to trace GenAI agents with OpenTelemetry without leaking PII once your code is in production.

So the question isn't whether to review AI-generated code, but how to review it at the same speed AI produces it — and that's where a multi-agent Claude Code review skill earns its keep.

The gap between AI adoption and verification represents unreviewed AI code entering production.

AI-coauthored pull requests exhibit approximately 1.7 times more issues than human-only PRs, while PRs per engineer jumped 113% in high-AI-adoption teams (Jellyfish, 2026). The speed-quality tradeoff means AI-assisted teams need stronger automated review, not weaker review because "the AI wrote it." For broader context on multi-agent setups, see building personal AI agent teams for productivity.

What Is the Claude Code Review Skill?

Teams using AI for code review alongside productivity tools saw quality improvements 81% of the time, compared to just 55% for fast teams without AI review (Qodo, 2026). I wanted those gains from a Claude Code review without leaving my terminal.

The CodeProbe Claude Code review skill is an open-source, multi-agent system that runs directly inside your editor. Think of it as a team of nine domain-expert reviewers that analyze your code simultaneously — one focused on security, another on SOLID principles, another on architecture, and so on. Three design decisions shaped the Claude Code review skill from day one.

Read-only, always. The skill never modifies, writes, or deletes your files. It reads your code, finds issues, and generates copy-pasteable fix prompts you can run separately. You stay in control of every change.

One-command install. npx skills add nishilbhave/codeprobe — no repo clone, no config files, no API keys, no OS-specific variants. It drops into ~/.claude/skills/ and is immediately available in any project.

Fix prompts, not auto-fixes. Every finding includes a fix_prompt field — a ready-to-paste Claude Code instruction that applies the fix. You decide which fixes to accept, and the skill just does the thinking.

It works in two modes. Full mode in Claude Code gives you filesystem access, parallel agents, and deterministic metrics. Degraded mode in Claude.ai still analyzes pasted or uploaded code — you lose parallelism and file scanning, but keep the core review logic intact.

The Claude Code review skill performs senior-engineer-level static analysis across nine domains including security, SOLID principles, and architecture. With 80% of AI-reviewed PRs requiring no additional human comments (Qodo, 2026), automated review is becoming the quality backstop AI-heavy teams need. For more on this category of tooling, see how AI-native tools differ from traditional developer workflows.

How Does the Multi-Agent Architecture Work?

Gartner reported a 1,445% surge in multi-agent system inquiries from Q1 2026 to Q2 2026. The firm also predicts 40% of enterprise apps will embed task-specific AI agents by 2026 (Gartner, 2026). The multi-agent pattern isn't hype anymore. It's how you solve problems too broad for a single prompt, including a deep Claude Code review.

The architecture follows an orchestrator pattern. A central router (SKILL.md) receives your command and auto-detects your tech stack. It then loads the right reference guides and dispatches work to specialized sub-skills inside the Claude Code review pipeline.

The Nine Domain Experts

Each sub-skill is a focused analyzer with its own detection rules:

Sub-Skill	What It Catches
codeprobe-solid	Classes with 5+ unrelated methods, switch-on-type patterns, LSP violations, fat interfaces
codeprobe-security	SQL injection, XSS, hardcoded credentials, mass assignment, broken access control
codeprobe-architecture	Coupling issues, circular dependencies, god objects (>500 LOC), missing module boundaries
codeprobe-code-smells	Long methods (>30 LOC), feature envy, data clumps, dead code, magic numbers, deep nesting
codeprobe-patterns	Recommends Builder, Factory, Strategy only when concrete problems exist; detects Singleton abuse
codeprobe-performance	N+1 queries, missing indexes, O(n²) algorithms, race conditions, unnecessary re-renders
codeprobe-error-handling	Swallowed exceptions, missing try-catch, absent timeout/retry, insufficient validation
codeprobe-testing	Untested public methods, mock abuse, missing edge cases, brittle test data
codeprobe-framework	Stack-specific checks — Laravel Eloquent patterns, React hook rules, Next.js conventions

Auto-Detection and Reference Loading

The Claude Code review skill doesn't ask you what stack you're using. It scans file extensions and project markers — .tsx files plus a next.config.* triggers React/Next.js references. A migrations/ directory loads SQL rules. Six language-specific reference guides cover PHP/Laravel, JavaScript/TypeScript, Python, React/Next.js, SQL, and API design.

Design decision I'm glad I made: building auto-detection instead of requiring manual configuration. Most review tools front-load setup with config files and CLI flags. I wanted cd my-project && /codeprobe audit to just work. The detection adds maybe 200ms and saves every first-time user from reading docs before getting value.

The execution flow runs ten steps: parse command, validate against routing table, load config (or defaults), detect stack, load references, route to sub-skills, collect findings, compute scores, render report, present results. Each step is deterministic — same codebase, same findings every time.

Multi-agent AI system inquiries surged 1,445% in just over a year, reflecting explosive enterprise interest in orchestrated AI workflows (Gartner, 2026). The Claude Code review skill applies this pattern to a problem every developer faces daily: ensuring code quality at AI-generation speed.

Why Nine Sub-Skills Instead of One Generic Reviewer?

McKinsey found that companies with 80–100% developer AI adoption saw productivity gains exceeding 110%, with teams saving an average of 6 hours per week (McKinsey, 2026). You only capture those gains if your review catches the right problems — and a single generic reviewer, human or AI, can't hold nine specialized domains in its head at the same time.

When you run /codeprobe audit, the orchestrator dispatches all nine sub-skills in parallel. Each one is a focused domain expert — security, SOLID, architecture, error handling, performance, testing, code smells, patterns, framework — with its own detection rules, reference guides, and scoring logic. Instead of one reviewer loosely sweeping across everything, you get nine reviewers that each specialize.

Why granular specialization outperforms broad-bucket reviewers in practice:

Focused scope per sub-skill. A security sub-skill reviewing for SQL injection isn't distracted by naming conventions. A SOLID reviewer isn't diluted by performance heuristics. Each sub-skill loads only what's relevant to its domain, so every token it processes is earning its keep.
Domain-specific reference guides. The security sub-skill works from OWASP patterns. The framework sub-skill loads Laravel, React, or Django idioms depending on the detected stack. No single prompt can be equally deep in nine different domains — specialization lets each one be genuinely senior in its lane.
Independent scoring. Findings don't get averaged into a generic "code quality" number. Security is tracked separately from test quality, so a perfect security score can't mask mediocre testing (or vice versa).
Parallel execution. Sub-skills run concurrently, not sequentially. A multi-hundred-file codebase gets audited in roughly the time it takes the slowest single sub-skill to finish, not the sum of nine serial passes.

What I learned building this: my first version tried to cram all Claude Code review logic into one giant prompt. It was worse at every individual category than the specialized version that came later. Narrowing each sub-skill's scope made each one sharper. The lesson generalizes well — give an LLM a focused role with a focused reference set, and it beats a generalist instruction every single time.

Nine sub-skills isn't arbitrary. It maps to the nine domains senior reviewers actually track on a large PR — security, structure (SOLID), architecture, error paths, performance, tests, smells, patterns, framework idioms. Adding a tenth would overlap with the others; collapsing to fewer buckets would lose coverage or force a single sub-skill to juggle domains that don't share reference material. The granularity reflects how review work already divides in practice.

Companies with the highest AI adoption reported productivity gains exceeding 110%, with engineering teams saving six hours per week on routine tasks (McKinsey, 2026). Parallel specialist execution applies that same principle to code review — nine simultaneous experts instead of one sequential pass. For the wider trend, see the shift toward AI-first development workflows.

How Does the Scoring System Work?

GitHub's randomized controlled trial of 202 developers found Copilot users had a 53.2% greater likelihood of passing all unit tests (GitHub, 2026). But passing tests isn't the same as passing review. The scoring system quantifies what tests don't measure — design quality, security posture, and architectural health.

Every finding from every sub-skill gets classified by severity: critical, major, or minor. The category score uses capped per-severity penalties:

crit_penalty  = min(50, critical × 15)
major_penalty = min(30, major    × 6)
minor_penalty = min(10, minor    × 2)

score = max(0, 100 - crit_penalty - major_penalty - minor_penalty)

One critical finding drops your score by 15 points. The caps matter for a fair Claude Code review. No single severity tier can push a category below 10 on its own. That means one noisy sub-skill can't bury signal from the other eight. A truly broken category — with many criticals and majors — still ends up in the red. A category with one false positive does not. The math reflects how real-world impact compounds while keeping scores easy to read.

The overall health score weights categories by importance:

Security carries double the weight of test quality — because a vulnerability in production costs orders of magnitude more than a missing unit test.

Why these weights? I modeled them after incident post-mortem frequency data from my own projects. Security and architecture issues caused the most expensive production incidents. Code smells and pattern issues, while annoying, rarely caused outages. The weights reflect real-world blast radius, not theoretical importance.

Three health thresholds make the score actionable. 80–100 means healthy. 60–79 means needs attention — you've got issues accumulating but nothing's on fire. Below 60 is critical — stop shipping features and fix what's broken.

Every single finding follows a strict format:

{
  "id": "SEC-003",
  "severity": "critical",
  "location": { "file": "src/UserService.php", "lines": "45-67" },
  "problem": "Raw user input concatenated into SQL query",
  "evidence": "Direct code quotes proving the issue",
  "suggestion": "Use parameterized queries via prepared statements",
  "fix_prompt": "In src/UserService.php lines 45-67, replace the raw SQL concatenation with a prepared statement using parameter binding"
}

That fix_prompt field is the key integration point in the Claude Code review workflow. Copy it, paste it into Claude Code, and the fix gets applied with no context switching and no manual translation from "you should do X" to actually doing X.

What Does It Look Like in Practice?

Eighty-five percent of developers now use AI tools regularly for coding, with nearly 9 in 10 saving at least one hour per week (JetBrains, 2026). Here's how those savings look once you start running a Claude Code review on every directory you ship.

Full Audit

The Claude Code review command you'll use most is the full audit. It scans every file in a target directory and writes a complete report you can share in the PR.

/codeprobe audit src/

This dispatches all nine sub-skills in parallel, scans every file in the directory, and produces a complete report with per-category scores, prioritized findings, and fix prompts for each issue. The full audit also saves to ./codeprobe-reports/<timestamp>.md so you can share it in a PR or diff it against the previous run. On a medium-sized project (20-50 files), expect Claude Code review results in under a minute.

Quick Review

When you just want the top problems from a Claude Code review without waiting for the full audit, the quick command returns the five highest-priority findings with fix prompts attached.

/codeprobe quick src/

It returns the five highest-priority findings with fix prompts. This mode is perfect for a fast sanity check before pushing a feature branch.

Health Dashboard

For a bird's-eye Claude Code review of an entire project, the health dashboard rolls every category score up into one view and surfaces the files with the most issues.

/codeprobe health

The dashboard shows aggregate scores across all categories, highlights hot spots (files with the most issues), and gives you a deterministic metrics summary — lines of code, class counts, method counts, comment ratios — powered by the bundled file_stats.py script.

Configuration

Drop a .codeprobe-config.json file in your project root whenever you need to customize thresholds. The config below raises the method and class size limits and skips the testing sub-skill entirely.

{
  "severity_overrides": {
    "long_method_loc": 50,
    "large_class_loc": 500,
    "max_constructor_deps": 6
  },
  "skip_categories": ["codeprobe-testing"],
  "skip_rules": ["SPEC-GEN-001"]
}

Don't want pattern analysis on a quick prototype? Skip it. Your legacy codebase has 80-line methods by convention? Raise the threshold. The defaults are opinionated but overridable, so your Claude Code review can match the team's actual standards.

What I use daily: I run /codeprobe quick before every commit and /codeprobe audit before every PR. The quick review catches the obvious stuff — hardcoded strings, missing error handling, methods that got too long. The full audit catches the architectural drift that accumulates over weeks. It's become the part of my workflow I'd miss most if I lost it. For the deeper pattern at work, see understanding orchestrator patterns in multi-agent systems.

What Makes This Different from Linters?

Developers using AI tools report that 45% of their top frustration is "AI solutions that are almost right, but not quite" (Stack Overflow, 2026). Linters catch the obviously wrong code, but a good Claude Code review catches the almost-right-but-not-quite cases that slip past every syntax check.

ESLint catches syntax issues, Prettier fixes formatting, and PHPStan checks types. Those tools are necessary, but they are not sufficient when AI is writing most of your code. The Claude Code review skill operates at a layer above. It catches SOLID violations — a class doing five unrelated things. It finds architectural drift — a service layer calling the database directly. It flags security patterns a linter won't see — mass assignment vulnerabilities, broken access control, insecure deserialization chains.

Linters answer "does this code follow syntax rules?" while the Claude Code review skill answers "will I regret this code in six months?" That gap is where the most expensive bugs live, and it is exactly where a multi-agent reviewer earns back the time you spend running it.

The Claude Code review skill also works across your entire codebase, not file-by-file. Circular dependencies, coupling between modules, god objects spanning hundreds of lines — these are codebase-level concerns that single-file linters miss entirely.

The fix-prompt model is the real differentiator. Most review tools tell you what's wrong. This one tells you what's wrong and hands you the instruction to fix it. That's not a convenience feature — it's a workflow design choice. When the friction between "finding a problem" and "fixing it" drops to a single paste, developers actually fix things instead of adding TODO comments.

Getting Started

Claude Code became the #1 AI coding tool in just 8 months, overtaking GitHub Copilot and Cursor (The Pragmatic Engineer, 2026). For a side-by-side look at how Cursor, Codex, Windsurf and other AI coding agents stack up, I've written a separate comparison. If you're already using Claude Code, installing the Claude Code review skill takes one command on every OS.

One-Command Install

npx skills add nishilbhave/codeprobe

That's it. Same command on macOS, Linux, and Windows — no shell-script variants, no curl | bash, no repo clone. npx skills add drops CodeProbe into ~/.claude/skills/ and it's immediately available in any project.

Manage it: npx skills update to pull upgrades, npx skills remove to uninstall.

Python 3.8+ is optional — it enables the deterministic metrics engine in /codeprobe health and the full audit dashboard. Every other review capability works without it.

Reports save to disk. Every audit writes a timestamped Markdown report to ./codeprobe-reports/<timestamp>.md in your current directory, so you can diff reports across commits, share them in a PR, or grep for regressions over time. Add codeprobe-reports/ to .gitignore if you don't want them committed.

Then open any project in Claude Code and kick off your first Claude Code review with the audit command.

/codeprobe audit .

That's it. There are no config files to create and no API keys to set up. The Claude Code review skill auto-detects your stack and starts reviewing immediately. It's MIT licensed, so fork it, extend it, and build on it. Contributions are welcome, especially for additional language reference guides and framework-specific detection rules.

emerging tools that are reshaping developer workflows

Frequently Asked Questions

Does the Claude Code review skill modify my code?

Never. The Claude Code review skill is strictly read-only — it won't write, edit, or delete any of your files. Every issue comes with a fix_prompt you can paste into Claude Code to apply the fix yourself. This design keeps you in full control, which 96% of developers prefer given that most don't fully trust AI-generated changes (SonarSource, 2026).

What languages and frameworks does the Claude Code review skill support?

Primary support covers PHP/Laravel, JavaScript/TypeScript, Python, React/Next.js, SQL, and API design — with dedicated reference guides for each. The deterministic metrics engine (file_stats.py) recognizes 13+ language types including Java, Ruby, Go, Rust, Vue, and Svelte. Stack detection is automatic based on file extensions and project markers, so a Claude Code review starts working the second you point it at a directory.

Can I run a Claude Code review in Claude.ai without Claude Code installed?

Yes, in degraded mode. Paste or upload your code and run the review commands normally. You'll lose parallel agent execution, filesystem scanning, and script-dependent metrics, but the core Claude Code review logic, scoring, and fix prompts all still work. The skill automatically detects which environment it's running in.

How do I customize Claude Code review severity thresholds?

Add a .codeprobe-config.json file to your project root. You can override defaults like long_method_loc (default: 30 lines), large_class_loc (default: 300), and max_constructor_deps (default: 5). You can also skip entire categories or individual rules in your Claude Code review. See the repo documentation for the full config schema.

Is the Claude Code review skill open source?

Yes, fully open source under the MIT license. The repo is at github.com/nishilbhave/codeprobe-claude. Install it via npx skills add nishilbhave/codeprobe. Python 3.8+ is an optional runtime dependency that unlocks the deterministic metrics dashboard, and everything else in the Claude Code review skill runs without it. Contributions are welcome, especially for new language reference guides and framework detection rules.

Conclusion

AI writes code faster than humans can review it, and that gap is going to accelerate rather than close. The question is whether your Claude Code review process can keep up. Here is what I'd take away from a year of building and using one.

The verification gap is real. Only 48% of developers always verify AI code, despite 96% not trusting it. Automated Claude Code review isn't optional anymore.
Specialized sub-skills beat a single pass. Nine focused domain experts catch what a generalist reviewer — human or AI — would miss. Parallel execution keeps the Claude Code review fast.
Fix prompts close the loop. Finding problems is easy, but getting developers to fix them is hard. Copy-pasteable instructions remove the friction.
One-command install wins adoption. npx skills add nishilbhave/codeprobe plus auto-detection means zero setup between installing and getting your first Claude Code review.

The Claude Code review skill is free, open source, and ready to use: github.com/nishilbhave/codeprobe-claude. Install it with npx skills add nishilbhave/codeprobe, run /codeprobe audit on your next project, and see what it finds. I'd bet it catches something you didn't expect.

CodeProbe: 9 Specialized AI Agents That Audit Your Codebase for SOLID, Security & Performance

The Complete Claude Code Workflow: How I Ship 10x Faster

The End of User Interfaces: How AI Agents Will Kill the Dashboard

Top comments (4)

Max Quimby • May 6

The "read-only + copy-pasteable fix prompts" split is underrated. Once we let review agents also propose patches inline, they started anchoring on their own suggestions during the next pass — score drift across runs got bad fast. Keeping the reviewer pure and handing the fix to a separate edit agent (or a human) gave us much more stable scores week over week.

The thing I'd push on is the deterministic score. Even with a fixed rubric, we found per-domain reviewers were wildly inconsistent at the boundaries — a "medium" security finding one run was "low" the next, depending on what else the agent latched onto. What helped was forcing each finding to cite a specific line range + a one-sentence "why this is in this severity bucket, not the one below it." Made the scores audit-friendly and, weirdly, more stable.

Curious — when two domain experts disagree (e.g., the SOLID reviewer wants an abstraction the performance reviewer flags as overhead), do you let them argue, or does the orchestrator just present both? We've gone back and forth on that one.

Nishil Bhave • May 17

Yeah, the read-only / fix-prompt split was the right call for the same reason you hit—once a reviewer can patch, it grades its own homework. Codeprobe is strictly read-only; each sub-skill just emits a `fix_prompt "field" that's meant for a separate session or a human.

The "why this severity, not the one below" line was a great nudge—I didn't require it before and just shipped it as a severity_rationale field on every finding. Small change, real stability win; thanks for that one.

On disagreements—today the orchestrator just dedupes overlapping findings by category priority. That works fine for "same issue, different lens" (a god class flagged as both SRP and code smell), but it totally whiffs on opposite recommendations—SOLID says extract, perf says don't. Those go out as two separate findings with no flag that they conflict.

Going back and forth on agents is an argument too. Leaning toward it: don't let them argue (too expensive, too unstable run-to-run), but detect the conflict and surface both with a "no winner, you decide" note. Where'd you land?

Kyle Carriedo • May 17

The 45% "almost right but not quite" stat is interesting because multi-agent review attacks it at the output stage — multiple specialists looking at the produced code. That's clearly useful for catching SOLID violations and architectural drift that a linter misses.

But I keep coming back to a question your post sort of dances around: how much of "almost right" is actually upstream of the code, in the plan? An agent that builds the wrong thing well is going to pass a multi-agent review of that wrong thing. The reviewers will check the SOLID-ness of the wrong abstraction, the security of the wrong endpoint, the test coverage of the wrong behavior. They have no signal that the entire plan was off by 30 degrees.

In the workflows I run, the "almost right" failure mode splits roughly:

~40% the agent built the right thing but produced sloppy code (what your skill catches)
~40% the agent built almost the right thing — wrong abstraction, wrong scope, fine code
~20% the agent built the wrong thing entirely because the plan was bad

Multi-agent review nails the first bucket. The other two need a different kind of intervention — probably a planner agent that gets the spec reviewed before implementation starts, or a checkpoint where the operator approves the plan rather than the code.

Curious how you've thought about the division of labor between "reviewing the output" and "reviewing the intent." Is there a planner-agent step in your skill, or is it strictly post-implementation?

Nishil Bhave • May 17

Thanks for the feedback.

Yes, CodeProbe is strictly post-implementation right now—no planner step. The reviewers will happily audit the SOLID-ness of the wrong abstraction; you're right.

Your 40/40/20 split matches what I see too. In bucket 1, the code is the evidence; bucket 2 needs the spec; bucket 3 needs the spec + intent. I've been turning over a few options—a separate planner tool, an intent-aware audit where the operator passes intent as input, or just being explicit that CodeProbe is output-review-only—but haven't picked one yet.

Curious how "approving the plan" actually works for you—do you have the agent write out a plan you read through, or is it more of a back-and-forth chat first?