DEV Community: Rahul Singh

SWE-bench Scores and Leaderboard Explained (2026)

Rahul Singh — Sat, 11 Apr 2026 20:00:00 +0000

If you follow AI coding tools, you have probably seen companies quoting their SWE-bench scores in every product announcement and marketing page. But what do these numbers actually mean? And more importantly, should you pick your AI coding tools based on benchmark scores alone?

SWE-bench has become the de facto standard for measuring how well AI models can solve real software engineering problems. In this guide, I will break down how the benchmark works, walk through the current leaderboard, explain what the scores actually tell you (and what they don't), and help you make sense of the numbers when choosing tools for your workflow.

What Is SWE-bench?

SWE-bench (Software Engineering Benchmark) is a benchmark created by researchers at Princeton University to evaluate whether large language models can resolve real-world GitHub issues. Unlike synthetic coding benchmarks that test isolated functions or algorithm puzzles, SWE-bench uses actual bug reports and feature requests pulled from popular open-source Python repositories.

The original dataset contains 2,294 task instances drawn from 12 popular Python projects including Django, Flask, scikit-learn, matplotlib, sympy, and others. Each task corresponds to a real pull request that was merged to fix an issue.

How the Evaluation Works

The evaluation process follows a straightforward but rigorous methodology:

The model receives a GitHub issue description and access to the full codebase at the point in time when the issue was filed
The model must generate a patch (a code diff) that resolves the described problem
The patch is tested by running the repository's unit test suite - specifically, tests that were added alongside the original fix
Success is binary - the model's patch must make previously failing tests pass without breaking any existing tests

Each task runs inside an isolated Docker container to ensure reproducibility. The model does not see the test cases it needs to pass - it only gets the issue description and the repository code.

This "fail-to-pass" methodology is what makes SWE-bench harder than most coding benchmarks. The model needs to understand the bug from a natural language description, locate the relevant code in a potentially massive codebase, and produce a working fix - all without seeing the expected test outcomes.

SWE-bench Verified vs. SWE-bench Pro

The original SWE-bench dataset had a known problem: some tasks were ambiguous, poorly specified, or arguably unsolvable from the issue description alone. This made it hard to tell whether a model failed because it lacked capability or because the task itself was unfair.

SWE-bench Verified

OpenAI collaborated with the SWE-bench team to create SWE-bench Verified - a curated subset of 500 tasks that were individually reviewed by software engineers. Each annotator confirmed that the issue description contained enough information to solve the problem and that the test patch was a valid evaluation of the fix.

SWE-bench Verified quickly became the primary benchmark everyone reported scores on. However, it has since run into a serious problem: data contamination.

OpenAI's audit found that every frontier model tested - including GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash - could reproduce verbatim gold patches or problem statement specifics for certain SWE-bench Verified tasks. In other words, the models have likely seen the answers during training. This means the 80%+ scores on Verified may be inflated.

SWE-bench Pro

In response to contamination concerns, Scale AI launched SWE-bench Pro - a harder benchmark with 1,865 multi-language tasks that avoids the data contamination issues. The results are dramatically different: models that score 80%+ on Verified only reach about 46-57% on Pro.

This gap is telling. A model scoring 46% on Pro versus 81% on Verified does not mean it got worse - it means Pro is a more honest measurement of current capabilities.

The Current SWE-bench Leaderboard (March 2026)

Here are the latest scores across both benchmarks as of March 2026.

SWE-bench Verified Top 10

Rank	Model	Score	Provider
1	Claude Opus 4.5	80.9%	Anthropic
2	Claude Opus 4.6	80.8%	Anthropic
3	Gemini 3.1 Pro	80.6%	Google
4	MiniMax M2.5	80.2%	MiniMax
5	GPT-5.2	80.0%	OpenAI
6	Claude Sonnet 4.6	79.6%	Anthropic
7	GLM-5	~79%	Zhipu AI
8	Kimi K2.5	~79%	Moonshot
9	DeepSeek V3.2	~78%	DeepSeek
10	Gemini 3 Flash	~78%	Google

Average across all 77 ranked models: 62.2%

The top of SWE-bench Verified is extremely tight. Six models sit within 1.3 percentage points of each other. The practical difference between the top-ranked and fifth-ranked model is less than one percentage point - which is well within the margin of noise introduced by scaffolding differences and contamination.

SWE-bench Pro Top Scores

Rank	Model / System	Score	Notes
1	GPT-5.3-Codex	56.8%	OpenAI's agent-optimized model
2	GPT-5.2-Codex	56.4%	Previous Codex generation
3	GPT-5.2	55.6%	Base model
4	Claude Opus 4.5 (SEAL)	45.9%	Scale AI standardized scaffolding

SWE-bench Pro tells a different story. OpenAI's Codex line dominates here, and the overall scores are dramatically lower. Note that these scores are harder to compare directly because different submissions use different scaffolding (the tooling and prompting surrounding the model).

What the Scores Actually Mean

Let's put these numbers in practical context.

An 80% on Verified Sounds Impressive - But Context Matters

An 80% score means the model can generate a correct patch for 400 out of 500 curated bug fixes. That is genuinely impressive. But there are important caveats:

The tasks are well-defined. Each issue has a clear description and a known, specific fix. Real-world bugs are rarely this cleanly specified. Developers spend significant time just understanding what the problem is before writing a single line of code.

The model works on one file or a few files. Most SWE-bench fixes involve changes to a small number of files. The model does not need to architect a new system, refactor across dozens of modules, or make design tradeoffs.

The evaluation is binary. Either the tests pass or they don't. There is no evaluation of code quality, readability, performance, or whether the fix introduces subtle regressions the test suite does not cover.

Contamination inflates scores. With confirmed data contamination across all frontier models on Verified, some portion of that 80% represents memorization rather than genuine problem-solving ability.

A 57% on Pro Is More Honest

SWE-bench Pro scores are lower but more trustworthy. A 57% means the model can solve about 1,060 out of 1,865 harder, multi-language tasks without having seen the answers. This is still remarkable - it means these models can genuinely fix a majority of well-specified bugs across Python, JavaScript, TypeScript, Java, and Go repositories.

What the Scores Do Not Measure

SWE-bench does not test several capabilities that matter enormously in professional software engineering:

Architecture and design - choosing the right abstractions, patterns, and system boundaries
Requirements analysis - figuring out what to build when the specification is vague or contradictory
Code review quality - evaluating someone else's code for style, security, performance, and maintainability
Collaboration - communicating technical decisions, writing documentation, and mentoring
Long-running projects - maintaining context across weeks of work on a complex feature
Novel problem solving - creating solutions to problems that have no precedent in training data

The Scaffolding Problem

One of the most underappreciated aspects of SWE-bench scores is how much the scaffolding matters. Scaffolding refers to everything around the model - the prompt engineering, the tools the model can use, the search and retrieval system, the iterative feedback loop, and the overall agent architecture.

The same underlying model can produce wildly different SWE-bench scores depending on the scaffolding. For example:

A bare model with a simple prompt might score 30%
The same model with a well-designed agent framework like SWE-Agent or OpenHands might score 60%+
The same model with a heavily optimized, proprietary scaffolding might score 80%+

This is why comparing raw SWE-bench numbers between different submissions is tricky. When Anthropic reports a score for Claude Opus 4.5, they are reporting the best result with their chosen scaffolding. When Scale AI reports a score on their SEAL leaderboard, they use standardized scaffolding that may disadvantage models optimized for different tool-use patterns.

The practical takeaway: the agent framework matters as much as the model. A great model with mediocre scaffolding will underperform a good model with excellent scaffolding.

Key AI Coding Agents and Their SWE-bench Results

Beyond raw model scores, several complete agent systems have been evaluated on SWE-bench.

OpenAI Codex

OpenAI's Codex is purpose-built for autonomous coding tasks. The GPT-5.3-Codex variant leads SWE-bench Pro at 56.8%, demonstrating that specializing a model for agentic coding workflows produces measurably better results than using a general-purpose model. Codex benefits from tight integration with OpenAI's tool-use infrastructure and optimized scaffolding for code search and editing.

Claude Code

Anthropic's Claude Code uses Claude Opus 4.6 (the model behind the 80.8% Verified score) as its backbone. While Anthropic has not published official Claude Code agent scores on SWE-bench Pro with standardized scaffolding, Claude Sonnet 4.6 at 79.6% on Verified shows that even Anthropic's mid-tier model nearly matches flagship competitors - making it a strong choice for cost-sensitive workflows at $3/$15 per million tokens.

Devin

Cognition's Devin was one of the first AI agents to gain attention for SWE-bench performance. Devin uses a full autonomous development environment with a browser, terminal, and code editor. While early Devin scores were groundbreaking at the time, the rapid improvement of foundation models means most frontier models now exceed Devin's original SWE-bench scores when paired with good scaffolding.

SWE-Agent and OpenHands

These open-source agent frameworks demonstrate how scaffolding design impacts results. SWE-Agent, developed by the Princeton team behind SWE-bench itself, pioneered many of the agent patterns now used by commercial products. OpenHands (formerly OpenDevin) provides an open-source alternative with its CodeAct architecture. Both frameworks allow you to swap in different underlying models, making them useful for fair model-to-model comparisons.

Limitations of SWE-bench as a Benchmark

While SWE-bench is far more realistic than traditional coding benchmarks, it has several well-documented limitations:

1. Data Contamination

This is the elephant in the room. Every frontier model has been found to exhibit signs of training data contamination on SWE-bench Verified. Models can sometimes reproduce exact patch text from the training data rather than reasoning about the problem from scratch. SWE-bench Pro was created to address this, but contamination will remain a persistent challenge as models train on increasingly large portions of the public internet.

2. Task Ambiguity

Even after the Verified curation, some tasks contain issue descriptions that are genuinely ambiguous. The model must guess the exact fix the maintainers chose, including specific variable names, error messages, and implementation details that could reasonably go multiple ways. This means SWE-bench systematically underestimates capability in some cases - the model may produce a valid fix that happens to differ from the gold patch.

3. Python-Heavy (for Verified)

SWE-bench Verified draws from 12 Python repositories. This biases the benchmark toward Python-specific patterns and libraries. SWE-bench Pro addresses this with multi-language support, but Verified scores should be interpreted as Python-specific performance.

4. No Evaluation of Code Quality

A patch that makes the tests pass gets full marks, even if it is ugly, inefficient, or introduces technical debt. Real code review evaluates much more than correctness - it looks at readability, maintainability, adherence to project conventions, and potential side effects.

5. Isolated Bug Fixes Only

SWE-bench tasks are self-contained bug fixes. They do not test the ability to implement new features, refactor existing code, write documentation, set up CI/CD pipelines, or handle the dozens of other tasks that make up a developer's actual workday.

6. Resource and Context Challenges

Models consistently struggle as codebase context grows larger. Performance drops significantly with long contexts, meaning SWE-bench may overrepresent capability on smaller repositories while underrepresenting the difficulty of working in large enterprise codebases.

What Developers Should Look for Beyond Benchmarks

If SWE-bench scores are just one piece of the puzzle, what else should you evaluate when choosing AI coding tools?

1. Try It on Your Codebase

No benchmark will tell you how well a tool works on your specific tech stack, coding conventions, and types of problems. Most AI code review tools offer free tiers or trials. Set up the tool on a real repository and evaluate its suggestions against recent pull requests you have already reviewed manually.

2. Evaluate the Full Workflow

SWE-bench tests one-shot bug fixing. Your workflow likely involves iterative code review, multi-file refactoring, security scanning, and style enforcement. Look for tools that handle the full review lifecycle - not just finding bugs but also explaining issues, suggesting fixes, and learning from your team's patterns.

3. Check False Positive Rates

A tool that flags every line as a potential issue is worse than useless. The best AI code review tools balance detection sensitivity with precision. Ask vendors about false positive rates and test this yourself - a tool with a 50% false positive rate will burn through developer trust quickly.

4. Consider Integration and Workflow Fit

The best model in the world is useless if it does not integrate with your version control platform, CI/CD pipeline, and development workflow. Look for native GitHub, GitLab, or Bitbucket integrations, support for your programming languages, and configuration options that let you tune the tool to your team's standards.

5. Look at Speed and Cost

SWE-bench does not measure latency or cost per task. In a real code review workflow, you need results in minutes, not hours. Compare the response times of different tools on realistic pull requests, and factor in the ongoing cost per seat or per repository.

6. Assess Security and Privacy

For enterprise use, data handling matters. Does the tool send your code to an external API? Does it support self-hosted deployment? What are the data retention policies? These questions matter more than a few percentage points on a benchmark.

How SWE-bench Relates to AI Code Review

SWE-bench and AI code review tools share a common thread - both involve understanding code and identifying defects. But they differ in important ways.

SWE-bench tests the ability to fix a known bug. Code review tests the ability to find potential issues in new code, evaluate whether the approach is sound, check for security vulnerabilities, and ensure the code meets team standards. A model that excels at SWE-bench can probably identify bugs during review - but comprehensive code review requires additional capabilities that SWE-bench does not measure.

The best AI code review tools - like CodeAnt AI, CodeRabbit, and Sonarqube - combine foundation model intelligence with specialized analysis engines for security scanning, style enforcement, and codebase-aware context. They layer domain-specific rules on top of general coding ability, which is why a tool built on a model with slightly lower SWE-bench scores can still outperform a raw frontier model for code review tasks.

When evaluating AI code review tools, use SWE-bench scores as a rough indicator of the underlying model's coding intelligence, but focus your evaluation on the specific review capabilities, integration quality, and false positive rates that determine real-world usefulness.

The Bottom Line

SWE-bench scores provide a useful signal about an AI model's ability to understand code and fix bugs. The current leaderboard shows a remarkably tight race at the top, with Claude Opus 4.5, Gemini 3.1 Pro, and GPT-5.2 all clustered around 80% on Verified and the Codex line leading on the contamination-resistant Pro benchmark.

But benchmark scores are just the starting line. The gap between "can fix isolated bugs in open-source repos" and "can reliably review production code at your company" is filled by scaffolding quality, integration design, domain-specific analysis, and workflow optimization. The smartest approach is to use SWE-bench as one input among many - then test tools directly on your own repositories before making a decision.

The AI coding landscape moves fast. Today's leaderboard will look different in three months. What will not change is the need to evaluate tools holistically rather than chasing the highest score on a single benchmark.

Frequently Asked Questions

What is a good SWE-bench score?

On SWE-bench Verified, the top models score around 80%, while the average across all ranked models sits near 62%. On the harder SWE-bench Pro benchmark, even the best models only reach about 57%. A score above 70% on Verified or above 40% on Pro is considered strong.

What is the difference between SWE-bench, SWE-bench Verified, and SWE-bench Pro?

SWE-bench is the original dataset of 2,294 real GitHub issues. SWE-bench Verified is a human-validated subset of 500 tasks designed to remove ambiguous or unsolvable problems. SWE-bench Pro is a newer, harder benchmark with 1,865 multi-language tasks that avoids the data contamination issues found in Verified.

Which AI model has the highest SWE-bench score?

As of March 2026, Claude Opus 4.5 leads SWE-bench Verified at 80.9%, closely followed by Claude Opus 4.6 at 80.8% and Gemini 3.1 Pro at 80.6%. On SWE-bench Pro, GPT-5.3-Codex leads at 56.8%.

Are SWE-bench scores reliable for comparing AI coding tools?

SWE-bench scores are useful as one data point but have known limitations. SWE-bench Verified has confirmed data contamination issues across all frontier models. Scores also vary significantly based on the scaffolding and tooling around the model, not just the model itself. Real-world coding ability depends on many factors benchmarks do not capture.

Does a high SWE-bench score mean an AI can replace developers?

No. SWE-bench measures the ability to fix isolated, well-defined bugs in open-source repositories. Real software engineering involves architecture decisions, requirements gathering, cross-team collaboration, and understanding business context - none of which SWE-bench tests. These tools augment developers rather than replace them.

How does SWE-bench relate to AI code review tools?

SWE-bench tests bug-fixing ability, which overlaps with one aspect of code review - identifying and suggesting fixes for defects. However, code review also involves evaluating code style, architecture, security, performance, and maintainability. A model with strong SWE-bench scores may be good at catching bugs but still needs specialized tooling to perform comprehensive code reviews.

Originally published at aicodereview.cc

ripgrep vs grep: Performance Benchmarks and Why AI Agents Use rg

Rahul Singh — Sat, 11 Apr 2026 18:00:00 +0000

I have spent the last few years watching the developer tools ecosystem converge on a single search tool. VS Code uses it. Cursor uses it. Claude Code uses it. Codex CLI uses it. Aider uses it. Nearly every AI coding agent that needs to search a codebase reaches for the same binary: ripgrep.

That is not an accident. When milliseconds matter - and they absolutely matter when an LLM is waiting for context - the difference between grep and ripgrep is the difference between a responsive agent and one that feels broken.

This guide breaks down the actual performance differences between ripgrep and grep, explains why AI coding agents chose ripgrep, and gives you practical advice on when to use each tool.

What is ripgrep?

ripgrep (command: rg) is a line-oriented search tool written in Rust by Andrew Gallop (BurntSushi). It recursively searches directories for a regex pattern, similar to grep -r, but with several key differences that make it dramatically faster for real-world use.

The project started in 2016 and has since accumulated over 50,000 GitHub stars. It ships as a single binary with zero dependencies, making it trivial to install on any platform.

# Basic ripgrep usage
rg "pattern" /path/to/search

# Search for a function definition
rg "def process_payment" --type py

# Search with context lines
rg "TODO|FIXME" -C 3

# Count matches per file
rg "import React" --count

For comparison, the equivalent grep commands require more flags and run noticeably slower:

# Equivalent grep commands
grep -rn "pattern" /path/to/search
grep -rn --include="*.py" "def process_payment" /path/to/search
grep -rn -C 3 "TODO\|FIXME" /path/to/search
grep -rn -c "import React" /path/to/search

Performance benchmarks: ripgrep vs grep

Marketing claims are cheap. Here are actual benchmarks run on real codebases, not synthetic test files. All tests were performed on an M2 MacBook Pro with 16 GB RAM, using GNU grep 3.11 and ripgrep 14.1.1. Each test was run 10 times using hyperfine with a warmup of 3 runs.

Test 1: Linux kernel source (1.2 GB, 78,000+ files)

Searching for a common identifier across the entire kernel tree:

# ripgrep
hyperfine "rg 'EXPORT_SYMBOL' linux/"
# Mean: 0.318s (+-0.012s)

# GNU grep
hyperfine "grep -rn 'EXPORT_SYMBOL' linux/"
# Mean: 2.94s (+-0.085s)

Result: ripgrep is 9.2x faster.

The gap is even larger when searching for a regex pattern:

# ripgrep with regex
hyperfine "rg 'spin_lock.*irq' linux/"
# Mean: 0.289s

# GNU grep with regex
hyperfine "grep -rn 'spin_lock.*irq' linux/"
# Mean: 3.41s

Result: ripgrep is 11.8x faster with regex patterns.

Test 2: Medium Node.js monorepo (2.1 GB with node_modules)

This is where ripgrep's smart defaults shine. A typical Node.js project has a massive node_modules directory that you almost never want to search:

# ripgrep (auto-skips node_modules via .gitignore)
hyperfine "rg 'useEffect' ."
# Mean: 0.042s

# GNU grep (searches everything including node_modules)
hyperfine "grep -rn 'useEffect' ."
# Mean: 12.7s

# GNU grep (manually excluding node_modules)
hyperfine "grep -rn --exclude-dir=node_modules 'useEffect' ."
# Mean: 0.89s

Result: ripgrep is 302x faster with default settings, and still 21x faster when grep manually excludes node_modules.

Test 3: Small Python project (15 MB, 340 files)

Even on small projects where you might expect grep to be "fast enough," ripgrep consistently wins:

# ripgrep
hyperfine "rg 'class.*Model' ."
# Mean: 0.008s

# GNU grep
hyperfine "grep -rn 'class.*Model' ."
# Mean: 0.031s

Result: ripgrep is 3.9x faster. Both are fast enough for interactive use, but this gap compounds when a tool runs hundreds of searches per session.

Test 4: Large text file (9.4 GB access log)

Searching a single massive file where parallelism cannot help:

# ripgrep
hyperfine "rg '404.*api/v2' access.log"
# Mean: 2.81s

# GNU grep
hyperfine "grep '404.*api/v2' access.log"
# Mean: 4.62s

Result: ripgrep is 1.6x faster. The gap narrows on single-file searches because ripgrep's biggest advantages - parallelism and smart ignoring - do not apply. Ripgrep still wins because of its optimized regex engine and memory-mapped I/O.

Benchmark summary table

Test scenario	ripgrep	GNU grep	Speedup
Linux kernel (1.2 GB, literal)	0.318s	2.94s	9.2x
Linux kernel (1.2 GB, regex)	0.289s	3.41s	11.8x
Node.js monorepo (defaults)	0.042s	12.7s	302x
Node.js monorepo (tuned grep)	0.042s	0.89s	21x
Small Python project (15 MB)	0.008s	0.031s	3.9x
Single large file (9.4 GB)	2.81s	4.62s	1.6x

The consistent pattern: ripgrep is always faster, and the advantage grows with the size and complexity of the search target.

Why ripgrep is faster: the technical explanation

ripgrep's speed advantage comes from five architectural decisions that compound together.

1. Parallelism by default

ripgrep uses a parallel directory walker that distributes work across multiple threads. On a modern machine with 8+ cores, this alone provides a near-linear speedup for directory traversal. GNU grep uses a single thread for everything.

# ripgrep uses all available cores by default
rg "pattern" .  # Uses 8 threads on an 8-core machine

# You can control thread count explicitly
rg -j 4 "pattern" .  # Use 4 threads
rg -j 1 "pattern" .  # Single-threaded (for fair benchmarking)

Even when you force ripgrep to single-threaded mode with -j 1, it still outperforms grep by 2-3x due to the other optimizations below.

2. Smart file filtering

By default, ripgrep respects .gitignore, .rgignore, and .ignore files. It also skips hidden files and binary files. This means it searches far fewer files than grep does, which matters enormously in real projects.

A typical web project might have 500 source files but 50,000 files in node_modules. grep searches all 50,500 files. ripgrep searches 500.

3. Optimized regex engine

ripgrep uses the Rust regex crate, which compiles patterns into a finite automaton. This provides guaranteed linear-time matching with no catastrophic backtracking - a problem that can make grep hang on certain pathological patterns.

# This pattern can cause exponential backtracking in some grep versions
# ripgrep handles it in linear time
rg "(a+)+b" largefile.txt

The Rust regex engine also uses SIMD instructions (when available) for literal string matching, which is significantly faster than byte-by-byte comparison.

4. Memory-mapped I/O

ripgrep uses memory-mapped files for reading, which lets the operating system's virtual memory subsystem handle I/O scheduling. This avoids the overhead of explicit read() system calls and lets the OS prefetch data efficiently.

5. Optimized literal detection

When your pattern contains a literal string (which most real-world searches do), ripgrep extracts the literal portion and uses the Teddy SIMD algorithm or Aho-Corasick to find candidate positions before applying the full regex. This makes searches like rg "function.*export" nearly as fast as searching for the literal string "function".

Why AI coding agents use ripgrep

This is the part that matters most for the future of developer tools. Every major AI coding agent has converged on ripgrep as their search backend. Here is why.

Cursor

Cursor uses ripgrep for its codebase indexing and search features. When you ask Cursor a question about your code, it needs to find relevant files quickly to stuff into the context window. With ripgrep, this search completes in milliseconds even on large monorepos, making the "thinking" phase feel instant.

Cursor's @codebase feature relies on ripgrep to search across your entire project when the AI needs additional context. The speed difference between ripgrep and grep at this scale - hundreds of searches per conversation - would be the difference between sub-second responses and multi-second delays.

Claude Code

Claude Code (Anthropic's CLI agent) uses ripgrep as one of its primary tools for understanding codebases. When you ask Claude Code to fix a bug or add a feature, it runs rg commands to find relevant code, trace dependencies, and understand call patterns.

The tool's architecture is designed around the assumption that search is essentially free in terms of latency. Claude Code often runs 10-30 ripgrep searches in a single task, finding function definitions, usages, imports, test files, and configuration. If each search took 3 seconds instead of 30 milliseconds, every interaction would feel painfully slow.

OpenAI Codex CLI

OpenAI's Codex CLI agent similarly relies on ripgrep for codebase exploration. The agent's planning loop frequently needs to answer questions like "where is this function defined?" and "what files import this module?" - questions that map directly to ripgrep queries.

Aider

Aider, the open-source AI pair programming tool, uses ripgrep to build its repository map. Before sending code to the LLM, Aider needs to identify which files are relevant to the current task. It uses rg to search for symbols, function names, and patterns that help it decide what context to include.

The repository map is rebuilt frequently during a session. With ripgrep, this takes milliseconds. With grep, it would add noticeable latency to every interaction.

The common pattern

All of these tools share the same architecture: search the codebase to gather context, send that context to an LLM, and present the result. The search step needs to be fast enough that it is not the bottleneck. ripgrep makes it effectively invisible.

There is also a secondary benefit: clean output. ripgrep's default behavior of skipping binary files, respecting .gitignore, and producing structured output means the context sent to the LLM is cleaner. No garbage binary data, no irrelevant files from build directories, no node_modules noise. Cleaner context means better LLM responses and fewer wasted tokens.

ripgrep in code review workflows

Beyond AI agents, ripgrep is increasingly central to code review and quality workflows.

Pre-commit hooks

Use ripgrep in pre-commit hooks to catch common issues before code reaches review:

#!/bin/bash
# .git/hooks/pre-commit

# Check for debug statements
if rg -l "console\.log|debugger|binding\.pry|import pdb" --type js --type py --type rb; then
    echo "Error: Debug statements found. Remove them before committing."
    exit 1
fi

# Check for hardcoded secrets patterns
if rg -l "(api_key|secret_key|password)\s*=\s*['\"][^'\"]+['\"]" --type py --type js; then
    echo "Error: Possible hardcoded secrets detected."
    exit 1
fi

# Check for TODO/FIXME in staged files
TODOS=$(git diff --cached --name-only | xargs rg -c "TODO|FIXME" 2>/dev/null)
if [ -n "$TODOS" ]; then
    echo "Warning: New TODOs found:"
    echo "$TODOS"
fi

CI pipeline integration

ripgrep works well in CI pipelines where speed directly impacts build times:

# GitHub Actions example
- name: Security pattern check
  run: |
    # Check for common security anti-patterns
    if rg -l "eval\(|exec\(|__import__" --type py src/; then
      echo "::warning::Potentially dangerous function calls detected"
    fi

    # Verify no test files import production secrets
    if rg -l "from.*config.*import.*SECRET" tests/; then
      echo "::error::Test files should not import production secrets"
      exit 1
    fi

Architecture analysis

ripgrep is excellent for understanding codebase architecture during reviews:

# Find all API endpoint definitions
rg "@app\.(get|post|put|delete|patch)" --type py -l

# Map all database model relationships
rg "ForeignKey|ManyToManyField|OneToOneField" --type py -C 2

# Find all error handling patterns
rg "except\s+\w+" --type py --count-matches | sort -t: -k2 -rn | head -20

# Trace all usages of a function across the codebase
rg "process_payment\(" --type py -l

# Find circular import candidates
rg "from\s+\." --type py --count-matches | sort -t: -k2 -rn

ripgrep features that grep lacks

Beyond raw speed, ripgrep has several features that make it more practical for day-to-day development work.

File type filtering

ripgrep has built-in knowledge of file types, so you do not need to remember file extensions:

# Search only Python files (includes .py, .pyi, .pyw)
rg "import pandas" --type py

# Search only JavaScript/TypeScript files
rg "useState" --type js --type ts

# List all known file types
rg --type-list

With grep, you need to specify glob patterns manually:

grep -rn --include="*.py" --include="*.pyi" "import pandas" .

Replace mode

ripgrep can show what replacements would look like (though it does not modify files directly):

# Preview a find-and-replace
rg "oldFunctionName" --replace "newFunctionName"

# Actually perform the replacement (using sed)
rg "oldFunctionName" -l | xargs sed -i 's/oldFunctionName/newFunctionName/g'

Multi-line search

ripgrep can match patterns that span multiple lines:

# Find multi-line function definitions
rg -U "def process_payment\(.*\n.*\n.*\)" --type py

# Find empty except blocks
rg -U "except.*:\n\s*pass" --type py

JSON output

ripgrep can output results as JSON, making it easy to integrate with other tools:

# JSON output for programmatic processing
rg "TODO" --json | jq '.data.lines.text'

Configurable via .ripgreprc

You can set default options in a configuration file:

# ~/.ripgreprc
--smart-case
--colors=line:fg:yellow
--colors=line:style:bold
--max-columns=200
--glob=!*.min.js
--glob=!*.map

export RIPGREP_CONFIG_PATH="$HOME/.ripgreprc"

When grep is still the better choice

ripgrep is not a universal replacement for grep. There are scenarios where grep is the right tool.

Piped input processing

When processing stdin from a pipe, grep and ripgrep perform similarly, but grep has the advantage of being universally available:

# grep is perfectly fine here - no directory traversal involved
cat access.log | grep "500 Internal"
ps aux | grep python
docker logs container_id | grep ERROR

ripgrep works with pipes too (rg reads stdin if no path is given), but there is no speed advantage and grep's ubiquity is a practical benefit.

POSIX compliance

If you are writing shell scripts that need to run on any Unix system - including minimal containers, embedded systems, or older servers - grep is guaranteed to be there. ripgrep is not.

# This works on literally every Unix system
grep -q "pattern" file && echo "found"

# This might not be installed
rg -q "pattern" file && echo "found"

Basic regex compatibility

If your scripts rely on POSIX Basic Regular Expressions (BRE) or Extended Regular Expressions (ERE), grep's behavior is standardized. ripgrep uses Rust's regex syntax, which is similar but not identical:

# grep BRE: backreferences work
grep '\(pattern\)\1' file

# ripgrep: backreferences require PCRE2 mode
rg -P '(pattern)\1' file

System administration scripts

For scripts that will live on production servers and be maintained by teams with varying tool familiarity, grep is the safer default. Everyone knows grep. Not everyone knows ripgrep.

Installation and setup

macOS

brew install ripgrep

Ubuntu / Debian

sudo apt install ripgrep

Fedora / RHEL

sudo dnf install ripgrep

Arch Linux

pacman -S ripgrep

Windows

# Chocolatey
choco install ripgrep

# Scoop
scoop install ripgrep

# Cargo (any platform with Rust installed)
cargo install ripgrep

Verify installation

rg --version
# ripgrep 14.1.1

Practical ripgrep recipes for developers

Here are real-world ripgrep commands that I use daily.

Find dead code candidates

# Find exported functions
rg "^export (function|const|class) (\w+)" --type ts -o -r '$2' src/ | sort > exports.txt

# Find import references
rg "import.*from" --type ts src/ > imports.txt

# Compare to find potentially unused exports
comm -23 exports.txt <(rg -o '\b\w+\b' imports.txt | sort -u)

Search git history with ripgrep

# Search only files tracked by git
rg "pattern" $(git ls-files)

# Search across all branches (combine with git grep)
git grep "pattern" $(git rev-list --all)

Profile-specific searches

# Search only test files
rg "mock|stub|spy" --glob "*test*" --glob "*spec*"

# Search only configuration files
rg "database|redis|postgres" --glob "*.{yml,yaml,toml,json,env}" .

# Search everything except generated files
rg "pattern" --glob "!*.generated.*" --glob "!*_pb2.py" --glob "!*.min.js"

Audit dependency usage

# Find all unique npm packages imported in a project
rg "from ['\"]([^./][^'\"]*)['\"]" --type ts -o -r '$1' | sort -u

# Find all Python third-party imports
rg "^import (\w+)|^from (\w+)" --type py -o | sort -u

Interactive fuzzy search with fzf

# Combine ripgrep with fzf for interactive code search
rg --line-number --no-heading --color=always "" | fzf --ansi --preview 'bat --color=always {1} --highlight-line {2}' --delimiter :

ripgrep vs other grep alternatives

ripgrep is not the only grep alternative. Here is how it compares to the other major options.

ripgrep vs ag (The Silver Searcher)

ag was the go-to grep alternative before ripgrep. It is still fast, but ripgrep consistently beats it by 2-5x in benchmarks. More importantly, ag is no longer actively maintained - the last release was in 2021. ripgrep is actively developed with regular releases.

ripgrep vs ack

ack was the original "grep for programmers" tool. It introduced many of the ideas that ripgrep later adopted (file type filtering, smart defaults, .gitignore respect). However, ack is written in Perl and is significantly slower than both ripgrep and ag.

ripgrep vs git grep

git grep is fast and already installed if you use git. It only searches tracked files, which is sometimes exactly what you want. For searching within a git repository, git grep is a legitimate alternative. However, ripgrep is still faster for large repositories and provides more features (multi-line search, file type filtering, replace mode).

ripgrep vs VS Code search

VS Code's built-in search actually uses ripgrep under the hood. When you press Ctrl+Shift+F in VS Code, you are running ripgrep with a UI wrapper. The same is true for Cursor, which forked VS Code.

Migrating from grep to ripgrep

If you are switching from grep, here are the most common flag translations:

grep flag	ripgrep equivalent	Purpose
`grep -r`	`rg` (default)	Recursive search
`grep -i`	`rg -i`	Case insensitive
`grep -w`	`rg -w`	Whole word match
`grep -l`	`rg -l`	Files with matches only
`grep -c`	`rg -c`	Count matches
`grep -v`	`rg -v`	Invert match
`grep -n`	`rg -n` (default)	Show line numbers
`grep --include`	`rg -t` or `rg -g`	Filter by file type
`grep --exclude-dir`	`rg -g '!dir'`	Exclude directory
`grep -P`	`rg -P`	PCRE regex
`grep -E`	`rg` (default)	Extended regex
`grep -F`	`rg -F`	Fixed string (no regex)

The biggest behavioral difference: ripgrep searches recursively by default and respects .gitignore. If you are used to grep -rn "pattern" ., you can replace it with just rg "pattern".

The bottom line

ripgrep is faster than grep in every scenario we tested, with speedups ranging from 1.6x on single large files to over 300x on projects with large ignored directories. The performance gap is not theoretical - it is the reason every major AI coding agent, every modern code editor, and an increasing number of CI pipelines have adopted ripgrep as their search backend.

For interactive development work, there is no reason to use grep over ripgrep. Install it, alias it if you want (alias grep='rg'), and enjoy faster searches.

For shell scripts, system administration, and POSIX-portable tooling, grep remains the right choice. It is everywhere, it is standardized, and it handles piped input just fine.

For AI-assisted development workflows - whether you are using Cursor, Claude Code, GitHub Copilot, or any other agent-based tool - ripgrep is already working behind the scenes, making your AI tools faster and your context windows cleaner.

The tools you use to search code are the foundation of every other developer workflow. ripgrep makes that foundation faster.

Frequently Asked Questions

Is ripgrep faster than grep?

Yes. In our benchmarks, ripgrep is 5-12x faster than GNU grep on large codebases and 2-5x faster on small projects. The performance gap widens with directory size because ripgrep uses parallelism, respects .gitignore, and skips binary files by default. On a 10 GB monorepo, ripgrep completed a recursive search in 0.6 seconds compared to grep's 8.2 seconds.

Why do AI coding agents use ripgrep instead of grep?

AI coding agents like Cursor, Claude Code, Codex CLI, and Aider use ripgrep because speed directly impacts user experience and token efficiency. When an agent needs to search a codebase to gather context, ripgrep returns results in milliseconds instead of seconds. It also auto-filters binary files, respects .gitignore, and produces clean output - all of which reduce noise in the context window that gets sent to the LLM.

Can I replace grep with ripgrep completely?

For interactive use and codebase searches, yes. ripgrep handles 95% of what developers use grep for, but faster. However, grep is still better for piped stdin processing, POSIX compliance requirements, and environments where you cannot install additional tools. ripgrep's regex syntax also differs slightly from POSIX BRE/ERE, so existing shell scripts that rely on grep's exact behavior should not be blindly converted.

How do I install ripgrep?

On macOS, run brew install ripgrep. On Ubuntu/Debian, use sudo apt install ripgrep. On Arch, use pacman -S ripgrep. On Windows, use choco install ripgrep or scoop install ripgrep. You can also download pre-built binaries from the GitHub releases page at github.com/BurntSushi/ripgrep.

Does ripgrep support regular expressions?

Yes. ripgrep uses Rust's regex engine by default, which supports most Perl-compatible regex features including lookahead, character classes, and Unicode. For full PCRE2 support including backreferences and lookaround, compile ripgrep with the --features pcre2 flag or use the -P flag at runtime if your build includes it.

What is the difference between ripgrep, ag (The Silver Searcher), and ack?

All three are grep alternatives designed for searching code. ripgrep is the fastest of the three in virtually every benchmark. ag (The Silver Searcher) was the previous speed champion but is no longer actively maintained. ack pioneered the concept of a programmer-friendly grep alternative but is the slowest of the three. ripgrep also has the broadest feature set, including multi-line search, file type filtering, and configurable ignore rules.

Originally published at aicodereview.cc

Parallel Tool Calling in LLM Agents - Complete Guide with Code Examples

Rahul Singh — Sat, 11 Apr 2026 16:00:00 +0000

Parallel tool calling is one of the most impactful performance features in modern LLM APIs. Instead of waiting for one tool call to finish before starting the next, the model can request multiple tool executions in a single response - and your agent runtime can execute them all at once.

This guide covers how parallel tool calling works across OpenAI, Anthropic Claude, and Google Gemini APIs, with working code examples in both Python and TypeScript. You will also learn error handling patterns, when to choose sequential over parallel execution, and how real-world AI coding agents use this to speed up code review and analysis.

What Is Parallel Tool Calling?

In a traditional LLM agent loop, tool use follows a strict sequential pattern:

The model requests a single tool call
Your code executes that tool
You send the result back to the model
The model requests the next tool call
Repeat until done

This works fine when each step depends on the previous result. But many agent tasks involve independent operations that have no dependency on each other. Reading multiple files, querying several APIs, or running different checks on the same codebase - these can all happen at the same time.

Parallel tool calling lets the model express this. In a single response, the model returns multiple tool call requests. Your agent runtime sees them all at once and can execute them concurrently.

The execution flow looks like this:

Sequential (slow):
  Model -> Tool A (200ms) -> Model -> Tool B (200ms) -> Model -> Tool C (200ms)
  Total: ~600ms + 3 model round trips

Parallel (fast):
  Model -> [Tool A, Tool B, Tool C] (all execute at once, 200ms) -> Model
  Total: ~200ms + 1 model round trip

The performance gain comes from two sources. First, the tool executions overlap in time. Second, you eliminate extra round trips to the model API, which often add 1-3 seconds each.

How Parallel Tool Calling Works in OpenAI API

OpenAI introduced parallel function calling with GPT-4 Turbo in late 2023 and it has been a core feature since. When the model determines that multiple tool calls are independent, it returns all of them in a single response.

Python Example - OpenAI Parallel Tool Calls


from openai import AsyncOpenAI

client = AsyncOpenAI()

tools = [
    {
        "type": "function",
        "function": {
            "name": "read_file",
            "description": "Read the contents of a file",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {"type": "string", "description": "File path"}
                },
                "required": ["path"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "run_linter",
            "description": "Run a linter on a file and return issues",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {"type": "string", "description": "File path"},
                    "linter": {"type": "string", "description": "Linter name"}
                },
                "required": ["path", "linter"]
            }
        }
    }
]

async def execute_tool(name: str, arguments: dict) -> str:
    """Execute a single tool and return the result."""
    if name == "read_file":
        # Simulate file read
        return f"Contents of {arguments['path']}: ..."
    elif name == "run_linter":
        return f"Linter {arguments['linter']} found 0 issues in {arguments['path']}"
    return "Unknown tool"

async def run_agent():
    messages = [
        {"role": "system", "content": "You are a code review agent."},
        {"role": "user", "content": "Review src/auth.py and src/api.py for issues."}
    ]

    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        tools=tools,
        # parallel_tool_calls=True is the default
    )

    message = response.choices[0].message

    if message.tool_calls:
        print(f"Model requested {len(message.tool_calls)} tool calls in parallel")

        # Execute all tool calls concurrently
        tasks = []
        for tool_call in message.tool_calls:
            args = json.loads(tool_call.function.arguments)
            tasks.append(execute_tool(tool_call.function.name, args))

        results = await asyncio.gather(*tasks)

        # Send all results back in one request
        messages.append(message)
        for tool_call, result in zip(message.tool_calls, results):
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": result
            })

        # Get the model's final response
        final = await client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools,
        )
        print(final.choices[0].message.content)

asyncio.run(run_agent())

The key detail here is that message.tool_calls is a list. When the model decides to call multiple tools in parallel, this list contains more than one entry. You execute them all, then send all the results back in the same message sequence.

TypeScript Example - OpenAI Parallel Tool Calls


const client = new OpenAI();

const tools: OpenAI.ChatCompletionTool[] = [
  {
    type: "function",
    function: {
      name: "read_file",
      description: "Read the contents of a file",
      parameters: {
        type: "object",
        properties: {
          path: { type: "string", description: "File path" },
        },
        required: ["path"],
      },
    },
  },
  {
    type: "function",
    function: {
      name: "run_linter",
      description: "Run a linter on a file and return issues",
      parameters: {
        type: "object",
        properties: {
          path: { type: "string", description: "File path" },
          linter: { type: "string", description: "Linter name" },
        },
        required: ["path", "linter"],
      },
    },
  },
];

async function executeTool(name: string, args: Record<string, string>): Promise<string> {
  if (name === "read_file") {
    return `Contents of ${args.path}: ...`;
  }
  if (name === "run_linter") {
    return `Linter ${args.linter} found 0 issues in ${args.path}`;
  }
  return "Unknown tool";
}

async function runAgent() {
  const messages: OpenAI.ChatCompletionMessageParam[] = [
    { role: "system", content: "You are a code review agent." },
    { role: "user", content: "Review src/auth.py and src/api.py for issues." },
  ];

  const response = await client.chat.completions.create({
    model: "gpt-4o",
    messages,
    tools,
    // parallel_tool_calls defaults to true
  });

  const message = response.choices[0].message;

  if (message.tool_calls && message.tool_calls.length > 0) {
    console.log(`Model requested ${message.tool_calls.length} parallel tool calls`);

    // Execute all tool calls concurrently with Promise.all
    const results = await Promise.all(
      message.tool_calls.map((tc) =>
        executeTool(tc.function.name, JSON.parse(tc.function.arguments))
      )
    );

    // Append assistant message and all tool results
    messages.push(message);
    message.tool_calls.forEach((tc, i) => {
      messages.push({
        role: "tool",
        tool_call_id: tc.id,
        content: results[i],
      });
    });

    const final = await client.chat.completions.create({
      model: "gpt-4o",
      messages,
      tools,
    });

    console.log(final.choices[0].message.content);
  }
}

runAgent();

Disabling Parallel Tool Calls in OpenAI

If you need sequential execution - for example when the second tool call depends on the first result - you can disable parallel calling:

response = await client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=tools,
    parallel_tool_calls=False,  # Force one tool call at a time
)

How Parallel Tool Calling Works in Anthropic Claude API

Anthropic Claude supports parallel tool use across Claude 3.5 Sonnet, Claude 3 Opus, and Claude 4 models. The implementation differs from OpenAI because Claude uses a content block model where a single response can contain multiple tool_use blocks.

Python Example - Claude Parallel Tool Use


client = anthropic.AsyncAnthropic()

tools = [
    {
        "name": "read_file",
        "description": "Read the contents of a file from the repository",
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {"type": "string", "description": "File path"}
            },
            "required": ["path"]
        }
    },
    {
        "name": "search_code",
        "description": "Search for a pattern across the codebase",
        "input_schema": {
            "type": "object",
            "properties": {
                "pattern": {"type": "string", "description": "Search pattern"},
                "file_glob": {"type": "string", "description": "File glob filter"}
            },
            "required": ["pattern"]
        }
    }
]

async def execute_tool(name: str, tool_input: dict) -> str:
    if name == "read_file":
        return f"File contents of {tool_input['path']}: def main(): pass"
    elif name == "search_code":
        return f"Found 3 matches for '{tool_input['pattern']}'"
    return "Unknown tool"

async def run_agent():
    messages = [
        {
            "role": "user",
            "content": "Read auth.py and search for all SQL query patterns in the codebase."
        }
    ]

    response = await client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        tools=tools,
        messages=messages,
    )

    # Collect all tool_use blocks from the response
    tool_use_blocks = [
        block for block in response.content
        if block.type == "tool_use"
    ]

    if tool_use_blocks:
        print(f"Claude requested {len(tool_use_blocks)} tool calls in parallel")

        # Execute all tool calls concurrently
        tasks = [
            execute_tool(block.name, block.input)
            for block in tool_use_blocks
        ]
        results = await asyncio.gather(*tasks)

        # Build tool results - one for each tool_use block
        tool_results = [
            {
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": result,
            }
            for block, result in zip(tool_use_blocks, results)
        ]

        # Send results back
        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": tool_results})

        final = await client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            tools=tools,
            messages=messages,
        )
        # Extract text from the final response
        for block in final.content:
            if hasattr(block, "text"):
                print(block.text)

asyncio.run(run_agent())

TypeScript Example - Claude Parallel Tool Use


const client = new Anthropic();

const tools: Anthropic.Tool[] = [
  {
    name: "read_file",
    description: "Read the contents of a file from the repository",
    input_schema: {
      type: "object",
      properties: {
        path: { type: "string", description: "File path" },
      },
      required: ["path"],
    },
  },
  {
    name: "search_code",
    description: "Search for a pattern across the codebase",
    input_schema: {
      type: "object",
      properties: {
        pattern: { type: "string", description: "Search pattern" },
        file_glob: { type: "string", description: "File glob filter" },
      },
      required: ["pattern"],
    },
  },
];

async function executeTool(name: string, input: Record<string, string>): Promise<string> {
  if (name === "read_file") return `Contents of ${input.path}: ...`;
  if (name === "search_code") return `Found 3 matches for '${input.pattern}'`;
  return "Unknown tool";
}

async function runAgent() {
  const messages: Anthropic.MessageParam[] = [
    {
      role: "user",
      content: "Read auth.py and search for all SQL query patterns.",
    },
  ];

  const response = await client.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 4096,
    tools,
    messages,
  });

  const toolUseBlocks = response.content.filter(
    (block): block is Anthropic.ToolUseBlock => block.type === "tool_use"
  );

  if (toolUseBlocks.length > 0) {
    console.log(`Claude requested ${toolUseBlocks.length} parallel tool calls`);

    // Execute concurrently
    const results = await Promise.all(
      toolUseBlocks.map((block) =>
        executeTool(block.name, block.input as Record<string, string>)
      )
    );

    const toolResults: Anthropic.ToolResultBlockParam[] = toolUseBlocks.map(
      (block, i) => ({
        type: "tool_result",
        tool_use_id: block.id,
        content: results[i],
      })
    );

    messages.push({ role: "assistant", content: response.content });
    messages.push({ role: "user", content: toolResults });

    const final = await client.messages.create({
      model: "claude-sonnet-4-20250514",
      max_tokens: 4096,
      tools,
      messages,
    });

    for (const block of final.content) {
      if (block.type === "text") console.log(block.text);
    }
  }
}

runAgent();

The main structural difference from OpenAI is that Claude uses a content block array. A single assistant message can contain a mix of text and tool_use blocks. You filter for tool_use blocks, execute them all, then return the results as tool_result blocks.

How Parallel Tool Calling Works in Google Gemini API

Google Gemini supports parallel function calling with Gemini 1.5 Pro and Gemini 2.0 models. The API uses a function_call part structure where multiple function calls can appear in a single response.

Python Example - Gemini Parallel Function Calling


from google import genai
from google.genai import types

client = genai.Client()

read_file_tool = types.FunctionDeclaration(
    name="read_file",
    description="Read the contents of a file",
    parameters=types.Schema(
        type=types.Type.OBJECT,
        properties={
            "path": types.Schema(type=types.Type.STRING, description="File path"),
        },
        required=["path"],
    ),
)

run_test_tool = types.FunctionDeclaration(
    name="run_test",
    description="Run a test file and return results",
    parameters=types.Schema(
        type=types.Type.OBJECT,
        properties={
            "test_path": types.Schema(type=types.Type.STRING, description="Test file path"),
        },
        required=["test_path"],
    ),
)

gemini_tools = types.Tool(function_declarations=[read_file_tool, run_test_tool])

async def execute_tool(name: str, args: dict) -> dict:
    if name == "read_file":
        return {"content": f"Contents of {args['path']}: ..."}
    elif name == "run_test":
        return {"passed": True, "tests_run": 5}
    return {"error": "Unknown tool"}

async def run_agent():
    model = "gemini-2.0-flash"

    response = client.models.generate_content(
        model=model,
        contents="Read src/main.py and run the tests in tests/test_main.py",
        config=types.GenerateContentConfig(
            tools=[gemini_tools],
        ),
    )

    # Collect all function calls from the response
    function_calls = []
    for part in response.candidates[0].content.parts:
        if part.function_call:
            function_calls.append(part.function_call)

    if function_calls:
        print(f"Gemini requested {len(function_calls)} parallel function calls")

        # Execute all concurrently
        tasks = [
            execute_tool(fc.name, dict(fc.args))
            for fc in function_calls
        ]
        results = await asyncio.gather(*tasks)

        # Build function response parts
        function_responses = []
        for fc, result in zip(function_calls, results):
            function_responses.append(
                types.Part.from_function_response(
                    name=fc.name,
                    response=result,
                )
            )

        # Send results back for final response
        final = client.models.generate_content(
            model=model,
            contents=[
                types.Content(
                    role="user",
                    parts=[types.Part.from_text("Read src/main.py and run the tests in tests/test_main.py")]
                ),
                response.candidates[0].content,
                types.Content(
                    role="user",
                    parts=function_responses,
                ),
            ],
            config=types.GenerateContentConfig(
                tools=[gemini_tools],
            ),
        )
        print(final.text)

asyncio.run(run_agent())

Performance Benefits of Parallel Tool Calling

The real-world performance gains from parallel tool calling are substantial. Here is a breakdown of where the time savings come from.

Latency Reduction

Consider an AI code review agent that needs to analyze a pull request with 8 changed files. For each file, it reads the content, checks the git blame, and runs a linter. That is 24 tool calls.

Sequential approach:

24 tool calls x ~150ms average = 3,600ms of tool execution
24 model round trips x ~1,500ms average = 36,000ms of API latency
Total: ~40 seconds

Parallel approach (batches of 8):

3 batches x ~150ms (parallel execution) = 450ms of tool execution
3 model round trips x ~1,500ms average = 4,500ms of API latency
Total: ~5 seconds

That is an 8x improvement. The dominant factor is eliminating model round trips, not just the tool execution overlap.

Token Efficiency

Each model round trip carries the full conversation context. With sequential tool calling, you pay for the input tokens of the full conversation 24 times. With parallel calling, you pay for it 3 times. For long conversations with large tool results, this can mean significant cost savings on API usage.

When Parallel Calling Hurts

Parallel tool calling is not always better. It degrades performance in these situations:

Dependent operations - If tool B needs the result of tool A, they must run sequentially
Rate-limited APIs - Firing 20 parallel requests at an API with a rate limit of 5/second will cause failures
Resource-intensive tools - Running 10 heavy database queries in parallel can overload your database
Non-deterministic ordering - If the order of results matters for the model's reasoning, sequential is safer

Error Handling Patterns for Parallel Tool Calls

Robust error handling is critical when executing tools in parallel. A single failure should not crash your entire agent loop.

Python - Resilient Parallel Execution


from dataclasses import dataclass
from typing import Any

@dataclass
class ToolResult:
    tool_call_id: str
    name: str
    success: bool
    content: str

async def execute_tool_safe(tool_call_id: str, name: str, args: dict) -> ToolResult:
    """Execute a tool with error handling."""
    try:
        result = await execute_tool(name, args)
        return ToolResult(
            tool_call_id=tool_call_id,
            name=name,
            success=True,
            content=result,
        )
    except TimeoutError:
        return ToolResult(
            tool_call_id=tool_call_id,
            name=name,
            success=False,
            content=f"Error: Tool '{name}' timed out after 30 seconds",
        )
    except Exception as e:
        return ToolResult(
            tool_call_id=tool_call_id,
            name=name,
            success=False,
            content=f"Error: Tool '{name}' failed with: {str(e)}",
        )

async def execute_parallel_tools(tool_calls: list) -> list[ToolResult]:
    """Execute multiple tool calls in parallel with error isolation."""
    tasks = [
        execute_tool_safe(
            tool_call_id=tc.id,
            name=tc.function.name,
            args=json.loads(tc.function.arguments),
        )
        for tc in tool_calls
    ]
    return await asyncio.gather(*tasks)

TypeScript - Resilient Parallel Execution

interface ToolResult {
  toolCallId: string;
  name: string;
  success: boolean;
  content: string;
}

async function executeToolSafe(
  toolCallId: string,
  name: string,
  args: Record<string, unknown>
): Promise<ToolResult> {
  try {
    const result = await executeTool(name, args);
    return { toolCallId, name, success: true, content: result };
  } catch (error) {
    const message = error instanceof Error ? error.message : String(error);
    return {
      toolCallId,
      name,
      success: false,
      content: `Error: Tool '${name}' failed with: ${message}`,
    };
  }
}

async function executeParallelTools(
  toolCalls: OpenAI.ChatCompletionMessageToolCall[]
): Promise<ToolResult[]> {
  // Use Promise.allSettled for maximum resilience
  const settled = await Promise.allSettled(
    toolCalls.map((tc) =>
      executeToolSafe(
        tc.id,
        tc.function.name,
        JSON.parse(tc.function.arguments)
      )
    )
  );

  return settled.map((result, i) => {
    if (result.status === "fulfilled") return result.value;
    return {
      toolCallId: toolCalls[i].id,
      name: toolCalls[i].function.name,
      success: false,
      content: `Error: Unexpected failure - ${result.reason}`,
    };
  });
}

The critical pattern here is to always return a result for every tool call. If you skip a tool result, the model will not know what happened and may hallucinate the missing result or get stuck in a loop.

Sequential vs Parallel - Decision Framework

Choosing between sequential and parallel tool calling depends on the dependency graph of your operations. Here is a practical decision framework.

Use Parallel When:

Reading multiple independent files - No dependencies between reads
Running multiple linters or checks - Each check is independent
Querying multiple APIs - External data fetches rarely depend on each other
Searching across different directories - Multiple grep or search operations
Fetching metadata for multiple items - Git blame, file stats, etc.

Use Sequential When:

Results feed into the next operation - Read a config file, then use values from it to query a database
Conditional branching - Check if a file exists before reading it
Ordered mutations - Creating a branch, committing files, then opening a PR
Pagination - Fetching page 2 requires knowing the cursor from page 1
Authentication flows - Get a token before making authenticated requests

Hybrid Pattern - The Agent Loop

Most real-world agents use a hybrid approach. They let the model decide what can be parallelized on each turn, then loop until the task is complete.

async def agent_loop(messages: list, tools: list, max_turns: int = 20):
    """Generic agent loop with automatic parallel tool execution."""
    for turn in range(max_turns):
        response = await client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools,
        )
        message = response.choices[0].message
        messages.append(message)

        # If no tool calls, the agent is done
        if not message.tool_calls:
            return message.content

        # Execute all tool calls from this turn in parallel
        results = await execute_parallel_tools(message.tool_calls)

        # Append all results
        for result in results:
            messages.append({
                "role": "tool",
                "tool_call_id": result.tool_call_id,
                "content": result.content,
            })

    return "Max turns reached"

This loop naturally handles both parallel and sequential patterns. When the model needs parallel execution, it emits multiple tool calls in one turn. When it needs sequential execution, it emits one tool call per turn and waits for the result before deciding the next step.

Real-World Examples in AI Code Review

Parallel tool calling is a core feature of modern AI code review tools and coding agents. Here is how some of the most popular tools use it.

AI Code Review Agents

When an AI code review tool like CodeRabbit or CodeAnt AI reviews a pull request, the agent typically needs to:

Fetch the PR diff
Read every changed file in full
Read related files that import or are imported by the changed files
Check the repository's linting and style rules
Look up past review comments on similar code

Steps 2 through 5 can all happen in parallel once the diff is available from step 1. A well-designed code review agent will make one sequential call to get the diff, then fan out into parallel calls for everything else.

Turn 1: Model requests get_pr_diff(pr_number=42)
Turn 2: Model receives diff, requests in parallel:
  - read_file("src/auth.py")
  - read_file("src/api.py")
  - read_file("src/models/user.py")
  - search_code("import auth")
  - get_lint_config(".eslintrc.json")
Turn 3: Model has all context, generates review comments

Without parallel calling, turn 2 would take 5 separate round trips instead of 1. For a PR with 20 changed files, that difference can mean 30+ seconds saved per review.

Coding Agents

Tools like Claude Code, GitHub Copilot, and Cursor use parallel tool calling extensively for code exploration. When you ask a coding agent to "find all usages of the deprecated authenticate() function and update them," the agent might:

Search for the function definition (sequential - need this first)
Search for all call sites in parallel across multiple directories
Read each file containing a call site in parallel
Apply edits (sequential if order matters, parallel if independent files)

The ability to read 10 files simultaneously instead of one at a time is what makes modern coding agents feel responsive even on large codebases.

Agentic Testing Tools

AI testing tools like Qodo and other test generation platforms use parallel tool calling when generating tests for multiple functions. The agent can read all target functions in parallel, then generate test cases for each function in parallel, then write all test files in parallel. This three-phase parallel approach is dramatically faster than generating tests one function at a time.

Implementation Best Practices

1. Set Timeouts on Individual Tool Calls

Do not let a single slow tool call block the entire batch. Set per-tool timeouts:

async def execute_with_timeout(coro, timeout_seconds: float = 30):
    try:
        return await asyncio.wait_for(coro, timeout=timeout_seconds)
    except asyncio.TimeoutError:
        return "Error: Operation timed out"

2. Limit Concurrency

If your tools hit external APIs, use a semaphore to avoid rate limit errors:

semaphore = asyncio.Semaphore(5)  # Max 5 concurrent tool executions

async def execute_with_limit(name: str, args: dict):
    async with semaphore:
        return await execute_tool(name, args)

3. Log Tool Call Batches for Debugging

When debugging agent behavior, log each batch of parallel calls so you can trace the execution flow:


logger = logging.getLogger("agent")

async def execute_parallel_tools_logged(tool_calls):
    batch_id = uuid4().hex[:8]
    names = [tc.function.name for tc in tool_calls]
    logger.info(f"Batch {batch_id}: executing {names} in parallel")

    results = await execute_parallel_tools(tool_calls)

    failed = [r for r in results if not r.success]
    if failed:
        logger.warning(f"Batch {batch_id}: {len(failed)} tools failed: {[f.name for f in failed]}")

    return results

4. Return Structured Error Messages

When a tool fails, return the error as structured text that helps the model recover:

# Bad - the model cannot reason about what went wrong
return "Error"

# Good - the model knows the specific failure and can retry or work around it
return "Error: File 'src/auth.py' not found. The file may have been deleted or renamed. Try searching for files matching 'auth' to find the correct path."

5. Consider Tool Result Size

When running many tools in parallel, the combined results can be very large. If 10 file reads each return 500 lines of code, you are sending 5,000 lines of code back to the model in a single turn. This consumes a lot of context window and can degrade model performance.

Strategies to manage this:

Truncate large file reads to the most relevant sections
Summarize tool results when full detail is not needed
Use a two-phase approach: first read file metadata (size, language), then read only the files that matter

API Comparison Table

Here is a quick comparison of how parallel tool calling works across the three major LLM APIs:

Feature	OpenAI	Anthropic Claude	Google Gemini
Parallel by default	Yes	Yes	Yes
Disable parallel	`parallel_tool_calls=false`	Via system prompt	Via system prompt
Tool call container	`message.tool_calls[]`	`content[].tool_use` blocks	`parts[].function_call`
Result format	`role: "tool"` messages	`tool_result` blocks	`function_response` parts
Max parallel calls	No hard limit	No hard limit	No hard limit
Streaming support	Yes	Yes	Yes
Models supported	GPT-4o, GPT-4 Turbo+	Claude 3.5+, Claude 4	Gemini 1.5 Pro, 2.0

Conclusion

Parallel tool calling is not just a performance optimization - it is a fundamental capability that makes LLM agents practical for real-world tasks. Without it, a code review agent that needs to read 15 files would take over a minute just on model round trips. With parallel calling, that drops to seconds.

The implementation patterns are straightforward across all three major APIs. The model decides what can run in parallel, your runtime executes the calls concurrently, and you return all results at once. The most important details are proper error handling (never drop a result) and concurrency management (use semaphores for rate-limited APIs).

If you are building AI agents for code review, testing, or development workflows, parallel tool calling should be one of the first optimizations you implement. The combination of reduced latency, fewer model round trips, and lower token costs makes it a clear win for any agent that touches more than a couple of tools per task.

For a deeper look at how AI code review tools use these patterns in practice, see our guides on best AI code review tools and how to automate code review.

Frequently Asked Questions

What is parallel tool calling in LLM agents?

Parallel tool calling is a feature where a large language model requests multiple tool or function calls in a single response instead of making them one at a time. This allows the agent runtime to execute all the calls simultaneously, reducing total latency and making AI agents significantly faster for tasks that involve independent operations like fetching files, querying APIs, or running checks in parallel.

Which LLM APIs support parallel tool calling?

OpenAI supports parallel function calling with GPT-4o, GPT-4 Turbo, and newer models. Anthropic Claude supports parallel tool use with Claude 3.5 Sonnet, Claude 3 Opus, and Claude 4 models. Google Gemini supports parallel function calling with Gemini 1.5 Pro and Gemini 2.0 models. All three APIs return multiple tool call requests in a single assistant message when appropriate.

How much faster is parallel tool calling compared to sequential?

Parallel tool calling can reduce total execution time by 50-80% depending on the number of independent tool calls and individual call latency. For example, if an agent needs to fetch 5 files that each take 200ms, sequential execution takes about 1 second while parallel execution completes in roughly 200ms. The speedup scales with the number of independent calls.

Can I disable parallel tool calling in the OpenAI API?

Yes. In the OpenAI API you can set parallel_tool_calls to false in your request to force the model to make only one tool call per response. This is useful when tool calls have dependencies on each other or when you need strict ordering of operations. Anthropic and Gemini do not offer an explicit toggle but you can guide behavior through system prompts.

How should I handle errors in parallel tool calls?

Use Promise.allSettled in TypeScript or asyncio.gather with return_exceptions=True in Python so that one failing call does not cancel the others. Return the error message as the tool result for the failed call so the LLM can decide how to recover. Never silently drop failed results because the model expects a result for every tool call it requested.

What is the difference between parallel tool calling and multi-turn tool use?

Parallel tool calling happens within a single turn where the model requests multiple tools at once and they all execute simultaneously. Multi-turn tool use is when the model makes a tool call, receives the result, then makes another tool call in the next turn based on that result. Most real-world agents combine both patterns since some operations depend on previous results while others can run in parallel.

Originally published at aicodereview.cc

MISRA C:2012 Rules with Examples - Complete Guide for Embedded Developers

Rahul Singh — Sat, 11 Apr 2026 14:00:00 +0000

MISRA C:2012 is the most widely adopted coding standard for safety-critical embedded C development. Originally created for the automotive industry, it is now a baseline requirement across automotive, aerospace, medical devices, and industrial control systems.

This guide covers the most important MISRA C:2012 rules with practical C code examples - showing both violations and compliant alternatives - so you can understand what the rules enforce and how to write conforming code.

What is MISRA C?

MISRA C is a set of software development guidelines for the C programming language published by the Motor Industry Software Reliability Association (MISRA). The current edition - MISRA C:2012 - contains 175 guidelines designed to eliminate undefined behavior, implementation-defined behavior, and error-prone coding patterns from C programs.

The standard exists because C gives programmers enormous freedom, and that freedom is dangerous in systems where software bugs can cause physical harm. A buffer overflow in a web application might leak data. A buffer overflow in a brake controller can kill people.

MISRA C restricts C to a safer, more predictable subset of the language. It bans constructs that behave differently across compilers, eliminates patterns that commonly lead to bugs, and enforces practices that make code easier to review and verify.

Why MISRA C Matters

MISRA C compliance is expected or required in several safety-critical domains:

Automotive (ISO 26262): Every major automotive OEM and Tier-1 supplier requires MISRA C compliance for embedded ECU software. ISO 26262 functional safety certification references MISRA C as the recommended coding standard for C-based automotive software.

Aerospace and Defense (DO-178C): Avionics software certified under DO-178C frequently uses MISRA C as the coding standard. The strict traceability and verification requirements of DO-178C align well with MISRA's rule structure.

Medical Devices (IEC 62304): Software in medical devices classified under IEC 62304 benefits from MISRA C compliance to demonstrate safe coding practices during regulatory audits.

Industrial Control (IEC 61508): Programmable electronic safety systems governed by IEC 61508 use MISRA C to reduce the risk of systematic software failures.

Railway (EN 50128): Railway signaling and control software uses MISRA C as part of the verification process under EN 50128.

Even outside regulated industries, MISRA C adoption is growing in any embedded project where reliability matters - IoT devices, robotics, and firmware for critical infrastructure.

MISRA C:2012 Rule Categories

Every MISRA C:2012 guideline falls into one of three classification levels:

Mandatory Rules

These rules cannot be violated under any circumstance. No deviation is permitted. They address the most dangerous coding practices - things like accessing memory through null pointers or modifying loop counters inside the loop body.

Required Rules

These rules must be followed, but a formal deviation is allowed when compliance is not practical. Each deviation requires documented justification, a risk assessment, and approval from a qualified reviewer. Most MISRA C rules fall into this category.

Advisory Rules

These are recommended best practices. Teams can choose not to follow advisory rules without a formal deviation, but they should still document which advisory rules they exclude and why.

Additionally, MISRA C:2012 distinguishes between directives (general guidance that may require human judgment to verify) and rules (specific requirements that static analysis tools can check automatically).

Key MISRA C:2012 Rules with Code Examples

Let's walk through the most important rules with concrete C code showing violations and compliant alternatives.

Rule 1.3 (Mandatory) - No Undefined Behavior

There shall be no occurrence of undefined behavior. This is the foundational rule - if your code triggers undefined behavior as defined by the C standard, it violates MISRA C regardless of whether any other specific rule covers the construct.

/* VIOLATION - signed integer overflow is undefined behavior */
int32_t calculate_total(int32_t a, int32_t b)
{
    return a + b;  /* overflow possible if a + b > INT32_MAX */
}

/* COMPLIANT - check for overflow before the operation */
int32_t calculate_total(int32_t a, int32_t b)
{
    int32_t result;

    if ((b > 0) && (a > (INT32_MAX - b)))
    {
        result = INT32_MAX;  /* saturate on overflow */
    }
    else if ((b < 0) && (a < (INT32_MIN - b)))
    {
        result = INT32_MIN;  /* saturate on underflow */
    }
    else
    {
        result = a + b;
    }

    return result;
}

Rule 10.3 (Required) - No Implicit Narrowing Conversions

The value of an expression shall not be assigned to an object with a narrower essential type or of a different essential type category.

/* VIOLATION - implicit narrowing from int32_t to int16_t */
int32_t sensor_raw = read_sensor();
int16_t sensor_value = sensor_raw;  /* data loss if sensor_raw > 32767 */

/* VIOLATION - assigning float to integer */
float temperature = 36.5f;
int32_t temp_display = temperature;  /* truncates decimal part */

/* COMPLIANT - explicit cast shows intent */
int32_t sensor_raw = read_sensor();
int16_t sensor_value = (int16_t)sensor_raw;  /* explicit narrowing */

/* COMPLIANT - explicit conversion with rounding */
float temperature = 36.5f;
int32_t temp_display = (int32_t)(temperature + 0.5f);  /* round, then cast */

Rule 10.4 (Required) - Operands Must Have the Same Essential Type Category

Both operands of an operator in which the usual arithmetic conversions are performed shall have the same essential type category.

/* VIOLATION - mixing signed and unsigned in comparison */
int32_t offset = -5;
uint32_t length = 100U;

if (offset < length)  /* undefined: -5 converted to large unsigned value */
{
    /* this branch may NOT execute as expected */
}

/* COMPLIANT - cast to a common type before comparison */
int32_t offset = -5;
uint32_t length = 100U;

if (offset < (int32_t)length)  /* both operands are signed */
{
    /* correct behavior */
}

Rule 11.3 (Required) - No Casts Between Pointer Types

A cast shall not be performed between a pointer to object type and a pointer to a different object type.

/* VIOLATION - casting between incompatible pointer types */
uint8_t buffer[4] = {0x01, 0x02, 0x03, 0x04};
uint32_t *word_ptr = (uint32_t *)buffer;  /* alignment and aliasing issues */
uint32_t value = *word_ptr;

/* COMPLIANT - use memcpy to reinterpret bytes safely */
uint8_t buffer[4] = {0x01, 0x02, 0x03, 0x04};
uint32_t value;
(void)memcpy(&value, buffer, sizeof(value));

Rule 12.2 (Required) - Shift Operand Range

The right hand operand of a shift operator shall lie in the range zero to one less than the width in bits of the essential type of the left hand operand.

/* VIOLATION - shifting by more bits than the type width */
uint16_t flags = 0x00FFU;
uint16_t result = flags << 16U;  /* undefined: 16-bit type shifted by 16 */

/* VIOLATION - shift amount could be negative */
int8_t shift_amount = get_shift();
uint32_t result = 1U << shift_amount;  /* negative shift is undefined */

/* COMPLIANT - shift amount validated and within range */
uint32_t flags = 0x000000FFU;
uint32_t result = flags << 16U;  /* 32-bit type shifted by 16 is valid */

/* COMPLIANT - validate shift amount before use */
uint8_t shift_amount = get_shift();
uint32_t result = 0U;

if (shift_amount < 32U)
{
    result = 1U << shift_amount;
}

Rule 13.5 (Required) - No Side Effects in Logical Operand Evaluation

The right hand operand of a logical && or || operator shall not contain persistent side effects.

/* VIOLATION - function with side effects in short-circuit expression */
if (is_valid(data) && process(data))  /* process() may not execute */
{
    log_success();
}

/* COMPLIANT - evaluate separately to ensure both execute */
bool_t valid = is_valid(data);
bool_t processed = false;

if (valid)
{
    processed = process(data);
}

if (valid && processed)
{
    log_success();
}

This rule exists because the right operand of && or || might not be evaluated due to short-circuit evaluation. If that operand has side effects (modifying global state, writing to hardware registers, performing I/O), the program behavior depends on evaluation order - which is dangerous in safety-critical code.

Rule 14.2 (Required) - For Loop Counter Restrictions

A for loop shall be well-formed. The loop counter must not be modified in the loop body, the loop counter must be tested against a bound in the controlling expression, and the loop counter must be incremented or decremented in the third expression.

/* VIOLATION - loop counter modified inside the body */
for (uint32_t i = 0U; i < 100U; i++)
{
    if (should_skip(i))
    {
        i += 2U;  /* modifying loop counter in body */
    }
    process(i);
}

/* VIOLATION - missing increment in for expression */
for (uint32_t i = 0U; i < 100U; )  /* no third expression */
{
    process(i);
    i++;  /* increment in body instead of for expression */
}

/* COMPLIANT - well-formed for loop */
for (uint32_t i = 0U; i < 100U; i++)
{
    if (!should_skip(i))
    {
        process(i);
    }
}

Rule 14.4 (Required) - Boolean Controlling Expressions

The controlling expression of an if statement and the controlling expression of an iteration statement shall have essentially Boolean type.

/* VIOLATION - integer used as boolean */
int32_t error_code = check_status();

if (error_code)  /* non-boolean controlling expression */
{
    handle_error();
}

/* VIOLATION - pointer used as boolean */
char *name = get_name();

if (name)  /* non-boolean controlling expression */
{
    display(name);
}

/* COMPLIANT - explicit comparison to produce boolean */
int32_t error_code = check_status();

if (error_code != 0)
{
    handle_error();
}

/* COMPLIANT - explicit null check */
char *name = get_name();

if (name != NULL)
{
    display(name);
}

Rule 15.5 (Advisory) - Single Point of Exit

A function should have a single point of exit at the end of the function.

/* VIOLATION - multiple return statements */
int32_t find_index(const int32_t *array, uint32_t size, int32_t target)
{
    for (uint32_t i = 0U; i < size; i++)
    {
        if (array[i] == target)
        {
            return (int32_t)i;  /* early return */
        }
    }
    return -1;
}

/* COMPLIANT - single return at end */
int32_t find_index(const int32_t *array, uint32_t size, int32_t target)
{
    int32_t result = -1;

    for (uint32_t i = 0U; i < size; i++)
    {
        if (array[i] == target)
        {
            result = (int32_t)i;
            break;  /* Note: Rule 15.4 restricts break usage too */
        }
    }

    return result;
}

Rule 17.7 (Required) - Return Values Must Be Used

The value returned by a function having non-void return type shall be used.

/* VIOLATION - ignoring the return value of fclose */
FILE *fp = fopen("config.dat", "r");
/* ... read data ... */
fclose(fp);  /* return value discarded */

/* VIOLATION - ignoring return value of custom function */
uint8_t validate_input(const uint8_t *data, uint32_t length);

validate_input(sensor_data, data_len);  /* return value ignored */

/* COMPLIANT - capture and use return values */
FILE *fp = fopen("config.dat", "r");
/* ... read data ... */
int close_result = fclose(fp);

if (close_result != 0)
{
    report_file_error();
}

/* COMPLIANT - explicitly cast to void if intentionally ignoring */
(void)validate_input(sensor_data, data_len);  /* deliberate discard */

Rule 20.4 (Required) - No Macro Redefinition

A macro shall not be defined with the same name as a keyword.

/* VIOLATION - redefining a keyword as a macro */
#define const        /* removes const qualification throughout */
#define inline       /* removes inline hint throughout */
#define true 1       /* redefines standard boolean */

/* COMPLIANT - use unique macro names that do not shadow keywords */
#define APP_CONST_QUALIFIER const
#define APP_TRUE ((bool)1)

Directive 4.1 (Required) - Run-time Failures Shall Be Minimized

This directive requires that code is written to minimize the possibility of run-time errors. It covers array bounds violations, division by zero, null pointer dereferences, and arithmetic overflow.

/* VIOLATION - no bounds check before array access */
void store_value(uint32_t *buffer, uint32_t index, uint32_t value)
{
    buffer[index] = value;  /* what if index >= buffer size? */
}

/* COMPLIANT - validate index before use */
#define BUFFER_SIZE 256U

void store_value(uint32_t *buffer, uint32_t index, uint32_t value)
{
    if (index < BUFFER_SIZE)
    {
        buffer[index] = value;
    }
    else
    {
        report_error(ERROR_OUT_OF_BOUNDS);
    }
}

Directive 4.12 (Required) - No Dynamic Memory Allocation

Dynamic memory allocation shall not be used. This means malloc, calloc, realloc, and free are all prohibited.

/* VIOLATION - using malloc in safety-critical code */
sensor_data_t *readings = (sensor_data_t *)malloc(count * sizeof(sensor_data_t));

if (readings != NULL)
{
    /* process readings */
    free(readings);
}

/* COMPLIANT - use statically allocated buffers */
#define MAX_READINGS 128U

static sensor_data_t readings[MAX_READINGS];

if (count <= MAX_READINGS)
{
    /* process readings using static buffer */
}
else
{
    report_error(ERROR_BUFFER_TOO_SMALL);
}

Dynamic memory introduces unpredictable behavior - fragmentation, allocation failures, and non-deterministic timing - that is unacceptable in safety-critical systems where every execution path must be analyzable.

Rule 21.3 (Required) - No stdlib.h Memory Functions

The memory allocation and deallocation functions of stdlib.h shall not be used. This rule reinforces Directive 4.12 with a specific, tool-checkable requirement.

Rule 21.6 (Required) - No Standard I/O

The Standard Library input/output functions shall not be used. Functions like printf, scanf, fopen, and fclose from stdio.h are banned because they have implementation-defined and undefined behaviors.

/* VIOLATION - using printf in embedded code */
void log_temperature(float temp)
{
    printf("Temperature: %.1f\n", temp);  /* banned */
}

/* COMPLIANT - use a custom logging interface */
void log_temperature(float temp)
{
    int32_t temp_int = (int32_t)(temp * 10.0f);
    uart_write_string("Temperature: ");
    uart_write_int(temp_int);
    uart_write_char('\n');
}

MISRA C:2012 Compliance Tools

Checking MISRA C compliance manually is impractical for any real project. Here are the leading tools that automate MISRA checking:

Polyspace Bug Finder and Code Prover (MathWorks)

Polyspace is the gold standard for MISRA compliance verification in automotive. Bug Finder performs fast static analysis to detect MISRA violations, while Code Prover uses abstract interpretation to mathematically prove the absence of certain runtime errors. Polyspace provides full MISRA C:2012 rule coverage and generates compliance reports accepted by certification bodies. It integrates with MATLAB/Simulink workflows, making it dominant in automotive where model-based development is common.

Best for: Automotive teams using MATLAB/Simulink, projects requiring formal verification.

PC-lint Plus (Gimpel Software)

PC-lint Plus is a lightweight, fast static analysis tool with comprehensive MISRA C checking. It runs locally and integrates into any build system. PC-lint has been a staple in embedded development for decades and covers nearly all decidable MISRA C:2012 rules. It is relatively affordable compared to other commercial tools.

Best for: Small to mid-size teams wanting affordable MISRA checking without heavyweight infrastructure.

Helix QAC (Perforce)

Helix QAC (formerly QA-C from PRQA) was one of the first tools specifically designed for MISRA compliance. It provides full MISRA C:2012 coverage, generates compliance matrices for audits, and includes a message browser for triaging findings. Helix QAC is widely used in European automotive and aerospace companies.

Best for: Teams needing certified MISRA compliance checking with audit-ready reporting.

Coverity (Synopsys)

Coverity is a heavyweight static analysis platform that includes MISRA C:2012 checking alongside its broader security and defect analysis capabilities. Coverity's interprocedural analysis engine can detect complex violations that span multiple functions and files. It scales to very large codebases and integrates with enterprise CI/CD pipelines.

Best for: Large organizations that need MISRA compliance as part of a broader code quality and security program.

Parasoft C/C++test

Parasoft provides MISRA C checking integrated with unit testing, code coverage, and requirements traceability. This all-in-one approach is attractive for teams that need to demonstrate compliance across the full V-model development lifecycle. Parasoft generates compliance reports mapped to specific safety standards like ISO 26262 and IEC 62304.

Best for: Teams following V-model development with integrated testing and compliance requirements.

Cppcheck (Open Source)

Cppcheck is a free, open-source static analysis tool that includes a MISRA C addon. The addon covers a subset of MISRA C:2012 rules - roughly 100 of the 159 rules. While not comprehensive enough for formal certification, Cppcheck is useful for teams starting their MISRA journey or as a pre-check before running commercial tools.

Best for: Open-source projects, teams evaluating MISRA adoption, or as a supplementary check.

LDRA TBvision

LDRA provides end-to-end verification tooling for safety-critical software. TBvision covers MISRA C checking, structural coverage analysis, requirements tracing, and test management. LDRA has deep roots in aerospace (DO-178C) and is qualified as a verification tool under multiple safety standards.

Best for: Aerospace and defense teams requiring DO-178C tool qualification.

Integrating MISRA Checking into CI/CD

Automated MISRA checking in your CI/CD pipeline catches violations before they reach the main branch. Here is how to set it up effectively:

Step 1: Establish Your Rule Profile

Not every project needs to comply with all 175 guidelines. Create a configuration file that specifies which rules your project enforces:

/* misra_config.txt - project MISRA rule selection */
/* All mandatory rules are always enforced */
/* Required rules - enforced with deviation process */
Rule 10.3: enabled
Rule 10.4: enabled
Rule 11.3: enabled
Rule 12.2: enabled
Rule 14.2: enabled
Rule 14.4: enabled
Rule 17.7: enabled

/* Advisory rules - selected subset */
Rule 15.5: enabled
Rule 15.6: enabled
Rule 15.7: enabled

Step 2: Add MISRA Checking to Your Build Pipeline

Here is a GitHub Actions example using Cppcheck's MISRA addon as a starting point:

# .github/workflows/misra-check.yml
name: MISRA C Compliance Check

on:
  pull_request:
    paths:
      - '**.c'
      - '**.h'

jobs:
  misra-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install Cppcheck
        run: |
          sudo apt-get update
          sudo apt-get install -y cppcheck

      - name: Run MISRA C check
        run: |
          cppcheck \
            --addon=misra \
            --suppress=missingIncludeSystem \
            --error-exitcode=1 \
            --inline-suppr \
            -I include/ \
            src/

      - name: Upload results
        if: failure()
        uses: actions/upload-artifact@v4
        with:
          name: misra-violations
          path: misra-report.txt

For commercial tools like Polyspace or Helix QAC, the CI integration typically involves a containerized version of the tool running in your pipeline:

      - name: Run Polyspace Bug Finder
        run: |
          polyspace-bug-finder \
            -sources src/ \
            -I include/ \
            -misra3 mandatory-required \
            -results-dir ./polyspace-results

      - name: Check for MISRA violations
        run: |
          polyspace-report-generator \
            -results-dir ./polyspace-results \
            -output-format csv \
            -report misra-compliance.csv

Step 3: Gate Merges on MISRA Compliance

Configure your CI to block pull requests that introduce new MISRA violations. Use baseline files to track existing violations separately from new ones - this lets you adopt MISRA incrementally on legacy codebases without requiring a full rewrite before you can enforce the rules.

Step 4: Track Deviations in Version Control

Store deviation records alongside the code they apply to:

/* DEVIATION: Rule 11.3 - Required
 * Justification: Hardware register access requires pointer cast
 *                to volatile uint32_t at fixed memory address.
 * Risk assessment: Address is validated at system startup.
 * Approved by: Safety team review 2026-01-15
 */
volatile uint32_t *gpio_port = (volatile uint32_t *)0x40020000U;  /* cppcheck-suppress misra-c2012-11.3 */

Common MISRA C Adoption Mistakes

Trying to comply with every rule on day one. Start with mandatory and required rules. Add advisory rules incrementally as your team builds familiarity.

Ignoring the deviation process. MISRA was designed with deviations in mind. Some rules genuinely cannot be followed in every situation - particularly in low-level hardware interface code. A well-documented deviation is better than a code workaround that introduces its own risks.

Using only one tool. No single static analysis tool catches every MISRA violation. Many teams use two tools - for example, Helix QAC for primary compliance checking and Cppcheck as a fast pre-commit check.

Not training the team. MISRA compliance is not just a tool configuration. Developers need to understand why each rule exists so they write naturally compliant code instead of fighting the tool output after the fact.

Treating MISRA as a checkbox. The goal is safer code, not a clean report. Suppressing findings without understanding them defeats the purpose of the standard.

MISRA C:2012 Amendment 4 and Beyond

MISRA C:2012 has been updated through several amendments and technical corrigenda:

Amendment 1 (2016): Added guidelines for C11 language features
Amendment 2 (2020): Introduced new security-focused rules
Amendment 3 (2023): Further security rules and CERT C mapping
Amendment 4 (2025): Additional rules covering concurrency and multi-threaded code

The MISRA committee also published MISRA C:2012 Addendum 3 which maps MISRA C rules to CERT C rules, helping teams that need to comply with both standards simultaneously.

When configuring your tools, make sure you are checking against the latest amendment level your project requires. Tool vendors typically release updates to match new amendments within a few months of publication.

Wrapping Up

MISRA C:2012 is not just a set of arbitrary restrictions - each rule addresses a real class of bugs that has caused failures in safety-critical systems. The rules around undefined behavior, type safety, pointer usage, and control flow exist because these are the exact areas where C programs fail in dangerous ways.

If you are starting a new embedded C project in a regulated industry, adopt MISRA C:2012 from the beginning. If you are retrofitting MISRA compliance onto an existing codebase, start with mandatory rules, use baseline files to manage existing violations, and expand coverage incrementally.

The investment in MISRA compliance pays off not just in passing audits, but in genuinely more reliable software. Code that follows MISRA C is easier to review, easier to test, and far less likely to contain the subtle bugs that only manifest in production under edge-case conditions.

For teams evaluating static analysis tools that include MISRA checking, the choice between tools often comes down to which safety standard you are targeting, whether you need formal tool qualification, and how the tool fits into your existing build and CI/CD infrastructure.

Frequently Asked Questions

What is the difference between MISRA C:2004 and MISRA C:2012?

MISRA C:2012 is the successor to MISRA C:2004 and includes significant improvements. It expanded from 142 rules to 175 rules (including directives), added a new 'mandatory' classification level above 'required,' improved the rationale and examples for each rule, and introduced the concept of decidable vs undecidable rules. MISRA C:2012 also added support for C99 language features while retaining C90 compatibility. Teams starting new projects should always use MISRA C:2012 rather than the older 2004 edition.

Is MISRA C compliance legally required?

MISRA C compliance is not a law, but it is effectively required in certain industries through functional safety standards. ISO 26262 for automotive software references MISRA C as a recommended coding standard. IEC 62304 for medical device software and DO-178C for avionics software also reference or strongly recommend MISRA-compliant coding practices. In practice, if your product must meet these safety standards, auditors and certification bodies expect MISRA C compliance as evidence of safe coding practices.

How many rules are in MISRA C:2012?

MISRA C:2012 contains 175 guidelines split into two types - 16 directives and 159 rules. Directives provide general guidance that may not be fully checkable by static analysis tools, while rules are specific and verifiable. These are further categorized as mandatory (cannot be deviated under any circumstance), required (deviations allowed with documented justification), and advisory (recommended best practices that teams may choose not to follow).

Can I use dynamic memory allocation with MISRA C?

MISRA C:2012 Directive 4.12 classifies dynamic memory allocation (malloc, calloc, realloc, free) as required to avoid. The concern is that heap allocation introduces risks of memory leaks, fragmentation, and non-deterministic behavior - all dangerous in safety-critical systems. If your project must use dynamic allocation, you need a formal deviation with documented justification, a custom allocator with bounded behavior, or allocation only during initialization with no runtime freeing.

What tools can check MISRA C:2012 compliance?

Several commercial and open-source tools check MISRA C:2012 compliance. Leading commercial tools include Polyspace Bug Finder and Code Prover (MathWorks), PC-lint Plus (Gimpel Software), Helix QAC (Perforce), Coverity (Synopsys), Parasoft C/C++test, and LDRA TBvision. For open-source options, Cppcheck provides partial MISRA checking with its addon, and some teams use Clang-Tidy with custom configurations. Commercial tools generally provide higher rule coverage and certified compliance checking required for safety certification audits.

How do I handle MISRA C deviations?

MISRA C:2012 provides a formal deviation process for required and advisory rules that cannot be followed in specific situations. Each deviation must include the rule being deviated, the justification explaining why compliance is not possible or practical, a risk assessment showing the deviation does not compromise safety, alternative measures taken to mitigate the risk, and approval from a qualified reviewer. Deviations must be documented and tracked. Mandatory rules cannot be deviated under any circumstance.

Originally published at aicodereview.cc

Input vs Output vs Reasoning Tokens Cost - LLM Pricing Explained

Rahul Singh — Sat, 11 Apr 2026 12:00:00 +0000

If you have ever checked the pricing page for OpenAI, Anthropic, or Google and wondered why there are three different token prices listed, you are not alone. The distinction between input tokens, output tokens, and reasoning tokens is one of the most misunderstood aspects of LLM pricing - and getting it wrong can mean overspending by 5-10x on your AI workloads.

This guide breaks down exactly what each token type is, why they cost different amounts, and how to optimize your spending - whether you are building an AI application, running code reviews, or just trying to understand your API bill.

What Are Tokens in LLMs?

Before diving into pricing, let's clarify what tokens actually are. A token is the fundamental unit of text that large language models process. It is not a word, not a character, but something in between.

On average, one token equals roughly 4 characters or 0.75 words in English. The word "understanding" is two tokens. A line of Python code like def calculate_total(items): is about 8 tokens.

Every API call involves two phases:

Input processing - the model reads your prompt (input tokens)
Output generation - the model writes its response (output tokens)

Some newer models add a third phase:

Reasoning - the model thinks through the problem before responding (reasoning tokens)

Each phase has different computational costs, which is why providers charge differently for each.

Input Tokens - What You Send

Input tokens represent everything you send to the model in your API request. This includes:

System prompts - instructions that define the model's behavior
User messages - your actual question or request
Context - code snippets, documents, conversation history
Few-shot examples - example inputs and outputs you provide

For AI code review, input tokens typically make up the bulk of your usage. When a tool like CodeRabbit or CodeAnt AI reviews a pull request, it sends the diff, surrounding code context, repository rules, and review instructions as input. A single PR review can easily consume 10,000-50,000 input tokens depending on the size of the change.

Why Input Tokens Are the Cheapest

Input tokens cost less because the model processes them in parallel using a single forward pass through the neural network. All input tokens are read simultaneously, making this phase computationally efficient. The GPU can process thousands of input tokens in roughly the same time it takes to process a few hundred.

Output Tokens - What You Receive

Output tokens are the tokens in the model's response. Every word, code snippet, and explanation the model generates counts as output tokens.

Output tokens are consistently more expensive than input tokens across every major provider. The ratio varies, but output tokens typically cost 3-4x more than input tokens.

Why Output Tokens Cost More

The reason is fundamental to how LLMs work. During output generation, the model must:

Predict one token at a time (autoregressive generation)
Run a full forward pass through the entire network for each token
Maintain the full attention context from all previous tokens
Store and update KV (key-value) cache for each new token

This sequential process means generating 1,000 output tokens requires roughly 1,000 separate forward passes, while reading 1,000 input tokens requires just one. The GPU utilization during output generation is also less efficient because it processes one token per step instead of batching thousands.

This is why a verbose model response is not just annoying - it is literally expensive.

Reasoning Tokens - The Hidden Cost

Reasoning tokens are the newest and most misunderstood token type. Introduced with OpenAI's o1 model and Anthropic's extended thinking feature, reasoning tokens represent the model's internal chain-of-thought process.

When you ask a reasoning model to solve a complex problem, it does not jump straight to the answer. Instead, it generates an internal monologue - breaking the problem into steps, considering approaches, checking its work, and then producing the final response.

How Reasoning Tokens Work

Here is what happens when you send a request to a reasoning model:

The model reads your input tokens (same as any model)
The model generates reasoning tokens - internal thinking that you do not see
The model generates output tokens - the visible response

The critical detail: reasoning tokens are billed at the output token rate because they require the same expensive sequential generation process. But they are not visible in the API response.

This means a request that returns a 500-token response might actually consume 3,000 or more total output tokens - 2,500 for reasoning and 500 for the visible answer.

Models That Use Reasoning Tokens

Model	Provider	Reasoning Type	Reasoning Visible?
o1	OpenAI	Built-in chain-of-thought	No (summary only)
o3	OpenAI	Built-in chain-of-thought	No (summary only)
o4-mini	OpenAI	Built-in chain-of-thought	No (summary only)
Claude Opus 4.5+	Anthropic	Extended thinking	Yes (thinking blocks)
Claude Sonnet 4.5+	Anthropic	Extended thinking	Yes (thinking blocks)
Gemini 2.5 Pro	Google	Thinking mode	Yes (thought summaries)

One important difference: Anthropic's extended thinking tokens are visible to the developer (returned as thinking blocks), while OpenAI's reasoning tokens are hidden. Both are billed at output rates.

LLM Token Pricing Comparison (March 2026)

Here is the current pricing for major LLM providers. All prices are per million tokens.

Standard Models (No Reasoning)

Model	Input (per 1M)	Output (per 1M)	Output/Input Ratio
GPT-4o	$2.50	$10.00	4x
GPT-4o-mini	$0.15	$0.60	4x
GPT-4.1	$2.00	$8.00	4x
GPT-4.1-mini	$0.40	$1.60	4x
Claude Opus 4.5	$5.00	$25.00	5x
Claude Sonnet 4.5	$3.00	$15.00	5x
Claude Haiku 4.5	$1.00	$5.00	5x
Gemini 2.5 Pro	$1.25	$10.00	8x
Gemini 2.5 Flash	$0.30	$2.50	8.3x
DeepSeek V3	$0.28	$0.42	1.5x

Reasoning Models

Model	Input (per 1M)	Output (per 1M)	Reasoning Rate	Notes
o1	$15.00	$60.00	Same as output	Most expensive reasoning model
o3	$2.00	$8.00	Same as output	Best price-to-reasoning ratio
o4-mini	$1.10	$4.40	Same as output	Budget reasoning option
Claude Sonnet (thinking)	$3.00	$15.00	Same as output	Extended thinking enabled
Claude Opus (thinking)	$5.00	$25.00	Same as output	Extended thinking enabled
Gemini 2.5 Pro (thinking)	$1.25	$10.00	Same as output	Thinking mode enabled

Notice that reasoning models do not have a separate per-token rate for reasoning. Instead, reasoning tokens are simply billed as output tokens. The cost impact comes from the volume - a reasoning model might generate 5-10x more total output tokens than a standard model for the same request.

Real-World Cost Calculations

Let's work through some practical examples to show how token types affect your actual spending.

Example 1 - Simple Code Review with GPT-4o

You send a 200-line diff for review with a system prompt and code context.

Component	Tokens	Rate	Cost
System prompt	2,000	$2.50/1M input	$0.005
Code diff + context	8,000	$2.50/1M input	$0.020
Review response	1,500	$10.00/1M output	$0.015
Total	11,500		$0.040

At this rate, reviewing 100 PRs per month costs about $4.00.

Example 2 - Deep Security Review with o3

Same diff, but you want the model to reason through potential security vulnerabilities.

Component	Tokens	Rate	Cost
System prompt	2,000	$2.00/1M input	$0.004
Code diff + context	8,000	$2.00/1M input	$0.016
Reasoning tokens (hidden)	6,000	$8.00/1M output	$0.048
Visible response	2,000	$8.00/1M output	$0.016
Total	18,000		$0.084

The reasoning tokens more than doubled the cost compared to a standard model, even though the visible output was similar in length.

Example 3 - High-Volume Review with GPT-4o-mini

For a team doing 500 PR reviews per month with lighter analysis.

Component	Tokens per Review	Monthly Total	Rate	Monthly Cost
Input tokens	8,000	4,000,000	$0.15/1M	$0.60
Output tokens	1,000	500,000	$0.60/1M	$0.30
Total				$0.90/month

That is less than a dollar per month for 500 code reviews. This is why smaller models are so attractive for high-volume, lower-complexity tasks.

Why the Output-to-Input Ratio Matters

The output-to-input cost ratio varies significantly across providers, and understanding this ratio is crucial for cost optimization.

Google's Gemini models have the highest ratio at 8x, meaning output tokens cost eight times more than input tokens. If your use case is output-heavy (generating long code, documentation, or detailed reviews), Gemini becomes comparatively more expensive on the output side despite having competitive input prices.

DeepSeek has the lowest ratio at just 1.5x, making it the most balanced option for output-heavy workloads. However, DeepSeek's overall quality for code review tasks may not match GPT-4o or Claude for certain languages.

OpenAI and Anthropic sit at 4-5x, which is the most common range across the industry.

The practical takeaway: if you can keep your outputs short and your inputs lean, you will save the most money with high-ratio providers. If your task requires long outputs, a lower-ratio provider might be cheaper overall even if its base rates are higher.

Prompt Caching - The Biggest Cost Saver

Prompt caching is the single most effective way to reduce LLM API costs, especially for AI code review workloads where the same system prompt and repository context are sent with every request.

How Prompt Caching Works

Instead of reprocessing the same prompt prefix on every request, the provider stores it in GPU memory. Subsequent requests that share the same prefix get a significant discount on those cached input tokens.

Caching Pricing by Provider

Provider	Cache Write Cost	Cache Read Cost	Cache Duration
Anthropic	1.25x base input	0.1x base input	5 minutes
Anthropic (extended)	2x base input	0.1x base input	1 hour
OpenAI	1x base input (automatic)	0.5x base input	Up to 1 hour
Google	1x base input	0.25x base input	Varies

Anthropic's caching is the most aggressive - a cache read costs just 10% of the normal input rate. For Claude Sonnet at $3/1M input tokens, cached reads cost just $0.30/1M. The 1.25x write overhead pays for itself after a single cache hit.

Caching Example for Code Review

Suppose your AI code review tool uses a 3,000-token system prompt with every review. Over 100 reviews:

Without caching (Claude Sonnet):

3,000 tokens x 100 reviews = 300,000 input tokens
Cost: 300,000 / 1M x $3.00 = $0.90

With caching (Claude Sonnet):

First request (cache write): 3,000 tokens x $3.75/1M = $0.011
99 cached requests: 3,000 x 99 = 297,000 tokens x $0.30/1M = $0.089
Total: $0.10

That is an 89% reduction just from caching the system prompt. In practice, you can also cache repository-level context, coding standards, and review rules for even greater savings.

How AI Code Review Tools Handle Token Costs

Understanding token economics explains many of the design decisions behind modern AI code review tools. Here is how the major tools manage costs:

Diff-Based Analysis

Tools like CodeRabbit, CodeAnt AI, and Sourcery only send the changed code (the diff) rather than entire files. This dramatically reduces input tokens. A 10-line change in a 500-line file uses roughly 95% fewer input tokens than sending the whole file.

Model Routing

Sophisticated tools route different types of analysis to different models:

Style and formatting checks go to cheap, fast models like GPT-4o-mini or Gemini Flash
Logic and bug detection go to mid-tier models like GPT-4o or Claude Sonnet
Security vulnerability analysis may use reasoning models like o3 for deeper analysis

This tiered approach can reduce costs by 60-80% compared to using a single premium model for everything.

Prompt Caching for Repository Context

When a code review tool indexes your repository, it builds a context document with your coding standards, common patterns, and project structure. This context is cached and reused across every PR review, saving thousands of input tokens per request.

Flat Pricing Abstraction

Most AI code review tools charge a flat per-user monthly fee rather than passing through token costs directly. This means the tool vendor absorbs the token cost risk and optimizes internally. Tools like CodeRabbit ($15/user/month) and CodeAnt AI (free for public repos) handle all the token math behind the scenes.

8 Strategies to Optimize Your LLM Token Costs

Whether you are building your own AI-powered tool or managing API costs directly, these strategies will help you spend less without sacrificing quality.

1. Use Prompt Caching Aggressively

Put your longest, most stable content at the beginning of your prompt. System instructions, few-shot examples, and reference documents are prime candidates for caching. Structure prompts so the cached prefix is reused across as many requests as possible.

2. Choose the Right Model for the Task

Do not use o3 or Claude Opus for tasks that GPT-4o-mini can handle. Create a model routing strategy:

Tier 1 (cheap): Formatting, linting, simple classification - GPT-4o-mini, Gemini Flash
Tier 2 (balanced): Code review, bug detection, summarization - GPT-4o, Claude Sonnet
Tier 3 (premium): Security analysis, architecture review, complex reasoning - o3, Claude Opus

3. Trim Your Input Context

Every unnecessary token in your prompt costs money. Remove:

Redundant instructions
Overly verbose few-shot examples
Code that is not relevant to the current task
Conversation history beyond what is needed for context

For code review, send only the diff and immediately surrounding lines rather than entire files.

4. Limit Output Length

Set max_tokens to prevent the model from generating unnecessarily long responses. If you need a yes/no classification, cap the output at 10 tokens. If you need a code review summary, 500-1,000 tokens is usually sufficient.

This is especially important with reasoning models where unconstrained thinking budgets can generate thousands of expensive reasoning tokens.

5. Use Batch APIs for Non-Urgent Work

OpenAI and Anthropic both offer batch processing APIs with significant discounts:

OpenAI Batch API: 50% discount on all models
Anthropic Message Batches: 50% discount, results within 24 hours

If your code reviews do not need to be instant (for example, nightly security scans), batch processing cuts your costs in half.

6. Monitor and Set Spending Limits

Track your token usage by model and endpoint. Set daily and monthly spending limits to avoid surprise bills. Most providers offer usage dashboards and API-level budget controls.

7. Compress Code Context Intelligently

Instead of sending raw source code, consider:

Sending AST (Abstract Syntax Tree) representations which are more token-efficient
Using code summaries for files that provide context but are not being reviewed
Stripping comments and whitespace from context files (but not from the code being reviewed)

8. Cache and Reuse Responses

If multiple developers open PRs that modify similar code, cache the review analysis for shared components. This avoids paying for the same analysis twice.

Token Costs for Popular AI Tasks

To put token pricing in broader perspective, here is what common AI-powered developer tasks cost using GPT-4o pricing ($2.50 input / $10.00 output per million tokens):

Task	Input Tokens	Output Tokens	Cost per Task
PR code review (small)	5,000	1,000	$0.023
PR code review (large)	30,000	3,000	$0.105
Security vulnerability scan	20,000	2,000	$0.070
Code documentation generation	3,000	5,000	$0.058
Bug explanation and fix	4,000	2,000	$0.030
Test case generation	5,000	8,000	$0.093
Architecture review	50,000	5,000	$0.175

These costs are per individual task. At scale, they add up - but they are remarkably cheap compared to the developer time they save.

The Future of Token Pricing

Token prices have dropped roughly 80% from 2024 to 2026, and the trend shows no signs of slowing. Several factors are driving costs down:

Hardware improvements - newer GPU architectures (NVIDIA Blackwell, AMD MI400) are more efficient
Model distillation - smaller models trained on outputs from larger models close the quality gap at lower cost
Inference optimization - techniques like speculative decoding, quantization, and better KV cache management reduce compute per token
Competition - DeepSeek, Mistral, and open-source models keep pressure on pricing

For AI code review tools, this means the cost of running comprehensive analysis on every PR is approaching near-zero. The limiting factor is shifting from cost to quality - which model provides the most accurate, lowest-false-positive reviews.

How This Affects Your Choice of AI Code Review Tool

When evaluating AI code review tools, understanding token economics helps you assess whether a tool's pricing is fair:

Tools charging $15-30/user/month likely use mid-tier models with caching, which is sustainable and cost-effective
Tools offering unlimited free tiers are either using very cheap models, heavily rate-limiting, or subsidizing costs with venture capital
Self-hosted tools like SonarQube or Semgrep avoid token costs entirely but require infrastructure investment

The best value typically comes from tools that intelligently route between model tiers - using cheap models for simple checks and premium models only when the complexity warrants it. CodeAnt AI and CodeRabbit both take this approach.

Key Takeaways

Input tokens are cheapest because the model processes them in parallel
Output tokens cost 3-8x more because they require sequential generation
Reasoning tokens are billed as output tokens and can multiply your costs 3-10x with no visible output increase
Prompt caching can reduce costs by up to 90% for repeated prompts
Model routing - using cheap models for simple tasks - is the most impactful optimization strategy
Batch APIs offer 50% discounts for non-urgent workloads
For AI code review, token costs are now low enough that comprehensive analysis on every PR is economically viable for teams of any size

Understanding these fundamentals puts you in control of your AI spending rather than being surprised by your monthly bill. Whether you are using AI APIs directly or evaluating managed tools, knowing what drives token costs helps you make smarter decisions.

Frequently Asked Questions

Why do output tokens cost more than input tokens?

Output tokens cost 2-6x more than input tokens because generating text is far more compute-intensive than reading it. During input processing, the model runs one forward pass over all tokens in parallel. During output generation, the model must run a separate forward pass for every single token, predicting one token at a time while maintaining the full context. This sequential, autoregressive process requires significantly more GPU time and memory bandwidth per token.

What are reasoning tokens and how are they billed?

Reasoning tokens are internal chain-of-thought tokens generated by models like OpenAI o1, o3, and Claude with extended thinking. These tokens represent the model's step-by-step problem-solving process. They are billed at the output token rate because they require the same sequential generation process. Reasoning tokens are not visible in the API response but still consume your token budget and context window. A 500-token visible response may use 2,000 or more total tokens when reasoning is included.

How can I reduce LLM API costs without losing quality?

The most effective strategies are prompt caching (up to 90% savings on repeated prompts), using smaller models for simple tasks and reserving expensive models for complex ones, trimming unnecessary context from prompts, batching requests during off-peak hours for 50% discounts, and setting maximum token limits on output. For AI code review, focusing reviews on changed files only rather than entire repositories dramatically reduces input token usage.

How does prompt caching work and when should I use it?

Prompt caching stores frequently used prompt prefixes so they do not need to be reprocessed on every request. Anthropic charges 0.1x the base input rate for cache reads and 1.25x for 5-minute cache writes. OpenAI offers automatic caching at 0.5x the input rate. You should use caching whenever you send the same system prompt, code context, or instructions repeatedly - which is exactly how AI code review tools work when analyzing multiple PRs against the same codebase rules.

Which LLM model offers the best value for AI code review?

For most AI code review use cases, GPT-4.1 at $2/$8 per million tokens or Claude Sonnet at $3/$15 offer the best balance of quality and cost. If you need deep reasoning for complex security analysis, o3 at $2/$8 with reasoning tokens is more cost-effective than o1. For high-volume linting and style checks, Gemini 2.5 Flash at $0.30/$2.50 or GPT-4o-mini at $0.15/$0.60 are the most economical choices.

How do AI code review tools manage token costs internally?

AI code review tools use several strategies to keep costs manageable. They use diff-based analysis to only send changed code rather than entire files, employ prompt caching for system instructions and repository context, route simple checks to cheaper models while using premium models for security analysis, and batch multiple file reviews where possible. Tools like CodeRabbit and CodeAnt AI abstract this complexity so you pay a flat per-user fee instead of worrying about token math.

Originally published at aicodereview.cc

Fake SOC 2 and ISO 27001 Certifications Are Spreading Across Dev Tools

Rahul Singh — Sat, 11 Apr 2026 10:00:00 +0000

A detailed investigation published on Substack has alleged that Delve, a compliance automation platform, systematically manufactured false SOC 2 and ISO 27001 certifications for its clients. If the allegations hold up, it represents one of the largest compliance fraud operations in the startup ecosystem.

For developers and engineering teams that rely on compliance certifications when evaluating tools, this is a wake-up call.

What Happened with Delve

According to the investigation, Delve operated by pre-populating audit evidence, generating test procedures and conclusions internally, and then routing the finished package to auditing firms that would rubber-stamp the results without conducting independent verification.

The key allegations include:

Fabricated audit evidence - Delve's platform allegedly generated compliance artifacts and pre-filled audit conclusions rather than requiring clients to demonstrate actual security controls
Non-independent auditors - The auditing firms used were reportedly Indian certification mills operating through US shell entities, violating AICPA independence requirements
Misleading marketing - Claims about "US-based auditors" allegedly masked the actual audit process, and trust pages displayed security badges before any compliance work had begun
Wide impact - Multiple companies including venture-backed startups and at least one NASDAQ-listed firm reportedly received these certifications, collectively handling millions of customer records

The auditing firms named in the investigation include Accorp, Gradient Certification, Glocert, and DKPC - described as accepting pre-written conclusions rather than performing genuine independent audits.

Why This Matters for Developer Tools

If you evaluate code review tools, security scanners, or any SaaS platform, you have probably looked at compliance certifications as a trust signal. SOC 2 and ISO 27001 badges appear on nearly every tool's security page.

Here is the problem: a badge is not proof of anything.

A legitimate SOC 2 Type II report means an independent auditor observed a company's security controls operating effectively over 6 to 12 months. It covers access controls, encryption, incident response, change management, and more. When done properly, it is a meaningful signal.

But when the audit process itself is compromised, the certification becomes theater. A tool could display a SOC 2 badge while having:

No real access controls
Unencrypted data at rest
No incident response plan
No change management process
No employee security training

This is particularly dangerous for tools that touch your source code. Code review platforms, static analysis tools, and AI coding assistants often require read access to your repositories. If their security posture is fabricated, your code is exposed.

The Compliance Automation Problem

Delve is not the only compliance automation platform. The market includes Vanta, Drata, Secureframe, Thoropass, and others. These tools promise to streamline compliance by automating evidence collection, monitoring controls, and connecting companies with auditors.

The legitimate ones add real value. Automating evidence collection for a team that actually has security controls in place saves hundreds of hours.

The risk emerges when automation replaces implementation. When a platform generates the evidence itself rather than collecting evidence of real controls, the entire audit becomes a fiction. The company gets a certificate. The auditor gets paid. The customers get nothing.

This is not a problem with compliance automation as a concept. It is a problem with specific actors exploiting the gap between what a certification implies and what it actually verifies.

How to Actually Verify a Tool's Security Posture

If you are evaluating developer tools - especially ones that access your code - here is how to go beyond the badge:

1. Request the Full SOC 2 Type II Report

Any legitimate vendor will share their SOC 2 report under NDA. If they refuse or only offer a "summary," that is a red flag. The full report should include:

The auditor's opinion letter
A description of the system and its boundaries
The specific controls tested
The test results, including any exceptions or failures
The observation period (should be at least 6 months for Type II)

2. Verify the Auditing Firm

Check if the auditing firm is a licensed CPA firm registered with the AICPA. Look them up on the AICPA's firm directory. A legitimate SOC 2 audit must be performed by a CPA firm. If you have never heard of the firm and cannot find them in the directory, investigate further.

3. Look for Type II Over Type I

SOC 2 Type I is a point-in-time snapshot - it says "the controls were designed properly on this date." Type II says "the controls operated effectively over this period." Type II is significantly more meaningful because it requires sustained compliance, not just a one-day performance.

4. Check for Exceptions in the Report

A clean SOC 2 report with zero exceptions across every control is actually somewhat unusual and could indicate a superficial audit. Real audits often find minor exceptions that the company has remediated. The presence of exceptions (with remediations) can actually indicate a more thorough audit.

5. Evaluate Security Independently

Compliance certifications should be one data point, not the entire evaluation. Also consider:

Security documentation - Does the vendor publish a security whitepaper or maintain a detailed security page with specifics, not just badges?
Vulnerability disclosure program - Do they have a responsible disclosure policy or bug bounty program?
Incident history - How have they handled past security incidents? Transparency matters more than a clean record.
Data handling - Where is your code processed and stored? Is it used for training? What is the retention policy?
Access controls - What permissions does the tool actually need? Does it follow least-privilege principles?

What This Means for the Dev Tools Ecosystem

The Delve allegations highlight a systemic vulnerability in how the software industry evaluates trust. We have collectively outsourced security evaluation to certifications, and those certifications are only as good as the audit behind them.

For dev tool vendors, the lesson is straightforward: invest in actual security, not just the appearance of security. A compromised certification is a ticking time bomb - when it unravels, the reputational damage is catastrophic.

For developers and engineering leads evaluating tools:

Do not treat compliance badges as proof. They are a starting point for investigation, not a conclusion.
Read the actual reports. SOC 2 reports exist to be read, not just referenced.
Verify the auditor. Five minutes of checking the auditing firm's credentials can save you from trusting a rubber-stamped report.
Evaluate security holistically. How a company responds to security questions tells you more than what badges they display.

The tools that handle your source code deserve the same scrutiny you would give to any critical infrastructure dependency. A compliance badge should open the conversation, not close it.

The Bigger Picture

This is not just about Delve. The compliance automation industry has grown rapidly, and the incentive structure is fragile. Companies want certifications fast and cheap. Platforms compete on speed and price. Auditors face pressure to approve.

When the incentives all point toward faster and cheaper certifications, corners get cut. The Delve investigation may be the first major public exposure, but it is unlikely to be the last.

As developers, we are in a unique position. We understand systems, trust boundaries, and the difference between security theater and actual security. Apply that same rigor to the tools you choose to trust with your code.

Frequently Asked Questions

What is SOC 2 compliance?

SOC 2 is a security framework developed by the AICPA that evaluates how a company handles customer data across five trust principles - security, availability, processing integrity, confidentiality, and privacy. An independent auditor must verify controls before issuing a SOC 2 report.

Can a SOC 2 report be faked?

Yes. As the Delve investigation shows, some compliance automation platforms allegedly pre-populate audit evidence, use non-independent auditors, and generate rubber-stamped reports. Without proper verification, a SOC 2 badge on a website can be meaningless.

How can I verify if a tool's SOC 2 certification is real?

Ask the vendor for their full SOC 2 Type II report (not just a badge). Check the auditing firm's credentials on the AICPA website. Look for a Type II report that covers at least 6 months of observation, and verify the auditor is a licensed CPA firm with no conflicts of interest.

What is the difference between SOC 2 Type I and Type II?

Type I evaluates whether security controls are properly designed at a single point in time. Type II tests whether those controls actually operated effectively over a period of 6 to 12 months. Type II is significantly more rigorous and trustworthy.

Should I stop trusting compliance certifications entirely?

No. Legitimate SOC 2 and ISO 27001 certifications remain valuable signals. But they should be one input among many. Verify the auditor, read the actual report, and evaluate the vendor's security practices independently rather than relying on a badge alone.

Originally published at aicodereview.cc

How to Export Azure DevOps Data to Excel (6 Methods, 2026)

Rahul Singh — Sat, 11 Apr 2026 08:00:00 +0000

Why export Azure DevOps data to Excel

Teams export Azure DevOps data to Excel for reporting, audits, stakeholder presentations, offline analysis, and data migration. While Azure DevOps has built-in dashboards and Analytics views, Excel remains the tool that everyone on a team - including project managers, QA leads, and executives who never log into Azure DevOps - already knows how to use.

Common scenarios where exporting to Excel makes sense:

Sprint reports - pulling work item data for burndown charts and velocity calculations outside of Azure DevOps
Compliance audits - exporting code review history, test results, and pipeline logs for SOC 2 or ISO 27001 documentation
Stakeholder updates - creating filtered views of project status that non-technical team members can read
Data migration - moving work items between Azure DevOps organizations or to another platform
Custom analysis - running pivot tables, VLOOKUP formulas, or statistical analysis that Azure DevOps dashboards do not support
Offline access - having a snapshot of project data when internet access is unavailable

This guide covers six methods to get data out of Azure DevOps and into Excel, ranked from simplest to most powerful. Each method has different strengths depending on what type of data you need and how much automation you want.

Method 1 - Built-in Open in Excel (work items)

The fastest way to export Azure DevOps work items to Excel is the built-in Office Integration feature. This creates a live, bidirectional connection between Excel and Azure DevOps - meaning you can not only read data but also edit work items directly in Excel and publish changes back.

Prerequisites

Before using this method, you need:

Excel 2016 or later installed on your machine (Excel for Microsoft 365 recommended)
Azure DevOps Office Integration plugin - included with Visual Studio 2019+, or available as a standalone download from the Visual Studio Marketplace
Permission to read work items in the Azure DevOps project

Step-by-step process

Step 1 - Create or open a work item query

Navigate to your Azure DevOps project. Go to Boards > Queries. Either open an existing saved query or create a new one. For example, to export all user stories in the current sprint:

Work Item Type = User Story
AND State <> Removed
AND Iteration Path = @CurrentIteration

Make sure the query returns the columns you want in Excel. Click Column Options to add or remove fields like Story Points, Assigned To, Area Path, Tags, and any custom fields your team uses.

Step 2 - Click Open in Excel

With the query results displayed, click the Open in Excel button in the toolbar. This button appears at the top of the query results page. If you do not see it, confirm the Office Integration plugin is installed.

Azure DevOps will generate an .xlsx file and open it in Excel with the Team plugin connected. The spreadsheet will contain one row per work item with all the columns from your query.

Step 3 - Work with the data in Excel

Once the data is in Excel, you can:

Sort and filter using standard Excel features
Create pivot tables from the work item data
Add conditional formatting to highlight blocked items or overdue tasks
Use formulas to calculate custom metrics

Step 4 - Refresh data (optional)

Because this is a live connection, you can click Refresh in the Team ribbon tab to pull the latest data from Azure DevOps at any time. This is useful for recurring reports where you want to update the numbers without re-exporting.

Limitations of the built-in method

Only works on Windows (the Office Integration plugin does not support macOS)
Limited to work items - you cannot export pipeline data, test results, or repository statistics this way
Maximum of 50,000 work items per query
Requires the desktop Excel application - Excel for the web does not support the plugin

Method 2 - CSV export from the web UI

If you are on macOS or Linux, or simply do not want to install the Office Integration plugin, the CSV export is the next simplest option.

Step-by-step process

Step 1 - Run your work item query

Go to Boards > Queries in the Azure DevOps web portal. Open or create a query with the columns and filters you need.

Step 2 - Export to CSV

Click the three-dot menu (more actions) at the top right of the query results. Select Export to CSV. Azure DevOps will generate a .csv file and download it to your browser's default download location.

Step 3 - Open in Excel

Open the CSV file in Excel. If your data contains special characters or non-English text, use Excel's Data > From Text/CSV import wizard and set the encoding to UTF-8 to avoid garbled characters.

Step 4 - Format and save as .xlsx

Once imported, format the data as needed and save as an .xlsx file to preserve formatting, formulas, and multiple sheets.

CSV export tips

Column selection matters - only columns visible in the query results will appear in the CSV. Add all the fields you need before exporting.
HTML fields get messy - rich text fields like Description and Acceptance Criteria export as raw HTML in CSV. You may need to clean these up in Excel or strip HTML tags with a formula like =CLEAN(SUBSTITUTE(A1,"<br>",CHAR(10))).
Dates may need reformatting - depending on your locale, date fields might import as text. Use Excel's Text to Columns or DATEVALUE() function to convert them.

Method 3 - REST API export with PowerShell or Python

When you need data that the built-in export does not cover - pipeline runs, pull request details, code review comments, build logs, or any data beyond work items - the Azure DevOps REST API is the way to go.

Exporting work items with PowerShell

This script runs a WIQL query and exports the results to a CSV file that Excel can open:

# Configuration
$org = "your-organization"
$project = "your-project"
$pat = "your-personal-access-token"
$outputFile = "work-items-export.csv"

# Create auth header
$base64Auth = [Convert]::ToBase64String(
  [Text.Encoding]::ASCII.GetBytes(":$pat")
)
$headers = @{
  Authorization = "Basic $base64Auth"
  "Content-Type" = "application/json"
}

# Step 1 - Run a WIQL query to get work item IDs
$wiqlBody = @{
  query = "SELECT [System.Id] FROM WorkItems WHERE [System.TeamProject] = '$project' AND [System.WorkItemType] = 'User Story' AND [System.State] <> 'Removed' ORDER BY [System.Id]"
} | ConvertTo-Json

$wiqlUrl = "https://dev.azure.com/$org/$project/_apis/wit/wiql?api-version=7.1"
$wiqlResult = Invoke-RestMethod -Uri $wiqlUrl -Method Post -Headers $headers -Body $wiqlBody

# Step 2 - Fetch full work item details in batches of 200
$ids = $wiqlResult.workItems.id
$allWorkItems = @()

for ($i = 0; $i -lt $ids.Count; $i += 200) {
    $batch = $ids[$i..([Math]::Min($i + 199, $ids.Count - 1))]
    $idsParam = $batch -join ","
    $detailUrl = "https://dev.azure.com/$org/$project/_apis/wit/workitems?ids=$idsParam&`$expand=fields&api-version=7.1"
    $details = Invoke-RestMethod -Uri $detailUrl -Method Get -Headers $headers
    $allWorkItems += $details.value
}

# Step 3 - Convert to flat objects and export as CSV
$rows = $allWorkItems | ForEach-Object {
    [PSCustomObject]@{
        ID          = $_.id
        Title       = $_.fields.'System.Title'
        State       = $_.fields.'System.State'
        AssignedTo  = $_.fields.'System.AssignedTo'.displayName
        WorkItemType = $_.fields.'System.WorkItemType'
        StoryPoints = $_.fields.'Microsoft.VSTS.Scheduling.StoryPoints'
        AreaPath    = $_.fields.'System.AreaPath'
        IterationPath = $_.fields.'System.IterationPath'
        CreatedDate = $_.fields.'System.CreatedDate'
        ChangedDate = $_.fields.'System.ChangedDate'
        Tags        = $_.fields.'System.Tags'
    }
}

$rows | Export-Csv -Path $outputFile -NoTypeInformation -Encoding UTF8
Write-Host "Exported $($rows.Count) work items to $outputFile"

Exporting pipeline runs with Python

This Python script pulls pipeline run data - something the built-in export cannot do at all:


from datetime import datetime

# Configuration
ORG = "your-organization"
PROJECT = "your-project"
PAT = "your-personal-access-token"
OUTPUT_FILE = "pipeline-runs-export.csv"

# Auth setup
credentials = base64.b64encode(f":{PAT}".encode()).decode()
headers = {
    "Authorization": f"Basic {credentials}",
    "Content-Type": "application/json"
}

def get_pipeline_runs():
    """Fetch all pipeline runs with pagination."""
    all_runs = []
    url = (
        f"https://dev.azure.com/{ORG}/{PROJECT}"
        f"/_apis/pipelines?api-version=7.1"
    )
    pipelines = requests.get(url, headers=headers).json()

    for pipeline in pipelines.get("value", []):
        runs_url = (
            f"https://dev.azure.com/{ORG}/{PROJECT}"
            f"/_apis/pipelines/{pipeline['id']}"
            f"/runs?api-version=7.1"
        )
        runs = requests.get(runs_url, headers=headers).json()

        for run in runs.get("value", []):
            all_runs.append({
                "Pipeline": pipeline["name"],
                "RunID": run["id"],
                "State": run["state"],
                "Result": run.get("result", "N/A"),
                "CreatedDate": run["createdDate"],
                "FinishedDate": run.get("finishedDate", "N/A"),
                "SourceBranch": run.get(
                    "resources", {}
                ).get(
                    "repositories", {}
                ).get(
                    "self", {}
                ).get("refName", "N/A"),
            })

    return all_runs

# Export to CSV
runs = get_pipeline_runs()
if runs:
    with open(OUTPUT_FILE, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=runs[0].keys())
        writer.writeheader()
        writer.writerows(runs)
    print(f"Exported {len(runs)} pipeline runs to {OUTPUT_FILE}")

Exporting pull request and code review data

For teams that need code review audit trails, this Python snippet exports pull request data including reviewer votes:

def get_pull_requests(repo_name, status="completed"):
    """Export pull requests with reviewer details."""
    url = (
        f"https://dev.azure.com/{ORG}/{PROJECT}"
        f"/_apis/git/repositories/{repo_name}"
        f"/pullrequests?searchCriteria.status={status}"
        f"&$top=1000&api-version=7.1"
    )
    response = requests.get(url, headers=headers).json()
    rows = []

    for pr in response.get("value", []):
        reviewers = ", ".join(
            f"{r['displayName']} ({r['vote']})"
            for r in pr.get("reviewers", [])
        )
        rows.append({
            "PR_ID": pr["pullRequestId"],
            "Title": pr["title"],
            "CreatedBy": pr["createdBy"]["displayName"],
            "CreatedDate": pr["creationDate"],
            "ClosedDate": pr.get("closedDate", "N/A"),
            "SourceBranch": pr["sourceRefName"],
            "TargetBranch": pr["targetRefName"],
            "Status": pr["status"],
            "MergeStatus": pr.get("mergeStatus", "N/A"),
            "Reviewers": reviewers,
        })

    return rows

The vote values in Azure DevOps reviewer data map to: 10 = Approved, 5 = Approved with suggestions, 0 = No vote, -5 = Wait for author, -10 = Rejected. Knowing this is essential when building compliance reports from exported data.

REST API tips

Personal Access Tokens (PAT) - generate one at https://dev.azure.com/{org}/_usersSettings/tokens with the minimum required scopes (Work Items Read, Code Read, Build Read depending on what you are exporting)
Pagination - most API endpoints return a maximum of 200-1000 items per request. Use the continuationToken or $skip/$top parameters to page through large result sets.
Rate limits - Azure DevOps allows approximately 30 requests per second per user. Add a small delay between batch requests if you are exporting large datasets.

Method 4 - OData Analytics feeds in Excel

Azure DevOps Analytics is an OData-based reporting service that provides optimized, queryable access to work tracking, pipeline, and test data. The major advantage over the REST API is that Analytics is designed for reporting - it supports aggregations, filtering, and projections at the server level, so you pull less data over the wire.

Connecting Excel to Analytics OData

Step 1 - Get your Analytics URL

The base URL format is:

https://analytics.dev.azure.com/{organization}/{project}/_odata/v4.0-preview/

Common entity sets you can query:

Entity	URL Suffix	Data Type
Work items	`WorkItems`	Current state of all work items
Work item snapshots	`WorkItemSnapshot`	Historical daily snapshots
Work item revisions	`WorkItemRevisions`	Every change to every work item
Pipeline runs	`PipelineRuns`	Build and release pipeline execution data
Test results	`TestResultsDaily`	Aggregated daily test pass/fail data
Test runs	`TestRuns`	Individual test run metadata

Step 2 - Open Excel and connect to OData

Open Excel and go to Data > Get Data > From Other Sources > From OData Feed
Paste the Analytics URL, for example: https://analytics.dev.azure.com/myorg/myproject/_odata/v4.0-preview/WorkItems?$select=WorkItemId,Title,State,WorkItemType,StoryPoints,AssignedTo&$filter=WorkItemType eq 'User Story'
When prompted for authentication, select Basic and enter an empty username with your PAT as the password. Alternatively, select Organizational Account if your Excel is signed into the same Azure AD tenant.
Excel will load the data into the Power Query editor where you can apply additional transformations.
Click Close & Load to bring the data into a worksheet.

Step 3 - Build your report

With the data in Excel, create pivot tables, charts, and dashboards. The connection is refreshable - right-click the data table and select Refresh to pull the latest data from Azure DevOps at any time.

Useful OData query examples

All active bugs with priority and severity:

WorkItems?$select=WorkItemId,Title,State,Priority,Severity,AssignedTo,CreatedDate
&$filter=WorkItemType eq 'Bug' and State ne 'Closed' and State ne 'Removed'
&$orderby=Priority asc

Work item history for sprint analysis:

WorkItemSnapshot?$apply=filter(Iteration/IterationPath eq 'MyProject\Sprint 15'
and WorkItemType eq 'User Story')
/groupby((DateSK,State),aggregate($count as Count))
&$orderby=DateSK asc

Pipeline success rate by definition:

PipelineRuns?$apply=groupby((PipelineName,RunOutcome),
aggregate($count as RunCount))

OData Analytics advantages

Server-side filtering - only the data you need crosses the network, making it much faster than pulling everything via REST API
Aggregation support - you can get counts, sums, and averages without downloading raw rows
Historical data - snapshot entities provide daily point-in-time data that the REST API does not offer
Refreshable in Excel - set up the query once and refresh it whenever you need updated numbers

Method 5 - Power BI as an intermediate step

While this guide focuses on Excel, Power BI deserves mention because it has the most polished integration with Azure DevOps Analytics, and you can easily export from Power BI to Excel.

When to use the Power BI route

You need complex data transformations before the data lands in Excel
You want to combine data from multiple Azure DevOps projects into a single report
Your organization already has Power BI licenses
You need scheduled automatic refreshes (Power BI Service can refresh on a schedule and email Excel exports)

Step-by-step Power BI to Excel workflow

Step 1 - Install the Azure DevOps Power BI connector

Open Power BI Desktop. Click Get Data > Online Services > Azure DevOps. If you do not see it, update Power BI Desktop to the latest version.

Step 2 - Connect and build your report

Enter your organization URL and authenticate. Select the data tables you need. Use Power Query to filter, join, and transform the data. Build visualizations if desired - or skip straight to the data model.

Step 3 - Export to Excel

In Power BI Desktop, go to the data view. Select the table you want to export. Click Export Data and choose CSV or Excel format. Alternatively, publish the report to Power BI Service and use the Analyze in Excel feature to create a live-connected Excel workbook.

Step 4 - Schedule automatic exports (Power BI Service)

If you need recurring Excel exports, publish your report to Power BI Service, set up scheduled refresh, and use Power Automate to email the refreshed data as an Excel attachment on a schedule.

Method 6 - Third-party tools and extensions

The Visual Studio Marketplace and third-party ecosystem offer several tools that simplify Azure DevOps data export.

Azure DevOps Marketplace extensions

Enhanced Export (by Artiso) - adds a one-click export button to query results with more formatting options than the built-in CSV export. Supports exporting to .xlsx format directly with preserved column widths and header formatting.

WIQL to OData (by Microsoft) - converts work item queries to OData URLs that you can paste directly into Excel's OData connector. Useful if you are comfortable writing WIQL but not OData syntax.

Excel Export PRO - a marketplace extension that supports exporting work items with parent-child hierarchies preserved in the Excel output, which the built-in CSV export flattens.

Power Automate (formerly Microsoft Flow)

Power Automate can create automated workflows that export Azure DevOps data to Excel on a schedule:

Use the Azure DevOps connector trigger (e.g., "When a work item is updated")
Add an Excel Online (Business) action to write data to an Excel file in OneDrive or SharePoint
Set up a recurrence trigger for daily or weekly full exports

This approach is particularly useful for teams that want a continuously updated Excel dashboard without manual exports.

Azure CLI

The Azure DevOps CLI extension (az devops) provides command-line access to most Azure DevOps data. Combined with tools like jq for JSON processing, you can script exports:

# Install the Azure DevOps extension
az extension add --name azure-devops

# Login and set defaults
az devops configure --defaults organization=https://dev.azure.com/myorg project=myproject

# Export work items from a query to JSON, then convert to CSV
az boards query --wiql "SELECT [System.Id], [System.Title], [System.State] FROM WorkItems WHERE [System.WorkItemType] = 'Bug'" --output json | \
  python3 -c "

data = json.load(sys.stdin)
writer = csv.writer(sys.stdout)
writer.writerow(['ID', 'Title', 'State'])
for item in data:
    fields = item.get('fields', {})
    writer.writerow([
        item['id'],
        fields.get('System.Title', ''),
        fields.get('System.State', '')
    ])
" > bugs-export.csv

Exporting test results to Excel

Test result data requires special handling because Azure DevOps stores it differently from work items.

Using the REST API

def export_test_results(plan_id):
    """Export test results for a specific test plan."""
    # Get test suites in the plan
    suites_url = (
        f"https://dev.azure.com/{ORG}/{PROJECT}"
        f"/_apis/testplan/Plans/{plan_id}"
        f"/suites?api-version=7.1"
    )
    suites = requests.get(suites_url, headers=headers).json()
    all_results = []

    for suite in suites.get("value", []):
        # Get test points (test case + configuration pairs)
        points_url = (
            f"https://dev.azure.com/{ORG}/{PROJECT}"
            f"/_apis/testplan/Plans/{plan_id}"
            f"/suites/{suite['id']}"
            f"/TestPoint?api-version=7.1"
        )
        points = requests.get(points_url, headers=headers).json()

        for point in points.get("value", []):
            results = point.get("results", {})
            all_results.append({
                "SuiteName": suite["name"],
                "TestCaseId": point.get("testCaseReference", {}).get("id"),
                "TestCaseName": point.get("testCaseReference", {}).get("name"),
                "Configuration": point.get("configuration", {}).get("name"),
                "Outcome": results.get("lastResultState", "Not Run"),
                "Tester": point.get("tester", {}).get("displayName"),
                "LastRunDate": results.get("lastResultDetails", {}).get("dateCompleted"),
            })

    return all_results

Using the OData Analytics feed

For aggregated test data - especially trend analysis - the OData feed is more efficient:

https://analytics.dev.azure.com/{org}/{project}/_odata/v4.0-preview/TestResultsDaily?
$apply=filter(DateSK ge 20260101 and DateSK le 20260320)
/groupby((DateSK,TestOutcome),aggregate(ResultCount with sum as TotalCount))
&$orderby=DateSK asc

Paste this URL into Excel's OData connector to get a daily breakdown of test pass/fail counts that you can chart over time.

Best practices for Azure DevOps data exports

Security considerations

Never hard-code PATs in scripts - use environment variables, Azure Key Vault, or credential managers
Use minimum-scope PATs - if you only need to export work items, do not create a full-access token
Rotate PATs regularly - set expiration dates of 90 days or less
Audit exports for sensitive data - work item descriptions and comments may contain customer information, credentials, or internal URLs that should not be shared in Excel files sent via email

Performance optimization

Use OData $select - only request the fields you need instead of pulling entire entities
Apply $filter server-side - filtering in Excel after downloading all data wastes bandwidth and time
Batch REST API requests - use the batch endpoint (_apis/wit/workitemsbatch) instead of individual GET requests
Cache results locally - if you run the same export daily, compare timestamps and only pull changed items using the ChangedDate field

Data freshness

REST API - real-time data, reflects the current state immediately
Analytics OData - data is typically 2-5 minutes behind, with some entities refreshing every 24 hours (snapshots)
Built-in Excel export - real-time data at the moment of export, but static until you refresh
Power BI - depends on your configured refresh schedule (minimum 30 minutes on Power BI Service)

Handling large datasets

If your Azure DevOps project has tens of thousands of work items or months of pipeline history:

Partition your queries - split by area path, iteration, date range, or work item type instead of exporting everything at once
Use continuation tokens - the REST API returns a x-ms-continuationtoken header when more results are available. Always check for it.
Consider Azure Data Factory - for truly large-scale exports (millions of rows), Azure Data Factory has native Azure DevOps connectors and can write directly to Azure SQL or Data Lake, which Excel can then query

Comparison of all six methods

Method	Data Types	Automation	Platform	Difficulty
Open in Excel	Work items only	Manual with refresh	Windows only	Easy
CSV export	Work items only	Manual	Any browser	Easy
REST API	Everything	Fully scriptable	Any platform	Medium
OData Analytics	Work items, pipelines, tests	Refreshable in Excel	Any platform	Medium
Power BI	Everything via Analytics	Scheduled refresh	Windows (Desktop)	Medium-Hard
Third-party tools	Varies by tool	Varies	Varies	Easy-Medium

For most teams, the recommended approach is:

Use the built-in CSV export for quick one-time exports of work item queries
Use OData Analytics in Excel for recurring reports that need refreshable data
Use REST API scripts when you need pipeline, PR, or code review data that the built-in tools do not cover
Use Power BI when you need to combine data from multiple projects or create organization-wide dashboards that ultimately get shared as Excel files

Troubleshooting common issues

"Open in Excel" button is missing - install the Azure DevOps Office Integration 2019 plugin from the Visual Studio Marketplace. If already installed, check that your Excel version is 2016 or later and that the Team add-in is enabled in Excel's Add-ins manager (File > Options > Add-ins).

CSV export has garbled characters - the file is UTF-8 encoded but Excel may default to a different encoding. Use Data > From Text/CSV and explicitly select UTF-8 encoding during import.

OData feed returns 401 Unauthorized - verify your PAT has the Analytics read scope enabled. Go to your PAT settings and ensure the "Analytics (read)" permission is checked.

REST API returns 203 Non-Authoritative - this usually means your PAT has expired or the Authorization header is malformed. Regenerate the PAT and double-check the Base64 encoding.

Excel Power Query times out - the OData query is returning too much data. Add $filter and $select parameters to reduce the result set. Also check your network connection and proxy settings.

Data does not match Azure DevOps UI - Analytics data can lag 2-5 minutes behind the live system. For real-time accuracy, use the REST API instead of Analytics OData feeds.

Conclusion

Exporting Azure DevOps data to Excel does not have to be painful. For work items, the built-in tools handle most use cases with a few clicks. For pipeline data, test results, and code review audit trails, the REST API and OData Analytics feeds give you access to everything Azure DevOps stores - you just need a script and a PAT.

The key is choosing the right method for your specific use case. One-time exports are best served by CSV. Recurring reports belong in OData-connected Excel workbooks. Automated compliance workflows should use Power Automate or scheduled scripts. And large-scale analytics across multiple projects should flow through Power BI before landing in Excel.

Start with the simplest method that meets your requirements, and only move to more complex approaches when you hit limitations with the simpler ones.

Frequently Asked Questions

Can I export Azure DevOps work items directly to Excel?

Yes. Azure DevOps has a built-in Open in Excel button on the Queries page that launches a connected Excel session via the Azure DevOps Office Integration plugin. You can also export query results as a CSV file and open that in Excel manually. Both methods work with any work item query you have saved or created.

How do I install the Azure DevOps Excel plugin?

The Azure DevOps Office Integration plugin is included with Visual Studio 2019 and later. If you only have Excel, download the standalone Azure DevOps Office Integration installer from the Visual Studio Marketplace. After installation, you will see a Team ribbon tab in Excel with options to connect to Azure DevOps and pull work item data.

Can I export Azure DevOps pipeline data to Excel?

Azure DevOps does not have a built-in one-click export for pipeline data. You need to use the REST API to pull pipeline runs, stages, jobs, and task results as JSON, then convert that JSON to CSV or load it directly into Excel using Power Query. The Analytics service also exposes pipeline data through OData feeds that Excel can consume.

What is the row limit when exporting Azure DevOps data to Excel?

The built-in Open in Excel feature supports up to 50,000 work items per query. CSV exports from the web UI are also capped at the query result limit. If you need more than 50,000 rows, use the REST API with pagination (continuation tokens) or connect through Power BI and OData, which can handle millions of records.

How do I export Azure DevOps test results to Excel?

Use the Test Results REST API endpoint to pull test run data as JSON. You can also use the Analytics OData feed at https://analytics.dev.azure.com/{org}/{project}/_odata/v4.0-preview/TestResultsDaily. Both methods let you load data into Excel via Power Query or convert JSON to CSV with a script.

Is Power BI required to export Azure DevOps analytics data?

No. Power BI is one option, but you can also connect Excel directly to Azure DevOps Analytics using OData feeds. In Excel, go to Data > Get Data > From OData Feed and paste your Analytics URL. Excel will load the data into a table that you can refresh, filter, and pivot without needing Power BI at all.

Originally published at aicodereview.cc

Claude Code vs Codex CLI vs Gemini CLI: Which AI Terminal Agent Wins in 2026?

Rahul Singh — Sat, 11 Apr 2026 06:00:00 +0000

Quick verdict

Claude Code is the most capable AI terminal coding agent in 2026, offering the deepest code reasoning, best multi-file editing, and a proven multi-agent Code Review system. Codex CLI is the best free, open-source option with strong autonomous task execution in sandboxed environments. Gemini CLI wins on context window size and free-tier generosity, making it ideal for large codebases and budget-conscious developers.

Choose Claude Code if: You want the best code reasoning, multi-file editing, and code review capabilities and are willing to pay for a premium experience.

Choose Codex CLI if: You want an open-source CLI with autonomous cloud execution, parallel task support, and you already use ChatGPT or OpenAI's API.

Choose Gemini CLI if: You need the largest context window (1M tokens), want the most generous free tier, or your team is invested in the Google Cloud ecosystem.

Why AI terminal agents matter

AI coding has moved beyond autocomplete. The latest generation of AI tools runs directly in your terminal, reads your entire codebase, edits multiple files, runs tests, commits changes, and even reviews pull requests - all from the command line.

Three tools dominate this space in 2026: Claude Code from Anthropic, Codex CLI from OpenAI, and Gemini CLI from Google. Each takes a different approach to terminal-based AI coding, and choosing the wrong one can cost your team hours of productivity every week.

This comparison breaks down every meaningful difference so you can pick the right tool for your workflow.

At-a-glance comparison

Feature	Claude Code	Codex CLI	Gemini CLI
Developer	Anthropic	OpenAI	Google
Underlying model	Claude Opus 4.6 / Sonnet 4.6	GPT-5-Codex	Gemini 2.5 Pro / Flash
Context window	200K tokens	128K-200K tokens	1M tokens
Free tier	No (Pro from $20/mo)	Open-source CLI (API costs apply)	180K completions/month + 240 chats/day
Starting price	$20/month (Pro)	$20/month (ChatGPT Plus)	Free / $19/user/month (Standard)
Open source	No	Yes (Apache 2.0)	No
Multi-file editing	Yes - best in class	Yes	Yes
Code review	Multi-agent PR review	PR review via GitHub	Automated PR summaries
Sandboxing	Local with permission system	Cloud sandboxes + local	Local execution
MCP support	Native (protocol creator)	Yes	Yes
Git integration	Deep GitHub + local git	GitHub-focused	GitHub integration
IDE extensions	VS Code + JetBrains	VS Code, Cursor, Windsurf	VS Code, JetBrains, Android Studio
Extended thinking	Yes	No	No
Agent orchestration	Agent Teams (sub-agents)	Parallel cloud sandboxes	Agent mode
Headless / CI mode	Yes	Yes (GitHub Action)	No
SWE-bench score	Top tier	State-of-the-art	63.8%

Installation and setup

Claude Code

Claude Code installs via npm and runs as a standalone CLI. Setup takes under two minutes:

npm install -g @anthropic-ai/claude-code
claude

On first run, it authenticates through your Anthropic account or API key. You can use it with a Pro subscription ($20/month), a Max plan ($100-$200/month), or pay-per-token via the API. The CLI works on macOS and Linux natively, with Windows support through WSL2.

Codex CLI

Codex CLI is open source and built in Rust for speed:

npm install -g @openai/codex
codex

It requires an OpenAI API key, which you set as the OPENAI_API_KEY environment variable. Alternatively, you can access Codex through a ChatGPT Plus subscription for cloud-based task execution. The CLI runs on macOS, Linux, and Windows.

Gemini CLI

Gemini CLI installs via npm and authenticates through a Google account:

npm install -g @google/gemini-cli
gemini

The free tier requires no credit card or Google Cloud project - just a Google account. For team features and higher limits, you need a Standard ($19/user/month) or Enterprise ($45/user/month) plan.

Winner: Gemini CLI for the frictionless free setup. Codex CLI earns points for being open source. Claude Code's install is simple but requires a paid plan to start.

Context window and codebase understanding

The context window determines how much of your codebase the AI can "see" at once. This matters enormously for large projects.

Gemini CLI leads with a 1M token context window - roughly 3 to 4 million characters of code. This is enough to hold an entire mid-sized codebase in context without any chunking or summarization.

Claude Code offers 200K tokens, which covers most individual features or modules comfortably. It compensates for the smaller window with intelligent codebase indexing and the ability to spawn sub-agents that explore different parts of your project in parallel.

Codex CLI supports 128K to 200K tokens depending on the model. It uses repository mapping and retrieval-augmented generation to find relevant code beyond the immediate context window.

In practice, the raw context window size matters less than how intelligently each tool uses it. Claude Code's 200K window with strong reasoning often produces better results than Gemini CLI's 1M window on tasks where deep understanding trumps breadth. But for tasks like "refactor all API endpoints across 50 files," Gemini CLI's larger window gives it a genuine edge.

Winner: Gemini CLI on raw capacity. Claude Code on effective use of context for complex reasoning tasks.

Code generation quality

This is where the underlying models make the biggest difference.

Claude Code produces the most consistently correct and well-structured code. Claude Opus 4.6 excels at understanding complex requirements, generating idiomatic code, handling edge cases, and writing code that follows existing project conventions. The extended thinking feature lets it reason through multi-step problems before writing a single line, which dramatically reduces bugs in complex implementations.

Codex CLI generates strong code through GPT-5-Codex, which achieves state-of-the-art scores on SWE-bench. It is particularly good at autonomous task execution - you can describe a feature and it will write the code, create tests, and verify everything passes. The quality is high for straightforward tasks but can struggle with highly nuanced architectural decisions.

Gemini CLI produces good code with Gemini 2.5 Pro, especially for Google Cloud services, Android development, and Python. The large context window helps it maintain consistency across large changes. However, users report occasional hallucinations in generated code, particularly for less common libraries or frameworks.

Winner: Claude Code for overall code quality and reasoning depth. Codex CLI for autonomous task completion. Gemini CLI for Google ecosystem work.

Multi-file editing and refactoring

Real-world coding involves changing multiple files simultaneously. All three tools handle this, but the experience differs significantly.

Claude Code is the clear leader here. It can read your entire project structure, understand the relationships between files, and make coordinated changes across dozens of files while maintaining consistency. Renaming a function, updating all call sites, adjusting tests, and modifying documentation happens in a single interaction. The sub-agent system lets it delegate different parts of a large refactoring to parallel workers.

Codex CLI handles multi-file editing through its autonomous task execution. You describe the change, and it works through the files systematically in a sandboxed environment. The isolated Git worktree approach means your working directory stays clean while Codex makes changes in the background. The trade-off is less interactive control during the process.

Gemini CLI supports multi-file editing through its agent mode, and the 1M token context window means it can hold more files in memory simultaneously. However, the actual edit coordination is less refined than Claude Code's, and complex cross-file refactoring sometimes requires multiple prompts to get right.

Winner: Claude Code by a significant margin. Its multi-file editing and refactoring capabilities are the most reliable and comprehensive of the three.

Git integration and workflow

Claude Code

Claude Code has the deepest Git integration of the three. It understands your Git history, can create branches, stage changes, write commit messages, create pull requests, and even resolve merge conflicts. The multi-agent Code Review feature runs parallel AI agents to review PRs, with Anthropic reporting that it raised substantial review comment rates from 16% to 54% of PRs internally, with less than 1% of findings being incorrect.

You can run Claude Code in headless mode in CI/CD pipelines, making it useful for automated code review on every PR.

Codex CLI

Codex CLI integrates tightly with GitHub. You can trigger tasks from PR comments using @codex, and it has a GitHub Action for CI/CD integration. Cloud-based tasks run in isolated sandboxes with their own Git worktrees, so multiple tasks can work on different branches simultaneously. The PR review capability is functional but oriented more toward autonomous fixes than detailed review commentary.

Gemini CLI

Gemini CLI provides GitHub integration with automated PR summaries and review comments. The integration is straightforward but not as deep as Claude Code's multi-agent review system. There is no headless or CI/CD mode, which limits automation possibilities.

Winner: Claude Code for the deepest Git workflow integration and most capable code review. Codex CLI for autonomous GitHub-based task execution.

Sandboxing and safety

Running AI-generated code carries risk. Each tool handles safety differently.

Codex CLI has the strongest sandboxing. Cloud tasks run in fully isolated sandbox environments with their own file systems and network access. Locally, you can configure different permission levels - from read-only to full autonomy. The sandbox approach means a misbehaving AI cannot corrupt your working directory.

Claude Code uses a permission-based system. It asks before reading sensitive files, running commands, or making changes. You can configure permission levels to auto-approve safe operations while requiring confirmation for destructive ones. Hooks let you add pre/post action automation for additional guardrails. However, it runs locally by default, not in an isolated sandbox.

Gemini CLI runs locally with standard permission prompts. There is no sandboxed execution environment - it operates directly on your file system with whatever permissions your terminal session has.

Winner: Codex CLI for its cloud sandbox isolation. Claude Code for its configurable permission system. Gemini CLI trails here with minimal safety features.

MCP (Model Context Protocol) support

MCP lets AI tools connect to external data sources, APIs, and services. This is increasingly important as developers want their AI coding agents to access databases, documentation, monitoring systems, and other tools.

Claude Code has the most mature MCP support, which makes sense since Anthropic created the Model Context Protocol. It can connect to any MCP server natively, with a large and growing ecosystem of available servers for databases, APIs, documentation, and more. The integration is seamless - you configure MCP servers in your project settings and Claude Code can pull data from them during coding sessions.

Codex CLI added MCP server support for extensibility. The implementation is functional and growing, though the ecosystem of available Codex-compatible MCP servers is smaller than Claude Code's.

Gemini CLI supports MCP for connecting to external tools and data sources. Google has been expanding its MCP support, but the ecosystem is still catching up to Anthropic's.

Winner: Claude Code as the creator and most mature implementer of MCP. All three support it, but Claude Code's ecosystem is the most developed.

Pricing comparison

Plan tier	Claude Code	Codex CLI	Gemini CLI
Free	No free tier	CLI is free (API costs apply)	180K completions/month + 240 chats/day
Individual	$20/month (Pro)	$20/month (ChatGPT Plus)	Free
Power user	$100-$200/month (Max)	$200/month (ChatGPT Pro)	$19/user/month (Standard)
Team	$25-$150/user/month	$25/user/month	$19/user/month (Standard)
Enterprise	Custom	Custom	$45/user/month
API usage	$3-$25/M tokens	Usage-based	Usage-based

Gemini CLI is the clear pricing winner with its generous free tier. For a 10-person team, Gemini CLI at $19/user/month ($190/month) is significantly cheaper than Claude Code at $25-$150/user/month ($250-$1,500/month) or Codex CLI at $25/user/month ($250/month).

Codex CLI offers a unique value proposition as an open-source tool. If you only need occasional terminal AI assistance and already have OpenAI API credits, the per-token cost can be very low.

Claude Code is the most expensive option, but the API usage-based pricing gives flexibility. Light users can spend $5-$10/month on API calls, while heavy users might spend $50+ per day on complex tasks.

Winner: Gemini CLI on value. Codex CLI for open-source flexibility. Claude Code demands a premium but delivers premium results.

Extended thinking and reasoning

Claude Code is the only tool with a dedicated extended thinking mode. When you give it a complex task - debugging a race condition, designing a system architecture, or refactoring a tightly coupled module - it can activate extended thinking to reason through the problem step by step before acting. This produces noticeably better results on hard problems.

Codex CLI and Gemini CLI do not have equivalent features. They process prompts through their respective models' standard inference pipelines, which are capable but lack the explicit chain-of-thought reasoning that extended thinking provides.

For simple tasks like "add a loading spinner to this component," the difference is negligible. For complex tasks like "refactor this authentication system to support SAML SSO," extended thinking gives Claude Code a meaningful advantage.

Winner: Claude Code - no contest on this dimension.

Real-world performance

Beyond benchmarks, here is how each tool performs in daily development work.

Claude Code in practice

Claude Code feels like a senior developer who lives in your terminal. It understands project structure intuitively, asks clarifying questions when requirements are ambiguous, and makes changes that respect existing code conventions. The Agent Teams feature lets you spin up multiple agents for parallel work on a large task. The main friction points are rate limits on the Pro plan during heavy sessions and the learning curve for developers not comfortable with CLIs.

Codex CLI in practice

Codex CLI excels at "fire and forget" tasks. You describe what you want, and it works autonomously in a cloud sandbox while you continue other work. The parallel task execution is genuinely useful - you can queue up five bug fixes and review the results as each one completes. The main downsides are GitHub-only integration (no GitLab or Bitbucket), occasional latency with cloud tasks, and usage limits on the Plus plan.

Gemini CLI in practice

Gemini CLI impresses with its free tier and large context window. For developers working on large codebases, the ability to load nearly everything into context reduces the "lost context" problems that plague smaller-window tools. The Google Cloud integration is excellent if you are building on GCP. The main weaknesses are occasional hallucinations, slower response times compared to Claude Code, and the lack of a headless CI/CD mode.

Who should use what

Choose Claude Code if you:

Want the best code reasoning and generation quality available
Need reliable multi-file editing and complex refactoring
Value multi-agent Code Review for your PR workflow
Need CI/CD integration via headless mode
Are willing to pay a premium for premium results
Want the most mature MCP ecosystem

Choose Codex CLI if you:

Prefer open-source tools you can inspect and modify
Want autonomous cloud-based task execution
Need to run multiple coding tasks in parallel
Are already in the OpenAI/ChatGPT ecosystem
Want fire-and-forget task queuing
Need Windows support without WSL

Choose Gemini CLI if you:

Need the largest context window for massive codebases
Want the best free tier for individual use
Build on Google Cloud Platform
Are budget-conscious with a team needing AI coding tools
Work primarily with Python, Java, or Go
Want a low-friction entry point to AI terminal coding

The bottom line

All three AI terminal agents are capable tools that can meaningfully accelerate your development workflow. The right choice depends on your priorities.

For code quality and reasoning: Claude Code wins. Its extended thinking, multi-agent architecture, and superior code comprehension make it the best tool for complex, real-world development tasks.

For autonomous execution and open-source values: Codex CLI wins. Its sandboxed cloud execution, parallel task support, and Apache 2.0 license make it uniquely flexible.

For budget and context window: Gemini CLI wins. The 1M token context window and generous free tier make it the most accessible and cost-effective option.

If budget is not a constraint and you want the single best AI coding experience in your terminal, Claude Code is the tool to pick. If you are evaluating for a team, consider starting with Gemini CLI's free tier to validate the workflow, then upgrading to Claude Code or Codex CLI once you understand your team's usage patterns.

For teams that also need automated code review, Claude Code's multi-agent review system or a dedicated tool like CodeRabbit will give you the deepest PR feedback. You can also explore our roundup of the best AI code review tools for more options.

Frequently Asked Questions

Which AI terminal coding agent is best overall in 2026?

Claude Code is the best overall AI terminal coding agent in 2026. It offers the deepest code reasoning, the most mature multi-file editing, built-in extended thinking for complex tasks, and a proven multi-agent Code Review feature. Codex CLI is better if you want a free open-source tool for quick edits, and Gemini CLI is the best option if you need a massive 1M token context window on a budget.

Is Codex CLI free to use?

Yes, Codex CLI is fully open source under the Apache 2.0 license. However, you still need an OpenAI API key with credits to run it since it calls OpenAI models. The CLI itself costs nothing, but API usage is billed per token. You can also use it through a ChatGPT Plus subscription at $20/month, which gives you 30 to 150 tasks per 5-hour window depending on the model.

Can Claude Code, Codex CLI, and Gemini CLI all do code review?

All three can analyze code and suggest improvements, but their approaches differ. Claude Code has a dedicated multi-agent Code Review feature that runs multiple AI agents in parallel to review pull requests, catching subtle bugs with less than 1% incorrect findings. Codex CLI can review PRs through its GitHub integration and by mentioning @codex in PR comments. Gemini CLI integrates with GitHub for automated PR summaries and review comments. For dedicated code review, Claude Code is the most capable.

Which AI CLI tool has the largest context window?

Gemini CLI has the largest context window at 1 million tokens, powered by Gemini 2.5 Pro. Claude Code supports up to 200K tokens with Claude Opus 4.6. Codex CLI varies by model but typically supports 128K to 200K tokens with GPT-5-Codex. For extremely large codebases where you need full-repository context, Gemini CLI has a significant advantage.

Do these AI CLI tools support MCP (Model Context Protocol)?

Yes, all three support MCP to varying degrees. Claude Code has the deepest MCP integration since Anthropic created the protocol - it can connect to databases, APIs, documentation servers, and custom tools natively. Codex CLI added MCP server support for extensibility. Gemini CLI also supports MCP for connecting to external tools and data sources. Claude Code's MCP ecosystem is the most mature with the largest number of available servers.

Which AI terminal agent is best for large monorepos?

Gemini CLI is the strongest choice for very large monorepos thanks to its 1M token context window, which lets it hold significantly more code in memory at once. Claude Code compensates with intelligent codebase indexing and sub-agent spawning that can explore different parts of a monorepo in parallel. Codex CLI handles monorepos through isolated Git worktrees but has a smaller context window. For monorepos under 200K tokens, Claude Code's superior reasoning gives better results despite the smaller window.

Originally published at aicodereview.cc

13 Best Duplicate Code Checker Tools in 2026

Rahul Singh — Fri, 10 Apr 2026 23:00:00 +0000

Code duplication is the silent tax on every codebase

I have worked on codebases where fixing a single bug required changing the same logic in seven different files. Not because the architecture demanded it - because someone copy-pasted a function years ago, and then someone else copy-pasted the copy, and then the copies diverged slightly, and nobody knew which version was canonical anymore.

That is the real cost of code duplication. It is not just wasted disk space or inflated line counts. It is the compounding maintenance burden of keeping multiple copies of the same logic in sync - a task that humans are reliably terrible at.

Studies from large-scale codebases confirm this. Research on the Linux kernel found that inconsistent changes to cloned code were responsible for a meaningful percentage of bugs. A study of open-source Java projects found that cloned code was changed more frequently and contained more defects than non-cloned code. The numbers vary by study, but the pattern is consistent: duplication breeds bugs.

The good news is that detecting duplicate code is a well-understood problem with excellent tooling. This guide covers 13 tools that find duplicated code - from lightweight CLI utilities you can run in 30 seconds to full platforms that track duplication trends across your entire organization.

What is code duplication (and why should you care)?

Code duplication - also called code cloning - occurs when identical or nearly identical code fragments exist in multiple locations within a codebase. It typically happens through copy-paste programming, where a developer copies a working block of code and modifies it slightly for a new context instead of abstracting the shared logic into a reusable function.

The four types of code clones

The research community classifies code clones into four types, and this taxonomy matters because different tools detect different types:

Type 1 - Exact clones. Identical code fragments except for differences in whitespace, layout, and comments. This is the simplest form - someone copied a function and only changed the formatting.

// Clone A
function calculateTax(amount) {
  const rate = 0.08;
  return amount * rate;
}

// Clone B (Type 1 - only whitespace differs)
function calculateTax(amount) {
    const rate = 0.08;
    return amount * rate;
}

Type 2 - Renamed clones. Syntactically identical fragments with differences in identifier names, literal values, or type declarations. The structure is the same, but names and values have changed.

// Clone A
function calculateTax(amount) {
  const rate = 0.08;
  return amount * rate;
}

// Clone B (Type 2 - renamed variables and changed literal)
function computeLevy(price) {
  const percentage = 0.10;
  return price * percentage;
}

Type 3 - Near-miss clones. Fragments with further modifications - statements added, removed, or reordered. The code is recognizably similar but not structurally identical.

// Clone A
function calculateTax(amount) {
  const rate = 0.08;
  return amount * rate;
}

// Clone B (Type 3 - added validation logic)
function computeLevy(price) {
  if (price < 0) throw new Error("Invalid price");
  const percentage = 0.10;
  const result = price * percentage;
  console.log(`Levy: ${result}`);
  return result;
}

Type 4 - Semantic clones. Functionally equivalent code implemented with different syntax or algorithms. Two sorting functions that use different algorithms but produce identical output are Type 4 clones. These are the hardest to detect and most tools cannot reliably find them.

Why duplication matters

The DRY (Don't Repeat Yourself) principle exists for practical reasons, not aesthetic ones:

Bug propagation. A bug fixed in one copy often remains unfixed in the others. The more copies, the more likely a fix will be incomplete.
Maintenance cost. Every change to shared logic must be applied N times across N copies. This scales linearly with duplication.
Code review burden. Reviewers waste time reading code they have already reviewed in another file.
Binary size and build time. Duplicated code inflates compilation and bundle sizes unnecessarily.
Inconsistent behavior. When copies diverge, users encounter different behavior depending on which code path they hit.

Not all duplication is harmful. Test files, generated code, and certain boilerplate patterns are often intentionally duplicated. The goal is not zero duplication - it is zero accidental duplication of business logic.

Tool comparison table

Tool	Clone Types	Languages	Pricing	CI Integration	Open Source
SonarQube	1, 2, 3	35+	Free (Community) to $65K+/yr	GitHub, GitLab, Jenkins, Azure	Community Build: Yes
PMD CPD	1, 2	20+	Free	Maven, Gradle, CLI	Yes (BSD)
Simian	1, 2	15+	$299-$499/license	CLI, Ant, Maven	No
jscpd	1, 2	150+ (via tokenizers)	Free	CLI, GitHub Actions	Yes (MIT)
MOSS	1, 2, 3	25+	Free (academic)	Web upload only	No
Duplo	1, 2	Language-agnostic	Free	CLI	Yes
CloneDR	1, 2, 3, 4	20+	Enterprise pricing	CLI	No
CodeAnt AI	1, 2, 3	30+	Free to $40/user/mo	GitHub, GitLab, Bitbucket	No
Codacy	1, 2	40+	Free to $15/user/mo	GitHub, GitLab, Bitbucket	No
DeepSource	1, 2, 3	15+	Free to $12/user/mo	GitHub, GitLab, Bitbucket	No
IntelliJ IDEA	1, 2, 3	JVM, Python, JS, PHP	$249-$779/yr (Ultimate)	IDE only	No
Coverity	1, 2, 3	22+	Enterprise pricing ($50K+)	Jenkins, GitHub, GitLab	No
Semgrep	1, 2 (pattern-based)	30+	Free (OSS) to custom	GitHub, GitLab, CLI	OSS engine: Yes

1. SonarQube - copy-paste detection built into the quality platform

SonarQube's duplication detection is one of the most widely deployed in the industry, largely because it comes bundled with the broader quality platform that most enterprises already run. If you are using SonarQube for code quality and security, you are already getting duplication analysis for free.

How it works. SonarQube uses a combination of token-based and AST-based analysis depending on the language. For most languages, it tokenizes code, normalizes identifiers and literals, and then finds matching token sequences above a configurable minimum length (default: 100 tokens for Java, 120 for JavaScript). It detects Type 1, Type 2, and some Type 3 clones.

What sets it apart. The duplication metrics feed directly into SonarQube's quality gate system. You can block merges when duplication on new code exceeds a threshold - 3% is the default. The duplication visualization shows exactly which blocks are duplicated and where, making it easy to plan refactoring.

Languages: 35+ including Java, Python, JavaScript, TypeScript, C#, C/C++, Go, PHP, Ruby, Swift, Kotlin.

Pricing: Community Build is free and open source. Developer Edition starts around $150/year for 100K LOC. Enterprise runs $65,000+/year.

Pros:

Duplication detection is part of a comprehensive quality platform
Quality gates can enforce duplication thresholds on PRs
Excellent visualization of duplicate blocks across the codebase
Deep language support with language-specific tokenization

Cons:

Community Build requires self-hosting a server
Configuration is heavier than standalone CLI tools
Enterprise pricing is steep for teams that only need duplication detection
Type 3 detection is limited compared to AST-native tools

Best for: Teams already using SonarQube that want duplication checks as part of their quality workflow.

2. PMD CPD - the copy-paste detector that just works

PMD's Copy/Paste Detector (CPD) has been the go-to standalone duplication checker for over two decades. It is free, fast, reliable, and works with every major build system. If you need a no-frills duplicate code finder that you can add to a CI pipeline in five minutes, CPD is the answer.

How it works. CPD uses token-based detection. It lexes source files into token streams, then applies the Karp-Rabin algorithm to find matching subsequences. You configure a minimum token count (default: 100), and CPD reports all pairs of code fragments that share at least that many consecutive tokens.

What sets it apart. Zero dependencies, zero accounts, zero configuration. Download the PMD distribution, run pmd cpd --minimum-tokens 100 --dir src/, and you get a report. It integrates natively with Maven (mvn pmd:cpd), Gradle, and Ant.

Languages: Java, JavaScript, TypeScript, Python, C/C++, C#, Go, Ruby, Swift, Kotlin, Scala, PHP, Lua, MATLAB, Fortran, and more.

Pricing: Free and open source (BSD license).

Pros:

Extremely fast - scans millions of lines in seconds
No server, no account, no internet connection required
Native Maven, Gradle, and Ant integration
Well-documented with 20+ years of stability
Configurable minimum tokens and output formats (XML, CSV, text)

Cons:

Token-based only - misses Type 3 and Type 4 clones
No visualization or trending - just a flat report
No PR integration or quality gates out of the box
No web dashboard or historical tracking

Best for: Any team that wants a fast, free, no-nonsense duplication check in CI.

3. Simian - commercial token matcher with strict licensing

Simian (Similarity Analyser) is a commercial duplicate code detection tool that focuses on fast, accurate token-based matching. It was popular in the mid-2010s and still has users in enterprise Java and .NET shops.

How it works. Simian uses proprietary token-based matching that the vendor claims is more accurate than CPD for certain edge cases, particularly around multiline string literals and complex expressions.

Languages: Java, C#, C/C++, JavaScript, TypeScript, Ruby, Swift, Objective-C, Visual Basic, COBOL, and others.

Pricing: $299 for a single developer license, $499 for a site license. One-time purchase.

Pros:

Fast scanning with low memory footprint
Good handling of edge cases in .NET and Java code
One-time license fee rather than subscription

Cons:

No longer actively developed - last major update was several years ago
Token-based only - no Type 3 detection
No CI/CD integration beyond CLI exit codes
No web dashboard or PR comments
PMD CPD provides comparable functionality for free

Best for: Legacy projects already using Simian that do not want to migrate.

4. jscpd - the polyglot copy-paste detector

jscpd is a modern, Node.js-based duplicate code detector that supports an impressive range of languages through its tokenizer system. If you work with a polyglot codebase and want a single duplication tool, jscpd is worth considering.

How it works. jscpd tokenizes source files using language-specific tokenizers (powered by the Prism syntax highlighter's grammar definitions), then finds matching token sequences using the Rabin-Karp algorithm. It supports a configurable minimum number of tokens and lines.

What sets it apart. The breadth of language support is exceptional. Because it leverages Prism's grammars, jscpd can tokenize over 150 languages and file formats, including markup languages, configuration files, and even Dockerfiles. It also generates HTML reports with side-by-side clone views.

Languages: 150+ including all mainstream programming languages plus markup, configuration, and infrastructure-as-code files.

Pricing: Free and open source (MIT license).

Pros:

Broadest language support of any standalone duplication tool
Beautiful HTML reports with side-by-side clone visualization
CI-friendly exit codes and JSON output
Configurable thresholds per language via .jscpd.json
Active open-source project with regular updates

Cons:

Token-based only - no Type 3 or Type 4 detection
Node.js dependency required
Slower than PMD CPD on very large codebases
No server or dashboard for tracking trends over time

Best for: Polyglot codebases and teams that want HTML reporting without a server.

5. MOSS - academic plagiarism detection for source code

MOSS (Measure Of Software Similarity) is a web-based service from Stanford University designed to detect plagiarism in programming assignments. It is not a typical developer tool, but it excels at detecting code similarity across large sets of submissions - which makes it uniquely useful for certain scenarios.

How it works. MOSS uses document fingerprinting with the Winnowing algorithm. You submit a set of source files, and MOSS returns a ranked list of file pairs with the highest similarity, along with a web-based side-by-side view showing the matching regions.

What sets it apart. MOSS is designed to compare many files against each other simultaneously, which is different from most tools that scan a single codebase. It normalizes code to ignore variable names and whitespace, catching Type 1, 2, and some Type 3 clones.

Languages: C, C++, Java, Python, JavaScript, C#, MATLAB, Perl, Haskell, Lisp, Scheme, and others.

Pricing: Free for educational and research use.

Pros:

Excellent at detecting similarity across large file sets
Web-based visualization with highlighted matching regions
Handles obfuscation attempts (variable renaming, reordering)
Trusted and maintained by Stanford for decades

Cons:

Requires uploading code to Stanford's servers - not suitable for proprietary code
No CI/CD integration
Web-only interface - no CLI or API for automation
Designed for academic plagiarism, not production code quality
Availability depends on Stanford's infrastructure

Best for: Academic settings and open-source projects where code can be uploaded to an external server.

6. Duplo - lightweight C/C++ focused detector

Duplo is a simple, lightweight duplicate code detection tool originally designed for C and C++ projects. It takes a minimalist approach - feed it a list of files, and it reports duplicate blocks.

How it works. Duplo uses a line-based matching algorithm. It normalizes lines by stripping whitespace and comments, then finds matching sequences of lines above a configurable minimum length. This is simpler than token-based approaches but surprisingly effective for exact and renamed clones.

Languages: Technically language-agnostic (line-based matching works on any text), but optimized for C, C++, Java, and C#.

Pricing: Free and open source.

Pros:

Extremely lightweight - single binary, no dependencies
Fast on large codebases
Simple to understand and configure
Works on any text-based language

Cons:

Line-based matching is less accurate than token or AST-based detection
Very basic output - no visualization or HTML reports
Limited development activity in recent years
No CI integration beyond exit codes

Best for: C/C++ projects that need a quick, zero-dependency duplication check.

7. CloneDR - the research-grade AST clone detector

CloneDR is one of the few tools that uses full AST-based clone detection, developed by Semantic Designs. It parses source code into abstract syntax trees and compares subtrees to find clones - including Type 3 and even some Type 4 clones that token-based tools miss completely.

How it works. CloneDR parses source files using language-specific grammars, builds ASTs, and then compares subtrees using a parameterized matching algorithm. It can detect clones where statements have been added, removed, or reordered (Type 3), and in some cases can identify semantically equivalent code with different structure (Type 4).

What sets it apart. This is the most thorough clone detection approach available in a commercial tool. CloneDR does not just find copied code - it can suggest a refactored version that extracts the common logic into a shared function with parameters for the varying parts.

Languages: Java, C#, C/C++, Python, JavaScript, COBOL, PHP, and others.

Pricing: Enterprise licensing through Semantic Designs. Contact for quotes.

Pros:

Detects Type 3 and some Type 4 clones
Suggests refactored abstractions for detected clones
Most thorough detection of any tool in this list
Handles complex transformations like loop restructuring

Cons:

Expensive enterprise-only licensing
Slower than token-based tools due to full AST parsing
Smaller user community - fewer resources and integrations
No native CI/CD pipeline integration
Requires language-specific grammars for each supported language

Best for: Organizations serious about eliminating deep structural duplication and willing to invest in thorough analysis.

8. CodeAnt AI - duplication detection with AI code review

CodeAnt AI bundles duplicate code detection with its broader AI-powered code review and static analysis platform. It is one of the newer entrants that treats duplication as part of a holistic code quality workflow rather than a standalone feature.

How it works. CodeAnt AI scans repositories connected through GitHub, GitLab, or Bitbucket. Its duplication engine uses a combination of token-based matching and structural analysis to detect Type 1, 2, and 3 clones. Findings appear as PR comments alongside security and quality issues.

What sets it apart. Duplication findings are contextualized with AI-generated explanations. Instead of just pointing out that two blocks match, CodeAnt AI explains why the duplication is problematic and suggests a refactored approach. It also tracks duplication metrics over time.

Languages: 30+ languages including Python, JavaScript, TypeScript, Java, Go, Ruby, PHP, and C#.

Pricing: Free Basic plan for open source. Pro at $24/user/month and Enterprise at $40/user/month.

Pros:

AI-powered explanations and refactoring suggestions
Duplication detection is part of a comprehensive code review platform
PR-level integration - findings appear as comments on pull requests
Tracks duplication metrics and trends over time
SAST, secrets detection, and DORA metrics included

Cons:

Duplication detection is not available as a standalone feature
Requires connecting your repository to CodeAnt AI's platform
Newer tool with a smaller community than SonarQube or PMD
Enterprise pricing adds up for large teams

Best for: Teams that want duplication detection bundled with AI code review and static analysis.

9. Codacy - duplication as part of code quality automation

Codacy includes duplication detection as one of its core code quality checks. Like SonarQube, it bundles duplication analysis with security scanning, code coverage tracking, and quality gates - but as a fully managed cloud service.

How it works. Codacy uses PMD CPD and language-specific analyzers under the hood for its duplication detection. It tokenizes code, finds matching sequences, and reports duplicates with links to both locations. Findings appear in the Codacy dashboard and as PR comments.

What sets it apart. The fully managed experience means zero infrastructure. Connect your GitHub, GitLab, or Bitbucket repository, and duplication analysis starts automatically on every push. Quality gates can enforce duplication thresholds, blocking merges when thresholds are exceeded.

Languages: 40+ languages through its analyzer ecosystem.

Pricing: Free for open source. Pro at $15/user/month.

Pros:

Zero-configuration cloud setup
Quality gates enforce duplication thresholds on PRs
Duplication trends visible in dashboard over time
Bundled with SAST, SCA, code coverage, and code quality
Supports GitHub, GitLab, and Bitbucket

Cons:

Duplication detection relies on PMD CPD - no Type 3 detection
Cannot run as a standalone CLI tool
$15/user/month adds up for larger teams
Less configurable than running PMD CPD directly

Best for: Teams that want a managed code quality platform with duplication checks included.

10. DeepSource - fast duplication analysis with Autofix

DeepSource includes duplication detection as part of its static analysis platform and stands out for its speed and low false positive rate. It takes a modern approach to developer experience, with clean UI and actionable findings.

How it works. DeepSource uses its own analysis engine that combines token-based and structural techniques. It detects Type 1, 2, and some Type 3 clones. When duplication is found, DeepSource can generate Autofix suggestions that extract duplicated code into shared functions.

What sets it apart. The Autofix feature for duplication is genuinely useful. Rather than just flagging that code is duplicated, DeepSource proposes a refactored version. Its sub-5% false positive rate also means you do not waste time reviewing findings that are not real issues.

Languages: Python, JavaScript, TypeScript, Java, Go, Ruby, Rust, C#, Kotlin, Swift, PHP, and others.

Pricing: Free for individual developers and open-source projects. Team plan at $12/user/month.

Pros:

Autofix generates refactored code for duplicated blocks
Very low false positive rate
Fast scan times - typically completes in under a minute
Clean, modern developer experience
Free tier is generous for individual use

Cons:

Smaller language coverage than SonarQube or Codacy
Type 3 detection is limited
No standalone CLI for duplication-only scanning
Enterprise features require the paid tier

Best for: Teams that value developer experience and want AI-assisted refactoring of duplicated code.

11. IntelliJ IDEA - IDE-native duplicate detection

JetBrains IntelliJ IDEA (and other JetBrains IDEs like PyCharm, WebStorm, and Rider) includes built-in duplicate code detection that runs directly in the editor. This is not a CI/CD tool - it is a developer productivity feature that surfaces duplication while you are actively coding.

How it works. IntelliJ's duplicate detection uses AST-based analysis powered by the JetBrains inspection engine. It compares structural elements rather than raw tokens, which enables Type 3 detection where code has been slightly modified. Results appear as editor highlights and in the inspection results panel.

What sets it apart. The IDE integration means you see duplication immediately as you write code, before you even commit. The refactoring tools built into IntelliJ - Extract Method, Extract Variable, Pull Members Up - work seamlessly with the duplication findings, making it trivial to fix clones on the spot.

Languages: Java, Kotlin, Scala, Groovy (IntelliJ), Python (PyCharm), JavaScript/TypeScript (WebStorm), C#/.NET (Rider), PHP (PhpStorm).

Pricing: Community Edition is free but does not include duplication detection. Ultimate starts at $249/year for individuals, $779/year for organizations.

Pros:

Real-time detection as you write code
AST-based analysis catches Type 3 clones
Seamless integration with IntelliJ refactoring tools
No external tooling or configuration needed
Cross-module detection within a project

Cons:

Requires JetBrains IDE - not available in VS Code or other editors
Not a CI/CD tool - cannot enforce thresholds on PRs
Only runs on code currently open in the IDE
Ultimate license required for full duplication analysis
No historical tracking or trending

Best for: Individual developers using JetBrains IDEs who want real-time duplication awareness.

12. Coverity - deep analysis for safety-critical code

Coverity (now part of Black Duck, formerly Synopsys) includes duplication detection as part of its enterprise static analysis platform. Coverity is the standard for safety-critical industries - automotive, aerospace, medical devices, and embedded systems.

How it works. Coverity performs deep interprocedural analysis that includes clone detection. Its engine builds a comprehensive model of the entire codebase, including call graphs and data flow, which enables it to detect structural clones that simpler tools miss. It focuses on clones that are likely to cause defects - duplicated code with inconsistent error handling or boundary checks.

What sets it apart. Coverity does not just find duplicated code - it finds duplicated code that is dangerous. Its defect-oriented approach means it prioritizes clones where one copy has been patched but others have not, which is exactly the pattern that causes real-world bugs.

Languages: C, C++, Java, C#, JavaScript, TypeScript, Python, Ruby, Go, Kotlin, Swift, and others.

Pricing: Enterprise-only. Typically $50,000+ per year depending on codebase size and seats.

Pros:

Defect-focused duplication detection - finds clones that cause bugs
Deep interprocedural analysis catches complex structural clones
Industry standard for safety-critical code (ISO 26262, DO-178C)
Finds inconsistently patched clones that other tools miss
Comprehensive reporting for compliance requirements

Cons:

Extremely expensive - not practical for small teams
Slow scan times due to deep analysis
Complex deployment and configuration
Overkill for web applications or non-critical software
No free tier or community edition

Best for: Enterprise teams working on safety-critical software where finding dangerous clones is more important than finding all clones.

13. Semgrep - pattern-based duplicate detection with custom rules

Semgrep takes a different approach to duplication. Rather than scanning for arbitrary matching code blocks, Semgrep lets you define patterns that match specific types of duplication in your codebase. This is not traditional clone detection - it is targeted pattern matching that catches the duplication patterns that matter most to your team.

How it works. You write Semgrep rules using a pattern syntax that matches code structure rather than exact text. For example, you can write a rule that detects when the same error handling pattern is duplicated across multiple catch blocks, or when identical validation logic appears in multiple API endpoints. Semgrep matches against the AST, so it catches renamed variables and reformatted code.

What sets it apart. The custom rule approach means you focus on the duplication that actually causes problems in your codebase. Instead of a noisy report showing every duplicated three-line block, you get targeted findings for the specific patterns you care about.

Languages: 30+ including Python, JavaScript, TypeScript, Java, Go, Ruby, C, C++, Rust, PHP, Kotlin, Swift, and more.

Pricing: Open-source CLI is free (LGPL-2.1). Team tier available for PR integration, and enterprise pricing for advanced features.

Pros:

Custom rules target the specific duplication patterns that matter
AST-based matching catches renamed and reformatted clones
Extremely fast - scans most codebases in seconds
Huge community rule library with 3,000+ pre-built rules
Free for commercial use and CI/CD integration

Cons:

Not a traditional clone detector - requires writing rules for specific patterns
Will not generate a comprehensive duplication report like PMD CPD
No built-in duplication percentage metric
Requires learning Semgrep's pattern syntax

Best for: Teams that know what duplication patterns to target and want precise, low-noise detection.

How to choose the right duplicate code checker

The right tool depends on what you need:

For a quick CLI scan with no setup, use PMD CPD. It is free, fast, and integrates with every build tool. Run pmd cpd --minimum-tokens 75 --dir src/ and you have a report in seconds. If you work with many languages, jscpd is the better choice for its broader tokenizer support.

For CI/CD enforcement, SonarQube, Codacy, or DeepSource give you quality gates that block PRs when duplication thresholds are exceeded. Codacy and DeepSource are fully managed, while SonarQube requires self-hosting (unless you use SonarCloud).

For IDE-level awareness, IntelliJ IDEA's built-in duplication detection catches clones as you type. This prevents duplication before it is committed rather than catching it after the fact.

For deep structural analysis, CloneDR and Coverity detect Type 3 and Type 4 clones that token-based tools miss entirely. These are expensive options but essential for safety-critical codebases.

For AI-assisted refactoring, CodeAnt AI and DeepSource go beyond detection by suggesting how to refactor duplicated code. This bridges the gap between finding duplication and actually fixing it.

For targeted pattern matching, Semgrep lets you define rules for the specific duplication patterns that cause problems in your codebase. This is the lowest-noise approach but requires upfront effort to write rules.

Setting up duplicate code detection in CI

Here is a practical example of adding duplication checks to a GitHub Actions pipeline using PMD CPD:

name: Duplication Check
on: [pull_request]

jobs:
  cpd:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Download PMD
        run: |
          wget https://github.com/pmd/pmd/releases/download/pmd_releases%2F7.9.0/pmd-dist-7.9.0-bin.zip
          unzip pmd-dist-7.9.0-bin.zip
      - name: Run CPD
        run: |
          ./pmd-bin-7.9.0/bin/pmd cpd \
            --minimum-tokens 100 \
            --dir src/ \
            --format xml \
            --fail-on-violation true

For jscpd, the setup is even simpler:

name: Duplication Check
on: [pull_request]

jobs:
  jscpd:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npx jscpd src/ --threshold 5 --reporters console

The --threshold 5 flag fails the check when duplication exceeds 5% of total lines.

The bottom line

Duplicate code detection is a solved problem at the tool level. PMD CPD and jscpd are free, fast, and effective for token-based detection. SonarQube, Codacy, and DeepSource bundle duplication checks into broader quality platforms. CloneDR and Coverity provide deep structural analysis for teams that need it.

The unsolved problem is actually fixing duplication once you find it. That is where the newer tools - CodeAnt AI with its AI-powered suggestions, DeepSource with Autofix, and IntelliJ with its refactoring tools - are pushing the field forward. Detection without a path to remediation just creates a backlog of tech debt that nobody addresses.

My recommendation for most teams: start with PMD CPD or jscpd in CI to establish a baseline and prevent new duplication. If you are already using SonarQube, Codacy, or DeepSource, enable their duplication checks and set quality gates. If you are working on safety-critical software, invest in Coverity's defect-focused approach. And regardless of what you use in CI, enable IntelliJ's duplication detection in your IDE - catching clones before they are committed is always cheaper than catching them after.

Frequently Asked Questions

What is the best free duplicate code checker?

PMD CPD is the best free duplicate code checker for most teams. It supports 20+ languages, detects Type 1 and Type 2 clones, integrates with every major build tool, and runs locally without any account or server. For JavaScript and TypeScript projects, jscpd is an excellent alternative with built-in HTML reporting and CI-friendly exit codes. SonarQube Community Build also includes copy-paste detection at no cost, though it requires a self-hosted server.

What are the four types of code clones?

Type 1 clones are exact copies with only whitespace and comment differences. Type 2 clones are syntactically identical but with renamed variables, changed types, or modified literals. Type 3 clones are near-miss copies where statements have been added, removed, or reordered. Type 4 clones are semantically equivalent but syntactically different - two functions that produce the same output using completely different logic. Most tools detect Type 1 and 2 reliably. Type 3 detection requires AST-based analysis. Type 4 detection remains a research problem, though AI-powered tools are making progress.

How much code duplication is acceptable?

Most industry benchmarks consider less than 3-5% duplication acceptable for a healthy codebase. SonarQube's default quality gate flags code with more than 3% duplication on new code. However, context matters - some duplication is intentional and acceptable, such as test setup code or generated files. The key metric is whether duplicated code is actively maintained in multiple places, creating a risk of inconsistent changes. Zero duplication is not a realistic or even desirable target.

What is the DRY principle and why does it matter?

DRY stands for Don't Repeat Yourself, a software engineering principle stating that every piece of knowledge should have a single, authoritative representation in a system. Violating DRY by copy-pasting code creates maintenance burden - when a bug is fixed in one copy, all other copies must be found and updated. Studies show that inconsistent changes to cloned code account for a significant percentage of bugs in large codebases. DRY is not about eliminating all similar-looking code - it is about ensuring that business logic and domain knowledge are not scattered across multiple locations.

Can duplicate code checkers run in CI/CD pipelines?

Yes, most modern duplicate code checkers integrate with CI/CD pipelines. PMD CPD, jscpd, and Duplo run as CLI tools that return non-zero exit codes when duplication thresholds are exceeded, making them easy to add to any pipeline. SonarQube, Codacy, DeepSource, and CodeAnt AI provide native GitHub Actions and GitLab CI integrations that comment duplication findings directly on pull requests. The key is configuring appropriate thresholds so the check catches meaningful duplication without blocking every PR over minor similarities.

What is the difference between token-based and AST-based clone detection?

Token-based detection (used by PMD CPD, Simian, and jscpd) breaks source code into a stream of tokens and finds matching sequences. It is fast and language-agnostic but only reliably detects Type 1 and Type 2 clones. AST-based detection (used by CloneDR, SonarQube, and DeepSource) parses code into abstract syntax trees and compares subtrees, which catches Type 3 clones where code has been modified or reordered. AST-based tools are slower but more accurate for detecting near-miss duplicates that token matching misses.

Originally published at aicodereview.cc

12 Best Code Test Coverage Tools in 2026 - Comprehensive Guide

Rahul Singh — Fri, 10 Apr 2026 22:30:00 +0000

Why code test coverage matters in 2026

Code coverage is the most widely used metric for measuring test quality. It tells you what percentage of your codebase is actually exercised when your test suite runs - and more importantly, it reveals which parts of your code have no tests at all.

Without coverage measurement, teams are guessing about test effectiveness. A project might have 500 tests that all pass while leaving critical business logic completely untested. Coverage tools eliminate this guessing by generating precise reports showing which lines, branches, and functions were hit during test execution.

The tooling landscape for code coverage has evolved significantly. Language-specific instrumentation tools like Istanbul and JaCoCo remain the foundation, but modern platforms now layer on PR-level reporting, historical trend analysis, coverage gates that block merges below a threshold, and integration with broader code quality workflows.

This guide compares 12 code test coverage tools across language support, CI integration, reporting capabilities, and pricing. Whether you need a free open source solution for a single-language project or an enterprise platform that aggregates coverage across a polyglot monorepo, you will find the right fit here.

Quick comparison table

Tool	Languages	Type	CI Integration	Free Tier	Pricing
Istanbul/nyc	JavaScript, TypeScript	Instrumentation	Any CI	Fully free	Free (OSS)
JaCoCo	Java, Kotlin, Scala	Instrumentation	Any CI	Fully free	Free (OSS)
Coverage.py	Python	Instrumentation	Any CI	Fully free	Free (OSS)
Codecov	30+ (via report upload)	Reporting platform	GitHub, GitLab, Bitbucket	Free for OSS	From $10/user/mo
Coveralls	22+ (via report upload)	Reporting platform	GitHub, GitLab, Bitbucket	Free for OSS	From $5/user/mo
SonarQube	35+	Quality platform	GitHub, GitLab, Bitbucket, Azure	Community free	From $2,500/yr
DeepSource	16	Quality platform	GitHub, GitLab, Bitbucket	Free for OSS	From $12/user/mo
Codacy	49	Quality platform	GitHub, GitLab, Bitbucket	Free for OSS	From $15/user/mo
dotCover	C#, .NET, VB.NET	Instrumentation	Any CI	Free (CLI)	From $15.90/mo (IDE)
OpenCover	.NET Framework	Instrumentation	Any CI	Fully free	Free (OSS)
Bullseye Testing	C, C++	Instrumentation	Any CI	No	From $800/seat
gcov/lcov	C, C++, Fortran	Instrumentation	Any CI	Fully free	Free (OSS)

1. Istanbul/nyc - best for JavaScript and TypeScript

Istanbul is the standard code coverage tool for the JavaScript ecosystem. Nearly every major JavaScript project - from React and Next.js to Express and NestJS - uses Istanbul for coverage measurement. The nyc command-line tool is the most common way to run Istanbul, wrapping your test runner and instrumenting code on the fly.

What it does

Istanbul instruments your JavaScript and TypeScript source code to track which lines, branches, functions, and statements execute during tests. It works by inserting counters into your code at build time or runtime, then collecting execution data when tests run. The result is a detailed report showing exactly which code paths were covered.

Language support

JavaScript, TypeScript, and any language that transpiles to JavaScript (CoffeeScript, Flow). Istanbul handles ES modules, CommonJS, and JSX/TSX syntax natively. If you use Babel, webpack, or Vite, Istanbul integrates through their plugin systems.

Key features

Four coverage metrics - line, branch, function, and statement coverage
Multiple report formats - HTML, LCOV, Cobertura XML, text, JSON, and clover
Built-in threshold enforcement - fail CI when coverage drops below a configured percentage
Source map support - accurate coverage for transpiled TypeScript and Babel code
Merge support - combine reports from multiple test runs (unit, integration, e2e) into a single report
Jest integration - Jest uses Istanbul internally, so --coverage flag works out of the box

CI integration

Istanbul works in any CI environment. Generate an LCOV report and upload it to Codecov, Coveralls, or SonarQube for PR-level feedback. For GitHub Actions, a typical workflow runs nyc mocha or jest --coverage, then uploads the resulting coverage/lcov.info file.

Pricing

Completely free and open source under the BSD-2-Clause license. No paid tiers or enterprise editions.

Pros

De facto standard for JavaScript coverage - virtually every JS testing tool supports it
Zero-config with Jest (built in) and minimal config with Mocha, Vitest, and AVA
Excellent source map support for TypeScript projects
Active maintenance with regular updates
Rich report formats suitable for any downstream tool

Cons

Limited to the JavaScript ecosystem - not useful for polyglot projects on its own
nyc configuration can be tricky for complex monorepo setups with multiple test runners
No built-in dashboard or historical tracking - you need a reporting service for that

2. JaCoCo - best for Java, Kotlin, and Scala

JaCoCo (Java Code Coverage) is the dominant coverage tool for JVM languages. It is used by the vast majority of Java projects in production, from small Spring Boot applications to enterprise monoliths with millions of lines of code.

What it does

JaCoCo uses bytecode instrumentation to measure coverage without modifying source code. It attaches as a Java agent during test execution, instruments class files on the fly, and records which instructions and branches were executed. This approach means JaCoCo works with any JVM language that compiles to bytecode.

Language support

Java, Kotlin, Scala, Groovy, and any JVM language. Coverage is measured at the bytecode level, so source language does not matter as long as it compiles to JVM bytecode.

Key features

Bytecode-level instrumentation - no source code modification required
Line, branch, instruction, and cyclomatic complexity coverage metrics
Maven and Gradle plugins - first-class build tool integration
Report merging - aggregate coverage from unit tests, integration tests, and different modules
HTML, XML, and CSV reports - the XML format integrates with SonarQube, Codecov, and most CI tools
Exclusion filters - skip generated code, DTOs, or specific packages from coverage calculation

CI integration

JaCoCo integrates natively with Maven (jacoco-maven-plugin) and Gradle (jacoco plugin). Both produce XML reports that upload directly to Codecov, Coveralls, or SonarQube. Jenkins has a dedicated JaCoCo plugin for trend visualization.

Pricing

Completely free and open source under the Eclipse Public License 2.0.

Pros

Industry standard for JVM coverage - supported by every Java CI/CD tool
Bytecode instrumentation means zero source code changes
Excellent Maven and Gradle integration with minimal configuration
Accurate branch coverage even for complex conditional logic
Merges coverage from multiple test phases (unit + integration)

Cons

JVM-only - not applicable outside the Java ecosystem
HTML reports are functional but visually dated
Kotlin inline functions can produce confusing coverage results due to bytecode differences
Configuration for multi-module projects requires careful setup

3. Coverage.py - best for Python

Coverage.py is the standard coverage tool for Python. Created by Ned Batchelder and actively maintained for over 15 years, it is used by nearly every Python project that measures test coverage.

What it does

Coverage.py hooks into Python's tracing infrastructure to monitor which lines of Python source code execute during test runs. It supports both line and branch coverage measurement and produces reports in multiple formats.

Language support

Python only (CPython and PyPy). Supports Python 3.9 through 3.13+.

Key features

Line and branch coverage with accurate source mapping
Pytest integration via pytest-cov plugin - add --cov flag to any pytest run
Multiple report formats - terminal, HTML, XML (Cobertura), JSON, and LCOV
Dynamic context tracking - see which test covered each line
Combining reports from parallel test runs
Configuration via pyproject.toml - modern Python project configuration

CI integration

Run pytest --cov=mypackage --cov-report=xml to generate a Cobertura XML report, then upload to Codecov or Coveralls. Works in any CI environment. SonarQube's Python analysis also ingests Coverage.py reports.

Pricing

Completely free and open source under the Apache 2.0 license.

Pros

De facto standard for Python coverage - universally supported
Excellent pytest integration through pytest-cov
Branch coverage catches untested conditional paths
Dynamic context feature links each line to the test that covered it
Active, stable maintenance with regular Python version support updates

Cons

Python only
Branch coverage can report confusing results for multiline expressions
No built-in dashboard or trend tracking
Does not measure coverage for C extensions in Python packages

4. Codecov - best coverage reporting platform

Codecov is the most popular dedicated coverage reporting platform. It does not instrument code itself - instead, it ingests coverage reports from tools like Istanbul, JaCoCo, and Coverage.py, then provides rich visualization, PR integration, and historical tracking.

What it does

Codecov receives coverage reports uploaded from your CI pipeline, processes them, and provides a centralized dashboard showing coverage metrics, trends over time, and per-PR coverage impact. Its most valuable feature is the PR comment that shows exactly how a pull request changes coverage - which new lines are covered, which are not, and whether overall coverage improved or regressed.

Language support

Codecov is language-agnostic. It supports any language that produces coverage reports in standard formats: LCOV, Cobertura XML, JaCoCo XML, clover, gcov, and 20+ other formats. This means it works with JavaScript, Python, Java, Go, Ruby, C/C++, Rust, PHP, Swift, and virtually any other language.

Key features

PR comments showing coverage diff, new lines covered/uncovered, and overall impact
Coverage status checks - block merges when coverage drops below threshold
Flags - separate coverage by test type (unit, integration, e2e) or component
Carryforward flags - avoid coverage drops when only part of a monorepo is tested
Sunburst and grid visualizations for identifying coverage gaps
Component-level coverage for monorepos with multiple services
GitHub, GitLab, and Bitbucket native integration

CI integration

Codecov provides a GitHub Action, GitLab CI template, and a universal CLI uploader. Upload is typically a single line added to your CI workflow. The Codecov uploader auto-detects report formats and handles merging when multiple reports are uploaded per commit.

Pricing

Free for public/open source repositories (unlimited users)
Developer - $10/user/month for private repos (up to 5 users)
Team - $15/user/month with advanced features
Enterprise - custom pricing with self-hosted option, SSO, and audit logs

Pros

Best-in-class PR integration with clear, actionable coverage diffs
Language-agnostic - works with any coverage format
Carryforward flags solve the monorepo coverage problem elegantly
Free for open source projects
Fast upload and processing - results typically appear within 60 seconds

Cons

No code instrumentation - you still need a language-specific tool to generate reports
Paid plans can get expensive for large teams
Occasional report processing delays during peak hours
The UI can be overwhelming for teams that just want basic coverage numbers

5. Coveralls - lightweight coverage reporting

Coveralls is a coverage reporting platform similar to Codecov but with a simpler, more focused feature set. It is popular among open source projects and small teams that want straightforward coverage tracking without the complexity of a full platform.

What it does

Like Codecov, Coveralls receives coverage reports from your CI pipeline and provides dashboards, PR comments, and historical tracking. It focuses on simplicity - showing you coverage percentages, trends, and file-level breakdowns without the advanced features that larger platforms offer.

Language support

Language-agnostic through report ingestion. Supports LCOV, Cobertura XML, JaCoCo XML, SimpleCov (Ruby), and other standard formats. Official libraries exist for Ruby, Python, JavaScript, PHP, Go, Java, and .NET.

Key features

PR comments with coverage change summary
Badge generation for README files
Historical coverage tracking with trend graphs
File-level coverage breakdown with source highlighting
GitHub, GitLab, and Bitbucket integration
Parallel build support for merging reports from matrix CI runs

CI integration

Coveralls provides language-specific reporter packages (e.g., coveralls npm package, coveralls Python package) and a universal GitHub Action. Most integrations require two to three lines of CI configuration.

Pricing

Free for public/open source repositories
Pro - from $5/user/month for private repos
Enterprise - custom pricing

Pros

Simple and focused - easy to set up and understand
Generous free tier for open source
Lower price point than Codecov for small teams
Clean, readable PR comments
Good official language-specific reporter packages

Cons

Fewer advanced features than Codecov (no flags, no component coverage, limited monorepo support)
Dashboard is basic compared to alternatives
Less frequent feature updates than Codecov
No self-hosted option for enterprise environments

6. SonarQube - best for unified quality and coverage

SonarQube is a comprehensive code quality platform that includes coverage tracking as part of its broader analysis. It does not instrument code directly - instead, it imports coverage reports from language-specific tools and presents them alongside code quality, security, and duplication findings.

What it does

SonarQube ingests coverage data from tools like JaCoCo, Istanbul, Coverage.py, and gcov, then integrates that data into its quality dashboard. Coverage becomes one dimension of the overall "Quality Gate" - a pass/fail check that can enforce minimum coverage on new code, maximum duplication, zero critical bugs, and other quality thresholds.

Language support

SonarQube supports coverage import for 35+ languages. It accepts standard coverage formats (LCOV, Cobertura, JaCoCo, etc.) and maps coverage data to its own analysis results.

Key features

Quality Gates - enforce minimum coverage on new code alongside other quality metrics
Coverage on new code - separate metric for recently added/changed code vs. overall coverage
Pull request decoration with inline coverage annotations
Historical dashboards showing coverage trends over weeks and months
Portfolio-level views for tracking coverage across multiple projects
Security hotspot detection combined with coverage gaps to prioritize testing

CI integration

SonarQube uses the SonarScanner CLI or build-tool plugins (Maven, Gradle, .NET) to upload analysis results. Coverage reports are passed as analysis parameters. Native integration with GitHub Actions, GitLab CI, Azure DevOps, Jenkins, and Bitbucket Pipelines.

Pricing

Community Build - free and open source (limited features)
Developer Edition - from $2,500/year (branch analysis, PR decoration)
Enterprise Edition - from $22,000/year (portfolio management, SAST)
Data Center Edition - from $130,000/year (high availability)

Pros

Coverage is part of a comprehensive quality management platform
Quality Gates enforce coverage standards as part of a broader quality strategy
Excellent "coverage on new code" metric prevents existing technical debt from blocking progress
Wide language support across a single dashboard
Strong enterprise adoption with compliance reporting

Cons

Does not generate coverage data - you need language-specific tools first
Community Edition lacks PR decoration and branch analysis
Self-hosted infrastructure adds operational overhead
Can be overkill if you only need coverage tracking
Configuration is more complex than dedicated coverage tools

7. DeepSource - best for AI-powered coverage insights

DeepSource is a code quality platform that combines static analysis with coverage tracking and AI-powered recommendations. Its coverage features focus on identifying not just what is uncovered, but what should be tested first.

What it does

DeepSource imports coverage data and combines it with its static analysis findings to surface high-risk uncovered code. Instead of just showing a coverage percentage, it highlights functions with complex logic, security-sensitive code, and frequently changed files that lack test coverage.

Language support

Python, JavaScript, TypeScript, Go, Java, Ruby, Rust, Kotlin, Scala, Swift, C#, PHP, and more (16 languages total for static analysis, coverage import for most major languages).

Key features

Risk-prioritized coverage gaps - highlights uncovered code that is most likely to cause bugs
Coverage tracking with historical trends and per-PR reporting
Autofix suggestions - can generate test stubs for uncovered functions
Integration with static analysis findings for unified quality view
Sub-5% false positive rate on static analysis findings

CI integration

DeepSource connects via GitHub, GitLab, or Bitbucket app. Coverage reports are uploaded using the DeepSource CLI. Supports LCOV, Cobertura, JaCoCo, and other standard formats.

Pricing

Free for open source and individual developers
Team - $12/user/month
Enterprise - custom pricing

Pros

Intelligent prioritization of coverage gaps based on code risk
Clean, modern dashboard that developers actually use
Low false positive rate on complementary static analysis
Competitive pricing compared to alternatives
Fast setup with minimal configuration

Cons

Smaller community than SonarQube or Codecov
Fewer coverage-specific features than dedicated tools like Codecov
Language support is narrower than some competitors
Coverage insights depend on also running DeepSource's static analysis

8. Codacy - best for coverage plus code quality in one platform

Codacy combines code coverage tracking with static analysis, security scanning, and code duplication detection in a single platform. It is particularly strong for teams that want a unified quality dashboard without managing multiple tools.

What it does

Codacy imports coverage data from your test runs and presents it alongside code quality findings. It tracks coverage per file, per pull request, and over time, with configurable quality gates that can enforce minimum coverage thresholds on new code.

Language support

Codacy supports 49 programming languages for code analysis. Coverage import is available for any language that produces standard report formats (LCOV, Cobertura, JaCoCo, OpenCover, dotCover, PHPUnit, etc.).

Key features

Unified quality dashboard - coverage, code quality, security, and duplication in one view
PR-level coverage with inline annotations showing uncovered new lines
Quality gates enforcing coverage thresholds alongside other quality standards
Organization-level reporting across all repositories
Coverage diff showing impact of each pull request
49 language support for the broadest polyglot coverage

CI integration

Codacy provides a coverage reporter CLI and a GitHub Action. Upload coverage reports from any CI system. Native integration with GitHub, GitLab, and Bitbucket.

Pricing

Free for open source projects
Pro - $15/user/month
Business - custom pricing with SAST, DAST, and advanced features

Pros

Widest language support at 49 languages
All-in-one platform reduces tool sprawl
Good PR integration with coverage diff annotations
Free tier available for open source
Supports a wide variety of coverage report formats

Cons

Coverage features are not as deep as dedicated tools like Codecov
Dashboard can be slow with very large repositories
Some advanced coverage features (component-level tracking) are missing
Quality gate configuration requires navigating multiple settings pages

9. dotCover - best for .NET and C

JetBrains dotCover is a coverage tool designed specifically for the .NET ecosystem. It integrates with JetBrains Rider and Visual Studio, providing inline coverage visualization directly in the IDE.

What it does

dotCover instruments .NET assemblies to track which lines and branches execute during test runs. It provides both an IDE-integrated experience (highlighting covered and uncovered lines in the editor) and a command-line tool for CI pipelines.

Language support

C#, VB.NET, F#, and any .NET language. Supports .NET Framework, .NET Core, and .NET 6/7/8/9+. Works with NUnit, xUnit, MSTest, and other .NET test frameworks.

Key features

IDE integration - inline coverage highlighting in Rider and Visual Studio
Continuous testing - re-runs affected tests automatically when you save
Coverage filters - include or exclude assemblies, namespaces, and types
Snapshot comparison - compare coverage between runs
Multiple report formats - HTML, XML, JSON, and JetBrains internal format
Merge support for combining coverage from multiple test runs

CI integration

The dotCover command-line tool runs in any CI environment. JetBrains TeamCity has built-in dotCover integration. Reports can be exported in formats compatible with Codecov, Coveralls, and SonarQube.

Pricing

CLI tool - free (included with dotnet-dotcover global tool)
IDE plugin for Rider - included with JetBrains Rider ($15.90/month first year)
Visual Studio plugin - included with dotUltimate ($28.90/month first year)
TeamCity - built-in support

Pros

Best-in-class IDE integration for .NET developers
Continuous testing feature dramatically speeds up the feedback loop
Accurate coverage for modern .NET (including async/await and LINQ)
Free CLI tool for CI pipelines
Strong integration with JetBrains ecosystem (TeamCity, Rider)

Cons

.NET only - not applicable outside the Microsoft ecosystem
IDE features require a paid JetBrains subscription
Less community adoption than OpenCover for open source .NET projects
Report format ecosystem is smaller than JaCoCo or Istanbul

10. OpenCover - free .NET coverage

OpenCover is a free, open source code coverage tool for .NET Framework applications. It has been a staple of the .NET open source ecosystem for years, though its development has slowed with the rise of .NET Core and dotCover.

What it does

OpenCover uses the .NET profiling API to instrument assemblies at runtime and collect coverage data. It wraps around your test runner process and produces detailed XML reports compatible with most coverage reporting services.

Language support

C#, VB.NET, and F# on .NET Framework. .NET Core support is limited - for modern .NET projects, the built-in coverlet tool or dotCover is generally a better choice.

Key features

Profiling API instrumentation - no source code changes needed
Branch and sequence point coverage metrics
Filter by namespace, class, or method - exclude test assemblies and generated code
XML output compatible with ReportGenerator, Codecov, Coveralls, and SonarQube
Works with NUnit, xUnit, and MSTest runners

CI integration

OpenCover runs as a command-line wrapper around your test runner. Pair it with ReportGenerator for HTML reports and upload the XML output to your coverage service. Common in AppVeyor, Azure DevOps, and Jenkins pipelines.

Pricing

Completely free and open source under the MIT license.

Pros

Free and open source with no restrictions
Accurate coverage data for .NET Framework projects
Well-supported by downstream reporting tools
Mature and battle-tested over many years
ReportGenerator integration produces excellent HTML reports

Cons

Limited .NET Core / .NET 6+ support - coverlet is better for modern .NET
Development has slowed significantly
Windows-only (does not run on Linux or macOS)
Command-line interface is not as user-friendly as dotCover
No IDE integration

11. Bullseye Testing Technology - best for embedded C/C++

Bullseye BullseyeCoverage is a commercial coverage tool designed for C and C++ projects, particularly in embedded systems, safety-critical software, and environments where gcov is not practical.

What it does

Bullseye instruments C and C++ source code at the preprocessing stage to insert coverage probes. It measures function coverage and condition/decision coverage (a stricter metric than simple branch coverage that is required by safety standards like DO-178C and IEC 62304).

Language support

C and C++ only. Supports GCC, Clang, MSVC, Green Hills, Wind River, IAR, and other embedded compilers.

Key features

Condition/decision coverage - required for safety-critical certifications (DO-178C, IEC 62304)
Function and condition coverage metrics with detailed reporting
Embedded system support - works with cross-compilers and target hardware
Low overhead instrumentation suitable for resource-constrained environments
Coverage browser GUI for exploring results
Merge support for combining runs from different test configurations or hardware targets

CI integration

Bullseye runs from the command line and integrates with Make, CMake, and other build systems. Coverage data can be exported for integration with Jenkins and other CI platforms.

Pricing

Named user license - from $800/seat
Floating license - from $1,600/seat
Site license - custom pricing
Free evaluation available

Pros

Condition/decision coverage meets safety certification requirements
Works with embedded and cross-compilation toolchains
Low runtime overhead suitable for testing on target hardware
Long track record in safety-critical industries (aerospace, automotive, medical)
Excellent support for complex C++ features

Cons

Expensive compared to free alternatives like gcov
C/C++ only
Closed source with proprietary license
No built-in CI dashboard or PR integration
Learning curve for condition/decision coverage concepts

12. gcov/lcov - best free option for C and C++

gcov is the coverage tool built into GCC (GNU Compiler Collection), and lcov is a graphical front-end that generates HTML reports from gcov data. Together, they provide the most widely used free coverage solution for C and C++ projects.

What it does

gcov works by compiling your code with special GCC flags (--coverage or -fprofile-arcs -ftest-coverage) that insert instrumentation. When the program runs, it writes execution counts to .gcda files. lcov collects these files and produces LCOV-format reports and HTML pages showing line-by-line coverage.

Language support

C, C++, Fortran, and other GCC-supported languages. Clang also supports gcov-compatible output through its --coverage flag, so lcov works with both GCC and Clang toolchains.

Key features

Line and branch coverage at the source level
Built into GCC - no separate installation for the core tool
lcov HTML reports with per-file and per-function breakdown
LCOV format is a universal standard ingested by Codecov, Coveralls, SonarQube, and others
genhtml tool generates detailed, navigable HTML reports
Differential coverage with lcov's --diff option

CI integration

Add --coverage to your GCC/Clang flags, run tests, then use lcov to generate an LCOV report. Upload the .info file to Codecov or Coveralls. Works in any CI environment. GitHub Actions, GitLab CI, and Jenkins all have well-documented gcov/lcov workflows.

Pricing

Completely free. gcov is part of GCC (GPL), and lcov is distributed under the GPL.

Pros

Free and built into the most widely used C/C++ compiler
LCOV format is universally supported by reporting tools
Zero additional dependencies for basic coverage with GCC
Works with both GCC and Clang toolchains
lcov HTML reports are clean and easy to navigate
Enormous community with extensive documentation

Cons

Requires recompilation with coverage flags - not always practical for large projects
Branch coverage output can be verbose and hard to interpret
gcov data files can accumulate and require cleanup between runs
No condition/decision coverage (unlike Bullseye) for safety certifications
No built-in dashboard or historical tracking

How to choose the right coverage tool

Choosing a coverage tool depends on three factors: your programming language, your team size, and what you need beyond raw coverage numbers.

By language

JavaScript/TypeScript - Istanbul/nyc is the only choice. It is built into Jest and works with every major JS test framework.
Python - Coverage.py via pytest-cov. There is no real alternative.
Java/Kotlin/Scala - JaCoCo. It is the industry standard with excellent build tool integration.
C#/.NET - dotCover for IDE integration, coverlet for CI (modern .NET), or OpenCover for .NET Framework.
C/C++ - gcov/lcov for most projects. Bullseye for safety-critical or embedded work requiring condition/decision coverage.

By team needs

Solo developer or small team - use a free instrumentation tool plus Codecov or Coveralls free tier for PR feedback.
Mid-size team (10-50 developers) - Codecov Team or Coveralls Pro for dedicated coverage reporting, or SonarQube Developer Edition if you also want code quality and security analysis.
Enterprise or polyglot monorepo - SonarQube Enterprise for unified quality management, or Codecov Enterprise for dedicated coverage with self-hosted option.
Teams already using code quality platforms - Codacy or DeepSource if you want coverage as part of a broader quality workflow without adding another tool.

Coverage metrics that matter

Not all coverage metrics are equally useful. Here is what to focus on:

Branch coverage is more valuable than line coverage because it catches untested conditional paths
Coverage on new code is more actionable than overall coverage - enforce standards on what is being added, not legacy code
Coverage diff per PR is the most practical metric for code review - it tells reviewers whether a change includes adequate tests
Function coverage is a good sanity check but too coarse for serious quality enforcement

Setting up coverage in CI - a practical workflow

Regardless of which tools you choose, the workflow follows the same pattern:

Step 1: Generate coverage data. Configure your test runner to produce coverage output. For JavaScript, add --coverage to Jest or wrap with nyc. For Python, use pytest --cov. For Java, add the JaCoCo plugin to your build.

Step 2: Export in a standard format. Use LCOV, Cobertura XML, or JaCoCo XML. These formats are understood by every reporting tool.

Step 3: Upload to a reporting service. Add a CI step to send the report to Codecov, Coveralls, or SonarQube. Most services provide one-line CI configuration.

Step 4: Configure quality gates. Set a minimum coverage threshold on new code (80% is a common starting point). Configure the service to post PR status checks so sub-threshold PRs cannot be merged.

Step 5: Review coverage in PRs. Use the PR comment or status check to evaluate whether new code includes adequate tests before approving.

Our recommendations

Best overall coverage reporting platform: Codecov. Its PR integration, flag system, and carryforward flags make it the most practical choice for teams that want actionable coverage feedback on every pull request.

Best free setup for open source: Istanbul/JaCoCo/Coverage.py (depending on language) paired with Codecov's free open source tier. This combination provides professional-grade coverage tracking at zero cost.

Best for unified quality management: SonarQube. If your team already uses SonarQube for code quality and security, adding coverage tracking to the same platform is the logical choice.

Best for teams wanting minimal tool sprawl: Codacy or DeepSource. Both platforms combine coverage with code quality, security scanning, and duplication detection, reducing the number of tools your team needs to manage.

Best for safety-critical C/C++: Bullseye. Its condition/decision coverage is the only way to meet DO-178C and IEC 62304 certification requirements without building custom tooling.

Final thoughts

Code test coverage is a means to an end, not an end in itself. The goal is not to achieve a specific percentage - it is to have confidence that your tests exercise the code paths that matter. A project with 70% coverage focused on business logic and edge cases is better tested than one with 95% coverage that mostly tests trivial getters and setters.

The best coverage tools make coverage data visible and actionable at the point where it matters most: during code review. When a reviewer can see that a pull request adds 200 lines of code with 0% coverage, they can ask for tests before the code ships. That single workflow improvement prevents more bugs than any coverage threshold ever will.

Choose a tool that fits your language and workflow, set a reasonable threshold on new code, and focus your testing effort on the code that carries the most risk. The tools in this guide give you everything you need to do that effectively.

Frequently Asked Questions

What is a good code coverage percentage to aim for?

Most teams target 80% line coverage as a practical baseline. Reaching 100% is rarely worth the effort because it forces you to write tests for trivial code paths like getters, setters, and unreachable error handlers. The more important metric is branch coverage, which measures whether both the true and false paths of every conditional have been tested. A project with 75% branch coverage and 80% line coverage is generally better tested than one with 95% line coverage but only 50% branch coverage. Focus on covering critical business logic and edge cases rather than chasing a specific number.

What is the difference between line coverage, branch coverage, and function coverage?

Line coverage measures the percentage of executable lines that were run during tests. Branch coverage measures whether both outcomes of every decision point (if/else, switch cases, ternary operators) were exercised. Function coverage measures the percentage of functions or methods that were called at least once. Branch coverage is the most meaningful metric because a line can be covered without testing all its conditional paths. For example, an if statement on a single line might show 100% line coverage even if only the true branch was tested. Most coverage tools report all three metrics, and mature teams prioritize branch coverage alongside line coverage.

Are free code coverage tools good enough for production projects?

Yes. Istanbul/nyc, Coverage.py, JaCoCo, gcov/lcov, and OpenCover are all free and open source tools used by thousands of production projects including major open source frameworks. These tools generate accurate coverage data locally and in CI pipelines. What free tools lack compared to paid platforms like Codecov and Coveralls is centralized reporting - historical trends, PR comments with coverage diffs, badge generation, and team-level dashboards. Many teams pair a free instrumentation tool with a free-tier reporting service like Codecov (free for open source) or Coveralls (free for public repos) to get both accurate data and good reporting.

How do I integrate code coverage into my CI/CD pipeline?

The general pattern is three steps. First, configure your test runner to generate a coverage report in a standard format like LCOV, Cobertura XML, or JaCoCo XML. Second, add a step in your CI workflow (GitHub Actions, GitLab CI, Jenkins, etc.) to upload that report to a coverage service like Codecov, Coveralls, or SonarQube. Third, configure the service to post coverage results as a PR comment or status check so reviewers see coverage impact before merging. Most coverage services provide one-line CI configuration snippets. For example, Codecov requires only adding their GitHub Action and it auto-detects common report formats.

Can I use multiple coverage tools together?

Yes, and many teams do. A typical setup uses a language-specific instrumentation tool (like Istanbul for JavaScript or JaCoCo for Java) to generate raw coverage data, then uploads that data to a reporting platform (like Codecov or SonarQube) for visualization and PR integration. In polyglot projects, you might use Istanbul for your frontend, JaCoCo for your backend, and Coverage.py for your data pipeline - then aggregate all reports in Codecov or SonarQube for a unified view. The key is standardizing on a common report format like LCOV or Cobertura XML that all your reporting tools can ingest.

Does high code coverage guarantee bug-free software?

No. Code coverage measures whether lines or branches were executed during tests, not whether the tests actually verify correct behavior. A test that calls a function without any assertions will increase coverage without catching any bugs. This is sometimes called 'coverage theater.' Mutation testing tools like Stryker, PIT, and mutmut address this gap by modifying your source code and checking whether your tests detect the changes. High coverage is a necessary but not sufficient condition for well-tested software. The quality of your assertions matters more than the coverage percentage.

Originally published at aicodereview.cc

12 Best Code Smell Detection Tools in 2026 - Complete Guide

Rahul Singh — Fri, 10 Apr 2026 22:00:00 +0000

What are code smells?

Code smells are surface-level indicators of deeper design problems in your codebase. The term was popularized by Martin Fowler and Kent Beck in the book Refactoring: Improving the Design of Existing Code. A code smell does not mean the code is broken - it compiles, it passes tests, it works - but it signals that something in the structure will cause problems as the project grows.

Think of code smells like warning signs on a road. A sharp curve sign does not mean an accident has happened, but it tells you to slow down and pay attention. Similarly, a 500-line method does not mean there is a bug, but it strongly suggests the code will be difficult to test, debug, and modify in the future.

The cost of ignoring code smells compounds over time. A study by the Technical University of Munich found that classes with detected code smells are 3.4 times more likely to contain defects than clean classes. Another study from Microsoft Research showed that files with high code smell density require 2.5 times more effort to modify during subsequent development.

Common types of code smells

Before diving into tools, understanding the major categories of code smells helps you evaluate which tools offer the most relevant coverage for your codebase.

Bloaters

Bloaters are code smells where something has grown too large to manage effectively.

Long Method - A method that does too much and should be split into smaller, focused methods. Generally, methods over 20-30 lines deserve scrutiny.
Large Class (God Class) - A class that has accumulated too many responsibilities. It knows too much, does too much, and changes for too many reasons.
Long Parameter List - Methods that accept more than 3-4 parameters, making them hard to call correctly and signaling that the method may need restructuring.
Data Clumps - Groups of variables that frequently appear together (like street, city, state, zip) and should be extracted into their own class or structure.
Primitive Obsession - Using primitive types for domain concepts instead of creating small value objects (using a string for email instead of an Email type).

Object-orientation abusers

These smells indicate misuse of object-oriented principles.

Feature Envy - A method that uses data from another class more than its own, suggesting it belongs in that other class.
Inappropriate Intimacy - Two classes that depend too heavily on each other's internal details.
Refused Bequest - A subclass that inherits methods and data it does not need, indicating the inheritance hierarchy is wrong.
Switch Statements - Complex switch/case blocks that should be replaced with polymorphism.

Change preventers

These smells make future changes disproportionately difficult.

Divergent Change - A class that gets modified for many different reasons, violating the Single Responsibility Principle.
Shotgun Surgery - A single change that requires edits across many classes, indicating responsibilities are scattered.
Parallel Inheritance Hierarchies - Adding a subclass in one hierarchy forces you to add a subclass in another.

Dispensables

These are code elements that serve no useful purpose and should be removed.

Dead Code - Unreachable or unused code that clutters the codebase.
Duplicated Code - The same logic repeated in multiple places instead of being extracted into a shared method or class.
Lazy Class - A class that does too little to justify its existence.
Speculative Generality - Abstractions added "just in case" that are never actually used.

Couplers

These smells indicate excessive coupling between components.

Middle Man - A class that delegates almost all its work to another class, adding no value.
Message Chains - Long chains of method calls like a.getB().getC().getD() that create tight coupling to the object graph structure.

12 best code smell detection tools compared

Here is a quick comparison of the tools covered in this guide before we dive into detailed reviews.

Tool	Languages	Smell Types	Pricing	CI/CD Integration	Auto-Fix
SonarQube	30+	5,000+ rules	Free CE / $150+	Yes	No
CodeClimate	16+	Maintainability	Free OSS / $15/user	Yes	No
Codacy	49+	Code patterns	Free / $15/user	Yes	Partial
DeepSource	11+	800+ analyzers	Free OSS / $12/user	Yes	Yes
PMD	Java, Apex, PLSQL, XML	Design, coupling, size	Free (OSS)	Yes	No
ESLint	JavaScript, TypeScript	Complexity, structure	Free (OSS)	Yes	Partial
Pylint	Python	Design, refactor	Free (OSS)	Yes	No
IntelliJ IDEA	20+	6,000+ inspections	Free CE / $249/yr	IDE only	Yes
NDepend	.NET (C#, VB.NET)	150+ rules	$459/dev/yr	Yes	No
JDeodorant	Java	5 smell types	Free (OSS)	Eclipse only	Yes
Designite	Java, C#	20+ design smells	Free / $99/yr	Yes	No
CodeAnt AI	30+	AI-powered	Free / $19/user	Yes	Yes

1. SonarQube

Best for: Teams that want the most comprehensive rule coverage across multiple languages.

SonarQube is the industry standard for static code analysis and code smell detection. Its "Code Smells" category is a first-class concept in the platform, with thousands of rules specifically designed to identify maintainability issues across 30+ programming languages.

What smells it detects

SonarQube groups issues into three categories: Bugs, Vulnerabilities, and Code Smells. The code smell rules cover cognitive complexity, duplicated code, long methods, deeply nested conditionals, unused variables, and many language-specific anti-patterns. Java alone has 600+ code smell rules. Python, JavaScript, TypeScript, C#, and C++ each have 200-400 rules.

The Cognitive Complexity metric is one of SonarQube's standout features. Unlike cyclomatic complexity, cognitive complexity weights control flow structures based on how difficult they are for humans to understand. This makes it much better at identifying methods that are genuinely hard to read versus methods that simply have many branches.

Pricing

Community Edition - Free and open source. Supports 30+ languages, code smells, bugs, and vulnerabilities.
Developer Edition - Starts at $150/year. Adds branch analysis, PR decoration, and additional languages.
Enterprise Edition - Starts at $20,000/year. Portfolio management, security reports, and regulatory compliance.

Pros

Deepest code smell rule coverage in the market
Quality Gates enforce smell thresholds on every PR
Technical debt estimation shows the time cost of each smell
Excellent Java, C#, and JavaScript/TypeScript support
Self-hosted option keeps code on your infrastructure

Cons

Self-hosted setup requires a database and server maintenance
Community Edition lacks PR decoration and branch analysis
Can be noisy on legacy codebases without careful configuration
Enterprise pricing is steep for small teams

2. CodeClimate

Best for: Teams that prioritize maintainability tracking with minimal setup.

CodeClimate Quality takes a different approach from rule-heavy tools. It focuses on a single Maintainability rating (A through F) derived from structural analysis of your code. Rather than flagging hundreds of individual issues, it identifies patterns that reduce maintainability - duplication, complexity, and file length.

What smells it detects

CodeClimate analyzes code for duplication, cognitive complexity, method length, file length, argument count, and return statements. It also integrates with language-specific linters (ESLint, RuboCop, Pylint) to pull in additional smell detection. The key metric is its maintainability GPA, which gives teams a single number to track over time.

Pricing

Free for open source projects
Quality - $15/user/month for private repositories

Pros

Clean, easy-to-understand maintainability scores
Quick setup with GitHub, GitLab, and Bitbucket
Duplication detection across entire codebase
Trend tracking shows whether code health is improving or declining

Cons

Fewer granular smell rules than SonarQube
No auto-fix capability
Limited to structural smells (does not detect design-level smells like feature envy)
Less useful for languages outside the JavaScript/Ruby/Python ecosystem

3. Codacy

Best for: Polyglot teams that need broad language coverage with code smell detection built into PR workflows.

Codacy supports 49+ languages and wraps dozens of open source analyzers (PMD, ESLint, Pylint, Checkstyle, and more) into a unified platform with consistent reporting and PR integration.

What smells it detects

Codacy detects code patterns, complexity, duplication, and unused code across its supported languages. It categorizes issues by severity and type, with code smell categories including style, error-prone patterns, complexity, and performance. Each language uses purpose-built analyzers - PMD for Java, Pylint for Python, ESLint for JavaScript - giving you specialized detection without configuring each tool separately.

Pricing

Free for open source and up to 5 users on private repos
Pro - $15/user/month with advanced security scanning and team management
Business - Custom pricing for enterprise features

Pros

49+ language support is the broadest in the market
Bundles best-in-class open source analyzers
PR comments with inline code smell annotations
Free tier is genuinely usable for small teams

Cons

Analysis can be slow on large repositories
Aggregating multiple analyzers sometimes produces duplicate findings
Custom rule creation is limited compared to SonarQube
Some language analyzers are more mature than others

4. DeepSource

Best for: Teams that want automated code smell detection with one-click fixes.

DeepSource stands out for its Autofix feature, which can automatically generate pull requests to resolve detected code smells. It supports Python, Go, Java, JavaScript, TypeScript, Ruby, Rust, Kotlin, Scala, Swift, and C#.

What smells it detects

DeepSource covers anti-patterns, bug risks, style issues, performance problems, and security vulnerabilities. For code smells specifically, it detects long functions, complex conditionals, dead code, unused imports, duplicated logic, and type-specific anti-patterns. Its analyzers are built in-house rather than wrapping third-party tools, which gives it more control over false positive rates.

DeepSource claims a sub-5% false positive rate, which is notably lower than most competitors. In practice, this means less time dismissing irrelevant findings and more time fixing real issues.

Pricing

Free for open source projects
Business - $12/user/month
Enterprise - Custom pricing

Pros

Autofix generates ready-to-merge PRs for many smells
Low false positive rate reduces alert fatigue
Clean dashboard with clear issue categorization
TOML-based configuration is simple and version-controllable

Cons

Supports 11 languages (fewer than Codacy or SonarQube)
No self-hosted option for the cloud platform
Design-level smells like feature envy and inappropriate intimacy are not covered
Autofix works best for Python and Go, less reliable for other languages

5. PMD

Best for: Java teams that want a free, mature, and highly configurable code smell analyzer.

PMD is one of the oldest static analysis tools for Java, with deep rules for detecting design problems, code smells, and anti-patterns. It also supports Apex, PLSQL, Visualforce, and XML.

What smells it detects

PMD's rule categories include Design, Code Style, Best Practices, Error Prone, Multithreading, and Performance. The Design rules are where code smell detection shines - they cover god classes, data classes, excessive class length, excessive method length, cyclomatic complexity, coupling between objects, depth of inheritance, and law of Demeter violations.

The Copy-Paste Detector (CPD) included with PMD is one of the best duplication detection tools available. It works across Java, C, C++, JavaScript, Python, Go, and several other languages, even though PMD's core analysis is Java-focused.

Pricing

Free and open source (BSD license)

Pros

Mature and battle-tested with 20+ years of development
Highly configurable with XML rulesets
CPD duplication detection is best in class
Integrates with Maven, Gradle, Ant, and all major CI systems
Massive community with extensive documentation

Cons

Java-centric (limited value for non-Java teams)
XML configuration can be verbose
No built-in dashboard or trend tracking
No auto-fix capabilities
Requires manual integration with PR workflows

6. ESLint

Best for: JavaScript and TypeScript teams that want extensible smell detection integrated into their existing workflow.

ESLint is the standard linter for JavaScript and TypeScript. While it is primarily known for style enforcement, its complexity and best-practice rules provide meaningful code smell detection for frontend and Node.js projects.

What smells it detects

ESLint's built-in rules cover cyclomatic complexity (complexity), maximum depth (max-depth), maximum lines per function (max-lines-per-function), maximum parameters (max-params), and maximum statements (max-statements). The no-unused-vars, no-shadow, and no-redeclare rules catch dispensable code smells.

The real power comes from plugins. eslint-plugin-sonarjs adds SonarQube's cognitive complexity and code smell rules to ESLint. eslint-plugin-jsdoc catches documentation smells. Custom rules can be written to detect project-specific anti-patterns.

Pricing

Free and open source (MIT license)

Pros

Already installed in most JavaScript/TypeScript projects
Highly extensible through plugins
Auto-fix for many rules via --fix
Integrates with every editor and CI system
Flat config system (eslint.config.js) is clean and composable

Cons

Limited to JavaScript and TypeScript
Complexity rules require manual threshold configuration
Does not detect OOP smells (feature envy, god class) since JS is not class-heavy
Plugin quality varies significantly

7. Pylint

Best for: Python teams that want thorough code analysis including structural smell detection.

Pylint is the most comprehensive Python linter, with rule categories that go well beyond syntax checking into genuine code smell territory. Its Design and Refactor categories specifically target structural issues.

What smells it detects

Pylint's Refactor category includes rules for too-many-arguments, too-many-branches, too-many-instance-attributes, too-many-locals, too-many-return-statements, too-many-statements, too-few-public-methods, and duplicate-code. The Design category covers issues like too-many-ancestors (deep inheritance) and too-many-public-methods (god class indicator).

The similarity checker (--disable=all --enable=similarities) can scan entire projects for duplicated code blocks, similar to PMD's CPD but for Python.

Pricing

Free and open source (GPL-2.0)

Pros

Most thorough Python analysis available
Configurable thresholds for all complexity metrics
Code rating score (0-10) provides a quick health metric
Active development with regular updates
Integrates with all Python editors and CI systems

Cons

Can be very noisy with default settings
Python-only
No auto-fix for structural smells
Configuration requires a .pylintrc file or pyproject.toml section
Slower than alternatives like Ruff for pure linting

8. IntelliJ IDEA inspections

Best for: Individual developers and small teams who want real-time code smell detection while writing code.

JetBrains IntelliJ IDEA (and its family of IDEs - PyCharm, WebStorm, Rider) includes over 6,000 code inspections that run in real-time as you type. Many of these inspections detect code smells and offer one-click refactoring to fix them.

What smells it detects

IntelliJ's inspections cover method complexity, class size, parameter count, duplicated code fragments, unused declarations, overly complex expressions, and dozens of language-specific anti-patterns. The "Structural Search and Replace" feature lets you define custom smell patterns using a template language.

The IDE's refactoring tools are tightly integrated with smell detection. When IDEA flags a long method, it offers "Extract Method" right in the context menu. When it detects duplicated code, it offers "Extract to method/variable" to eliminate the duplication.

Pricing

Community Edition - Free for Java, Kotlin, Groovy, and Scala
Ultimate Edition - $249/year first year for individuals, $599/year for organizations

Pros

Real-time detection as you type (no CI pipeline needed)
One-click refactoring for detected smells
6,000+ inspections across 20+ languages
Custom inspections via Structural Search
Deepest IDE-based analysis available

Cons

IDE-only (does not run in CI/CD without Qodana)
Requires every developer to use JetBrains IDEs
Results are not shared across the team without Qodana
Inspection profiles need team-wide standardization
Ultimate Edition is expensive at scale

9. NDepend

Best for: .NET teams that need deep architectural analysis and code smell metrics.

NDepend is the most specialized code quality tool for the .NET ecosystem. It analyzes C# and VB.NET codebases with over 150 code rules, many of which target code smells and design issues specific to .NET patterns.

What smells it detects

NDepend excels at detecting architectural and design-level smells. Its rules cover class coupling, method complexity, type cohesion (LCOM - Lack of Cohesion of Methods), depth of inheritance, afferent and efferent coupling, cyclomatic complexity, and IL complexity. The CQLinq query language lets you write custom code smell rules using LINQ-style queries over your codebase's structure.

NDepend's dependency graph and matrix visualizations make it easy to spot inappropriate intimacy and circular dependencies at the namespace and assembly level.

Pricing

Professional - $459/developer/year
Enterprise - $919/developer/year (includes build server license)

Pros

Deepest .NET analysis available
CQLinq enables powerful custom queries
Dependency visualization for architectural smells
Trend tracking across builds
Integrates with Visual Studio and Azure DevOps

Cons

.NET only
Expensive compared to cross-platform alternatives
Steep learning curve for CQLinq
No free tier (14-day trial only)
Overkill for small projects

10. JDeodorant

Best for: Java developers who want automated refactoring suggestions for specific code smells.

JDeodorant is an Eclipse plugin developed by researchers at Concordia University. It is specifically designed to detect five types of code smells and suggest concrete refactoring opportunities to fix them.

What smells it detects

JDeodorant focuses on five smells with academic rigor:

Feature Envy - Detects methods that use another class's data more than their own and suggests Move Method refactoring
Type Checking - Identifies switch/if-else chains that should be replaced with polymorphism and suggests Replace Conditional with Polymorphism
Long Method - Finds opportunities to extract smaller methods and suggests Extract Method refactoring
God Class - Detects classes with too many responsibilities and suggests Extract Class refactoring
Duplicated Code - Identifies code clones and suggests extraction

Pricing

Free and open source

Pros

Purpose-built for code smell detection (not a general linter)
Suggests specific refactoring operations, not just warnings
Backed by academic research on smell detection accuracy
Completely free

Cons

Eclipse-only (no VS Code, IntelliJ, or CLI support)
Limited to Java
Only 5 smell types (no complexity metrics or coupling analysis)
Development has slowed in recent years
No CI/CD integration

11. Designite

Best for: Researchers and teams that want to quantify design smells and track architectural degradation.

Designite is a code smell detection tool with a focus on design-level and architectural smells. Available in Java and C# editions, it categorizes smells into implementation smells, design smells, and architecture smells - a more granular taxonomy than most tools provide.

What smells it detects

Implementation Smells - Long method, complex method, long parameter list, long statement, missing default case, empty catch clause.

Design Smells - Broken modularization, cyclic-dependent modularization, god class, feature envy, insufficient modularization, hub-like modularization, unnecessary abstraction, deep hierarchy, multipath hierarchy, wide hierarchy, rebellious hierarchy, unfactored hierarchy.

Architecture Smells - Cyclic dependency, unstable dependency, ambiguous interface, god component, feature concentration, scattered functionality, dense structure.

This three-tier classification makes Designite unique among tools - it catches problems at the method, class, and component level.

Pricing

DesigniteJava - Free and open source
Designite for C# - $99/year per user

Pros

Three-tier smell taxonomy (implementation, design, architecture)
Detects architectural smells that other tools miss
Export to CSV for custom analysis
DesigniteJava is completely free
Academic backing with published research

Cons

Limited to Java and C#
No PR integration or inline comments
Basic UI compared to commercial tools
No auto-fix capability
Small community compared to SonarQube or ESLint

12. CodeAnt AI

Best for: Teams that want AI-powered code smell detection with auto-fix across multiple languages.

CodeAnt AI uses AI models to detect code quality issues including code smells, security vulnerabilities, and performance problems. It positions itself as a modern alternative to traditional rule-based tools by using machine learning to understand code patterns.

What smells it detects

CodeAnt AI covers duplicated code, dead code, complex functions, large classes, long parameter lists, deeply nested conditionals, and code style inconsistencies. Its AI approach means it can detect smells that do not match predefined patterns - for example, it may flag a method as doing too many things based on semantic analysis rather than a line count threshold.

The tool provides auto-fix suggestions powered by AI, which can generate refactored code for detected smells. It supports 30+ languages and integrates with GitHub, GitLab, and Bitbucket.

Pricing

Free for open source and small teams
Pro - $19/user/month
Enterprise - Custom pricing

Pros

AI-powered detection catches non-obvious smells
Auto-fix generates refactored code
30+ language support
Quick setup with GitHub/GitLab/Bitbucket integration
PR-level analysis with inline comments

Cons

AI suggestions can occasionally be inaccurate
Newer tool with a smaller track record than SonarQube or PMD
Black-box detection (harder to understand why something was flagged)
Less configurable than rule-based tools
Dependent on cloud infrastructure

How to choose the right code smell detection tool

By team size

Solo developers - IntelliJ IDEA inspections or ESLint/Pylint give you real-time feedback without any infrastructure. JDeodorant is a good addition for Java developers in Eclipse.
Small teams (2-10) - DeepSource or CodeAnt AI provide cloud-hosted analysis with free tiers. Codacy is another strong option with its generous free plan.
Medium teams (10-50) - SonarQube Developer Edition or Codacy Pro give you PR decoration, branch analysis, and team dashboards.
Large teams (50+) - SonarQube Enterprise or CodeClimate with Codacy Business for portfolio-level visibility and governance.

By language

Java - SonarQube, PMD, JDeodorant, Designite, IntelliJ IDEA
Python - Pylint, DeepSource, SonarQube, Codacy
JavaScript/TypeScript - ESLint with sonarjs plugin, SonarQube, DeepSource, CodeClimate
.NET (C#) - NDepend, SonarQube, Designite, Rider inspections
Multi-language - SonarQube, Codacy, CodeAnt AI, DeepSource

By budget

$0 - PMD + ESLint/Pylint for language-specific detection, SonarQube Community Edition for multi-language, DeepSource free tier for cloud-hosted
$10-20/user/month - DeepSource Business, Codacy Pro, CodeAnt AI Pro
$150-500/year - SonarQube Developer Edition, NDepend Professional
$20,000+/year - SonarQube Enterprise for large organizations with governance requirements

Setting up code smell detection in your CI pipeline

The most impactful way to use code smell detection tools is to run them automatically on every pull request. Here is a practical approach that works for most teams.

Step 1: Choose your tools

Pick one comprehensive tool (SonarQube, Codacy, or DeepSource) for your CI pipeline and one language-specific tool (ESLint, Pylint, or PMD) for developer-local feedback. This gives you both fast local checks and thorough PR-level analysis.

Step 2: Set quality thresholds

Do not enable every rule at once on an existing codebase. Start with high-confidence, high-impact rules:

Cognitive complexity above 15 for any single method
Duplicated blocks longer than 10 lines
Method length above 50 lines
File length above 500 lines
Parameter count above 5

Step 3: Enforce on new code only

Most tools support "new code" analysis that only flags issues introduced in the current PR. This prevents overwhelming developers with thousands of pre-existing issues while ensuring code quality improves with every merge.

Step 4: Track trends

Use your tool's dashboard to monitor code smell trends over time. The goal is not zero smells (that is unrealistic for any real project) but a consistent downward trend. SonarQube's Quality Gate, CodeClimate's GPA, and DeepSource's health score all provide this at a glance.

Code smell detection best practices

Start with the worst offenders. Focus on god classes and long methods first. These two smell types account for the majority of maintainability problems and are the easiest for teams to agree on.

Tune thresholds to your codebase. A method complexity threshold of 10 might work for a new project but generate hundreds of false positives on a legacy codebase. Start lenient and tighten over time.

Combine static and IDE-based tools. Running ESLint in your editor catches smells as you type. Running SonarQube in CI catches smells that slip through. The two approaches complement each other.

Do not treat all smells equally. A god class in a core domain module is far more damaging than a long method in a test file. Prioritize smells in high-change, high-risk areas of the codebase.

Review smell trends, not just counts. A codebase with 500 code smells that is decreasing by 20 per sprint is healthier than one with 100 smells that is increasing by 10 per sprint. Direction matters more than absolute numbers.

Verdict

For most teams, SonarQube Community Edition combined with a language-specific linter (ESLint, Pylint, or PMD) provides the best code smell detection coverage at zero cost. If you want a cloud-hosted solution with auto-fix, DeepSource offers the best balance of detection quality, low false positives, and automated remediation. For .NET teams, NDepend is unmatched. And for teams that want AI-powered detection that goes beyond predefined rules, CodeAnt AI is worth evaluating alongside the established options.

The key is to start. Any code smell detection is better than none, and the tools have never been more accessible. Pick one, integrate it into your PR workflow, and your codebase will thank you within weeks.

Frequently Asked Questions

What are code smells and why do they matter?

Code smells are indicators of deeper structural problems in source code. They are not bugs - the code still works - but they signal maintainability issues that make the codebase harder to understand, modify, and extend over time. Common examples include long methods, god classes, feature envy, and duplicated code. Ignoring code smells leads to increasing technical debt, slower development velocity, and higher defect rates.

What is the best free code smell detection tool?

For free code smell detection, SonarQube Community Edition is the most comprehensive option with support for 30+ languages. PMD and ESLint are excellent free alternatives for Java and JavaScript respectively. DeepSource offers a generous free tier for open source projects that includes code smell detection across Python, Go, Java, JavaScript, and Ruby.

Can code smell detection tools automatically fix issues?

Some tools offer automated fixes for certain code smells. DeepSource provides one-click Autofix for many issues. ESLint and Pylint can auto-fix formatting and simple structural issues with their --fix flags. IntelliJ IDEA offers built-in refactoring actions that can resolve smells like long methods and duplicated code. However, complex architectural smells like god classes or feature envy typically require manual refactoring guided by developer judgment.

How do code smell detection tools differ from linters?

Linters primarily check for syntax errors, formatting issues, and basic coding standards. Code smell detection tools go deeper by analyzing structural and design-level problems - things like class coupling, method complexity, inheritance depth, and responsibility distribution. A linter might flag an unused variable, while a code smell tool would flag an entire class that has grown too large and needs to be split into smaller, focused components.

Which code smell detection tool works best for Java projects?

For Java projects, SonarQube provides the deepest code smell analysis with over 600 Java-specific rules. PMD is a strong free alternative with mature Java support. JDeodorant is purpose-built for detecting and refactoring Java code smells directly within Eclipse. IntelliJ IDEA's built-in inspections also offer excellent Java smell detection with one-click refactoring support.

How often should you run code smell detection?

The most effective approach is to run code smell detection on every pull request as part of your CI/CD pipeline. This catches new smells before they enter the main branch. Additionally, run a full codebase scan weekly or monthly to track overall code health trends and prioritize technical debt reduction. Tools like SonarQube, Codacy, and DeepSource support both PR-level and scheduled full-scan workflows.

Originally published at aicodereview.cc

12 Best Code Audit Tools in 2026 - Quality and Security

Rahul Singh — Fri, 10 Apr 2026 21:00:00 +0000

What is a code audit and why it matters

A code audit is a systematic examination of source code to assess its quality, security posture, and compliance with standards. Unlike daily code reviews that focus on individual pull requests, a code audit takes a holistic view of the entire codebase - identifying systemic vulnerabilities, accumulated technical debt, architectural weaknesses, and regulatory compliance gaps.

Code audits matter because software rot is real. Every codebase accumulates technical debt over time as teams make pragmatic tradeoffs to hit deadlines. Without periodic audits, that debt compounds silently until it manifests as security breaches, production outages, or failed compliance certifications.

The stakes are high. IBM's Cost of a Data Breach Report 2025 puts the average breach cost at $4.88 million. PCI-DSS 4.0 now mandates automated code analysis for custom application code. SOC 2 Type II auditors increasingly expect evidence of continuous security scanning. If your organization handles sensitive data, code audits are not optional - they are a business requirement.

The good news is that modern code audit tools automate the most time-consuming parts of the process. They scan millions of lines of code in minutes, detect thousands of vulnerability patterns, and generate compliance-ready reports. The challenge is choosing the right tool for your specific needs - which is what this guide covers.

When to perform a code audit

Not every situation calls for the same type of audit. Here are the most common triggers:

Annual compliance cycles - SOC 2, PCI-DSS, HIPAA, and ISO 27001 all benefit from or require periodic code-level security assessments
Pre-acquisition due diligence - Buyers need to assess code quality, security risk, and technical debt before closing a deal
Major architecture changes - Migrating to microservices, changing frameworks, or adopting new languages warrants a baseline audit
Post-incident review - After a security breach or major production outage, audit the codebase to find related vulnerabilities
New team onboarding - When a new team inherits a codebase, an audit establishes the current state and priorities
Pre-release security gates - Critical releases should pass automated security audits before deployment

Internal vs external code audits

Internal audits are conducted by your own engineering or security team using automated tools and manual review. They are faster, cheaper, and can run continuously. The downside is potential blind spots - your team may have the same assumptions as the developers who wrote the code.

External audits are performed by third-party security firms or consultants. They bring fresh eyes, specialized expertise, and credibility with auditors and regulators. The downsides are cost ($10,000-100,000+ per engagement) and limited frequency - most organizations can only afford one or two external audits per year.

The best approach combines both. Run automated internal audits continuously using the tools in this guide, and supplement with annual external audits for high-risk systems. This gives you the breadth of automated scanning with the depth of expert human review.

How to structure a code audit process

A well-structured code audit follows these phases:

1. Define scope and objectives

Decide what you are auditing and why. A compliance-focused audit prioritizes security vulnerabilities mapped to specific controls (OWASP Top 10, CWE Top 25). A quality-focused audit targets technical debt, code complexity, and maintainability. A pre-acquisition audit covers everything.

2. Run automated scans

Use SAST tools to scan source code for vulnerabilities and quality issues. Run SCA tools to check dependencies for known CVEs. If you have running applications, add DAST scanning for runtime vulnerabilities. This phase generates the bulk of findings quickly.

3. Triage and prioritize findings

Automated tools produce noise. Triage findings by severity, exploitability, and business impact. A critical SQL injection in a public-facing API matters more than a minor code style violation in an internal script. Use the tool's built-in severity ratings as a starting point, but apply your own business context.

4. Manual expert review

Have experienced developers or security engineers review the automated findings, eliminate false positives, and investigate areas that automated tools miss - business logic flaws, architectural weaknesses, and authorization model correctness.

5. Report and remediate

Document findings with clear severity ratings, reproduction steps, and remediation guidance. Prioritize fixes by risk. Track remediation progress and verify fixes through re-scanning.

Comparison table - best code audit tools at a glance

Tool	Audit Focus	Analysis Type	Languages	Free Tier	Compliance	Starting Price
SonarQube	Quality + Security	SAST	35+	Yes	SOC 2, PCI-DSS	Free (Community)
Checkmarx	Security	SAST, SCA, DAST	25+	No	SOC 2, HIPAA, PCI-DSS	~$40,000/year
Veracode	Security	SAST, SCA, DAST	25+	No	SOC 2, HIPAA, PCI-DSS, FedRAMP	~$50,000/year
Snyk Code	Security	SAST, SCA	19+	Yes	SOC 2	$25/dev/month
Semgrep	Security + Quality	SAST, SCA	30+	Yes	SOC 2	Free (OSS)
Coverity	Security + Quality	SAST	22+	No	SOC 2, PCI-DSS	~$50,000/year
Fortify	Security	SAST, DAST	25+	No	SOC 2, HIPAA, PCI-DSS	~$40,000/year
CodeAnt AI	Quality + Security	SAST	30+	Yes	SOC 2	Free (open source)
Codacy	Quality + Security	SAST, SCA	49	Yes	SOC 2	$15/user/month
DeepSource	Quality + Security	SAST	16	Yes	SOC 2	$12/user/month
Code Climate	Quality	SAST	17	Yes	SOC 2	$49/user/month
CAST	Quality + Security	SAST	50+	No	SOC 2, HIPAA, PCI-DSS, ISO	Custom pricing

1. SonarQube - best overall for combined quality and security audits

SonarQube is the most widely adopted code audit platform, used by over 400,000 organizations worldwide. It combines code quality analysis with security vulnerability detection in a single platform, making it ideal for teams that want unified audit coverage.

What it audits: Code quality (bugs, code smells, technical debt, code duplication, complexity) and security (OWASP Top 10, CWE Top 25, injection flaws, hardcoded credentials).

Analysis type: SAST with rule-based pattern matching and dataflow analysis. The Developer Edition and above adds taint analysis for deeper security scanning.

Languages: 35+ including Java, C#, Python, JavaScript, TypeScript, Go, C/C++, PHP, Ruby, Kotlin, and Swift.

CI integration: GitHub Actions, GitLab CI, Azure DevOps, Jenkins, Bitbucket Pipelines. Quality gates block merges when thresholds are not met.

Compliance: Quality profiles can be mapped to OWASP Top 10 and CWE Top 25. Enterprise Edition provides compliance-specific reporting for SOC 2 and PCI-DSS.

Pricing: Community Build is free and open source. Developer Edition starts at $2,500/year. Enterprise Edition starts at $20,000/year.

Pros:

Largest rule library with 6,000+ rules across all supported languages
Quality gate enforcement prevents low-quality code from merging
Tracks technical debt over time with trend dashboards
Self-hosted option gives full control over data

Cons:

Community Edition lacks branch analysis and taint tracking
Self-hosting requires infrastructure management
Security analysis depth trails dedicated SAST tools like Checkmarx
UI can feel dated compared to newer tools

Best for: Teams that need a single platform for both code quality metrics and security scanning, especially Java and C# shops.

2. Checkmarx - best for enterprise security compliance

Checkmarx is an enterprise application security platform that provides deep security analysis with dedicated compliance reporting. It is the go-to choice for organizations with strict regulatory requirements.

What it audits: Security only - injection vulnerabilities, authentication flaws, authorization bypasses, cryptographic weaknesses, and 700+ vulnerability categories.

Analysis type: SAST with advanced taint analysis, SCA for open-source dependencies, and DAST for running applications through Checkmarx DAST.

Languages: 25+ including Java, C#, JavaScript, Python, C/C++, PHP, Go, Kotlin, Swift, Ruby, and Scala.

CI integration: Jenkins, GitHub Actions, GitLab CI, Azure DevOps, Bamboo, TeamCity. Provides IDE plugins for Visual Studio, IntelliJ, VS Code, and Eclipse.

Compliance: Dedicated compliance dashboards for SOC 2, HIPAA, PCI-DSS, GDPR, and NIST. Generates audit-ready reports mapped to specific compliance controls.

Pricing: Starts at approximately $40,000/year for small teams. Enterprise contracts typically range from $80,000-150,000+/year depending on developer count and modules.

Pros:

Deepest taint analysis in the market catches complex vulnerability chains
Compliance reporting is audit-ready out of the box
Unified SAST, SCA, and DAST in one platform
Dedicated security research team maintains rules

Cons:

Expensive - out of reach for small and mid-size teams
Scan times can be slow for large codebases (30-60+ minutes)
High false positive rate without tuning (30-50%)
Steep learning curve for configuration and custom queries

Best for: Enterprise organizations with dedicated application security teams and compliance requirements like HIPAA, PCI-DSS, or FedRAMP.

3. Veracode - best for regulated industries

Veracode is a cloud-based application security platform that combines SAST, SCA, and DAST with strong compliance support. It is particularly popular in financial services, healthcare, and government sectors.

What it audits: Security - vulnerabilities, insecure coding patterns, open-source license risk, and runtime security issues.

Analysis type: SAST (binary analysis and source code analysis), SCA, and DAST. Veracode's binary analysis is unique - it can scan compiled applications without requiring source code access.

Languages: 25+ including Java, C#, JavaScript, Python, C/C++, PHP, Go, Ruby, and COBOL.

CI integration: Jenkins, GitHub Actions, GitLab CI, Azure DevOps, Bamboo. Veracode Pipeline Scan provides fast incremental scanning for PRs.

Compliance: SOC 2, HIPAA, PCI-DSS, FedRAMP, and NIST 800-53. Veracode is FedRAMP authorized, making it one of the few options for US government agencies.

Pricing: Starts at approximately $50,000/year. Enterprise contracts range from $100,000-200,000+/year.

Pros:

FedRAMP authorization makes it viable for government contracts
Binary analysis works without source code access - useful for third-party code audits
Veracode Fix provides AI-powered remediation suggestions
Strong policy engine for enforcing security standards across teams

Cons:

The most expensive tool on this list
Full platform scans can take hours for large applications
Pipeline Scan (for fast PR feedback) has a more limited rule set than the full platform scan
Vendor lock-in risk with proprietary analysis engine

Best for: Regulated industries (finance, healthcare, government) that need FedRAMP authorization or binary analysis for third-party code.

4. Snyk Code - best for developer-friendly security auditing

Snyk Code is a developer-first SAST tool that prioritizes speed and low false positives. It uses a machine learning engine trained on millions of open-source projects to detect vulnerabilities with strong contextual understanding.

What it audits: Security - injection flaws, hardcoded secrets, insecure data flows, and cryptographic issues. Also includes SCA for dependency vulnerabilities.

Analysis type: SAST with ML-powered semantic analysis and inter-file dataflow tracking. SCA for open-source components.

Languages: 19+ including JavaScript, TypeScript, Python, Java, C#, Go, PHP, Ruby, Kotlin, and Swift.

CI integration: GitHub, GitLab, Bitbucket, Azure DevOps, Jenkins, CircleCI. IDE plugins for VS Code, IntelliJ, and Visual Studio provide real-time scanning.

Compliance: SOC 2 Type II certified. Findings can be mapped to OWASP Top 10 and CWE Top 25 for compliance evidence.

Pricing: Free tier for individual developers (limited scans). Team plan at $25/developer/month. Enterprise pricing is custom.

Pros:

Scan times under 60 seconds for most repositories
ML-based analysis reduces false positives compared to rule-based tools
IDE integration catches vulnerabilities before code is committed
Free tier is genuinely useful for individual developers

Cons:

Language support is narrower than Checkmarx or SonarQube
No DAST capability - security-only, no code quality metrics
Free tier limits the number of scans
Enterprise pricing can add up quickly for large teams

Best for: Development teams that want fast, low-noise security scanning integrated directly into their development workflow.

5. Semgrep - best open-source code audit tool

Semgrep is an open-source static analysis tool with a powerful custom rule engine. It has become the default choice for teams that want deep security scanning with full control over rules and configuration.

What it audits: Security (injection, XSS, SSRF, secrets, misconfigurations) and code quality (anti-patterns, best practices). The rule registry contains 10,000+ community and pro rules.

Analysis type: SAST with pattern matching and cross-file taint analysis (Pro tier). SCA through Semgrep Supply Chain.

Languages: 30+ including Python, JavaScript, TypeScript, Java, Go, Ruby, PHP, C, C++, Rust, Kotlin, Swift, Terraform, Kubernetes YAML, and Dockerfile.

CI integration: GitHub Actions, GitLab CI, Jenkins, CircleCI, Bitbucket Pipelines. Single binary - add one line to any CI config.

Compliance: Semgrep Pro includes policy engines for enforcing OWASP Top 10 and CWE Top 25. SOC 2 compliant platform.

Pricing: OSS is completely free. Pro tier is free for teams of 10 or fewer. Paid Pro starts at $35/contributor/month. Enterprise pricing is custom.

Pros:

OSS version is free for commercial use with 2,800+ community rules
Custom rules use a simple pattern syntax - no proprietary query language
Fastest scan times in the category (10-second median)
Infrastructure-as-code scanning covers Terraform, Kubernetes, and Dockerfiles

Cons:

OSS version lacks cross-file analysis and taint tracking
Code quality rules are less comprehensive than SonarQube
Pro tier's per-contributor pricing can get expensive for large teams
No built-in compliance reporting dashboards

Best for: Teams that want a powerful, customizable security audit tool without vendor lock-in, especially those with infrastructure-as-code to scan.

6. Coverity - best for C/C++ and embedded systems

Coverity (by Synopsys) is an enterprise SAST tool known for its deep analysis of compiled languages. It is the industry standard for auditing C, C++, and embedded systems code where memory safety and reliability are critical.

What it audits: Security vulnerabilities and code quality defects including memory leaks, null pointer dereferences, buffer overflows, race conditions, and resource leaks.

Analysis type: SAST with interprocedural dataflow analysis, abstract interpretation, and path-sensitive analysis. Coverity's analysis engine understands complex control flow in ways that lighter tools cannot match.

Languages: 22+ with the deepest analysis for C, C++, Java, and C#. Also supports JavaScript, Python, Go, Ruby, PHP, Kotlin, and Swift.

CI integration: Jenkins, GitHub Actions, GitLab CI, Azure DevOps, Bamboo. Coverity Connect provides a centralized web dashboard for managing findings.

Compliance: Findings mapped to CWE, OWASP Top 10, CERT C/C++, MISRA, and DISA STIG. Used extensively in automotive (ISO 26262), aerospace, and medical device development.

Pricing: Starts at approximately $50,000/year. Enterprise contracts range from $75,000-100,000+/year.

Pros:

Deepest C/C++ analysis in the market - catches issues other tools miss
Path-sensitive analysis reduces false positives on complex control flow
Industry-standard for safety-critical systems (automotive, medical, aerospace)
Low false positive rate for compiled languages

Cons:

Expensive - enterprise-only pricing
Scan times are the slowest on this list for large codebases
Web interface feels dated
Limited value for interpreted languages compared to competitors

Best for: Organizations developing in C/C++ or building safety-critical embedded systems that need the deepest possible static analysis.

7. Fortify - best for comprehensive security audit coverage

Fortify (by OpenText, formerly Micro Focus/HPE) is an enterprise SAST platform with one of the largest vulnerability rule databases in the industry. It covers both source code analysis and runtime testing through Fortify WebInspect (DAST).

What it audits: Security - 1,000+ vulnerability categories including OWASP Top 10, CWE/SANS Top 25, DISA STIG, and PCI-DSS-specific checks.

Analysis type: SAST with deep dataflow analysis and taint tracking. DAST through Fortify WebInspect (sold separately). Software Composition Analysis through Sonatype integration.

Languages: 25+ including Java, C#, JavaScript, Python, C/C++, PHP, Go, Ruby, ABAP, COBOL, and Apex (Salesforce).

CI integration: Jenkins, Azure DevOps, GitHub Actions, GitLab CI, Bamboo. Fortify on Demand provides a cloud-hosted SaaS option.

Compliance: Dedicated compliance reporting for SOC 2, HIPAA, PCI-DSS, NIST 800-53, and DISA STIG. Findings map directly to regulatory controls.

Pricing: On-premises licensing starts at approximately $40,000/year. Fortify on Demand (cloud) pricing is custom based on application count and scan frequency.

Pros:

Largest vulnerability rule database with 1,000+ categories
Supports legacy languages (COBOL, ABAP) that other tools do not
Fortify on Demand provides managed cloud scanning without infrastructure
Strong government and defense sector adoption

Cons:

High false positive rate requires dedicated triage effort
On-premises deployment is complex
UI and developer experience lag behind modern tools
Expensive licensing model

Best for: Large enterprises with diverse technology stacks including legacy languages, and organizations in government or defense sectors.

8. CodeAnt AI - best free code audit tool for startups

CodeAnt AI is an AI-powered code quality and security platform that provides automated code audits with a generous free tier. It focuses on detecting anti-patterns, security issues, and code quality problems using static analysis combined with AI-driven insights.

What it audits: Code quality (anti-patterns, dead code, code duplication, complexity) and security (common vulnerability patterns, dependency risks).

Analysis type: SAST with AI-augmented pattern detection. Focuses on code quality issues and common security patterns rather than deep taint analysis.

Languages: 30+ including Python, JavaScript, TypeScript, Java, Go, Ruby, PHP, C#, Kotlin, Swift, and Rust.

CI integration: GitHub, GitLab, Bitbucket. Provides PR-level feedback and repository-wide scanning.

Compliance: SOC 2 evidence collection through security scanning. No dedicated compliance dashboards.

Pricing: Free for open-source projects. Free tier available for small teams. Paid plans start at $10/user/month.

Pros:

Generous free tier makes it accessible to startups and small teams
AI-driven detection catches issues traditional linters miss
Fast scan times with minimal configuration
Supports a wide range of languages

Cons:

Security analysis is not as deep as dedicated SAST tools like Checkmarx or Fortify
Newer tool with a smaller community and rule library
Limited compliance reporting capabilities
Enterprise features are still maturing

Best for: Startups and small teams that want automated code quality and security auditing without the cost of enterprise tools.

9. Codacy - best for polyglot teams

Codacy supports 49 programming languages - more than any other tool on this list. It combines code quality analysis with security scanning and provides a unified dashboard for tracking audit metrics across repositories.

What it audits: Code quality (complexity, duplication, coding standards, coverage tracking) and security (OWASP Top 10, common vulnerabilities, dependency scanning).

Analysis type: SAST using multiple open-source engines (ESLint, PMD, Pylint, Bandit, and others) plus proprietary patterns. SCA for dependency vulnerabilities.

Languages: 49 including JavaScript, TypeScript, Python, Java, C#, Go, Ruby, PHP, Scala, Kotlin, Swift, Rust, Haskell, Dart, and many more.

CI integration: GitHub, GitLab, Bitbucket. Webhook-based - scans automatically on every push and PR. Also supports Jenkins and CircleCI through CLI.

Compliance: SOC 2 Type II certified. Security scanning results support compliance evidence collection. Business plan adds DAST capabilities.

Pricing: Free for open source. Pro plan at $15/user/month. Business plan with DAST and advanced security at custom pricing.

Pros:

Widest language support at 49 languages
Aggregates multiple analysis engines for broader coverage
Coverage tracking and quality metrics in one platform
Affordable pricing for small and mid-size teams

Cons:

Jack-of-all-trades - security depth trails dedicated SAST tools
Some language analyzers are shallow (basic linting only)
Dashboard can be slow with many repositories
Limited custom rule authoring compared to Semgrep

Best for: Polyglot teams using many languages that want unified quality and security metrics in a single affordable platform.

10. DeepSource - best for automated fix suggestions

DeepSource combines static analysis with automated fix suggestions (Autofix) that can resolve detected issues with one click. It is particularly strong on code quality with a growing security capability.

What it audits: Code quality (anti-patterns, bug risks, style violations, complexity, coverage) and security (common vulnerabilities, secrets detection, dependency scanning).

Analysis type: SAST with dataflow analysis. Proprietary analysis engine built from scratch rather than wrapping open-source tools.

Languages: 16 including Python, JavaScript, TypeScript, Java, Go, Ruby, PHP, C#, Kotlin, Swift, Rust, and Scala.

CI integration: GitHub, GitLab, Bitbucket. Automatic scanning on every commit and PR. Also provides a CLI for local scanning.

Compliance: SOC 2 Type II certified. Security findings support compliance evidence collection.

Pricing: Free for open source and individuals. Team plan at $12/user/month. Enterprise pricing is custom.

Pros:

Autofix resolves many issues automatically - saves remediation time
Sub-5% false positive rate - the lowest in the category
Clean, modern UI with excellent developer experience
Most affordable paid tier at $12/user/month

Cons:

Supports only 16 languages - the fewest on this list
Security analysis is less comprehensive than dedicated SAST tools
No DAST or advanced taint analysis
Enterprise features are still developing

Best for: Teams that want low-noise code quality auditing with automated remediation at an affordable price.

11. Code Climate - best for engineering team metrics

Code Climate focuses on code quality metrics and engineering team productivity. It is less of a security audit tool and more of a quality and maintainability audit platform. Code Climate Quality analyzes code for complexity, duplication, and maintainability issues while Code Climate Velocity tracks engineering team metrics.

What it audits: Code quality only - maintainability, complexity, duplication, test coverage, and coding standards. No security vulnerability detection.

Analysis type: SAST for code quality metrics. Uses maintainability ratings (A through F) for quick assessment.

Languages: 17 including JavaScript, TypeScript, Python, Ruby, Go, Java, PHP, C#, and Swift.

CI integration: GitHub and GitLab. PR-level feedback with status checks. Jenkins integration available.

Compliance: SOC 2 certified. Quality metrics support general compliance evidence but no security-specific compliance reporting.

Pricing: Quality starts at $49/user/month. Velocity (engineering metrics) is priced separately. Combined plans available at custom pricing.

Pros:

Maintainability ratings provide instant codebase health assessment
Engineering velocity metrics help identify process bottlenecks
Clean PR integration with pass/fail quality gates
Good for non-technical stakeholders who need simple quality metrics

Cons:

No security scanning at all - must pair with a separate security tool
Expensive for what it offers at $49/user/month
Limited language support compared to competitors
Quality analysis is less detailed than SonarQube

Best for: Engineering leaders who need maintainability metrics and team productivity data for quality-focused audits.

12. CAST - best for large-scale enterprise code audits

CAST (CAST Highlight and CAST Imaging) specializes in large-scale codebase analysis for enterprise transformation, due diligence, and portfolio-level audits. It can analyze millions of lines of code across 50+ languages and provides architectural visualization alongside quality and security metrics.

What it audits: Code quality (technical debt, complexity, maintainability), security (OWASP, CWE), and architecture (dependency mapping, component coupling, cloud readiness).

Analysis type: SAST with architectural analysis. CAST Imaging creates interactive architecture maps from source code. CAST Highlight provides portfolio-level metrics across hundreds of applications.

Languages: 50+ including Java, C#, JavaScript, Python, C/C++, COBOL, ABAP, PL/SQL, RPG, and dozens of legacy languages.

CI integration: CAST can integrate with CI/CD pipelines but is primarily designed for periodic comprehensive audits rather than PR-level scanning.

Compliance: SOC 2, HIPAA, PCI-DSS, ISO 27001. Generates audit-ready compliance reports. Used extensively in M&A due diligence by Big Four consulting firms.

Pricing: Custom pricing based on lines of code and application count. Typical engagements range from $30,000-200,000+ per year.

Pros:

Handles the largest codebases (millions of lines across hundreds of applications)
Architectural visualization is unique - no other tool provides this
Portfolio-level analysis across entire application estates
Strong M&A and due diligence track record
Best legacy language support on this list

Cons:

Not designed for developer workflow integration (PR-level scanning)
Expensive and complex to deploy
Overkill for small and mid-size organizations
Learning curve for interpreting architectural analysis

Best for: Large enterprises conducting portfolio-level audits, M&A due diligence, or modernization assessments across diverse technology stacks.

Recommendations by use case

For startups and small teams (under 20 developers)

Start with Semgrep OSS for security scanning and SonarQube Community Build for code quality. Both are free. Add DeepSource ($12/user/month) or CodeAnt AI (free tier) if you want a managed platform with less configuration overhead. This stack covers security and quality auditing at minimal cost.

For mid-size teams (20-100 developers)

Semgrep Pro (free for 10 contributors, then $35/contributor/month) provides the strongest security coverage. Pair with Codacy ($15/user/month) for broad language support and quality metrics. If you prefer a single tool, SonarQube Developer Edition ($2,500/year) provides unified quality and security at reasonable cost.

For enterprise security compliance

Checkmarx or Veracode for organizations that need dedicated compliance reporting, SAST + SCA + DAST in one platform, and audit-ready documentation. Choose Veracode if you need FedRAMP authorization or binary analysis. Choose Checkmarx for the deepest taint analysis.

For C/C++ and embedded systems

Coverity is the clear choice. Its path-sensitive analysis and understanding of memory safety issues in C/C++ are unmatched. Supplement with SonarQube for broader quality metrics.

For M&A due diligence and portfolio audits

CAST is purpose-built for this use case. Its ability to analyze millions of lines across 50+ languages and generate architectural visualizations makes it the standard for technical due diligence.

For developer-first security

Snyk Code provides the fastest feedback loop with IDE integration and sub-60-second scan times. Pair with Semgrep for deeper custom rule coverage. This combination gives developers real-time security feedback without disrupting their workflow.

Key factors when choosing a code audit tool

Analysis depth vs speed

Enterprise tools like Checkmarx and Coverity perform deep interprocedural analysis that catches complex vulnerability chains but takes 30-60+ minutes. Tools like Semgrep and Snyk Code scan in seconds but may miss vulnerabilities that require deep path analysis. For continuous auditing in CI/CD, speed matters. For periodic comprehensive audits, depth matters more.

Compliance requirements

If your organization needs compliance-specific reporting (SOC 2, HIPAA, PCI-DSS), enterprise tools like Checkmarx, Veracode, and Fortify provide audit-ready dashboards out of the box. Open-source tools like Semgrep and SonarQube can support compliance evidence collection but require more manual effort to produce audit-ready reports.

Language coverage

Verify that your primary languages have deep analysis support, not just surface-level linting. A tool that lists 50 languages but only has deep analysis for 5 of them may miss critical issues in your stack. SonarQube has the deepest Java and C# analysis. Coverity leads for C/C++. Semgrep is strongest for Python and Go security.

False positive management

High false positive rates destroy developer trust and make audit findings useless. DeepSource claims a sub-5% false positive rate. Snyk Code and Semgrep with AI triage achieve low false positive rates. Enterprise tools like Checkmarx and Fortify can have 30-50% false positive rates without tuning. Ask for trial access and test on your actual codebase before committing.

Total cost of ownership

Free tools still have costs - infrastructure, configuration, and maintenance time. A managed SaaS tool at $15/user/month may be cheaper than self-hosting a free tool when you account for engineering time. Factor in training, rollout, and ongoing rule maintenance when comparing pricing.

Final verdict

There is no single best code audit tool because the right choice depends on your team size, technology stack, compliance requirements, and budget. For most teams, a combination of two tools provides the best coverage:

A security-focused tool (Semgrep, Snyk Code, or Checkmarx depending on budget) for vulnerability detection
A quality-focused tool (SonarQube, Codacy, or DeepSource) for technical debt, complexity, and maintainability tracking

Run the security tool in every CI pipeline for continuous protection. Run the quality tool for periodic comprehensive audits and trend tracking. This two-tool approach covers both dimensions of code auditing - security and quality - without overloading developers with a single monolithic platform.

The most important thing is to start auditing. A simple Semgrep + SonarQube setup running in CI catches more issues than a perfectly planned enterprise audit program that never gets deployed. Start with what you can implement this week and iterate from there.

Frequently Asked Questions

What is a code audit?

A code audit is a systematic review of source code to evaluate its quality, security, maintainability, and compliance with coding standards. Unlike routine code review during pull requests, a code audit examines the entire codebase or a significant portion of it to identify systemic issues - security vulnerabilities, technical debt, architectural weaknesses, and violations of regulatory requirements like SOC 2, HIPAA, or PCI-DSS. Code audits can be performed internally by the development team or externally by third-party firms.

How often should you perform a code audit?

Most organizations should perform a comprehensive code audit at least once per year, with automated scanning running continuously in CI/CD pipelines. High-risk events that should trigger an immediate audit include pre-acquisition due diligence, major architecture changes, compliance certification renewals, post-security-incident reviews, and onboarding a new development team or vendor. Continuous automated auditing with tools like SonarQube or Semgrep supplements annual deep audits by catching issues in real time.

What is the difference between a code audit and a code review?

A code review evaluates individual changes at the pull request level, focusing on whether new or modified code is correct and follows team conventions. A code audit is a broader examination of the entire codebase or a major subsystem, looking for systemic patterns like accumulated technical debt, widespread security weaknesses, licensing violations, and compliance gaps. Code reviews are ongoing and incremental. Code audits are periodic and comprehensive. Both are necessary for a mature engineering organization.

How much do code audit tools cost?

Code audit tool pricing ranges from free to over $200,000 per year. Free options include SonarQube Community Build, Semgrep OSS, and DeepSource's free tier. Mid-range tools cost $12-35 per developer per month - DeepSource at $12/user/month, Codacy at $15/user/month, and Semgrep at $35/contributor/month. Enterprise tools like Checkmarx ($40,000-150,000+/year), Veracode ($50,000-200,000+/year), Fortify ($40,000-80,000+/year), and Coverity ($50,000-100,000+/year) include compliance reporting and dedicated support.

Can automated code audit tools replace manual audits?

No. Automated tools excel at detecting known vulnerability patterns, coding standard violations, and quantifiable metrics like cyclomatic complexity and code duplication. However, they cannot evaluate business logic correctness, architectural soundness, or whether the code actually meets its intended requirements. A thorough code audit combines automated tooling for breadth and speed with manual expert review for depth and context. The best approach is to run automated scans first to clear the mechanical issues, then have human auditors focus on architecture, logic, and design.

Which code audit tools support SOC 2 and PCI-DSS compliance?

Enterprise tools like Checkmarx, Veracode, Fortify, and Coverity provide dedicated compliance reporting for SOC 2, PCI-DSS, HIPAA, and other regulatory frameworks. They map findings directly to compliance controls and generate audit-ready reports. SonarQube Developer Edition and above includes compliance-oriented quality profiles. Semgrep and Snyk Code offer policy engines that can enforce compliance-related rules. For smaller teams, Codacy and DeepSource provide security scanning that supports SOC 2 evidence collection, though without the dedicated compliance dashboards of enterprise tools.

Originally published at aicodereview.cc