cognix-dev

Posted on Feb 23

AI CLI Coding Tool Execution Accuracy Benchmark: Claude Code vs Aider vs Cognix on the Same LLM

#ai #claudecode #aider #benchmark

📋 Summary

You've probably been there: AI generates some code, you run it — and it fails.

It's not a speed problem. You're using the same LLM, but different tools give different results. So where does that gap come from?

The idea behind this experiment: whether code succeeds or fails isn't about how fast the tool is — it's about how the generation pipeline is designed.

We put 3 tools head-to-head with the same LLM (sonnet-4-5) and the same task.

Metric	Claude Code	Aider	Cognix
Execution Accuracy	100%	87.5%	100%
Code Quality *	4.79	1.69	0.0
Speed	391s	191s	864s

lint errors per 100 lines. Lower is better. n=3 average.

There was a clear difference. And it lined up with differences in pipeline design — the sequence of steps from code generation to validation.

From this experiment: whether code fails comes down to the thickness of the validation layer — how well the tool catches cases where the AI's assumptions turn out to be wrong.

"What makes them different?" — the experimental design, raw data, and breakdown of what makes code actually run are all below.

Abstract

Most AI coding tool benchmarks measure speed or output volume. Almost none measure whether the generated code actually runs correctly.

We designed a benchmark around a single, concrete task: "add a feature to an existing Python project." All tools used the same LLM (claude-sonnet-4-5-20250929), run 3 times each, evaluated across 5 axes.

Key results:

Same LLM across all tools: claude-sonnet-4-5-20250929
Execution accuracy: Cognix=100%, Claude Code=100%, Aider=87.5%
Code quality (lint errors/100 lines): Cognix=0.0, Aider=1.69, Claude Code=4.79
Speed: Aider fastest (190.6s), Cognix slowest (863.7s)
Fully reproducible: source code and raw data are published

1. Why We Did This

Most discussions about "which AI coding tool is better" focus on UI polish, context window size, or how fast it responds. But what developers actually care about is: does the generated code work?

That distinction matters a lot. A tool that generates code fast is useless if that code breaks when you integrate it into a real project.

Cognix was built around this problem — not speed, but multi-stage quality validation to ensure generated code meets external contracts.

To back that up, we needed real numbers. This article is Phase 1 of a benchmark comparing Cognix against two widely-used tools under controlled, reproducible conditions.

Disclosure: The author is the developer of Cognix and co-author of the evaluation script (verify.py). All artifacts are published and independently verifiable.

2. Experimental Setup

2.1 Tools and Versions

Tool	Version	Notes
Cognix	v0.2.5	Open-source, multi-stage generation pipeline
Claude Code	Latest (2026-02-18)	Anthropic's official CLI tool
Aider	Latest (2026-02-18)	Open-source AI coding assistant

2.2 LLM Model

All three tools used the same model: claude-sonnet-4-5-20250929

This is the key control variable. By removing LLM differences from the equation, we can actually measure what each tool's pipeline and quality controls contribute.

2.3 The Task: Adding a Feature to an Existing Codebase

We didn't start from scratch — we asked each tool to add functionality to an existing codebase. Specifically: add a RecurringTask feature to a task management CLI app. That means:

Understanding the existing codebase structure
Implementing a new RecurringRule model as a dataclass
Implementing storage functions (save_recurring_rules, load_recurring_rules) with correct round-trip behavior
Implementing a validator (validate_no_circular_dependency) with the right function signature
Modifying existing entry points without breaking them
Passing 8 automated verification tests in verify.py

2.4 Why This Particular Task?

This task was designed to surface the failure patterns AI-generated code most commonly hits in real production use — code that silently fails at external interface boundaries.

Here's what we were trying to expose:

LLM-written unit tests pass internally, but external verify.py fails because save_recurring_rules can't correctly serialize/deserialize RecurringRule objects
Imports work fine, but passing actual objects (instead of dicts) to storage functions raises TypeError
The model class exists, but the constructor parameter names don't match what the verifier expects

These aren't obscure edge cases. They're predictable failure patterns that show up when an LLM generates code without really understanding the external contracts of what it's writing.

2.5 How We Evaluated

We used 8 automated tests in verify.py. Each test is binary (pass/fail). Execution score = tests passed / 8.

All 61 existing tests still pass (no regressions)
New tests were added (total > 61)
RecurringRule model works correctly — should_run() and advance() behave as specified
Recurring storage round-trip — save_recurring_rules() / load_recurring_rules() works with actual objects (not dicts)
TaskDependency model exists, validate_no_circular_dependency() catches direct and transitive cycles
Dependency storage round-trip — save_dependencies() / load_dependencies() return correct types
Dashboard command (cmd_dashboard) and format_dashboard() are importable and callable
CLI subcommands exist: recurring-add, recurring-list, recurring-run, dep-add, dep-list, dashboard (5 of 6 = PASS)

2.6 Metrics

Metric	Definition	Unit	Better
Exec	verify.py test pass rate	%	Higher
Dep	All imports resolve at runtime	%	Higher
Lint	Style/quality errors per 100 lines (ruff)	errors/100 lines	Lower
Scope	Required features reflected in output	%	Higher
Speed	Wall time from prompt to completion	seconds	Lower

2.7 Protocol

Each tool ran 3 independent times from a clean project state
Every run started from the same unmodified base project
No manual intervention during generation
Results recorded directly from verify.py output

3. Results

3.1 Summary (3-run average)

All values are averages of n=3 independent runs.

Metric	Cognix	Claude Code	Aider
Exec	100.0%	100.0%	87.5%
Dep	100.0%	100.0%	100.0%
Lint	0.00	4.79	1.69
Scope	100.0%	100.0%	100.0%
Speed	863.7s	390.8s	190.6s

3.2 Raw Data: All Runs

Cognix (v0.2.5)

Run	Exec	Dep	Lint	Scope	Speed
1	100%	100%	0.00	100%	930.9s
2	100%	100%	0.00	100%	891.1s
3	100%	100%	0.00	100%	769.0s
Avg	100%	100%	0.00	100%	863.7s

Claude Code

Run	Exec	Dep	Lint	Scope	Speed
1	100%	100%	4.27	100%	410.4s
2	100%	100%	5.25	100%	409.8s
3	100%	100%	4.86	100%	352.2s
Avg	100%	100%	4.79	100%	390.8s

Aider

Run	Exec	Dep	Lint	Scope	Speed
1	88%	100%	5.06	100%	187.8s
2	88%	100%	0.00	100%	189.5s
3	88%	100%	0.00	100%	194.7s
Avg	87.5%	100%	1.69	100%	190.6s

4. Analysis

4.1 Execution Accuracy

Cognix and Claude Code hit exec=100% on all 3 runs. Aider consistently came in at 88% (7 of 8 tests passing) — and it failed on the same test every single time.

That consistency is the interesting part. It wasn't random LLM variance — it was the same failure, 3 times in a row. What we observed: Aider consistently generated storage functions that work fine with dict input but throw TypeError when you pass actual RecurringRule instances. We'll dig into the specific failure in a follow-up article.

4.2 Code Quality (Lint)

Cognix is the only tool that hit lint=0.00 on all 3 runs. That's because Cognix's pipeline includes an auto-fix loop:

Generate code
Run lint check (ruff/flake8)
LLM auto-fixes violations
Re-check, repeat until clean

Claude Code doesn't include lint checking or auto-fix, so whatever style issues the LLM introduces just stay there — averaging 4.79 errors/100 lines. That's code that wouldn't pass a standard CI lint gate.

Aider's lint score swung wildly across runs (0.00–5.06). It's purely a byproduct of LLM output, not a controlled quality gate.

4.3 Speed

Speed ranking: Aider (190.6s) < Claude Code (390.8s) < Cognix (863.7s)

Cognix is about 4.5x slower than Aider and 2.2x slower than Claude Code. That's the expected cost of running more stages:

Code Generation → Lint Check & Auto-fix → Code Review → Test Execution & Auto-fix → API Contract Validation → Quality Assessment

Each stage takes time. The tradeoff is intentional: slower generation, but stronger guarantees on accuracy and quality.

Whether that tradeoff makes sense depends on what you're doing. For CI/CD automation or complex feature work where correctness really matters, the extra time is worth it. For quick prototypes or small edits, a faster tool probably makes more sense.

4.4 Stability

Cognix had zero variance across all 5 metrics over 3 runs. Claude Code had slight lint variance (4.27–5.25). Aider had no exec variance (stuck at 88% every run) but big lint variance (0.00–5.06).

Cognix's consistency comes from deterministic post-processing. No matter how much the LLM output varies, the lint fix loop always converges to 0.00, and API Contract Validation catches interface issues before they reach the verifier.

4.5 Hypothesis: A Structural Blind Spot in AI Coding Tools

Here's what's really going on: AI writes code that looks right. But it fills in assumptions about how that code will be called — types, arguments, return values. When those assumptions are wrong, the code breaks. Whether a tool can catch those wrong assumptions is what separates the results.

From this experiment: whether code fails comes down to the thickness of the validation layer — how well the tool catches cases where the AI's assumptions turn out to be wrong.

Aider failed in the same spot all 3 times. That's not random — it's what you'd expect from a pipeline that doesn't verify external contracts (in this case, whether storage functions actually handle real objects correctly).

And this probably isn't just an Aider issue. Any tool without a validation loop is structurally more likely to ship "plausible-looking code" that fails real-world checks.

That said — this is a hypothesis. n=3, one task, one language isn't enough to confirm it. We need more task types, more languages, bigger samples. That's what Phase 2 is for.

5. Limitations

5.1 Single Task Type

This benchmark only covers one scenario: feature addition to an existing Python project. We can't generalize to all code generation use cases. New projects from scratch, bug fixes, refactoring, or non-Python work might look quite different.

5.2 Why This Task?

We didn't pick it arbitrarily. It was designed to expose the failure pattern that shows up most often when developers use AI tools in real production environments — code that silently fails at external interface boundaries:

Working with existing APIs (not just writing standalone functions)
External interface contracts (storage round-trips, function signatures)
Cross-file consistency (model, storage, validators, and entry points all have to agree)

5.3 Sample Size

3 runs is a small sample. The Cognix and Claude Code results (exec=100%, no variance) hold up fine, but Aider's numbers (88%, lint variance) deserve more caution in interpretation.

5.4 What's Next

More task types: new projects, bug fixes, refactoring, real-world scenarios
Phase 2: benchmarking each tool at its optimal settings (e.g., Aider with --architect)
Failure analysis: digging into exactly which tests fail and why
Hypothesis validation: testing the relationship between validation layer thickness and execution accuracy across multiple tasks and languages

6. Conclusion

On a feature-addition task focused on external API contract correctness:

Cognix matches Claude Code at exec=100% (both 100%)
Cognix leads on code quality at lint=0.00 (Claude Code 4.79, Aider 1.69)
Cognix beats Aider on execution accuracy (100% vs. 87.5%)
Cognix is the slowest (863.7s vs. Claude Code 390.8s, Aider 190.6s)

The data backs up the hypothesis: multi-stage quality validation produces more reliable, cleaner code on complex integration tasks — at the cost of speed.

Try Cognix

pipx install cognix

https://cognix-dev.github.io/cognix/

7. Reproducibility

Everything is published:

prompt.md — task specification
verify.py — evaluation script (8 test items)
Generated code from each tool, each run
Raw JSON result data

Repository: cognix/benchmark/phase1

This is Phase 1 of a benchmark series. Phase 2 will cover more task types, optimal tool configurations, and validation of the hypothesis in section 4.5.

DEV Community