DEV Community

Cover image for AI CLI Coding Tool Execution Accuracy Benchmark: Claude Code vs Aider vs Cognix on the Same LLM
cognix-dev
cognix-dev

Posted on

AI CLI Coding Tool Execution Accuracy Benchmark: Claude Code vs Aider vs Cognix on the Same LLM

πŸ“‹ Summary

You've probably been there: AI generates some code, you run it β€” and it fails.

It's not a speed problem. You're using the same LLM, but different tools give different results. So where does that gap come from?

The idea behind this experiment: whether code succeeds or fails isn't about how fast the tool is β€” it's about how the generation pipeline is designed.

We put 3 tools head-to-head with the same LLM (sonnet-4-5) and the same task.

Metric Claude Code Aider Cognix
Execution Accuracy 100% 87.5% 100%
Code Quality * 4.79 1.69 0.0
Speed 391s 191s 864s

lint errors per 100 lines. Lower is better. n=3 average.

There was a clear difference. And it lined up with differences in pipeline design β€” the sequence of steps from code generation to validation.

From this experiment: whether code fails comes down to the thickness of the validation layer β€” how well the tool catches cases where the AI's assumptions turn out to be wrong.

"What makes them different?" β€” the experimental design, raw data, and breakdown of what makes code actually run are all below.


Abstract

Most AI coding tool benchmarks measure speed or output volume. Almost none measure whether the generated code actually runs correctly.

We designed a benchmark around a single, concrete task: "add a feature to an existing Python project." All tools used the same LLM (claude-sonnet-4-5-20250929), run 3 times each, evaluated across 5 axes.

Key results:

  • Same LLM across all tools: claude-sonnet-4-5-20250929
  • Execution accuracy: Cognix=100%, Claude Code=100%, Aider=87.5%
  • Code quality (lint errors/100 lines): Cognix=0.0, Aider=1.69, Claude Code=4.79
  • Speed: Aider fastest (190.6s), Cognix slowest (863.7s)
  • Fully reproducible: source code and raw data are published

1. Why We Did This

Most discussions about "which AI coding tool is better" focus on UI polish, context window size, or how fast it responds. But what developers actually care about is: does the generated code work?

That distinction matters a lot. A tool that generates code fast is useless if that code breaks when you integrate it into a real project.

Cognix was built around this problem β€” not speed, but multi-stage quality validation to ensure generated code meets external contracts.

To back that up, we needed real numbers. This article is Phase 1 of a benchmark comparing Cognix against two widely-used tools under controlled, reproducible conditions.

Disclosure: The author is the developer of Cognix and co-author of the evaluation script (verify.py). All artifacts are published and independently verifiable.


2. Experimental Setup

2.1 Tools and Versions

Tool Version Notes
Cognix v0.2.5 Open-source, multi-stage generation pipeline
Claude Code Latest (2026-02-18) Anthropic's official CLI tool
Aider Latest (2026-02-18) Open-source AI coding assistant

2.2 LLM Model

All three tools used the same model: claude-sonnet-4-5-20250929

This is the key control variable. By removing LLM differences from the equation, we can actually measure what each tool's pipeline and quality controls contribute.

2.3 The Task: Adding a Feature to an Existing Codebase

We didn't start from scratch β€” we asked each tool to add functionality to an existing codebase. Specifically: add a RecurringTask feature to a task management CLI app. That means:

  • Understanding the existing codebase structure
  • Implementing a new RecurringRule model as a dataclass
  • Implementing storage functions (save_recurring_rules, load_recurring_rules) with correct round-trip behavior
  • Implementing a validator (validate_no_circular_dependency) with the right function signature
  • Modifying existing entry points without breaking them
  • Passing 8 automated verification tests in verify.py

2.4 Why This Particular Task?

This task was designed to surface the failure patterns AI-generated code most commonly hits in real production use β€” code that silently fails at external interface boundaries.

Here's what we were trying to expose:

  • LLM-written unit tests pass internally, but external verify.py fails because save_recurring_rules can't correctly serialize/deserialize RecurringRule objects
  • Imports work fine, but passing actual objects (instead of dicts) to storage functions raises TypeError
  • The model class exists, but the constructor parameter names don't match what the verifier expects

These aren't obscure edge cases. They're predictable failure patterns that show up when an LLM generates code without really understanding the external contracts of what it's writing.

2.5 How We Evaluated

We used 8 automated tests in verify.py. Each test is binary (pass/fail). Execution score = tests passed / 8.

  1. All 61 existing tests still pass (no regressions)
  2. New tests were added (total > 61)
  3. RecurringRule model works correctly β€” should_run() and advance() behave as specified
  4. Recurring storage round-trip β€” save_recurring_rules() / load_recurring_rules() works with actual objects (not dicts)
  5. TaskDependency model exists, validate_no_circular_dependency() catches direct and transitive cycles
  6. Dependency storage round-trip β€” save_dependencies() / load_dependencies() return correct types
  7. Dashboard command (cmd_dashboard) and format_dashboard() are importable and callable
  8. CLI subcommands exist: recurring-add, recurring-list, recurring-run, dep-add, dep-list, dashboard (5 of 6 = PASS)

2.6 Metrics

Metric Definition Unit Better
Exec verify.py test pass rate % Higher
Dep All imports resolve at runtime % Higher
Lint Style/quality errors per 100 lines (ruff) errors/100 lines Lower
Scope Required features reflected in output % Higher
Speed Wall time from prompt to completion seconds Lower

2.7 Protocol

  • Each tool ran 3 independent times from a clean project state
  • Every run started from the same unmodified base project
  • No manual intervention during generation
  • Results recorded directly from verify.py output

3. Results

3.1 Summary (3-run average)

All values are averages of n=3 independent runs.

Metric Cognix Claude Code Aider
Exec 100.0% 100.0% 87.5%
Dep 100.0% 100.0% 100.0%
Lint 0.00 4.79 1.69
Scope 100.0% 100.0% 100.0%
Speed 863.7s 390.8s 190.6s

3.2 Raw Data: All Runs

Cognix (v0.2.5)

Run Exec Dep Lint Scope Speed
1 100% 100% 0.00 100% 930.9s
2 100% 100% 0.00 100% 891.1s
3 100% 100% 0.00 100% 769.0s
Avg 100% 100% 0.00 100% 863.7s

Claude Code

Run Exec Dep Lint Scope Speed
1 100% 100% 4.27 100% 410.4s
2 100% 100% 5.25 100% 409.8s
3 100% 100% 4.86 100% 352.2s
Avg 100% 100% 4.79 100% 390.8s

Aider

Run Exec Dep Lint Scope Speed
1 88% 100% 5.06 100% 187.8s
2 88% 100% 0.00 100% 189.5s
3 88% 100% 0.00 100% 194.7s
Avg 87.5% 100% 1.69 100% 190.6s

4. Analysis

4.1 Execution Accuracy

Cognix and Claude Code hit exec=100% on all 3 runs. Aider consistently came in at 88% (7 of 8 tests passing) β€” and it failed on the same test every single time.

That consistency is the interesting part. It wasn't random LLM variance β€” it was the same failure, 3 times in a row. What we observed: Aider consistently generated storage functions that work fine with dict input but throw TypeError when you pass actual RecurringRule instances. We'll dig into the specific failure in a follow-up article.

4.2 Code Quality (Lint)

Cognix is the only tool that hit lint=0.00 on all 3 runs. That's because Cognix's pipeline includes an auto-fix loop:

  1. Generate code
  2. Run lint check (ruff/flake8)
  3. LLM auto-fixes violations
  4. Re-check, repeat until clean

Claude Code doesn't include lint checking or auto-fix, so whatever style issues the LLM introduces just stay there β€” averaging 4.79 errors/100 lines. That's code that wouldn't pass a standard CI lint gate.

Aider's lint score swung wildly across runs (0.00–5.06). It's purely a byproduct of LLM output, not a controlled quality gate.

4.3 Speed

Speed ranking: Aider (190.6s) < Claude Code (390.8s) < Cognix (863.7s)

Cognix is about 4.5x slower than Aider and 2.2x slower than Claude Code. That's the expected cost of running more stages:

Code Generation β†’ Lint Check & Auto-fix β†’ Code Review β†’ Test Execution & Auto-fix β†’ API Contract Validation β†’ Quality Assessment

Each stage takes time. The tradeoff is intentional: slower generation, but stronger guarantees on accuracy and quality.

Whether that tradeoff makes sense depends on what you're doing. For CI/CD automation or complex feature work where correctness really matters, the extra time is worth it. For quick prototypes or small edits, a faster tool probably makes more sense.

4.4 Stability

Cognix had zero variance across all 5 metrics over 3 runs. Claude Code had slight lint variance (4.27–5.25). Aider had no exec variance (stuck at 88% every run) but big lint variance (0.00–5.06).

Cognix's consistency comes from deterministic post-processing. No matter how much the LLM output varies, the lint fix loop always converges to 0.00, and API Contract Validation catches interface issues before they reach the verifier.

4.5 Hypothesis: A Structural Blind Spot in AI Coding Tools

Here's what's really going on: AI writes code that looks right. But it fills in assumptions about how that code will be called β€” types, arguments, return values. When those assumptions are wrong, the code breaks. Whether a tool can catch those wrong assumptions is what separates the results.

From this experiment: whether code fails comes down to the thickness of the validation layer β€” how well the tool catches cases where the AI's assumptions turn out to be wrong.

Aider failed in the same spot all 3 times. That's not random β€” it's what you'd expect from a pipeline that doesn't verify external contracts (in this case, whether storage functions actually handle real objects correctly).

And this probably isn't just an Aider issue. Any tool without a validation loop is structurally more likely to ship "plausible-looking code" that fails real-world checks.

That said β€” this is a hypothesis. n=3, one task, one language isn't enough to confirm it. We need more task types, more languages, bigger samples. That's what Phase 2 is for.


5. Limitations

5.1 Single Task Type

This benchmark only covers one scenario: feature addition to an existing Python project. We can't generalize to all code generation use cases. New projects from scratch, bug fixes, refactoring, or non-Python work might look quite different.

5.2 Why This Task?

We didn't pick it arbitrarily. It was designed to expose the failure pattern that shows up most often when developers use AI tools in real production environments β€” code that silently fails at external interface boundaries:

  • Working with existing APIs (not just writing standalone functions)
  • External interface contracts (storage round-trips, function signatures)
  • Cross-file consistency (model, storage, validators, and entry points all have to agree)

5.3 Sample Size

3 runs is a small sample. The Cognix and Claude Code results (exec=100%, no variance) hold up fine, but Aider's numbers (88%, lint variance) deserve more caution in interpretation.

5.4 What's Next

  • More task types: new projects, bug fixes, refactoring, real-world scenarios
  • Phase 2: benchmarking each tool at its optimal settings (e.g., Aider with --architect)
  • Failure analysis: digging into exactly which tests fail and why
  • Hypothesis validation: testing the relationship between validation layer thickness and execution accuracy across multiple tasks and languages

6. Conclusion

On a feature-addition task focused on external API contract correctness:

  • Cognix matches Claude Code at exec=100% (both 100%)
  • Cognix leads on code quality at lint=0.00 (Claude Code 4.79, Aider 1.69)
  • Cognix beats Aider on execution accuracy (100% vs. 87.5%)
  • Cognix is the slowest (863.7s vs. Claude Code 390.8s, Aider 190.6s)

The data backs up the hypothesis: multi-stage quality validation produces more reliable, cleaner code on complex integration tasks β€” at the cost of speed.


Try Cognix

pipx install cognix
Enter fullscreen mode Exit fullscreen mode

https://cognix-dev.github.io/cognix/


7. Reproducibility

Everything is published:

  • prompt.md β€” task specification
  • verify.py β€” evaluation script (8 test items)
  • Generated code from each tool, each run
  • Raw JSON result data

Repository: cognix/benchmark/phase1


This is Phase 1 of a benchmark series. Phase 2 will cover more task types, optimal tool configurations, and validation of the hypothesis in section 4.5.

Top comments (0)