π Summary
You've probably been there: AI generates some code, you run it β and it fails.
It's not a speed problem. You're using the same LLM, but different tools give different results. So where does that gap come from?
The idea behind this experiment: whether code succeeds or fails isn't about how fast the tool is β it's about how the generation pipeline is designed.
We put 3 tools head-to-head with the same LLM (sonnet-4-5) and the same task.
| Metric | Claude Code | Aider | Cognix |
|---|---|---|---|
| Execution Accuracy | 100% | 87.5% | 100% |
| Code Quality * | 4.79 | 1.69 | 0.0 |
| Speed | 391s | 191s | 864s |
lint errors per 100 lines. Lower is better. n=3 average.
There was a clear difference. And it lined up with differences in pipeline design β the sequence of steps from code generation to validation.
From this experiment: whether code fails comes down to the thickness of the validation layer β how well the tool catches cases where the AI's assumptions turn out to be wrong.
"What makes them different?" β the experimental design, raw data, and breakdown of what makes code actually run are all below.
Abstract
Most AI coding tool benchmarks measure speed or output volume. Almost none measure whether the generated code actually runs correctly.
We designed a benchmark around a single, concrete task: "add a feature to an existing Python project." All tools used the same LLM (claude-sonnet-4-5-20250929), run 3 times each, evaluated across 5 axes.
Key results:
- Same LLM across all tools:
claude-sonnet-4-5-20250929 - Execution accuracy: Cognix=100%, Claude Code=100%, Aider=87.5%
- Code quality (lint errors/100 lines): Cognix=0.0, Aider=1.69, Claude Code=4.79
- Speed: Aider fastest (190.6s), Cognix slowest (863.7s)
- Fully reproducible: source code and raw data are published
1. Why We Did This
Most discussions about "which AI coding tool is better" focus on UI polish, context window size, or how fast it responds. But what developers actually care about is: does the generated code work?
That distinction matters a lot. A tool that generates code fast is useless if that code breaks when you integrate it into a real project.
Cognix was built around this problem β not speed, but multi-stage quality validation to ensure generated code meets external contracts.
To back that up, we needed real numbers. This article is Phase 1 of a benchmark comparing Cognix against two widely-used tools under controlled, reproducible conditions.
Disclosure: The author is the developer of Cognix and co-author of the evaluation script (verify.py). All artifacts are published and independently verifiable.
2. Experimental Setup
2.1 Tools and Versions
| Tool | Version | Notes |
|---|---|---|
| Cognix | v0.2.5 | Open-source, multi-stage generation pipeline |
| Claude Code | Latest (2026-02-18) | Anthropic's official CLI tool |
| Aider | Latest (2026-02-18) | Open-source AI coding assistant |
2.2 LLM Model
All three tools used the same model: claude-sonnet-4-5-20250929
This is the key control variable. By removing LLM differences from the equation, we can actually measure what each tool's pipeline and quality controls contribute.
2.3 The Task: Adding a Feature to an Existing Codebase
We didn't start from scratch β we asked each tool to add functionality to an existing codebase. Specifically: add a RecurringTask feature to a task management CLI app. That means:
- Understanding the existing codebase structure
- Implementing a new
RecurringRulemodel as a dataclass - Implementing storage functions (
save_recurring_rules,load_recurring_rules) with correct round-trip behavior - Implementing a validator (
validate_no_circular_dependency) with the right function signature - Modifying existing entry points without breaking them
- Passing 8 automated verification tests in
verify.py
2.4 Why This Particular Task?
This task was designed to surface the failure patterns AI-generated code most commonly hits in real production use β code that silently fails at external interface boundaries.
Here's what we were trying to expose:
- LLM-written unit tests pass internally, but external
verify.pyfails becausesave_recurring_rulescan't correctly serialize/deserializeRecurringRuleobjects - Imports work fine, but passing actual objects (instead of dicts) to storage functions raises
TypeError - The model class exists, but the constructor parameter names don't match what the verifier expects
These aren't obscure edge cases. They're predictable failure patterns that show up when an LLM generates code without really understanding the external contracts of what it's writing.
2.5 How We Evaluated
We used 8 automated tests in verify.py. Each test is binary (pass/fail). Execution score = tests passed / 8.
- All 61 existing tests still pass (no regressions)
- New tests were added (total > 61)
-
RecurringRulemodel works correctly βshould_run()andadvance()behave as specified - Recurring storage round-trip β
save_recurring_rules()/load_recurring_rules()works with actual objects (not dicts) -
TaskDependencymodel exists,validate_no_circular_dependency()catches direct and transitive cycles - Dependency storage round-trip β
save_dependencies()/load_dependencies()return correct types - Dashboard command (
cmd_dashboard) andformat_dashboard()are importable and callable - CLI subcommands exist:
recurring-add,recurring-list,recurring-run,dep-add,dep-list,dashboard(5 of 6 = PASS)
2.6 Metrics
| Metric | Definition | Unit | Better |
|---|---|---|---|
| Exec | verify.py test pass rate | % | Higher |
| Dep | All imports resolve at runtime | % | Higher |
| Lint | Style/quality errors per 100 lines (ruff) | errors/100 lines | Lower |
| Scope | Required features reflected in output | % | Higher |
| Speed | Wall time from prompt to completion | seconds | Lower |
2.7 Protocol
- Each tool ran 3 independent times from a clean project state
- Every run started from the same unmodified base project
- No manual intervention during generation
- Results recorded directly from
verify.pyoutput
3. Results
3.1 Summary (3-run average)
All values are averages of n=3 independent runs.
| Metric | Cognix | Claude Code | Aider |
|---|---|---|---|
| Exec | 100.0% | 100.0% | 87.5% |
| Dep | 100.0% | 100.0% | 100.0% |
| Lint | 0.00 | 4.79 | 1.69 |
| Scope | 100.0% | 100.0% | 100.0% |
| Speed | 863.7s | 390.8s | 190.6s |
3.2 Raw Data: All Runs
Cognix (v0.2.5)
| Run | Exec | Dep | Lint | Scope | Speed |
|---|---|---|---|---|---|
| 1 | 100% | 100% | 0.00 | 100% | 930.9s |
| 2 | 100% | 100% | 0.00 | 100% | 891.1s |
| 3 | 100% | 100% | 0.00 | 100% | 769.0s |
| Avg | 100% | 100% | 0.00 | 100% | 863.7s |
Claude Code
| Run | Exec | Dep | Lint | Scope | Speed |
|---|---|---|---|---|---|
| 1 | 100% | 100% | 4.27 | 100% | 410.4s |
| 2 | 100% | 100% | 5.25 | 100% | 409.8s |
| 3 | 100% | 100% | 4.86 | 100% | 352.2s |
| Avg | 100% | 100% | 4.79 | 100% | 390.8s |
Aider
| Run | Exec | Dep | Lint | Scope | Speed |
|---|---|---|---|---|---|
| 1 | 88% | 100% | 5.06 | 100% | 187.8s |
| 2 | 88% | 100% | 0.00 | 100% | 189.5s |
| 3 | 88% | 100% | 0.00 | 100% | 194.7s |
| Avg | 87.5% | 100% | 1.69 | 100% | 190.6s |
4. Analysis
4.1 Execution Accuracy
Cognix and Claude Code hit exec=100% on all 3 runs. Aider consistently came in at 88% (7 of 8 tests passing) β and it failed on the same test every single time.
That consistency is the interesting part. It wasn't random LLM variance β it was the same failure, 3 times in a row. What we observed: Aider consistently generated storage functions that work fine with dict input but throw TypeError when you pass actual RecurringRule instances. We'll dig into the specific failure in a follow-up article.
4.2 Code Quality (Lint)
Cognix is the only tool that hit lint=0.00 on all 3 runs. That's because Cognix's pipeline includes an auto-fix loop:
- Generate code
- Run lint check (ruff/flake8)
- LLM auto-fixes violations
- Re-check, repeat until clean
Claude Code doesn't include lint checking or auto-fix, so whatever style issues the LLM introduces just stay there β averaging 4.79 errors/100 lines. That's code that wouldn't pass a standard CI lint gate.
Aider's lint score swung wildly across runs (0.00β5.06). It's purely a byproduct of LLM output, not a controlled quality gate.
4.3 Speed
Speed ranking: Aider (190.6s) < Claude Code (390.8s) < Cognix (863.7s)
Cognix is about 4.5x slower than Aider and 2.2x slower than Claude Code. That's the expected cost of running more stages:
Code Generation β Lint Check & Auto-fix β Code Review β Test Execution & Auto-fix β API Contract Validation β Quality Assessment
Each stage takes time. The tradeoff is intentional: slower generation, but stronger guarantees on accuracy and quality.
Whether that tradeoff makes sense depends on what you're doing. For CI/CD automation or complex feature work where correctness really matters, the extra time is worth it. For quick prototypes or small edits, a faster tool probably makes more sense.
4.4 Stability
Cognix had zero variance across all 5 metrics over 3 runs. Claude Code had slight lint variance (4.27β5.25). Aider had no exec variance (stuck at 88% every run) but big lint variance (0.00β5.06).
Cognix's consistency comes from deterministic post-processing. No matter how much the LLM output varies, the lint fix loop always converges to 0.00, and API Contract Validation catches interface issues before they reach the verifier.
4.5 Hypothesis: A Structural Blind Spot in AI Coding Tools
Here's what's really going on: AI writes code that looks right. But it fills in assumptions about how that code will be called β types, arguments, return values. When those assumptions are wrong, the code breaks. Whether a tool can catch those wrong assumptions is what separates the results.
From this experiment: whether code fails comes down to the thickness of the validation layer β how well the tool catches cases where the AI's assumptions turn out to be wrong.
Aider failed in the same spot all 3 times. That's not random β it's what you'd expect from a pipeline that doesn't verify external contracts (in this case, whether storage functions actually handle real objects correctly).
And this probably isn't just an Aider issue. Any tool without a validation loop is structurally more likely to ship "plausible-looking code" that fails real-world checks.
That said β this is a hypothesis. n=3, one task, one language isn't enough to confirm it. We need more task types, more languages, bigger samples. That's what Phase 2 is for.
5. Limitations
5.1 Single Task Type
This benchmark only covers one scenario: feature addition to an existing Python project. We can't generalize to all code generation use cases. New projects from scratch, bug fixes, refactoring, or non-Python work might look quite different.
5.2 Why This Task?
We didn't pick it arbitrarily. It was designed to expose the failure pattern that shows up most often when developers use AI tools in real production environments β code that silently fails at external interface boundaries:
- Working with existing APIs (not just writing standalone functions)
- External interface contracts (storage round-trips, function signatures)
- Cross-file consistency (model, storage, validators, and entry points all have to agree)
5.3 Sample Size
3 runs is a small sample. The Cognix and Claude Code results (exec=100%, no variance) hold up fine, but Aider's numbers (88%, lint variance) deserve more caution in interpretation.
5.4 What's Next
- More task types: new projects, bug fixes, refactoring, real-world scenarios
-
Phase 2: benchmarking each tool at its optimal settings (e.g., Aider with
--architect) - Failure analysis: digging into exactly which tests fail and why
- Hypothesis validation: testing the relationship between validation layer thickness and execution accuracy across multiple tasks and languages
6. Conclusion
On a feature-addition task focused on external API contract correctness:
- Cognix matches Claude Code at exec=100% (both 100%)
- Cognix leads on code quality at lint=0.00 (Claude Code 4.79, Aider 1.69)
- Cognix beats Aider on execution accuracy (100% vs. 87.5%)
- Cognix is the slowest (863.7s vs. Claude Code 390.8s, Aider 190.6s)
The data backs up the hypothesis: multi-stage quality validation produces more reliable, cleaner code on complex integration tasks β at the cost of speed.
Try Cognix
pipx install cognix
https://cognix-dev.github.io/cognix/
7. Reproducibility
Everything is published:
-
prompt.mdβ task specification -
verify.pyβ evaluation script (8 test items) - Generated code from each tool, each run
- Raw JSON result data
Repository: cognix/benchmark/phase1
This is Phase 1 of a benchmark series. Phase 2 will cover more task types, optimal tool configurations, and validation of the hypothesis in section 4.5.
Top comments (0)