Richard Kakengi

Posted on Feb 17

AI Agents Can't Mark Their Own Homework [Case Study]

#ai #testing #python #productivity

I ran an experiment with the same project through two AI LLM model scenarios — once with a standard prompt, once with spec driven workflow. The results weren't what I expected.

The headline isn't about tokens or the best performing LLM model. It's about measuring what the agents thought they delivered versus what they actually delivered.

Repo: https://github.com/SpecLeft/specleft-delta-demo

Models: Claude Opus 4.6, GPT-5.2-Codex

Coding Agent: OpenCode 1.1.36

TL;DR — The Good, The Bad, The Ugly

The Good: Spec-driven runs caught real bugs that baseline runs shipped silently. Claude Opus with specs found 3 defects during behaviour verification — including a classic Python truthiness trap that would have hit production. GPT-Codex with SpecLeft naturally adopted TDD without being told to. Both agents had fewer failed test runs with specs guiding them.

The Bad: Token usage roughly tripled. Baseline complete runs used 53k–83k tokens. Spec-driven runs used 146k–147k. The spec externalisation phase alone consumed more tokens than some baseline implementations. Time increased too — Codex went from ~18 minutes to ~38 minutes.

The Ugly: When asked to self-assess, the baseline agents gave themselves a clean bill of health. Opus-4.6 with only a PRD reported 0 issues. The code had bugs and missed a key scenario from PRD — the agent just had no framework to find them. It marked its own homework and gave itself an A+.

The takeaway: In its current state - spec driven development introduces an upfront token tax but produces code with fewer hidden defects. Whether that trade-off is worth it depends on whether you're building a side project or something that matters in production.

The Problem

AI coding agents are fast. Impressively fast. You can hand one a well-written product scope and a FastAPI project and it'll have routes, models, services, and tests in under 15 minutes.

But, as we know, "tests pass" isn't the same as the system is "correct."

I've been building SpecLeft — an open source spec-driven development tool that externalises behaviour into structured markdown specs and generates pytest scaffolding, with traceable links between them.

The idea is simple: define what correct looks like before the agent starts coding, then verify against it. The workflow looks like BDD, but quacks like TDD.

There are many spec driven dev tools out there (sorry, yes this is another one) - but they are generally for the AI assisted dev workflows, so still need human dev to drive. SpecLeft tries a different approach – it is agent-native, meaning it's optimised towards AI agent adoption with an agent contract to verify safety. This adds trust to building software without too much intervention or technical review.

To summarise the goal — can we trust AI agents to develop software that actually behaves as it should, while keeping the code readable, maintainable while fulfilling the intent?

The Experiment

The application: A document approval workflow API — documents move through draft → review → approved/rejected, with multi-reviewer approval, time-bound delegation, automatic escalation, and a handful of edge cases.

We don't want a basic CRUD system for a nice vibe-coding showcase. This system scope has state machine, concurrent decision handling, time-based logic, and business rules that interact with each other. Complex enough that an agent can't just wing it.

Product Scope

The setup:

Same starting commit for both runs
Same PRD (prd.md) with 5 features and 20 scenarios
Same models (Opus 4.6 and Codex 5.2)
Same coding agent (OpenCode 1.1.36)
Two runs per model: baseline prompt vs SpecLeft-assisted workflow

Controlled variables:

Tech stack (FastAPI + SQLAlchemy + SQLite + pytest),
Agent skill
Virtual environment with UV
Product requirements

The only difference was whether SpecLeft was involved.

💻 Repos and Session playbacks have been attached to each test run.
🎥 Session has to be downloaded and played with asciinema

Workflow A — Baseline (No SpecLeft)

The agent gets a straightforward prompt:

You are an autonomous agent guided by a planning-first workflow.
Build a document approval API using FastAPI and SQLAlchemy.
The project has had the initial setup already.
Follow ../prd.md for product requirements.
Follow ../SKILLS.md for instructions.
Include tests and ensure they pass.
Stop when all features are complete.
Go with your own recommendations for system behaviour instead of verifying with me.

Then I walked away and let it run.

Claude Opus 4.6 — Baseline

Prompt entered, and Opus took its time. It spent a solid chunk of the session reading and analysing the PRD before writing anything. Implementation and tests came out together — not in separate phases, but interleaved. Server started first time. Tests passed with 2 failures on the first run, both resolved quickly.

Total time: 13 minutes 53 seconds. Total tokens: 83,243.

When asked for a retrospective, Opus reported 0 issues found. Clean run. Everything looked good.

Code: Branch
Session Playback (asciinema cast): claude-opus-no-specs.cast

Bugs Discovered Post-Analysis:

Missing Auto-Escalation Feature: Despite PRD requiring automatic escalation after timeouts, only manual escalation is implemented. The check_and_escalate function exists but performs no escalation, violating core business requirements.
Potential Timezone Brittleness: Delegation expiry checks assume naive datetimes are UTC, which could fail if assumptions are incorrect.
Concurrency Risks: No explicit locking for concurrent reviewer decisions, potentially leading to race conditions.

GPT Codex 5.2 — Baseline

Codex moved faster and more aggressively. Implementation came out in parallel batches — models, schemas, routes, services written simultaneously. But it backtracked more. Tests failed 4 times before going green. Server failed to start on the first attempt. Behaviour verification required 4 patches to services.py.

Total time: ~18 minutes. Total tokens: 53,000.

The retrospective was vague: logic gaps were "caught early," timezone handling was a known issue. No specific bugs named.

Code : Branch

Session Playback (asciinema cast): gpt-codex-no-specs.cast

Baseline Results

Metric	Codex 5.2	Opus 4.6
Total tokens	53,000	83,243
Total tests passed	19 (100%)	53 (100%)
Failed test runs	4	2
Issues found in retro	0	0
Time to completion	~18m	13m 53s
Tokens before implementation	14,000	~33,000

Both agents declared the job done. Tests pass. Features work. Ship it?

Workflow B — With SpecLeft

Same project, same PRD. But this time SpecLeft is installed as a dependency, and the prompt tells the agent to externalise behaviour before writing code:

You are an autonomous agent guided by a planning-first workflow.
Build a document approval API using FastAPI and SQLAlchemy.
The project has had the initial setup already.
Follow ../prd.md for product requirements.
Follow ../SKILLS.md for instructions.
Initialize SpecLeft and use its commands to externalize behaviour before implementation.
I have installed v0.2.2.
Only if required, use doc: https://github.com/SpecLeft/specleft/blob/main/AI_AGENTS.md for more context.
Do not write implementation code until behaviour is explicit.
Go with your own recommendations for system behaviour instead of verifying with me.

Then I walked away again.

Note: The AI_AGENTS.md is to help the agent know how to use SpecLeft tool better.

Claude Opus 4.6 — With SpecLeft

Opus externalised all 5 features into SpecLeft specs before writing a line of implementation code. It updated scenario priorities to match feature priorities — a decision it made on its own. Then it generated test skeletons with specleft test skeleton, giving it 27 decorated test stubs mapped directly to scenarios.

First test run: 25/27 passed. The 2 failures were test logic issues, not application bugs. The core service layer was correct on first implementation.

Then came behaviour verification. And this is where it got interesting.

Bug 1: timeout_hours or doc.escalation_timeout_hours or 24 — when timeout_hours=0, Python treats 0 as falsy and falls through to the default of 24. Classic truthiness trap. The unit tests didn't catch it because they manipulated review_started_at directly with 25-hour backdating, never testing with timeout_hours=0.

Bug 2: review_cycle in the DocumentResponse schema had a default value of 1, but the model never exposed the actual cycle count. Pydantic's from_attributes silently fell back to the default. A resubmitted document showed review_cycle: 1 when it should have been 2.

Bug 3: Escalation test logic accessed response data before checking the status code — a test fragility that would cause misleading failures.

Total time: 21 minutes 1 second. Total tokens: ~147,000 (across two context windows, with compaction at 105k).

Code: Branch
Session Playback (asciinema download): claude-opus.cast

GPT Codex 5.2 — With SpecLeft

This was the surprise. Codex consumed the SpecLeft specs and test skeletons, and then did something I didn't engineer: it wrote functional test logic before implementation code. Genuine TDD, driven by the structure of the skeletons. The scaffolding naturally guided the agent into writing assertions first, then building the code to satisfy them. Sweet!

It read all the specs — which burned tokens on context — but that context clearly influenced implementation quality. Tests failed twice before going green, down from 4 in the baseline run.

Total time: ~38 minutes. Total tokens: 146,000.

Code : Branch

Session Playback (Asciinema download): gpt-codex-specs.cast

SpecLeft Results

Metric	Codex 5.2	Opus 4.6
Total tokens	146,499	~147,000
Total tests passed	27 (100%)	27 (100%)
Failed test runs	2	1
Issues found in retro	0	3
Time to completion	~38m	21m 1s
Tokens to externalise specs	49,000	45,000
Tokens before implementation	89,000	63,000

Side-by-Side Comparison

Opus without specs generated 53 tests, nearly double the SpecLeft run's 27 — but quantity isn't coverage. The 53 tests were whatever the agent decided mattered, with no traceability to product requirements — which is shown with the missing auto-escalate requirement. The 27 SpecLeft tests each map to a specific scenario in the PRD

Metric	Codex Baseline	Codex + SpecLeft	Opus Baseline	Opus + SpecLeft
Total tokens	53,000	146,000	83,243	~147,000
Total tests passed	19	27	53	27
Failed test runs	4	2	2	1
Bugs found during retro	0	0	0	3
Missing Requirements	0	0	1	0

Missing Requirements: count of unimplemented PRD features.

Which Stack Stayed on Track the Best?

Having a look at the code and testing the API manually - both spec driven runs are strong so it's pretty even. Codex had a much cleaner data model and modern sqlalchemy implementation; while Opus was more flat in its design. With that in mind - I'd feel better about picking up the Codex SpecLeft project in a realistic situation. That being said the code wasn't mind blowing either - especially the lack of exception handling around database queries in the service layer.

I've also prompted a few neutral agents (Gemini-3, Kimi K2.5, Grok) to evaluate the codebases on quality, maintainability, and correctness.

Full analysis found in the repo

What's the Takeway

Agents can't assess their own output

The fact that the critical defects were missed by the agent itself but caught by external verification highlights a fundamental limitation: AI agents can't reliably assess their own output without structured external criteria.

On top of that, the baseline code was brittle. The Codex baseline shipped with 175 deprecation warnings in its test suite—technical debt that the agent completely ignored because the tests technically "passed."

In contrast, the SpecLeft agent did introduce bugs during development—like the timeout_hours=0 truthiness trap and the review_cycle default issue. But crucially, it found and fixed them. The structured verification process forced the agent to confront its own logic errors, whereas the baseline agent simply marked its own homework as "correct" and moved on.

TDD emerged naturally from the workflow

This was unplanned but a pleasant surprise! Codex with SpecLeft generated test skeletons via specleft test skeleton, and those skeletons guided the agent into writing test assertions before implementation code. Not because the prompt said "do TDD" — it didn't. The structure of the scaffolding naturally produced that workflow.

What was even more interesting was how the agent was approaching the implementation. Based on the agent logging, it came across it was thinking more about the overall behaviour of the app overall, rather than purely logically.

The SDD token cost is real and significant

No getting around it. SpecLeft runs used 2–3x more tokens than baseline. The spec externalisation phase (45k–49k tokens) is pure overhead if you measure by "tokens to first passing test." The baseline agents started writing code sooner and finished sooner. From what I've seen with other SDD tools is that this is a common problem. Makes sense as it's a lot of additional context.

The question is whether having "passing tests" is the right finish line. The risk is the baseline code ships to production without a key piece of functionality, which would take time to find, diagnose and fix – my guess is that it'll cost more than 90k tokens plus impact on the end users.

This is something worth investigating if there are ways to optimise token usage with any context engineering techniques.

Try It on your PRD.md

pip install specleft

specleft init

specleft doctor

specleft status

specleft plan 

# or add individual features
specleft features add

Repo: https://github.com/SpecLeft/specleft
Docs: specleft.dev

Over to You

The data is in the repo. The recordings are linked above. Run it yourself if you want — same PRD, same setup, different agent if you like.

The bigger question: How are you verifying agent output today, or are you going with the pure vibe-coding approach?

Personally, here's what I'm thinking about: should that traceability be enforced in CI? A gate that fails the build if critical scenarios aren't implemented — not as a suggestion, but as a policy. Or is visibility enough?

CI enforcement on behaviour functionality that I've started working on — request early access if you utilise Python and AI agents in your dev workflow and want to be involved.

Drop a comment — I'm keen to hear your thoughts.

Top comments (11)

Ned C • Feb 17

i ran some tests on cursor's agent mode a few weeks ago and found something similar. it would pass its own lint checks and tests but introduce subtle drift from what you actually asked for, like renaming variables to match its preferred style or adding error handling you didn't request. the 0-issues self-report on code that actually had bugs is the part that concerns me most because it means you can't even use the agent's confidence as a signal.

Richard Kakengi • Feb 17

Oh that's interesting to hear - haven't used Cursor much. Quite worrying how these types of issues start small, but can snowball in to something much more critical over time. Especially with how quick the agents iterate.

May I ask what followed after you came across those renaming variables and error handling issues - did you put any process in place to reduce the chance of it happening?

Ned C • Feb 18

i started keeping a .cursor/rules file that explicitly tells the agent not to rename variables or add error handling unless i ask for it. the "ONLY do X when explicitly asked" framing sticks better than "don't do X" in my experience. i also got into the habit of diffing the full changeset before accepting anything, not just the file i asked about. the sneaky edits are usually 2-3 files away from the one you're focused on

Richard Kakengi • Feb 18

Nice, yeah I've found that too - telling the agent NOT to do something, doesn't work too well. Seems like agents don't like being told "no" haha.

I came across a useful tip online to always include "ask for clarification if unsure" in the dev task prompts which has reduced some of the drift from original intent, but still needs that manual diff review like you say.

That's basically what led me to the experiment — the rules tell the agent what to do, but the specs give you something to verify and trace against after it's done. The diff review you're doing manually is the step I wanted to automate and make a better DX.

Ned C • Feb 18

yeah the "ask for clarification" instruction helps but it's not enough on its own. what actually moved the needle for me was switching from vague rules ("write clean code") to exclusive framing. like instead of "don't rename my variables" i changed it to "ONLY rename variables when the user explicitly asks you to rename them." the model treats that differently for whatever reason, it gives it a concrete condition to check against instead of a soft preference to maybe follow

Richard Kakengi • Feb 19

That approach sounds quite close to the problem I'm trying to solve with this tool - but sounds like you're defining them more as general project rules, rather than an for explicit behaviour. Sounds like a good solution!

If you'd be up for trying the spec approach on one of your Cursor projects I'd genuinely like to hear what works and what doesn't — I'm collecting feedback from people who've already hit this wall.

Ned C • Feb 19

i'd be open to trying it on something small. the thing i'd want to see is how the spec handles cases where the agent does something technically correct but architecturally wrong, like renaming a variable to something "better" and breaking references downstream. that's the gap my rules don't cover well either

Richard Kakengi • Feb 19

That's an interesting edge case to try out. I'd love to set this up and get your insight — although I don't think Dev.to has DMs. Feel free to drop me a line on richard@specleft.dev. Thanks!

Ned C • Feb 20

i'll shoot you an email. the variable rename case sounds like a good first test. would be useful to see if the spec catches it where rules alone don't.

Matthew Hou • Feb 24

This title nails it. The whole "use AI to review AI" loop sounds elegant in theory but has a fundamental problem: the verifier shares the same blindspots as the generator.

I've been leaning hard into human-defined verification instead. Not reviewing every line — that doesn't scale — but writing acceptance criteria before the AI writes any code. "This function should handle X, reject Y, and never touch Z." If the AI's output doesn't match the spec, it doesn't matter how clean the code looks.

Kent Beck landed on something similar: humans define WHAT (tests, specs, acceptance criteria), AI implements HOW. The moment you let AI define both sides, you get code that passes its own tests but misses the edge cases no one thought to test for.

The hard part is that writing good acceptance criteria is harder than writing code. But that's kind of the point — it's where the actual thinking lives.

Richard Kakengi • Feb 24

Great to hear that we're on the same page! Definitely agree with Kent Beck's assessment. The creativity and common sense is something that AI cannot do well (from what I've seen anyway)

Writing strong acceptance criteria or expected behaviour has always been a challenge; but this is where agents can be super useful I think. The fast feedback loops can lead us to trial the intent a lot quicker, rather than finding edge cases in production.

What does you current workflow look like - do you provide ACs in small increments or bigger PRD style docs?

View full discussion (11 comments)