DEV Community

AI Agents Can't Mark Their Own Homework [Case Study]

Richard Kakengi on February 17, 2026

I ran an experiment with the same project through two AI LLM model scenarios — once with a standard prompt, once with spec driven workflow. The res...

Read full post

Ned C • Feb 17

i ran some tests on cursor's agent mode a few weeks ago and found something similar. it would pass its own lint checks and tests but introduce subtle drift from what you actually asked for, like renaming variables to match its preferred style or adding error handling you didn't request. the 0-issues self-report on code that actually had bugs is the part that concerns me most because it means you can't even use the agent's confidence as a signal.

Richard Kakengi • Feb 17

Oh that's interesting to hear - haven't used Cursor much. Quite worrying how these types of issues start small, but can snowball in to something much more critical over time. Especially with how quick the agents iterate.

May I ask what followed after you came across those renaming variables and error handling issues - did you put any process in place to reduce the chance of it happening?

Ned C • Feb 18

i started keeping a .cursor/rules file that explicitly tells the agent not to rename variables or add error handling unless i ask for it. the "ONLY do X when explicitly asked" framing sticks better than "don't do X" in my experience. i also got into the habit of diffing the full changeset before accepting anything, not just the file i asked about. the sneaky edits are usually 2-3 files away from the one you're focused on

Richard Kakengi • Feb 18

Nice, yeah I've found that too - telling the agent NOT to do something, doesn't work too well. Seems like agents don't like being told "no" haha.

I came across a useful tip online to always include "ask for clarification if unsure" in the dev task prompts which has reduced some of the drift from original intent, but still needs that manual diff review like you say.

That's basically what led me to the experiment — the rules tell the agent what to do, but the specs give you something to verify and trace against after it's done. The diff review you're doing manually is the step I wanted to automate and make a better DX.

Ned C • Feb 18

yeah the "ask for clarification" instruction helps but it's not enough on its own. what actually moved the needle for me was switching from vague rules ("write clean code") to exclusive framing. like instead of "don't rename my variables" i changed it to "ONLY rename variables when the user explicitly asks you to rename them." the model treats that differently for whatever reason, it gives it a concrete condition to check against instead of a soft preference to maybe follow

Richard Kakengi • Feb 19

That approach sounds quite close to the problem I'm trying to solve with this tool - but sounds like you're defining them more as general project rules, rather than an for explicit behaviour. Sounds like a good solution!

If you'd be up for trying the spec approach on one of your Cursor projects I'd genuinely like to hear what works and what doesn't — I'm collecting feedback from people who've already hit this wall.

Ned C • Feb 19

i'd be open to trying it on something small. the thing i'd want to see is how the spec handles cases where the agent does something technically correct but architecturally wrong, like renaming a variable to something "better" and breaking references downstream. that's the gap my rules don't cover well either

Richard Kakengi • Feb 19

That's an interesting edge case to try out. I'd love to set this up and get your insight — although I don't think Dev.to has DMs. Feel free to drop me a line on richard@specleft.dev. Thanks!

Ned C • Feb 20

i'll shoot you an email. the variable rename case sounds like a good first test. would be useful to see if the spec catches it where rules alone don't.

Matthew Hou • Feb 24

This title nails it. The whole "use AI to review AI" loop sounds elegant in theory but has a fundamental problem: the verifier shares the same blindspots as the generator.

I've been leaning hard into human-defined verification instead. Not reviewing every line — that doesn't scale — but writing acceptance criteria before the AI writes any code. "This function should handle X, reject Y, and never touch Z." If the AI's output doesn't match the spec, it doesn't matter how clean the code looks.

Kent Beck landed on something similar: humans define WHAT (tests, specs, acceptance criteria), AI implements HOW. The moment you let AI define both sides, you get code that passes its own tests but misses the edge cases no one thought to test for.

The hard part is that writing good acceptance criteria is harder than writing code. But that's kind of the point — it's where the actual thinking lives.

Richard Kakengi • Feb 24

Great to hear that we're on the same page! Definitely agree with Kent Beck's assessment. The creativity and common sense is something that AI cannot do well (from what I've seen anyway)

Writing strong acceptance criteria or expected behaviour has always been a challenge; but this is where agents can be super useful I think. The fast feedback loops can lead us to trial the intent a lot quicker, rather than finding edge cases in production.

What does you current workflow look like - do you provide ACs in small increments or bigger PRD style docs?