Kunal Sharda

Posted on Jun 3 • Edited on Jun 7

I changed how I write acceptance criteria, and my AI agent stopped building the wrong thing

#ai #devtools #programming #productivity

For a while I blamed the model. The agent would build something plausible and wrong, and I would assume it needed a smarter brain. Then I went back and read the tickets I had handed it, and the problem was obvious. My acceptance criteria were wishes, not specifications. The agent built exactly what I wrote. I just had not written what I meant.

Here is the change that fixed most of it, and it has nothing to do with the model.

Prose acceptance criteria are where intent goes to die

Most ACs read like this:

The export should handle large files gracefully and not time out.

Every word in that sentence is a landmine. "Large" is how big. "Gracefully" is what behavior. "Time out" at what threshold. A human reviewer fills those gaps with assumptions, usually different assumptions than the person who wrote it. An AI agent fills them too, just faster and more confidently. You get working code for a spec nobody actually agreed on.

The fix is to stop writing criteria as description and start writing them as something checkable. If a criterion cannot become a pass or fail, it is not a criterion. It is a vibe.

The format that travels

I moved everything to a given / when / then shape. Boring on purpose.

Given a CSV with 100,000 rows
When the user triggers an export
Then the file streams to download and completes within 30 seconds
And peak memory stays under 512 MB

Now there is nothing to assume. The thresholds are explicit. An engineer reads it the same way QA reads it the same way the agent reads it. And the last clause is the quiet hero: it makes the criterion testable. You can write the test before the code, and the agent can check itself against it.

A few rules I hold to now:

Numbers, not adjectives. "Fast" becomes "under 200ms at p95." "Large" becomes a row count. If you cannot put a number on it, you do not understand the requirement yet, and neither will the agent.

One behavior per criterion. The moment a criterion has an "and also," split it. Compound criteria are how half-finished features pass review.

State the unhappy path explicitly. Most agent failures live here. What happens on an empty input, a duplicate, a permission error. If you do not write it, the agent will invent it, and you will not like what it invents.

Why this matters more with AI, not less

A human engineer who reads a vague AC will often stop and ask. Slack you, raise it in refinement, push back. That friction is annoying and it is also a safety net. The vague spec gets clarified because a person refused to guess.

An agent does not refuse to guess. It guesses instantly and commits. So the vague AC that a human would have flagged sails straight through into code. The discipline that you could get away with skipping when humans were the only readers is now load-bearing.

This is the part people miss when they say AI lets you move faster. It does, but it removes the human who used to catch your underspecified tickets. You have to put that rigor back into the spec, because the agent will not supply it for you.

Where good criteria alone are not enough

Honesty time. A sharp AC fixes the "built the wrong thing" failure. It does not fix the "could not see the rest of the system" failure. The agent can perfectly satisfy a criterion and still duplicate an existing utility or violate an architecture decision it never knew about, because that context lived in another tool.

So the AC is necessary, not sufficient. The agent needs the criterion AND the surrounding truth: the existing tests, the relevant decisions, the related stories. When I write criteria as checkable statements and the agent can query them along with the rest of the project, the output stops being plausible and starts being correct.

That is the thesis behind what I am building at Stride: the AI writes and reads acceptance criteria as linked nodes next to the tests and decisions they relate to, so a criterion is never three tabs away from the thing that proves it. But you do not need any particular tool to get most of this benefit today. You need to stop writing wishes.

Try this on your next ticket

Take the next thing you are about to hand an agent. Find every adjective in the acceptance criteria and replace it with a number or a concrete behavior. Add the unhappy path. Then run the agent. The difference is not subtle.

What is the worst acceptance criterion you have shipped, in hindsight? I will go first: "should feel snappy." Caused a week of rework. Your turn.

Top comments (2)

Echo • Jun 4

Same here. The unlock for me was making every criterion a number or a boolean - 'fast' became 'p95 < 200ms', 'handles errors' became 'returns 4xx with json body'. Once it's checkable, both the model and QA can verify it the same way.

Kunal Sharda • Jun 7

Exactly that. "Fast" and "handles errors" are the two phrases that have wasted the most agent-tokens in my codebase.

The riff I'd add: the discipline gets harder once you leave backend territory. "p95 < 200ms" — clean. "Returns 4xx with json body" — clean. But "the form feels responsive" or "the empty state explains what to do" — those resist numbers, and you end up writing them as either a precise string ("show empty-state copy: 'No items yet. Add your first one.'") or you punt them to a Figma reference.

The hidden second unlock for me: once every criterion is checkable, the agent can verify its own work before opening the PR. The same definition you use as input becomes the test it runs as output. Closes a feedback loop that used to need a human in the middle.

What do you do for the criteria that resist clean numbers?

@uzoma_uche_3ec83974b4a8a5