What 'Done' Means for an AI Feature (Write It Down)

#ai #projectmanagement #testing #softwaredevelopment

For an AI feature, "done" means it passes a written set of acceptance tests on real inputs at an agreed success rate, not that it produced a good answer once in a demo. If you can't state the metric and the threshold, the feature isn't defined, and you'll argue about it at sign-off instead of agreeing on it up front.

This is the conversation I have at the start of every AI engagement, because it's the one that prevents the worst conversation at the end of one. Here's how I run it.

Traditional "done" breaks on AI

Normal software has a clean contract: input X returns output Y. You write a test, it passes or fails, everyone agrees on the state. Generative AI breaks that. The same input can return different output across runs. So "it worked when I tried it" is no longer evidence the feature is finished. It's evidence it can work, on that input, that time.

That's why I don't accept "the demo passed" as a definition of done. A demo is one sample from a probabilistic system. I need to know how the feature behaves across the real input distribution, and I need that agreed before we build, because acceptance criteria are how the client and the team agree on what "done" actually means.

Define done as a measurable target, not a vibe

For an AI feature, I write acceptance criteria as numbers tied to a test set:

Task completion rate on a fixed set of realistic cases. "The agent resolves the request correctly in at least N% of the golden test cases."
Tool/argument correctness for agents that take actions. Did it call the right function with the right inputs, not just produce nice text?
Groundedness for anything retrieval-based. Is the answer supported by the source, or a confident fabrication?
Refusal quality. When it shouldn't answer, does it decline cleanly instead of guessing?
Latency and cost per request, because a correct answer that takes nine seconds or burns a fortune in tokens may still fail the real requirement.

The exact metrics depend on the feature. The discipline doesn't: every criterion is a number against a named test set, with a threshold the client signed off on.

The eval harness is a deliverable, not overhead

Here's the part teams skip and then regret. You can't measure any of that without an eval harness, a repeatable way to run the feature against your golden cases and grade the output. So I scope it as a real deliverable with its own hours: pick a handful of key tasks, define success, build golden examples, and grade them, starting with simple deterministic checks before anything fancier.

A practical sequence I use:

Weeks 1–2: agree the key tasks and what success means for each. Collect real examples into a golden set.
Weeks 3–4: build the harness, add basic metrics (success rate, steps, latency, cost), and establish a baseline.
From there: every change runs against the harness, so "did this improvement break something else?" is a number, not an opinion.

This is the same instinct as defining production requirements before you prototype, which we make the case for in The AI demo works. That's the problem. The eval set is how you make "production-ready" measurable instead of a feeling.

Why this protects everyone

When "done" is a written, measured target, sign-off stops being a negotiation. The client knows exactly what they're accepting. The team knows exactly when they're finished. And nobody is stuck in the bad meeting where a stakeholder says "but it gave me a wrong answer yesterday" and the team says "it works on our machine." Both can be true for a non-deterministic system. The eval is what turns that standoff into a number you agreed on weeks earlier.

Key takeaways

"It worked in the demo" is not a definition of done for AI. One sample from a probabilistic system proves capability, not completion.
Write acceptance criteria as metrics against a fixed test set: completion rate, tool correctness, groundedness, refusal quality, latency, cost.
The eval harness is a real deliverable. Scope it with hours, not as an afterthought.
Agree the thresholds before building, so sign-off is confirmation, not a fight.
Start grading with simple deterministic checks before reaching for complex graders.

FAQ

How do you test something that gives different answers each time? You stop testing single outputs and start measuring rates across a fixed set of real cases. Done becomes "passes the agreed threshold on the golden set," not "gave a good answer once."

What's a golden set? A curated collection of real, representative inputs with known good outcomes. You run every version of the feature against it so you can see, in numbers, whether you're getting better or worse.

Isn't building evals expensive? It costs hours up front and saves far more in disputed sign-offs, regressions, and rework. It's the cheapest insurance on an AI project.

If your team is about to build an AI feature and "done" still means "the demo looked good," that's worth fixing before a line of code ships. The team at Shanti Infosoft can help you define measurable acceptance criteria and the eval harness to back them.

Top comments (1)

Saleha Mubeen • Jul 16

This is a solid perspective. One thing I've noticed with AI projects is that teams often spend more time debating whether a feature is "good enough" than actually measuring it. Having a predefined evaluation set and measurable success criteria removes a lot of that subjectivity.

I also like the emphasis on including latency and cost alongside accuracy. In production, a highly accurate model that's too slow or expensive can still fail to meet business requirements. Treating the evaluation harness as part of the product rather than an afterthought seems like a best practice that more teams should adopt.