Jamie Gray

Posted on Mar 23

How I Approach Evaluation When Building AI Features

#ai #evaluation #testing #machinelearning

Building an AI feature is not the same as shipping traditional software.

In classic software, you write code, test it, and deploy it. Deployment is usually a finish line.

With AI features, deployment is just the beginning.

That is one of the biggest mindset shifts I have had while working on AI systems. The question is not only whether a feature works during development. The bigger question is whether it keeps working well when real users, messy inputs, changing data, and production constraints enter the picture.

That is why I take evaluation seriously.

Not as a one-time quality check.
Not as something to do right before launch.
But as an ongoing part of building the product.

Why evaluation has to be continuous

AI systems are different because their behavior is not fully fixed.

Even if the code around the model does not change, the outputs can still shift because of:

new user inputs
different context data
retrieval quality changes
prompt changes
model updates
distribution drift in real-world usage

That means evaluation cannot be treated as a checkbox.

It has to be part of the product lifecycle.

I want to know not just whether the feature looked good in a demo, but whether it remains useful, stable, and trustworthy as conditions change.

That is the real test.

What I actually care about when evaluating AI features

When I evaluate an AI feature, I usually care about five things most:

1. Accuracy

Is the output correct?

This sounds obvious, but it is still the first thing I check. If the system produces wrong answers, wrong classifications, wrong summaries, or wrong structured data, nothing else matters much.

That said, accuracy in AI systems is often contextual. Sometimes “correct” means factually correct. Sometimes it means aligned with a business rule. Sometimes it means sufficiently useful for the task.

So I try to define accuracy in a way that matches the real product outcome, not just a vague technical idea of correctness.

2. Relevance

Even technically correct output can still be unhelpful.

A feature might produce something reasonable, but if it does not solve the user’s actual need, it is not high quality.

That is why I evaluate whether the output is relevant to the request, the workflow, and the context in which the feature is being used.

This matters a lot in AI systems because models are often capable of producing plausible but slightly off-target results.

Those are dangerous because they look good at first glance.

3. Consistency

If two similar inputs produce wildly different output quality, the product will feel unreliable.

Consistency matters because users form expectations fast.

If a feature works beautifully once and then behaves weakly the next time, trust drops quickly.

So I pay attention to whether the system behaves predictably across similar cases, especially around formatting, decision logic, quality level, and error handling.

Consistency is one of the most underrated parts of AI quality.

4. Safety and failure behavior

A feature is not only defined by when it works.

It is also defined by how it behaves when it does not work.

I want to know:

does it fail clearly?
does it avoid unsafe output?
does it avoid pretending to know more than it knows?
does it return a controlled result when confidence is low?
does it trigger fallback logic appropriately?

This is part of evaluation too.

A system that occasionally says “I cannot produce a reliable result here” may be much better than a system that always returns something confident but questionable.

5. Usability

Even a technically strong model can create a bad product if the experience is clumsy.

That is why I also evaluate things like:

response format
readability
latency
whether the output is actionable
whether the UI can use the response cleanly
whether the feature helps the user move forward

Usability is not separate from model quality.

In product terms, usability is quality.

Automated evaluation is useful, but limited

I like automated tests.
I use them often.
But I do not think they are enough for AI systems.

Automated evaluation is great for checking:

known test cases
regression behavior
output shape
schema compliance
business rules
scoring against benchmark datasets

These are all valuable.

They help catch obvious breakages early.
They make iteration safer.
They create a baseline.

But automated evaluation usually has limits.

It may miss subtle quality issues.
It may fail to capture user expectations.
It may not notice when output is technically valid but practically weak.

So I treat automation as necessary, but not sufficient.

Human evaluation still matters a lot

One of the most important lessons in AI product work is that human review still matters.

AI output quality is often contextual, and context is hard to fully encode in automated checks.

That is why I like including some kind of human evaluation loop, such as:

manual review of outputs
comparison between versions
user feedback collection
spot checks on edge cases
domain expert review for sensitive use cases

Human evaluation helps catch issues that metrics alone can miss.

For example:

tone feels wrong
reasoning is shallow
output is technically correct but not useful
the answer misses the most important point
the result feels inconsistent with user expectations

These are real product issues, even if an automated score does not flag them.

My favorite way to think about AI evaluation

I usually break evaluation into three layers:

Layer 1: Component correctness

This is the most basic layer.

I ask:

does the endpoint work?
is the request valid?
is the output schema correct?
does the system parse and return the result properly?
do rules and validations work as expected?

This layer is mostly about engineering correctness.

Layer 2: Workflow quality

Here I ask whether the full feature works in practice.

For example:

does retrieval bring in the right context?
does the prompt produce the intended behavior?
does the output fit the product need?
does fallback behavior work when needed?
does latency stay in an acceptable range?

This is where many real issues appear.

The model may work fine, but the workflow around it may be weak.

Layer 3: Real user value

This is the highest layer.

I ask:

is this actually helping users?
is it reducing effort?
is it improving speed or quality?
are people trusting it?
are they using it again?

This is the layer that matters most in the long run.

A feature can pass technical tests and still fail to create value.

I care a lot about edge cases

A lot of AI features look strong on normal examples.

That is not enough.

The real test is how they behave when inputs are incomplete, ambiguous, messy, repetitive, or just strange.

That is why I deliberately evaluate edge cases such as:

missing context
contradictory input
unexpected formatting
overly long input
low-signal input
near-duplicate requests
malformed documents
empty or partial results from upstream systems

Edge cases are where reliability becomes visible.

They also reveal whether the system is truly engineered or just loosely connected around a model call.

Feedback loops are part of evaluation

Evaluation should not stop once a feature goes live.

After launch, I want to learn from real usage.

That means looking at signals like:

user ratings
correction patterns
support complaints
failed requests
fallback frequency
manual review findings
drift in quality over time

These signals help answer an important question:

Is the system getting better, staying flat, or quietly getting worse?

Without feedback loops, it is very easy to assume an AI feature is healthy just because no one is actively reporting disaster.

That is not a strong standard.

Metrics matter, but I try not to worship them

Metrics are useful.
I rely on them.
But I also think it is easy to over-trust them.

A number can look clean while the user experience is getting worse.

For example, a feature may have:

strong request success rate
good average latency
valid JSON output
stable infrastructure

and still be underperforming from a product perspective.

Maybe the answers are too generic.
Maybe the model is missing nuance.
Maybe users are redoing the work manually.
Maybe the feature is technically “working” but not actually helping.

So I like metrics, but I always want them paired with real qualitative review.

My practical rule

When I evaluate an AI feature, I usually come back to one simple question:

If this feature became important to users tomorrow, would I trust the current evaluation process to catch quality problems early?

If the answer is no, the evaluation setup is probably too weak.

That usually means one of these is missing:

representative test cases
regression checks
edge-case coverage
human review
real-user feedback loops
monitoring for drift
clear definitions of quality

Evaluation is not just about measuring the system.

It is about building confidence that the system can keep improving without quietly breaking.

Final thought

When building AI features, I do not think evaluation is something you do after the work.

I think it is part of the work.

It shapes how you design the system.
It affects how safely you can iterate.
It influences how much trust the product earns.
And it determines whether the feature can survive real-world usage instead of just looking good in a test environment.

To me, strong AI evaluation means combining engineering discipline with product thinking.

It means checking correctness, usefulness, consistency, safety, and real user value.
It means using both automated checks and human review.
And it means accepting that quality is something you keep managing, not something you permanently finish.

That is how I approach evaluation when building AI features.

Closing question for DEV readers:

When you evaluate AI features, what do you trust more: automated test coverage, human review, or real user feedback after launch?

DEV Community