Jan tenPas IV

Posted on Apr 17

Stop Vibe-Checking Your AI App: A Practical Guide to Evals

#ai #llm #testing #softwareengineering

Most AI demos look great on Friday afternoon.

You try five prompts. The model answers smoothly. The summary is crisp. The chatbot sounds helpful. The extraction workflow pulls the right fields out of the sample PDF. Everyone nods. Someone says, "This is basically ready."

Then real users arrive...

They paste messy inputs. They ask ambiguous questions. They upload documents with weird formatting. They use company slang your test prompts never included. They click "regenerate" six times. They get a beautifully formatted answer that is completely wrong.

This is the moment many teams discover that "I tried it and it seemed good" is not an engineering strategy.

Without evals, teams ship blind: regressions hide inside polished answers, model upgrades become guesswork, and users slowly learn not to trust the product.

If you are building with LLMs, evals are how you move from vibes to evidence.

What Is an AI Eval?

An eval is a repeatable way to measure some aspect of an AI system's behavior.

Typical behavior checks include:

Correctness: whether the answer is factually, logically, or operationally right.

Groundedness: whether the answer is supported by the context, sources, or documents the system was given.

Usefulness: whether the output actually helps the user make progress on the task.

Safety: whether the system avoids harmful, disallowed, private, or policy-violating behavior.

Formatting: whether the output follows the structure the product expects, such as valid JSON, required sections, or a specific schema.

Latency: how long the system takes to produce something usable for the user.

Cost: how much model, retrieval, infrastructure, or review expense it takes to complete the task.

Consistency: whether the system behaves reliably across similar inputs, users, and versions.

A few good examples in a Slack thread can be useful evidence, but they are not an eval by themselves. Neither is a vague sense that the new prompt "feels better." Even manually trying the same prompt after every change only gets you part of the way there, because it does not give the team a structured way to compare behavior over time.

A real eval lets you compare version A against version B and answer questions like:

Did the new prompt improve answer quality?
Did the model upgrade break our JSON schema?
Did retrieval get better, or did it just get faster?
Are we hallucinating less, or are the hallucinations just shorter?
Did we fix the original bug without damaging the happy path?

That is the job: make AI behavior measurable enough that you can improve it on purpose.

Why AI Quality Needs More Than Unit Tests

Traditional software is usually deterministic. If you pass the same input into the same function, you expect the same output. Unit tests work beautifully in that world:

expect(calculateTotal(cart)).toBe(42.99);

LLM applications are different.

The same prompt may produce slightly different outputs. Many tasks do not have one exact correct answer. A response can be grammatically perfect and factually false. A summary can be concise but omit the one point the user needed. A chatbot can be polite and useless.

This does not mean normal tests are obsolete. You should absolutely still test your code. Validate schemas. Check permissions. Assert that required fields exist. Test your retrieval filters. Test your API integration.

But AI quality usually needs more than pass/fail assertions.

For example, imagine you are building a support assistant. The user asks:

My flight was canceled, but I still need to get home tonight. Can I get a refund or switch to another flight?

There may be several acceptable answers depending on the airline's refund policy, available flights, the user's ticket type, and whether the cancellation was the airline's fault. You may care that the answer is accurate, grounded in the latest policy docs, concise, appropriately cautious, and useful.

That is not one assertion. That is a rubric.

The Basic Eval Loop

A good AI development loop looks something like this:

Define what good and bad behavior look like.
Build evals that measure those behaviors.
Run the evals against the current system.
Change the prompt, model, retrieval pipeline, or tool logic.
Run the same evals again.
Compare results before shipping.

The important part is that you do not only ask, "Did the score go up?"

You also ask, "What got worse?"

That question matters because AI systems often trade one behavior for another. You make answers shorter and suddenly they stop citing sources. You make the model more cautious and suddenly it refuses normal requests. You improve extraction recall and accidentally increase false positives.

An aggregate score can hide those tradeoffs. Mature evals measure multiple dimensions separately.

The Five Types of Evals

No single eval catches everything. Strong AI teams usually combine several kinds.

1. Deterministic Evals

These are old-school code checks. They are cheap, fast, and underrated.

Use deterministic evals when the property is unambiguous:

Is the output valid JSON?
Are all required fields present?
Does the response stay under the token or character limit?
Do citation IDs refer to real retrieved documents?
Did the agent call an allowed tool?
Did the workflow return the expected status?

For structured AI features, deterministic evals should be your first line of defense.

If your app expects this:

{
  "company_name": "Acme Corp",
  "contract_value": 250000,
  "renewal_date": "2026-09-30"
}

Then do not use an LLM judge to decide whether the JSON parses. Just parse it.

Reach for code before you reach for another model.

2. Offline Evals

Offline evals run against saved examples before release.

Think of them as your AI regression test suite. You collect a dataset of inputs, expected behavior, labels, or reference outputs, then run candidate versions of your system against the same set.

Offline evals are especially useful for:

comparing prompts
comparing model versions
testing retrieval changes
catching regressions
validating known failure cases

For an invoice extraction system, your offline eval might contain 1,000 historical invoices with human-verified fields. For a summarization system, it might contain representative source documents and a rubric for faithfulness, coverage, and clarity.

The goal is not to prove the system is perfect. The goal is to know whether a change made it better, worse, or merely different.

3. Human Evals

Some things require human judgment.

You cannot reliably regex your way into knowing whether a customer support response shows empathy. You cannot exact-match your way into judging whether a brainstorming assistant produced genuinely useful campaign ideas.

Human evals ask reviewers to score outputs against a rubric.

They should also use a stable set of golden questions or examples that you revisit each time, which we will come back to later when we talk about golden datasets.

The rubric is the whole game. A vague "rate this from 1 to 5" is almost useless because every reviewer brings their own private definition of quality. One person gives a 3 because the answer is too long. Another gives a 3 because it missed a fact. Another gives a 3 because the tone feels off.

Same score. Completely different problem.

Better rubrics separate dimensions:

Correctness:
1 = contains major factual errors
3 = mostly correct but missing or ambiguous on important details
5 = accurate and complete based on the provided source

Groundedness:
1 = makes claims not supported by the source
3 = mostly grounded but includes minor unsupported claims
5 = all substantive claims are supported by the source

Usefulness:
1 = does not help the user complete the task
3 = partially helpful but requires significant user repair
5 = directly helps the user make progress with little or no repair

Your scores should point to something actionable.

4. LLM-as-Judge Evals

LLM-as-judge means using one model to evaluate another model's output.

This can be incredibly useful. It is faster and cheaper than asking humans to review thousands of outputs every time you change a prompt. It can help score dimensions like relevance, faithfulness, and completeness at a scale human review cannot match.

But do not treat the judge model as an oracle.

Judge models can prefer longer answers. They can favor certain writing styles. They can miss hallucinations. They can share blind spots with the model being evaluated. They can become inconsistent when the rubric is vague.

The right mental model is not "the AI grades the AI."

The right mental model is "the AI helps scale a review process that humans designed and calibrated."

A better workflow looks like this:

Create a rubric.
Have humans score a representative sample.
Prompt the judge model with the same rubric.
Compare judge scores to human scores.
Inspect disagreements.
Improve the judge prompt or rubric.
Use the judge at scale, with ongoing human spot checks.

LLM judges are accelerators. They are not a substitute for knowing what quality means.

5. Online Evals

Offline evals tell you how the system performs in the lab. Online evals tell you how it performs in the world.

Production users are wonderfully inconvenient. They reveal problems your curated dataset missed.

Useful online signals include:

acceptance rate: did the user keep or use the output?
copy rate: did the user copy the generated answer?
regeneration rate: did they keep asking for another try?
correction rate: how much did they edit the result?
abandonment rate: did they leave after the AI response?
escalation rate: did they need a human?
time to first useful answer: how long until the user actually made progress?
cost per successful task: how much did useful automation cost?

"Time to first useful answer" is especially important.

Developers often measure time to first token because it is easy. Users care about getting something useful. A fast model that gives three bad answers may be slower in product terms than a slightly slower model that gets it right the first time.

Match the Eval to the Task

A common mistake is trying to create one generic "AI quality score."

That sounds tidy, but it usually collapses different product goals into mush.

An AI system that extracts numbers from financial reports and a creative brainstorming assistant should not be evaluated the same way.

For financial extraction, you probably care about:

field-level precision
field-level recall
schema validity
exact values
confidence calibration
severity of mistakes

If the model extracts $10,000,000 instead of $100,000, it does not matter that the answer was well-written. It failed.

For a brainstorming assistant, you probably care about:

relevance
diversity
novelty
brand fit
usefulness
user adoption

Exact-match testing would be absurd there. There are many valid outputs.

For a retrieval-augmented assistant, you may need to evaluate at least two separate layers:

retrieval quality: did we fetch the right context?
answer quality: did the model use that context correctly?

If the answer is wrong, the model may not be the problem. Your retriever may have handed it irrelevant documents. Good evals help you locate the failure instead of randomly tweaking the prompt.

Build a Golden Dataset

Every serious AI feature should eventually have a golden dataset.

A golden dataset is a small, trusted set of examples that your team uses repeatedly to compare system changes. It should be hand-curated, reviewed, and stable enough to act as an anchor.

It does not need to be huge at first.

Start with 30 to 100 examples:

common happy paths
hard edge cases
adversarial inputs
ambiguous user requests
known historical failures
examples from important customer segments

Then keep growing it from production.

When your AI fails in an interesting way, ask whether that failure belongs in the dataset. Send the useful examples to a review queue. Be sure to remove sensitive data and tag them by task, failure type, or customer segment.

Then someone has to decide what the expected behavior should have been. For a support assistant, that might be a product manager and a support lead. For financial extraction, it might be a domain expert who can verify the source document and the expected fields. For legal or medical workflows, the reviewer may need real professional expertise.

If reviewers disagree on the right answer, treat that as a signal to clarify the policy, improve the rubric, or mark the case as ambiguous.

Once a case is reviewed, cleaned, labeled, and tagged, it can become part of the regression suite. That is how a weird production failure turns into a check that prevents the same bug from quietly returning later.

This is one of the simplest habits that separates demo engineering from production engineering.

A Tiny Example Eval

Here is a simplified example for a retrieval-backed answer system.

Imagine each eval case looks like this:

{
  "id": "refund-policy-annual-plan",
  "question": "Can I get a refund if I cancel an annual plan after 30 days?",
  "required_sources": ["billing_policy_v4"],
  "must_include": ["30 days", "annual plan", "not eligible"],
  "must_not_include": ["monthly refund", "contact sales"]
}

A tiny eval runner might look something like this:

case = load_eval_case("refund-policy-annual-plan")
result = app.answer(case["question"])

# Deterministic checks
assert case["required_sources"][0] in result.source_ids
assert all(term in result.text for term in case["must_include"])
assert not any(term in result.text for term in case["must_not_include"])

# Structural checks
assert result.latency_ms < 3000
assert word_count(result.text) < 120

# LLM judge, after the rubric is validated against human scores
judge_score = judge.evaluate(
    question=case["question"],
    answer=result.text,
    sources=result.sources,
    rubric=["groundedness", "correctness", "usefulness"],
)

assert judge_score["groundedness"] >= 4

This is intentionally layered. The first checks are deterministic: did retrieval include the required source, did the answer include expected concepts, and did it avoid known wrong claims? The next checks are structural: was the response fast enough and short enough for the product experience? The judge step handles a more subjective question: does the answer look grounded, correct, and useful according to the rubric?

You might still sample this output for human review, especially while you are calibrating the judge model or investigating failures.

That layered view is much more useful than asking, "Was the answer good?"

Regression Testing Matters More Than You Think

AI regressions are weird.

A tiny prompt edit can change outputs across hundreds of cases. A model upgrade can improve reasoning while breaking formatting. A retrieval change can reduce latency while lowering answer quality. A safety update can reduce risky outputs while blocking normal users.

Run evals when you change:

the system prompt
the model or model version
temperature or decoding settings
retrieval chunking
embedding models
reranking logic
tool definitions
output schemas
safety policies

And when you review results, look dimension by dimension.

Do not ship just because the average score improved. Ask what got worse. If correctness rose but groundedness fell, that is not a simple win. If helpfulness rose but cost doubled, that may or may not be acceptable.

Evals are decision support, not decoration.

Evals Are Product Work

One of the most important mindset shifts is that evals are not only an ML concern.

They are product work.

Why? Because "good" depends on what the user is trying to do.

A legal research assistant needs a different quality bar than a social media caption generator. A coding agent that can edit files needs a different eval strategy than a chatbot that answers onboarding questions. A medical workflow should be much more conservative than a creative writing toy.

This is why engineers, product managers, domain experts, and designers all belong in the eval conversation.

You are not only measuring model intelligence. You are measuring whether the system helps a human accomplish something safely and reliably.

What About Eval Tools?

This article is focused on the evaluation concepts, not a specific tool stack.

In practice, teams often use tools like Braintrust, LangSmith, RAGAS, custom notebooks, CI scripts, spreadsheets, or internal dashboards. Those tools can make evals easier to run, store, compare, and review. But they do not remove the core design work.

You still need to decide what quality means for your product. You still need representative examples. You still need clear rubrics. You still need repeatable runs and a way to compare results over time.

The tool should support the eval loop. It cannot define the loop for you.

A Practical Starting Plan

If you do not have evals today, do not start by building a giant internal platform.

Start small:

Pick one important AI workflow.
Write down the quality dimensions that matter.
Create 30 real or realistic examples.
Add deterministic checks wherever possible.
Create a simple rubric for subjective dimensions.
Have humans score a small sample.
Add an LLM judge only after the rubric is clear.
Run the eval before every meaningful prompt, model, or retrieval change.
Track production signals like acceptance, regeneration, correction, and escalation.
Add real failures back into the dataset.

That loop is enough to change how your team talks about AI quality.

Instead of:

The new prompt feels better.

You can say:

The new prompt improved groundedness from 72% to 84% on our golden dataset, but conciseness dropped and regeneration increased in the beta group. We should inspect long-answer cases before rolling it out.

That is a completely different level of engineering conversation.

Final Thought

The hard part of building AI products is not getting a model to say something impressive once.

The hard part is knowing whether it will keep being useful as your users, prompts, data, tools, and models change.

That is what evals give you: a way to see the system clearly enough to improve it.

Stop vibe-checking your AI app.

Measure it.

Top comments (2)

Chen Zhang • Apr 19

solid breakdown. curious how you handle rubric drift over time, when the product use case shifts and the definition of 'good' also moves. do you add new eval dims as you go or version the whole rubric and track deltas separately from the aggregate score

Mykhailo • Apr 22

Totally agree, testing AI agents requires a completely different approach than we used to. At my company, we apply things like stress-testing with tricky inputs, keeping an eye on how the system behaves over time.

Thanks for sharing your approach!