Tang Weigang

Posted on Jun 26

Before You Ship an Agent, Make DeepEval Test the Failure Path

#ai #llm #agents #testing

Before You Ship an Agent, Make DeepEval Test the Failure Path

Most AI agent projects add evaluation too late. The usual order is: connect the model, wire the tools, add retrieval, make the demo work, then think about evals. That is convenient, but it means the team only knows that a few happy paths looked fine. It does not know which failures are stable, which ones are dangerous, and which ones will quietly return with the next prompt or model change.

DeepEval is useful when you treat it as a release gate, not as a dashboard you add after launch. The Doramagic DeepEval manual breaks the project into the practical pieces that matter for that gate: LLMTestCase, GEval, AnswerRelevancyMetric, TaskCompletionMetric, hallucination checks, deepeval test run, deepeval generate golden, trace-aware evaluation, framework integrations, and the difference between local evaluation and Confident AI cloud synchronization.

The point is not to say "use every metric." The point is to make agent failure testable before the agent touches real workflows.

Start with failure examples

The first question should not be "which metric should we use?" A better first question is: what does a bad answer look like in this product?

For an agent or RAG system, useful failure examples might be:

the retriever found the right context, but the answer ignored the key fact;
the agent completed the task with the wrong tool;
the final answer sounded confident, but the retrieval_context did not support it;
the tool path worked once, but the trace showed repeated retries or a wrong branch;
the answer looked correct but violated a permission rule, policy rule, or user constraint.

Once those examples are written down, metrics become meaningful. Without them, a threshold such as 0.7 is just a number.

Keep the first test case boring

A minimal DeepEval test case can be very small:

from deepeval.test_case import LLMTestCase

case = LLMTestCase(
    input="What is the refund policy?",
    actual_output="Customers can get a free refund within 30 days.",
    expected_output="We offer a 30-day full refund at no extra costs.",
    retrieval_context=[
        "All customers are eligible for a 30 day full refund at no extra costs."
    ],
)

The value is not in the amount of code. The value is in separating input, actual output, expected output, and retrieval context. That separation prevents a common RAG mistake: judging whether the answer reads well instead of checking whether the allowed context supports it.

Pick metrics by failure type

DeepEval gives you several metric families. I would not start by wiring all of them into a pipeline. Pick one or two that map to a known failure.

Use answer relevancy when the answer drifts away from the user's question.
Use task completion when the agent may finish the wrong job or skip a step.
Use GEval when the product has custom criteria that need to be spelled out.
Use retrieval-aware tests when the system depends on context, sources, or documents.

The practical rule: every metric should tell you who needs to fix the failure. Is it the prompt, the retriever, the tool router, the dataset, the threshold, or the product boundary? If a failed eval only tells you "the model was bad," the eval is still too vague.

For agents, trace matters more than the final sentence

The Doramagic manual distinguishes end-to-end evaluation from trace-aware evaluation. For simple chatbots, final-answer checks already help. For agents, path quality matters more.

An agent can produce the right final sentence while taking an unsafe or unstable path. It may call a tool it should not use, retry the same step several times, continue after low-quality retrieval, or swallow a tool error and write a polished conclusion.

For agent evaluation, I would want the trace to answer four questions:

Which tool was selected?
Which context or source was used?
Where did the run retry, fail, or branch?
Which evidence supported the final answer?

Without that, you may only be evaluating writing quality.

Generated goldens still need review

deepeval generate golden is valuable because it lowers the cost of starting an eval set. It can generate candidate goldens from documents, contexts, scratch, or existing golden examples. But generated goldens are not the same thing as reviewed truth.

A safer path is:

generate 20 to 50 candidate cases;
remove duplicates, vague questions, and unsupported answers;
mark 5 to 10 cases as critical regression cases;
rerun those cases whenever the prompt, retriever, tool router, or model version changes.

That turns DeepEval into a regression habit instead of a one-time screenshot.

Local eval and cloud sync are different risk levels

The basic local path can be small:

pip install -U deepeval
deepeval test run test_chatbot.py

If you log in and sync reports, datasets, traces, or production monitoring to Confident AI, you should treat it as a separate data-boundary decision. Before doing that, answer:

will inputs, outputs, retrieval context, or traces be uploaded?
do any cases contain user data, internal documents, or secrets?
who can view the report?
can failure cases be redacted?
should CI be allowed to sync results automatically?

This is not a criticism of DeepEval. It is just the normal boundary work for any eval or observability system.

A useful host rule

If an AI coding host is going to help set up DeepEval, I would give it this rule first:

You may design DeepEval tests, but first state:
1. whether the target is RAG, an agent, a chatbot, or a single prompt;
2. the failure examples being tested;
3. which metric maps to which failure;
4. why the threshold is chosen;
5. whether trace-aware evaluation is needed;
6. whether cloud sync, API keys, or user data are involved.

Do not treat LLM-as-a-Judge scores as absolute truth.
Do not treat generated goldens as human-reviewed labels.
Do not claim DeepEval is installed or validated locally without a separate run log.

That rule is more useful than "add evals." It makes the evaluation plan reviewable.

A sane first day

For a first run, I would pick one real workflow and write ten cases:

three normal success cases;
three likely hallucination cases;
two missing-context cases;
one permission-boundary case;
one empty-result case.

Then I would add one metric and make the failure explanation useful before adding more metrics, trace collection, generated goldens, or CI gates.

DeepEval's value is not that it makes AI systems look controlled. Its value is that it makes failure earlier, sharper, and easier to reproduce.

Reference roles

Upstream project: confident-ai/deepeval, the source for code, installation, releases, and API facts, https://github.com/confident-ai/deepeval
Doramagic project page: an independent capability asset for AI hosts, https://doramagic.ai/en/projects/deepeval/
Doramagic manual: a practical reading path for test cases, metrics, tracing, generated goldens, pitfalls, and boundaries, https://doramagic.ai/en/projects/deepeval/manual/

DEV Community

Before You Ship an Agent, Make DeepEval Test the Failure Path

Before You Ship an Agent, Make DeepEval Test the Failure Path

Start with failure examples

Keep the first test case boring

Pick metrics by failure type

For agents, trace matters more than the final sentence

Generated goldens still need review

Local eval and cloud sync are different risk levels

A useful host rule

A sane first day

Reference roles

Top comments (0)