DEV Community

Mahima Thacker
Mahima Thacker

Posted on

Why AI Agents Need Both Tests and Traces

I’ve been learning more about evaluating AI agents recently, and one thing clicked for me:

For agents, checking the final answer is not enough.
You also need to evaluate the path the agent took.

Traditional software is usually easier to test because it is more deterministic.

If you give a function the same input, you usually expect the same output.

same input → same expected output

That makes unit tests and integration tests easier to write.
But LLM-based systems are different.

The same input can produce different outputs. The agent may use tools, memory, retrieval, prompts, and multiple reasoning steps before giving a final answer.

So the question is not only:
Did the agent answer correctly?

It is also:
How did the agent reach that answer?

A simple example: data analyzer agent

Imagine you are building a data analyzer agent.

The user asks:
“What was our revenue growth last month?”

The agent may need to:

  1. Understand the user’s request
  2. Choose the right tool
  3. Query the right database table
  4. Analyze the result
  5. Summarize the answer
  6. Remember previous context if needed

The final answer might look correct. But the path may still be wrong.
Maybe the agent queried the wrong table.

Maybe it used weak context.
Maybe it ignored a tool result.
Maybe it repeated the same step multiple times.
Maybe it got stuck in a loop before answering.

This is why agent evaluation is different.

Output evaluation is not enough

Most people start by evaluating the final output.
That is useful.

You can ask:

  • Is the answer correct?
  • Is the answer grounded?
  • Did the agent hallucinate?
  • Did it answer the user’s question?
  • Was the response useful?

But for agents, this only shows part of the picture.
An agent can produce a correct-looking answer after taking a bad path.

And in production, that bad path matters.
It can increase cost, latency, risk, and user confusion.

You need to evaluate the path too.

For an AI agent, the path includes things like:

which tool it selected
what input it passed to the tool
what the tool returned
whether it retried
whether it repeated itself
whether it made progress
whether it used the right context

This is where traces become important.

A trace shows what actually happened inside the agent workflow.
Without traces, debugging agents becomes guesswork.

You see the final answer, but you don’t know what happened before that.

Error analysis for agents

In machine learning, error analysis means observing, isolating, and diagnosing mistakes made by a model.

For agentic workflows, error analysis applies to the whole system.
Not just the final LLM response.

For example:

  • The router selected the wrong tool
  • The retrieval step returned irrelevant context
  • The agent queried the wrong database table
  • The tool call failed, but the agent continued confidently
  • The agent repeated the same action without progress
  • The final answer was acceptable, but the workflow was inefficient

These are system-level failures.

You cannot catch them by only reading the final answer.

Different parts need different evals

One eval cannot check everything.

A router decision, tool call, retrieval step, summary, and final answer may all need different checks.

1) Code-based evals

Code-based evals are useful when the expected behavior is clear.

For example:

  • Did the agent call the right tool?
  • Did it return valid JSON?
  • Did it stay within the expected number of steps?
  • Did it avoid unsafe operations?
  • Did the API response match the expected schema?

These are easier to automate.

2) LLM-as-a-judge

LLM-as-a-judge is useful when quality is harder to check with code.

For example:

  • Is the summary useful?
  • Is the answer grounded in the source?
  • Did it answer the actual user question?
  • Is the response coherent?
  • Did it hallucinate?

This is useful for subjective outputs, but it should still be used carefully.

3) Human evaluation

Human evaluation still matters.
Especially for:
high-stakes workflows
domain-specific tasks
safety-sensitive outputs
tone and usefulness
ambiguous answers

Sometimes the best evaluator is still a person who understands the real user and context.

Tests tell us if something passed

Tests are important.
They help us check whether the agent behaves correctly on known examples.

For example:

Question: “Show me total revenue for May”
Expected tool: run_sql_query
Expected behaviour: query revenue table
Expected output: grounded answer with a number

This gives us a way to catch regressions.

If we change the prompt, model, tool schema, or agent logic, we can run the same eval again and see what changed.

Traces show us what happened

Traces are equally important.
They tell us the story of the run.
For example:
User query
→ Router decision
→ Tool selected
→ Tool input
→ Tool output
→ Agent reasoning step
→ Final answer

A trace helps us see where things went wrong.

Maybe the router failed.
Maybe the tool returned bad data.
Maybe the LLM ignored the tool result.
Maybe the agent looped.
Maybe the answer was fine, but the path was too expensive.
Tests and traces work better together.

The main lesson

AI agents need both:
Tests to check expected behavior.
Traces to understand actual behavior.

Tests tell us whether something passed or failed.
Traces show us what happened.

Together, they make agents easier to debug, improve, and trust.

That is the direction I’m exploring more through projects like LoopGuard and Supabase Agent Eval Kit.

I’m still learning, but this area feels important because agents are not just prompt-in, answer-out systems. They are workflows.

And workflows need visibility.

Top comments (0)