Mahima Thacker

Posted on Jul 3

How to Choose the Right Eval for an AI Agent

#agents #ai #llm #testing

When I started learning about AI agent evaluation, I thought evals were mostly about checking the final answer.

But agents are not just final-answer machines.

They are systems made of smaller parts:

router
tools
skills
memory
retrieval
final response

Each part can fail differently.

So the better question is not only:
Did the agent answer correctly?

It is also:
Which part of the agent should I evaluate, and what type of eval makes sense there?

Not every eval needs an LLM judge

This was one important thing I learned.
It is easy to think that if we are evaluating LLM systems, we should use another LLM to judge everything.

But that is not always the best choice.
Some checks are simple and deterministic.
Some checks are subjective.
Some checks need human judgment.

A good eval setup uses the right method for the right problem.

Three common eval methods

There are three common ways to evaluate LLM systems and agents:
1) Code-based evals
2) LLM-as-a-judge evals
3) Human evaluation

Let’s break them down simply.

1. Code-based evals

Code-based evals are useful when the expected behavior is clear.
For example:

did the output return valid JSON?
did the answer contain a required keyword?
did it match a regex?
did the agent call the correct tool?
did it extract the correct parameter?
did the SQL stay read-only?
did generated code run successfully?

These are easier to automate because the rules are clear.

Example:

Expected:
tool = order_status_check
order_number = 1234

Actual:
tool = order_status_check
order_number = 1234

Result:
pass
This does not need an LLM judge.

Code can check it.

2. LLM-as-a-judge evals

LLM-as-a-judge is useful when the output quality is harder to check with simple rules.

For example:

is the summary useful?
is the answer grounded in the source?
did the response answer the user’s actual question?
is the tone appropriate?
did the agent hallucinate?
is the reasoning coherent?

These are qualitative checks.
You can ask another LLM to judge the output using a clear rubric.
But this method is not perfect.

LLM judges can also make mistakes.

So it helps to use clear labels like:
correct / incorrect
grounded / not grounded
safe / unsafe
useful / not useful

Avoid vague scores like:
87% helpful
73% grounded
Discrete labels are usually easier to interpret and compare.

3. Human evaluation

Human evaluation is still important.

Especially when the task is:

high-stakes
subjective
domain-specific
safety-sensitive
tied to user trust

For example, in healthcare, legal, finance, education, or enterprise workflows, a human may need to review whether the result is actually acceptable.

Humans can also provide labels that later become part of your test dataset.

For example:
User feedback: thumbs up / thumbs down
Human label: correct / incorrect
Reviewer note: answer missed key context
This feedback can help improve future evals.

How to choose the right eval

A simple rule:

If the criteria is deterministic, use code-based evals.
If the criteria is qualitative, use LLM-as-a-judge.
If the criteria needs domain judgment, use human evaluation.

Here is a simple mental model:

Clear rule → code-based eval
Quality judgment → LLM-as-a-judge
Domain or safety judgment → human evaluation

The goal is not to use the most advanced eval.
The goal is to use the eval that gives the clearest signal.

Evaluating a router

A router decides what the agent should do next.

It may choose:
which tool to call
which workflow to run
what parameters to extract
whether to answer directly
whether more context is needed

Router evals usually check two things:
Did the router choose the right function/tool?
Did the router extract the right parameters?

Example:
User:
“Can you check the status of order #1234?”

Correct router decision:
tool = order_status_check
order_number = 1234

Wrong router decision:
tool = shipping_status_check
shipping_tracking_id = 1234

The wrong decision may look small, but it changes the whole path.
If the router fails, the rest of the agent may fail too.

That is why router evals are important.

Evaluating skills

A skill is a task the agent can perform.
For example:

database lookup
data analysis
data visualization
retrieval
summarization
code generation

Each skill may need its own eval.
For a database lookup skill, you may evaluate:

did it generate correct SQL?
did it query the right table?
did it avoid unsafe operations?
did it return the right data?

For a data analysis skill, you may evaluate:

was the calculation correct?
did it identify the right entity?
was the explanation clear?
For a visualization skill, you may evaluate:
did the generated code run?
did the chart match the user’s request?
did it use the right data?

Different skills need different checks.

The mistake: evaluating the agent as one black box
If we only evaluate the final answer, we get weak feedback.

We may only know:
The agent failed.

But we do not know why.

A better eval setup tells us:
the router chose the wrong tool
the parameter extraction failed
the retrieval step returned weak context
the SQL was invalid
the summary hallucinated
the final answer was not useful
This makes improvement easier.

You can fix the specific part that failed.

Final thought

Agent evals become more useful when we break the agent into parts.

Evaluate the router.
Evaluate the tools.
Evaluate the skills.
Evaluate the final response.
Evaluate the path.

That is how evals move from vague scoring to practical debugging.

The goal is not just to say:
This agent is good or bad.

The goal is to understand:
What failed, where it failed, and how to improve it.

Top comments (1)

Raju Dandigam • Jul 3

I like that you break agent evals down by subsystem instead of treating the final response as the only thing worth measuring. Once routers, tools, skills, memory, and retrieval are all in play, a “bad answer” is too coarse to tell you what actually needs fixing. In practice the useful shift is from answer-level scoring to path-level diagnosis: which branch was chosen, which tool call failed, which retrieval step drifted, and whether the agent recovered well. That is also where traces from tools like agent-inspect become valuable, because they connect the eval result back to the execution path that caused it. Curious which subsystem you usually see teams under-instrument first: tool use, memory, or routing?