Mahima Thacker

Posted on Jun 26

Understanding AI Agents: Routers, Tools, Memory, and Why They Need Better Evals

#ai #evals #llm #agents

As you all know, AI agents are software systems that can reason, choose tools, and take actions on behalf of a user.

They work by routing a request, using one or more tools or skills, carrying state or memory, and then producing a final answer or action.

That sounds simple, but once you start building agentic systems, you quickly realize something:

An agent is not one big magic box. It is a workflow.And every workflow has places where it can break.

A simple agent flow

A basic agent flow may look like this:
User query
→ Router
→ Tool selection
→ Tool call
→ Tool result
→ LLM reasoning
→ Final answer

The user only sees the final answer.

But the system did many things before reaching that answer.

This is why evaluating agents is harder than checking if the final response sounds good.

The main parts of an agent

Most agents have a few common parts.

1. Router

The router decides what should happen next.

It may decide:
which tool to call
which workflow to run
whether the question needs retrieval
whether the agent should ask a follow-up question
whether it can answer directly

The router can be an LLM, a rules-based system, a classifier, or a mix of these.
If the router makes a wrong decision, everything after that can go wrong too.

2. Tools or skills

A skill is a block of logic that helps the agent complete a task.
It may include:

API calls
database queries
web search
calculations
retrieval
code execution
summarization
file processing

For example, a RAG skill might look like this:
embed query
→ search vector database
→ retrieve context
→ call LLM with retrieved context

A tool may work perfectly on its own, but the agent can still use it incorrectly.

That is why we need to evaluate not only the tool, but also how the agent chooses and uses the tool.

3. Memory and state

Memory helps the agent remember previous context.
State helps different parts of the agent share information during execution.

This can include:

chat history
retrieved context
configuration values
previous tool outputs
intermediate decisions
execution steps

Memory is powerful, but it can also create problems.
If the memory contains outdated, irrelevant, or confusing context, the agent may make poor decisions.

Example: data analysis agent

Imagine a data analysis agent.
The user asks:
“Show me sales trends from last month.”

The agent may need to:

Understand the request
Choose the database tool
Generate the right query
Fetch the data
Analyze the result
Create a summary
Maybe generate a chart

The final answer may look fine. But the path may still be wrong.

The agent may have queried the wrong table.

It may have used incomplete data.
It may have ignored a tool result.
It may have repeated the same step.
It may have generated a confident answer from a weak context.

This is the problem with only checking final answers.
What can go wrong?

A few common failures:

wrong tool selected
tool called with wrong input
weak retrieval context
invalid query
repeated tool calls
loop without progress
hallucinated summary
poor final response
correct answer but inefficient path

This is why agent evals need to be more granular.
Evaluate the components

Instead of asking only:
“Is the final answer correct?”
We should also ask:

Router eval
Did the agent choose the right path?
Tool eval
Did it call the right tool with the right input?
Retrieval eval
Did it use relevant context?
Memory eval
Did memory help, or did it confuse the agent?
Path eval
Did the agent avoid loops and unnecessary repeated steps?
Response eval
Was the final answer useful, grounded, and clear?

Why traces matter

Traces show what actually happened during the agent run.

A good trace can show:

router decision
selected tool
tool input
tool output
step count
retries
repeated steps
final answer

Without traces, debugging agents is mostly guesswork.
You see the final answer, but not the path.
And for agents, the path matters.

Final thought

AI agents are workflows.
Workflows need visibility.
If we want agents to become reliable, we need more than prompts.

We need:

traces
evals
error analysis
component-level checks
path-level debugging
better developer tooling

That is what I’m exploring more through projects like LoopGuard and Supabase Agent Eval Kit.

Still learning, but this area feels important.

DEV Community