As you all know, AI agents are software systems that can reason, choose tools, and take actions on behalf of a user.
They work by routing a request, using one or more tools or skills, carrying state or memory, and then producing a final answer or action.
That sounds simple, but once you start building agentic systems, you quickly realize something:
An agent is not one big magic box. It is a workflow.And every workflow has places where it can break.
A simple agent flow
A basic agent flow may look like this:
User query
→ Router
→ Tool selection
→ Tool call
→ Tool result
→ LLM reasoning
→ Final answer
The user only sees the final answer.
But the system did many things before reaching that answer.
This is why evaluating agents is harder than checking if the final response sounds good.
The main parts of an agent
Most agents have a few common parts.
1. Router
The router decides what should happen next.
It may decide:
which tool to call
which workflow to run
whether the question needs retrieval
whether the agent should ask a follow-up question
whether it can answer directly
The router can be an LLM, a rules-based system, a classifier, or a mix of these.
If the router makes a wrong decision, everything after that can go wrong too.
2. Tools or skills
A skill is a block of logic that helps the agent complete a task.
It may include:
- API calls
- database queries
- web search
- calculations
- retrieval
- code execution
- summarization
- file processing
For example, a RAG skill might look like this:
embed query
→ search vector database
→ retrieve context
→ call LLM with retrieved context
A tool may work perfectly on its own, but the agent can still use it incorrectly.
That is why we need to evaluate not only the tool, but also how the agent chooses and uses the tool.
3. Memory and state
Memory helps the agent remember previous context.
State helps different parts of the agent share information during execution.
This can include:
- chat history
- retrieved context
- configuration values
- previous tool outputs
- intermediate decisions
- execution steps
Memory is powerful, but it can also create problems.
If the memory contains outdated, irrelevant, or confusing context, the agent may make poor decisions.
Example: data analysis agent
Imagine a data analysis agent.
The user asks:
“Show me sales trends from last month.”
The agent may need to:
- Understand the request
- Choose the database tool
- Generate the right query
- Fetch the data
- Analyze the result
- Create a summary
- Maybe generate a chart
The final answer may look fine. But the path may still be wrong.
The agent may have queried the wrong table.
It may have used incomplete data.
It may have ignored a tool result.
It may have repeated the same step.
It may have generated a confident answer from a weak context.
This is the problem with only checking final answers.
What can go wrong?
A few common failures:
- wrong tool selected
- tool called with wrong input
- weak retrieval context
- invalid query
- repeated tool calls
- loop without progress
- hallucinated summary
- poor final response
- correct answer but inefficient path
This is why agent evals need to be more granular.
Evaluate the components
Instead of asking only:
“Is the final answer correct?”
We should also ask:
Router eval
Did the agent choose the right path?
Tool eval
Did it call the right tool with the right input?
Retrieval eval
Did it use relevant context?
Memory eval
Did memory help, or did it confuse the agent?
Path eval
Did the agent avoid loops and unnecessary repeated steps?
Response eval
Was the final answer useful, grounded, and clear?
Why traces matter
Traces show what actually happened during the agent run.
A good trace can show:
- router decision
- selected tool
- tool input
- tool output
- step count
- retries
- repeated steps
- final answer
Without traces, debugging agents is mostly guesswork.
You see the final answer, but not the path.
And for agents, the path matters.
Final thought
AI agents are workflows.
Workflows need visibility.
If we want agents to become reliable, we need more than prompts.
We need:
- traces
- evals
- error analysis
- component-level checks
- path-level debugging
- better developer tooling
That is what I’m exploring more through projects like LoopGuard and Supabase Agent Eval Kit.
Still learning, but this area feels important.


Top comments (0)