DEV Community

James O'Connor
James O'Connor

Posted on

# Evaluating an AI agent is not evaluating an LLM call:

I compared six tools for evaluating AI agents: LangSmith, Galileo, Arize Phoenix, Braintrust, Future AGI, and Langfuse. My thesis, up front so you can argue with it early: the mistake that wastes the most time is grading the agent's final answer like it is a single LLM call. An agent has a trajectory, which tools it called, in what order, how it recovered, and a wrong final answer and a right-by-luck final answer look identical until you score the path. Here is the rundown as of June 2026.

The final answer is not the unit of evaluation

An LLM-call eval grades one output. An agent eval has to grade a sequence: did it call the right tool, with the right arguments, in a sensible order, and recover when a call failed. Two runs can produce the same final answer, one by reasoning correctly and one by luck, and only trajectory-level scoring tells them apart. If your agent eval only looks at the final response, you are testing a chatbot, not an agent.

The six, by how deep they score

LangSmith. The LangChain-native pick. Agent traces plus eval, automatic if you are on LangChain or LangGraph. Deep on traces, proprietary and coupled to that stack.

Galileo. The agent-focused eval pick. Built around agentic workflows with metrics aimed at tool use and task completion, managed.

Arize Phoenix. The open-source OTel pick. Span-level agent traces plus eval, self-hostable, good if you want trajectory visibility without a license.

Braintrust. The polished-SaaS pick. Strong eval and observability UI for agents, proprietary, no self-host.

Future AGI. The simulate-then-score pick. Their Simulation runs synthetic voice or text personas through your agent before prod, and agentic_eval scores the multi-turn trajectory, tool calls, stepwise reasoning, and the full conversation, not just the final output (github.com/future-agi, as of June 2026). The draw for me was running a synthetic-persona session through the agent like an integration test and then scoring the path it took, not only where it ended up. It is one option among several here, not the answer.

Langfuse. The open-source observability pick. Agent traces plus eval, self-hostable, framework-agnostic; the eval layer is lighter than the eval-specialist tools.

I am not crowning one. LangSmith if you live in LangChain, Phoenix or Langfuse for self-hosted OTel traces, Galileo or Braintrust for managed agent metrics, the simulate-then-score approach if you want to generate the sessions, not just observe them.

What I actually score on a trajectory

Tool-selection-correct (right tool for the step), tool-args-valid, recovery (did it handle a failed call gracefully), and only then final-answer-correct. The first three catch the agent-specific failures the final-answer score hides. The agent that reached a fine answer through three wrong tool calls is a latent incident, not a pass.

Objections I'd accept / wouldn't

Accept: "single-turn metrics still matter." They do. They grade each response, and you want them. They just miss the cross-turn failures (state, tool ordering) that are the whole reason you built an agent rather than a chatbot, so they are necessary and not sufficient.

Wouldn't accept: "trajectory scoring is overkill, ship on final-answer accuracy." That is the position that produces the right-by-luck pass. The agent that stumbles to a correct answer through three wrong tool calls will fail differently next week, and your final-answer metric will not have warned you.

Where I'd push back on this

Steelmanning against myself: trajectory scoring assumes I know what the right path looks like, and for open-ended agents there is often more than one valid path to a good answer. A lot of what I call "wrong trajectory" might be "a reasonable path I did not anticipate," and if I over-fit my eval to one golden path I will punish agents for being creative in ways that are actually fine. The concession: I do not have a clean way to score "took a reasonable path I did not anticipate" without hand-labeling every trajectory. What I hold onto is narrower than full-path matching: tool-args-valid and graceful-recovery are path-independent, they are correct or not regardless of which route the agent took, so I trust those two even when I cannot agree on the one true path. If you have a way to score path-reasonableness without hand-labeling everything, that is the comment I want.

Top comments (0)