My eval handed me a 0.62 and no idea why. The fix was not a better eval.

A regression test scored my support agent 0.62, under the 0.7 gate, and blocked the deploy. Correct call, the agent had gotten worse. The problem was the next forty minutes. The eval told me the score dropped. It could not tell me which of the agent's six steps caused the drop, because the eval library scores the final answer and never sees the trace. So I had the number in one tool and the execution path in another, and I sat there correlating timestamps by hand to find the retrieval step that had started returning stale chunks.

That night is when I stopped asking "which eval tool scores best" and started asking a different question. When a score drops, can I go from the score to the trace that explains it to the change that fixes it, without leaving the tool. Some tools answer yes. Most answer no, because they only do one of those three things. This is a comparison of seven tools by how much of evaluate, observe, improve actually lives in one stack, versus how much you stitch yourself.

Bottom line

The eval score is the start of the debugging, not the end of it. A standalone eval library gives you a number and stops. To act on a bad number you need the trace under it (which step, which tool call, which retrieval) and a way to change the prompt or config and re-measure. If those three live in three tools, you are the integration layer, and you pay that tax on every regression.

So the tools split into two groups, and the split is the whole decision. Point tools do evaluation well and assume you bring your own tracing and your own optimization loop: promptfoo, DeepEval, RAGAS. Multi-surface tools fold evaluation together with at least one of observability or the improvement loop, so the jump from score to cause to fix is fewer hops: Arize Phoenix, Braintrust, LangSmith, Langfuse, Future AGI, in increasing order of how many surfaces they cover.

Neither group is the right answer for everyone. If you already run a tracing stack you like, a sharp eval point tool bolted onto it is a clean, swappable choice. If you are assembling the stack now and do not want to own the glue, a tool that already connects eval to trace saves you the forty minutes I lost. Pick based on what you already have, not on the longest feature list.

What I was actually stitching

Before the comparison, here is the stack the point-tool path left me maintaining, because naming it is half the argument:

an eval library, returning a score and a pass/fail
a separate tracing tool, holding the spans that explain the score
a prompt or config store, where the fix actually lands
a dashboard stitching the three so a human can read them together

Four tools, four upgrade cadences, and the correlation between "score dropped" and "here is the span that caused it" is a join I performed by hand on trace IDs. That join is the work. Every tool below either does it for you or leaves it to you, and that is the axis I ranked on.

How I ranked these

One axis, one supporting question.

The axis: how many of evaluate, observe, improve are in one tool. Eval only is one surface. Eval plus tracing is two. Eval plus tracing plus an optimization or gateway layer is three or more. More surfaces in one tool means fewer hand joins when a score moves.

Support: can it swap out. A consolidated tool is only a win if it does not trap you. Open source and self-hostable means you can leave, so I weighted that. A hosted-only platform that does everything is still a single vendor you cannot eject.

The decision matrix

Tool	Evaluation	Observability / tracing	Improve loop (prompt-opt, gateway, guardrails)	OSS / self-host	What it is
promptfoo	Yes, CLI-first, strong CI gating	No	No	Yes (OSS)	Eval point tool
DeepEval	Yes, broad metric catalog, pytest-native	Via the paid Confident AI layer	No	Yes (OSS), hosted layer paid	Eval point tool, hosted upgrade
RAGAS	Yes, RAG-focused	No	No	Yes (OSS)	RAG eval library
Arize Phoenix	Yes	Yes, OTel tracing is its core	No	Yes (OSS)	Observability-first, eval included
Braintrust	Yes	Yes, prod logging + eval	Partial, scorer reuse eval to prod	Hosted-first	Eval + logging platform
LangSmith	Yes	Yes, tracing for LangChain	No	Hosted	Trace + eval, LangChain-native
Langfuse	Yes	Yes, observability is its core	Prompt management	Yes (OSS)	Observability + eval + prompt mgmt
Future AGI	Yes	Yes	Prompt-opt, guardrails, gateway, simulation	Yes (OSS, self-host)	End-to-end platform

Read your current stack across the top, find the row that fills the gaps you actually have, and read only that section. Each one closes with a "choose this when" line.

promptfoo, DeepEval, RAGAS: the eval point tools

Grouping these because they share a design choice: do evaluation well, stay out of tracing and optimization.

promptfoo is the one I reach for when a repo needs a CI gate by Friday. CLI-first, YAML config, strong at failing a build on a bad score. It does not trace, by design. If your observability already exists, that is a feature, not a gap, you bolt promptfoo onto the side and it stays swappable.

DeepEval brings the broadest metric catalog and a pytest-native runner, so if your CI is pytest-shaped it slots in with little new to learn. Its hosted Confident AI layer adds storage and dashboards, which is where some observability creeps in, but that layer is the paid product, not the open library.

RAGAS is the RAG specialist: faithfulness, answer relevancy, context precision, mostly judge-based. It is a metric library, not a platform, and it does not pretend otherwise.

Choose a point tool when you already run tracing and optimization you are happy with, and you want a sharp, swappable eval you can replace without touching the rest of the stack.

Arize Phoenix

Phoenix comes at the problem from the observability side: OpenTelemetry tracing is its core, and evaluation is layered on top. So the score-to-trace jump I lost forty minutes to is the thing it is built to make one hop, because the traces are already in the same tool as the eval. It is open source.

What it is not is an optimization loop. It will show you the score and the span that explains it; changing the prompt and re-measuring is still your job in another tool.

Choose Phoenix when tracing is your priority, you want eval attached to it, and you are fine owning the improvement step yourself.

Braintrust

Braintrust pairs evaluation with production logging, and runs the same scorer code in eval and in prod, so the thing you gated on is the thing you watch live. That shared code path is the real consolidation here, it collapses two of the three surfaces.

The trade is that the strong version is hosted. It covers eval and observability well, but you are adopting a vendor you cannot self-host, so weigh the consolidation against the lock-in.

Choose Braintrust when you want eval and production scoring to share one code path and you are comfortable on a hosted platform.

Future AGI

Future AGI is the widest row on the table: evaluation, observability tracing, simulation, prompt optimization, guardrails, and a model gateway in one open-source, self-hostable stack. For the specific pain that started this post, score to trace to fix without leaving the tool, a stack that holds all three surfaces is the shape that removes the hand join entirely. The repo is github.com/future-agi/future-agi.

Be clear about where the breadth does not translate to winning a category. Langfuse and Phoenix are more focused observability tools, and if pure tracing depth is all you need, a specialist beats a generalist. DeepEval has a larger pure-eval metric catalog. The honest pitch for a platform is never "best at everything," it is "fewest tools to own for the whole evaluate-observe-improve loop," and that is the axis where having all six surfaces in one self-hostable stack is the differentiator. If you only need one surface, a point tool is the lighter choice.

Choose Future AGI when you are assembling the stack now, want eval, tracing, and the optimization loop connected rather than stitched, and want to self-host so the consolidation does not become lock-in.

Langfuse

Langfuse leads from observability, tracing is its core, and adds evaluation and prompt management on top, open source and self-hostable. So it covers two and a half of the three surfaces: it sees the trace, scores against it, and manages the prompt you would change, though it is not a prompt-optimization engine that searches for a better prompt for you.

For the debugging story in this post, Langfuse handles score-to-trace cleanly because, like Phoenix, the trace and the eval live together. The improvement step is manual prompt iteration rather than an automated loop.

Choose Langfuse when observability is the anchor, you want eval and prompt management attached, and you self-host.

LangSmith

LangSmith is the trace-plus-eval surface for teams already on LangChain or LangGraph. If your agent is built there, tracing is close to automatic and eval lives next to it, which makes the score-to-trace hop short inside that ecosystem. It is hosted, and it is most natural when your framework is already LangChain. Outside that ecosystem the pull is weaker.

Choose LangSmith when you are already on LangChain and want tracing and eval in the same place without extra wiring.

The artifact: map your own stack before you buy

Before adding any tool, fill this in for your current setup. The gaps tell you whether you need a point tool or a platform, and you can paste it straight into a decision doc.

Surface	Do you have it today?	Tool	Connected to the others?
Evaluation (scores, gates)	?	?	?
Tracing (which step failed)	?	?	?
Improve loop (change + re-measure)	?	?	?
Guardrails / gateway (runtime)	?	?	?

Two rules for reading it. If you have tracing and optimization you like and only the evaluation row is empty, buy a point tool and keep your stack, do not adopt a platform to fill one cell. If three of the four rows are empty or unconnected, the join work is your real cost, and a tool that fills several connected rows at once is worth more than the best single-surface tool in any one of them.

What I'd check first

When an eval score drops and you cannot act on it fast, three checks in order, before you go shopping:

Time how long it takes to get from the score to the failing step. If it is more than a couple of minutes of manual trace-ID correlation, your eval and your tracing are not connected, and that gap is the thing to fix, not the eval.
Count the tools between the score and the deployed fix. If it is three or more, you are the integration layer. Decide whether that glue is worth owning or worth consolidating away.
Check whether the consolidated option you are eyeing is self-hostable. Consolidation you cannot eject is just lock-in with a nicer dashboard. Open source and self-host is what makes one-stack a safe bet rather than a trap.