Picking LLM Eval Tooling in 2026: 4 Questions Before You Commit

#llm #observability #ai #devops

Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You picked an eval vendor in a hurry. The demo looked clean. Six weeks
later your golden dataset lives in their cloud, your CI pipeline shells
out to their CLI, and a model snapshot rotation just dropped your judge
scores with no way to tell which traces caused it. Switching now means
re-exporting everything through an API that was never built for export.

That is the shape of a bad eval-tooling decision. It does not announce
itself on day one. It shows up the day you want to leave.

The market in 2026 is crowded. Langfuse, LangSmith, Arize Phoenix,
Braintrust, DeepEval, Helicone, plus whatever your tracing vendor
bolted an evals tab onto last quarter. Most comparison posts rank them
on features. Features are the wrong axis. The axis that matters is
whether the tool still fits your team in a year, after your traffic,
your compliance scope, and your model lineup have all changed.

Four questions screen for that. Ask them before you sign anything.

Question 1: Self-host or SaaS, and what does "self-host" actually mean?

This is the first fork because it gates everything downstream:
compliance, data residency, cost at scale, and how hard it is to leave.

The trap is that "self-hostable" is marketing-true for a lot of tools
and operations-false. Some open-core products gate the parts you need
behind an enterprise license: SSO, the eval runner, dataset versioning.
You get a self-hosted box that can read traces but cannot run the evals
you bought it for.

Pin it down with three checks before you trust the label:

1. Run the eval engine fully offline?  (no callback to a SaaS API)
2. Which features are enterprise-license-gated?
3. Storage: your DB, or a managed store you can't query directly?

As of early 2026, Langfuse and Arize
Phoenix are genuinely
self-hostable and open source; you run the eval engine on your own
infrastructure. SaaS-first tools like
Braintrust and
LangSmith optimize for the hosted
path, and their self-host story is enterprise-tier where it exists.
License tiers move, so check current docs before you commit.
Neither posture is wrong. A three-person team shipping a side feature
does not want to operate a trace store. A bank with data-residency rules
does not get to use the hosted one. Decide which you are first.

Question 2: Who owns the dataset, and can you walk out with it?

Your eval dataset is the asset. The tool is replaceable; the curated
set of inputs, expected outputs, and human labels you built over months
is not. If that set is trapped in a vendor's proprietary format, you
are renting your own test suite.

Two things to verify before the dataset grows large enough to trap you:

Export. Can you pull the full dataset — inputs, expected outputs, labels, and metadata — as JSON or CSV, on demand, without a support ticket? Test the export on day one, while the set is small.
Round-trip. Can you re-import what you exported into a clean instance and get the same evals back? A lossy export is not an export.

Keep the dataset in your own repo as the source of truth and treat the
tool as a runner over it. A plain JSONL file in version control survives
any vendor decision:

{"id": "refund-001", "input": "cancel order 4471 and refund me", "expected_tool": "refund_order", "tags": ["agent", "refund"]}
{"id": "refund-002", "input": "what is your refund policy", "expected_tool": "none", "tags": ["agent", "policy"]}

A load step keeps the tool downstream of your git history, not the
other way around:

import json
from pathlib import Path


def load_dataset(path: str) -> list[dict]:
    rows = []
    for line in Path(path).read_text().splitlines():
        line = line.strip()
        if line:
            rows.append(json.loads(line))
    return rows


cases = load_dataset("evals/refund_agent.jsonl")
# push `cases` into whatever runner you picked;
# the file in git stays the source of truth.

When the file is the source of truth, switching tools is a new adapter,
not a migration project. Do this before you have 5,000 labeled rows, not
after.

Question 3: Does it gate CI, or just draw dashboards?

A lot of eval tooling is observability theater. It produces a beautiful
trend line that nobody looks at until after the bad prompt already
shipped. The tool earns its place when it can fail a build.

The test: can the tool run a defined dataset, apply pass thresholds, and
exit non-zero in a CI job? If the only output is a dashboard, it tells
you about regressions after they reach production. You want the gate
before the merge.

What this looks like in a pipeline, vendor-agnostic on purpose:

# .github/workflows/eval-gate.yml
name: eval-gate
on: [pull_request]

jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run eval suite
        run: |
          python -m evals.run \
            --dataset evals/refund_agent.jsonl \
            --threshold 0.90 \
            --fail-under

The runner's job is to return the right exit code:

import sys


def gate(scores: list[float], threshold: float) -> int:
    if not scores:
        print("no eval cases ran")
        return 1
    passed = sum(s >= threshold for s in scores)
    rate = passed / len(scores)
    print(f"pass rate {rate:.2%} (threshold {threshold:.0%})")
    return 0 if rate >= threshold else 1


sys.exit(gate(run_all(), 0.90))

If a tool cannot drive that exit code, it is a monitoring product, not a
testing product. Both have a place. Know which one you are buying, and
do not let a dashboard stand in for a gate.

One caveat worth budgeting for: LLM-as-judge evals call a model, so a CI
gate that runs them costs tokens and adds minutes per build. Sample a
subset on every PR and run the full set nightly, or the gate becomes the
thing engineers route around.

Question 4: Can you jump from an eval score to the exact trace?

This is the question most teams forget, and it is the one that decides
whether the tool is useful at 02:00. A judge score dropped from 0.91 to
0.78 overnight. Now what. If the eval result and the production trace
live in two systems with no shared key, you are reconstructing the
incident by hand.

The linkage you need is a stable id that exists on both the trace span
and the eval record:

from opentelemetry import trace

tracer = trace.get_tracer("app.llm")


def handle_request(req, conv_id, eval_case_id=None):
    with tracer.start_as_current_span("gen_ai.chat") as span:
        span.set_attribute("gen_ai.conversation.id", conv_id)
        if eval_case_id:
            # the join key between trace and eval record
            span.set_attribute("app.eval.case_id", eval_case_id)
        # ... model call, judge step, etc.

With that key, "judge score fell" turns into a query that returns the
exact traces behind the drop. Without it, you are eyeballing
dashboards and guessing at the deploy boundary.

Tools built on OpenTelemetry tracing have a head start here, because the
trace and the eval can share the same span context. Tools that bolted
evals onto a separate data model often cannot make that jump at all.
Ask the vendor to show you the click from a failing eval to the
production trace. If they demo two browser tabs instead of one link,
you have your answer.

The screen, in order

Run the four in sequence, because each one can disqualify a tool before
you spend time on the next.

Self-host or SaaS — your compliance scope decides this, not your preference. Verify the self-host claim covers the eval engine.
Dataset ownership — keep the golden set in your repo; the tool is a runner. Test export on day one.
CI gating — non-zero exit code or it is a dashboard. Sample on PR, full run nightly.
Trace-to-eval linkage — one shared id, one click from score to trace. Make the vendor demo the click.

Anything that passes all four will still be a fit a year from now, after
your traffic and your model lineup have both moved. Anything that fails
two of them will be the migration project you resent next spring.

If this was useful

The choice between Langfuse, Phoenix, Braintrust, LangSmith, and the
rest comes down to these trade-offs more than the feature matrix on the
landing page. The LLM Observability Pocket
Guide walks the tooling landscape
tool by tool — what each one does well, where the self-host story holds,
and how to wire the trace-to-eval link so a score drop points straight
at the traces that caused it.