Gabriel Anhaia

Posted on Apr 18

Langfuse vs LangSmith vs Phoenix vs Braintrust: The Honest 2026 Comparison

#ai #llm #devops #observability

Book: Observability for LLM Applications — paperback and hardcover on Amazon · Ebook from Apr 22
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Every vendor in this category publishes their own comparison matrix. Every matrix happens to conclude the same thing.

This one does not, because nobody is paying for it.

Four platforms, one workload, honest trade-offs. The workload is Hermes IDE, an AI-coding IDE with ~2M LLM spans per month at pin date, mixed OpenAI and Anthropic, agent and chat shapes.

The four

Langfuse — OSS (MIT), self-hostable, now owned by ClickHouse as of January 2026. The de facto default for teams who want to self-host.
LangSmith — commercial, by the LangChain team. Deepest integration with the LangChain ecosystem.
Arize Phoenix — OSS (Elastic License), on OpenInference. Phoenix is the OSS surface; Arize AX is the hosted commercial tier.
Braintrust — commercial, CI/CD-first. Marketed as the "evals-in-CI" specialist. Most aggressive on deployment-blocking automation.

Tracing

All four ingest OpenTelemetry GenAI semantic conventions without a translation layer. That is the single most important fact in this post. If you instrument against the gen_ai.* namespace, you can move between the four with no application-code changes.

	Langfuse	LangSmith	Phoenix	Braintrust
OTel GenAI ingest	yes, OTLP HTTP	yes (2026 Q1)	yes, translates to OpenInference	yes, v1.37+
Self-host	yes, all tiers	no (cloud only)	yes, Elastic License	no (cloud only)
SDK languages	Python, TS, OTLP any	Python, TS	Python, TS, OTLP any	Python, TS, OTLP any
Trace retention (free)	unlimited (self-host)	14 days	unlimited (self-host)	30 days

The self-host column matters. Langfuse and Phoenix are the only two of the four you can run fully inside your VPC. For regulated workloads (healthcare, finance, EU-resident data), that is often decisive before any UX comparison matters.

Evals

This is where the platforms diverge.

Langfuse. LLM-as-a-judge and custom scorer primitives. Ships the runtime. The orchestration layer is DIY: regression suites, CI gates, drift alerts. Good if you want control. Tedious if you want batteries.

LangSmith. Strong on dataset management and human-review queues. Eval runs are structured like experiments with statistical-significance math baked in. Heaviest gravity toward LangChain applications; works fine outside LangChain but you feel the edges.

Phoenix. OpenInference eval format. The agent-evaluation surface is the deepest of the four — Phoenix captures multi-step agent traces in more detail than the others, with purpose-built views for tool-call graphs. Best for teams running agentic workloads.

Braintrust. The CI-first pitch is real. Scorers run in CI, statistical significance is computed automatically, merges block on quality regressions. If your culture is "every PR runs full evals," this is the shortest path. If your culture is "evals are a separate workflow," you are paying for machinery you will not use.

Pricing (free tier, as of April 2026)

	Free tier
Langfuse	Unlimited on self-host. Cloud: 50k observations/month.
LangSmith	5k traces/month.
Phoenix	Unlimited on self-host. Arize AX cloud: 25k spans/month, 1 user.
Braintrust	1M trace spans + 10k eval runs/month, unlimited users.

Braintrust's free tier is the most generous by a wide margin. Langfuse and Phoenix are the cheapest at scale if you self-host (infrastructure cost only).

Where each wins

Self-host, vendor-neutral, OSS-first → Langfuse. If ClickHouse's backing reassures you (it should), Langfuse is the safest long-term bet for teams who want control over their data plane. MIT license is a hard guarantee.

LangChain shops → LangSmith. If 80%+ of your application is LangChain, the integration gravity pays back the pricing. If 20% is LangChain, you probably don't want to be locked in.

Agent-heavy workloads → Phoenix. The multi-step agent tracing and OpenInference eval format are Phoenix's clearest technical edge. Teams running ReAct, LangGraph, or custom agent frameworks find it the most expressive.

Ship-every-day teams → Braintrust. Deployment-blocking evals are the differentiator. Every PR runs the full regression suite, Braintrust computes statistical significance, merges block on regression. This is the shortest path from "we should have evals" to "we have evals enforced in CI."

Where each loses

Langfuse — the self-assembly tax is real. You get primitives; you own the orchestration. The 2026 managed cloud under ClickHouse is the easiest exit from that tax but not free. Acquisition gravity may still re-shape roadmap priorities toward enterprise; the open-source project stays open, but the velocity on OSS features versus cloud features is always a tension post-acquisition.

LangSmith — pricing escalates fast once you leave the free tier. 14-day retention on the free plan is short for regulated audit trails. The LangChain integration cuts both ways: if you leave LangChain, you feel it.

Phoenix — the Elastic License is not MIT. It does not let you run it as a hosted service you sell. Fine for 99% of users; trip-wire for the 1%. The Arize AX commercial tier is pricier than the OSS path suggests.

Braintrust — closed source. No self-host. The 30-day trace retention on the lower tiers can be short for quarterly drift investigations. The CI-first framing is an asset if you ship daily, a mismatch if you don't.

The vendor-neutral hedge

All four ingest the same OpenTelemetry GenAI semconv. The implication:

# instrumentation.py — the only file that changes
# when you switch backends.
from opentelemetry.instrumentation.openai_v2 import OpenAIInstrumentor

OpenAIInstrumentor().instrument()

Your application code emits gen_ai.* spans. The Collector ships them to whichever backend you point it at. Moving from Langfuse to Phoenix, or Phoenix to Datadog LLM Observability, or Datadog to Braintrust, is an Ops-side config change. No application rewrite.

The 2026 consolidation wave (ClickHouse–Langfuse, Cisco–Galileo) is exactly why this matters. The vendors you evaluate in Q4 2025 are not the vendors on your invoice in Q4 2026. Instrument to the spec. Let the backend be a detail you can change without the team feeling it.

The honest verdict

For a greenfield team in April 2026, the least-regret default is Langfuse self-hosted. MIT license, ClickHouse backing, full OTel GenAI support, no trace-retention ceiling. Add Braintrust later if your CI culture demands deployment-blocking evals and you want to stop building that workflow yourself.

For an existing team: instrument to OpenTelemetry GenAI semconv first, then pick the backend. The instrumentation is forever. The backend is a 2-year decision.

If this was useful

Chapters 12–14 of Observability for LLM Applications cover all four platforms with full self-host walkthroughs, screenshots, and migration playbooks between them. Chapter 15 covers the roll-your-own path (OTel Collector + ClickHouse + Grafana) for teams who want to own every layer.

Book: Observability for LLM Applications — paperback and hardcover now; ebook April 22.
Hermes IDE: hermes-ide.com — the IDE for developers shipping with Claude Code and other AI tools. (It is also the demo workload behind the numbers in this post.)
Me: xgabriel.com · github.com/gabrielanhaia.

DEV Community