Gabriel Anhaia

Posted on Apr 26

Langfuse vs LangSmith vs Phoenix vs Arize: One App, Four Stacks

#ai #llm #observability #tutorial

Book: LLM Observability Pocket Guide
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You've already had the conversation. Someone in the standup says "we need observability on the LLM side" and four browser tabs open. Langfuse. LangSmith. Phoenix. Arize. The first 30 minutes feel productive. By minute 45 the differences blur and you're back to comparing pricing pages.

In January 2026 ClickHouse acquired Langfuse, which nudged the market again. Self-host got more credible; the SaaS calculus shifted. Datadog shipped native OpenTelemetry GenAI semconv support in v1.37 around the same window. Your APM is in the conversation now.

This post is one experiment: same app, four backends. No anointed winner up front. The picks fall out of which trade-offs you can live with.

The app under test

Picture a boring RAG bot. FastAPI endpoint, pgvector retrieval over a 12k-document corpus, gpt-4o-mini for the answer. Two tools: lookup_doc and summarize_thread. Prompt, model, and temperature held constant across all four wires.

If you instrument the same path against each backend in turn and run a 24-hour staging window with a few thousand requests a day (call it ~6 spans per request after retrieval and tool wiring), here's what each tool actually surfaces: how the trace tree renders, what eval primitives ship in the box, how prompts get versioned, whether token cost is broken out per call, and what it costs to leave the lights on.

Self-host versus SaaS

Langfuse is the only one of the four that meaningfully self-hosts at scale today. The core is MIT-licensed, with no usage limits and no license keys: you stand up Postgres, ClickHouse, Redis, S3, and the app servers (Langfuse self-host docs). On a small team that already runs a Kubernetes cluster, the operational cost is real and manageable. On a team that doesn't, you'll spend a sprint on infra.

Phoenix self-hosts trivially (pip install arize-phoenix and phoenix serve, or one Docker image). The SQLite-backed default is a development convenience, and the Phoenix deployment docs recommend Postgres for production. You'll move it before you move it to production. Phoenix is source-available under Elastic License 2.0; it is not OSI-open (Phoenix on GitHub).

Arize AX is SaaS-only. There's a free dev tier; the enterprise tier is a sales call (Arize pricing).

LangSmith is also SaaS-only for practical purposes. There is a self-hosted offering, but it's gated behind the Enterprise tier and the LangChain pricing page doesn't list a number — you talk to sales.

If "we cannot send prompt content to a third party" is a hard constraint, the shortlist becomes Langfuse self-host or Phoenix self-host. Everything else needs a procurement conversation.

Tracing model — span versus run

Span-vs-run is the trade-off the marketing pages skim past.

LangSmith's atomic unit is a run. A run has a parent run and child runs; it nests, but the data model is run-first. Their auto-instrumentation for LangChain and LangGraph is the most complete out-of-the-box capture of the four: if your stack is langchain + langgraph, you import their tracer and the tree comes for free. Outside that stack, you write @traceable decorators or push runs manually.

Langfuse, Phoenix, and Arize are all OpenTelemetry-first. A trace is a tree of spans; an LLM call is a span with gen_ai.* attributes per the OTel GenAI semconv spec. Phoenix uses OpenInference (Arize's open instrumentation library), which sits on top of OTel and adds AI-specific conventions. Langfuse accepts OTLP/HTTP and also has its own SDK that itself emits OTel under the hood.

Practical consequence: if you already run an OTel collector, three of the four plug into it. LangSmith doesn't: it has its own SDK and wire protocol.

Instrumentation, the actual code

Same OpenAI call, four wires.

Langfuse (OTel-based SDK, decorator):

from langfuse.openai import openai
from langfuse import observe

@observe()
def answer(q: str) -> str:
    r = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": q}],
    )
    return r.choices[0].message.content

LangSmith:

from langsmith import traceable
from openai import OpenAI

client = OpenAI()

@traceable(run_type="llm")
def answer(q: str) -> str:
    r = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": q}],
    )
    return r.choices[0].message.content

Phoenix (OpenInference auto-instrumentation):

from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor
from openai import OpenAI

tracer_provider = register(project_name="rag-bot")
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

client = OpenAI()

def answer(q: str) -> str:
    r = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": q}],
    )
    return r.choices[0].message.content

Arize AX (same OpenInference, different exporter):

from arize.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor
from openai import OpenAI

tracer_provider = register(
    space_id="...", api_key="...", project_name="rag-bot",
)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

client = OpenAI()
# answer() identical to Phoenix

Phoenix and Arize share the instrumentation library. That's not an accident. Arize built OpenInference (per the Arize-ai/openinference repo); Arize the SaaS reads it, and Phoenix the OSS reads it too. Switching between the two backends is a config change.

Eval frameworks

LangSmith ships datasets, experiments, and a hosted evaluator runner. You define a dataset, define evaluators (functions that return scores), and call evaluate(target, data, evaluators). The whole thing runs in their cloud against your prompts. If your team is already on LangChain, you write less wrapper code than with the other three.

Langfuse ships datasets, experiments, and LLM-as-a-judge templates (hallucination, toxicity, relevance). You can run evals from a CI job or trigger them inside the Langfuse UI. The user model is unlimited seats — you can give the whole company access to the eval UI without bolting on per-seat fees.

Phoenix ships phoenix.evals — a Python package with judge templates, dataset evaluation, and a UI-driven experiment runner. The eval framework is open and you run it locally against your traces. Arize AX is the same primitives plus monitors, drift detection, and per-tenant slicing.

If you're picking on eval primitives alone, the four are closer than the marketing implies. What actually differs: where the eval runs (your infra vs theirs), and whether the eval can write back into the trace as a child span.

Prompt-version management

LangSmith has the most polished prompt registry — a hub-style UI where you push, tag, and roll back prompts, plus pull-via-SDK at runtime. Tightly coupled to their dataset/experiment loop.

Langfuse has a comparable prompt registry with versioning, labels, and runtime fetch via the SDK. The diff view is decent. Variables are templated.

Phoenix and Arize treat prompts as a first-class object too. Compared to the LangSmith prompt hub, the Phoenix prompt management and Arize prompt hub flows lean more on "edit and re-deploy" than "click and revert" today.

If prompt iteration is your team's daily activity — product writers tweaking prompts, then comparing eval scores — LangSmith and Langfuse are ahead. Phoenix is fine; Arize is fine; neither will be the reason you switch.

Cost tracking

All four parse usage.input_tokens / usage.output_tokens from OpenAI/Anthropic responses and apply a model-price table to compute USD. None of them have a magic source of truth — they all rely on a price catalog they update.

Langfuse exposes per-trace cost, per-user cost (via a user_id attribute), and per-tag cost. Custom price tables are a config edit.

LangSmith does the same. Their per-run cost surface is good; per-tenant cost is doable via metadata fields.

Phoenix shows cost on each span and aggregates at the project level (Phoenix cost tracking). Splitting cost by tenant inside the OSS UI is less native; you query the underlying database for that. Arize AX does it natively (part of what you're paying for).

If "alert me when one tenant's monthly LLM cost crosses $X" is a real requirement, Arize AX and Langfuse have the cleanest path. LangSmith works if you commit to setting metadata={"tenant_id": ...} on every run.

Pricing — the part that makes the call

Numbers below are illustrative for a 7-person team at ~250k user-traces/month, drawn from public pricing pages as of April 2026:

Backend	Sticker	Notes
Langfuse Hobby	$0	50k units, 30-day retention, 2 users
Langfuse Core	$29/mo	100k units, 90-day retention, unlimited users
Langfuse self-host	infra-only	MIT-licensed, no per-seat fees
LangSmith Plus	$39/seat + traces	7 seats = $273; 250k traces ≈ +$625 at $2.50/1k overage
Phoenix self-host	infra-only	Elastic License 2.0
Arize AX	sales call	Free dev tier; production is custom

Sources (as of April 2026): Langfuse pricing, LangSmith pricing, Arize pricing, and a third-party Pydantic pricing comparison. Your spend will look different. These are list rates; your trace shape is yours; do the math on your own volume before quoting it in a doc.

The headline: per-seat pricing scales with headcount whether or not those engineers actually log in. Per-trace pricing scales with traffic. Self-host shifts the cost onto your infra team's hour rate. Pick the curve you'd rather argue about in a budget review.

A decision matrix that survives the day

If you run the same app through all four, here are the picks that hold up:

You are deep in LangChain and LangGraph and your team is small. LangSmith. The auto-instrumentation alone saves the first sprint, and the prompt registry pays for itself if your team iterates daily.
You will not send prompt content off-prem, or you have many users. Langfuse self-host. The unlimited-users model and ClickHouse-backed query layer scale further than Phoenix's defaults.
Your team already runs OTel and you want a single eval/trace UI you can pip-install. Phoenix. Lowest friction to get a tracer running. Trade-off is a younger prompt registry.
You have an enterprise data/ML org with PCI/SOC and existing Arize ML monitoring. Arize AX. The continuity with classic ML drift monitors is real, and the per-tenant slicing is built in.
You're 80% of the market: a 5-15 person engineering team building production LLM features. Langfuse Core or Langfuse self-host. The cost curve and the OTel-first model give you the most exits if the answer changes in six months.

The other thing this exercise teaches: most of the differences disappear once you've instrumented. Trace trees, eval primitives, and dashboards all converge in the UI. What stays different is operational: how you run it, what it costs, who can log in, and whether your data leaves the building. Pick on those.

If this was useful

The LLM Observability Pocket Guide walks the same evaluation methodology in detail — what to put on each span, how to wire LLM-as-judge as a child span without breaking your trace, the eval rigs that catch silent regressions before users do. If this comparison saved you an afternoon, the book is the next two weeks.