The LLM observability category is fragmented
Search for "LLM observability" today and you'll get results from eight tools that do subtly different things. One is a tracing SDK you wire into your app. Another is a reverse proxy that logs every request. A third is an evals platform that happens to include tracing. A fourth is an enterprise ML monitoring product that added LLM support last year.
They all claim the same keywords — tracing, observability, logging, cost tracking — but their architectures, data models, and strengths diverge significantly. Picking the wrong one costs you weeks of integration work and, worse, leaves blind spots in production.
This post is the map we wish we'd had when we started building Grepture. We'll cover the eight tools most teams evaluate in 2026, how they actually differ, and when to pick each. We build a tool in this space, so we'll flag that clearly — but the bulk of this post is about the other seven, because you need that context first.
What you're actually evaluating
Before the tool-by-tool walkthrough, here are the five dimensions that matter.
Architecture. Is it a proxy (requests flow through it), an SDK (you instrument your code), or both? Proxies give you coverage without code changes but add a network hop. SDKs are zero-latency but require integration in every service.
Data captured by default. Some tools log full prompts and completions. Others capture only metadata (tokens, latency, errors). This matters for privacy — if your prompts contain PII, a default-log-everything tool creates a compliance liability you probably didn't plan for.
Evals vs. monitoring orientation. Some platforms are built around experiments and LLM-as-judge evals; observability is secondary. Others are production-monitoring first with evals bolted on.
Cost tracking granularity. Token counts are table stakes. The real question is: can you attribute spend to a team, a feature, an environment, or a user? And can you set budget alerts before the CFO notices?
Deployment model. Open source self-host, cloud, or both? This is usually a compliance question, not a cost question. EU-regulated teams often need self-host; US startups rarely do.
The eight tools
1. Langfuse
Langfuse is the most widely deployed open-source LLM observability platform. It's MIT-licensed, self-hostable, and has a generous cloud free tier.
- Architecture: SDK-based tracing. Not a proxy.
- Strengths: Open source with active community. Rich tracing model. Built-in prompt management and evals. Self-host is genuinely usable (needs PostgreSQL, ClickHouse, Redis, blob storage).
- Weaknesses: Instrumentation burden in every service. No gateway features.
- Pick if: You want open-source, can instrument code, don't need a proxy.
2. Helicone
Helicone is the clearest example of "observability as a proxy." Change your OpenAI base URL to Helicone's endpoint and every request gets logged.
- Architecture: HTTP proxy, primarily. Async logging mode also available.
- Strengths: Zero-code integration. Strong cost tracking and user-level attribution. Caching built in.
- Weaknesses: Proxy adds a network hop. Basic evals and prompt management.
- Pick if: You want fastest integration and are comfortable with a proxy in the request path.
3. Arize (Phoenix + AX)
Arize comes from traditional ML observability and extended into LLMs. Phoenix is the open-source tracing library; Arize AX is the paid enterprise platform.
- Architecture: OpenTelemetry-based SDK.
- Strengths: Deep eval and drift-detection heritage. Best if you also monitor traditional ML models. OTel plays nicely with existing observability stacks.
- Weaknesses: Enterprise-oriented pricing. Overkill for most startups.
- Pick if: You're a larger org already running ML in production.
4. Braintrust
Braintrust is evals-first. Observability is there, but the product is organized around experiments, scoring, and iterating on prompts.
- Architecture: SDK + strong web UI for evals.
- Strengths: Best eval workflow on this list, by a wide margin. Playground, datasets, and LLM-as-judge scoring tightly integrated.
- Weaknesses: More product than you need for pure monitoring. Closed source, cloud only.
- Pick if: Your team iterates heavily on prompts and evals.
5. Lunary
Lunary (formerly LLMonitor) is a lightweight open-source platform aimed at indie devs and small teams.
- Architecture: SDK-based tracing, also offers a proxy mode.
- Strengths: Simple setup, clean UI, open source. Decent cost tracking.
- Weaknesses: Smaller team and ecosystem than Langfuse. Basic evals.
- Pick if: You're a small team and Langfuse feels heavy.
6. Humanloop
Humanloop leans into prompt management and evaluation more than pure observability.
- Architecture: SDK-based with strong prompt versioning.
- Strengths: Excellent prompt-management story — versioning, deployment, non-engineer collaboration.
- Weaknesses: Observability is secondary. Closed source, enterprise pricing.
- Pick if: Prompt management and non-engineer collaboration are your primary pain points.
7. LangSmith
LangSmith is LangChain's official observability and eval platform.
- Architecture: SDK-based tracing, tightly integrated with LangChain primitives.
- Strengths: Zero-friction if you're already in LangChain. Deep agent, tool call, and chain run support.
- Weaknesses: Feels bolted-on if you're using raw SDKs. Closed source.
- Pick if: You're committed to LangChain/LangGraph.
8. Grepture
Disclosure: this is us. Grepture started as a content-aware AI gateway with PII redaction and expanded into full observability.
- Architecture: Proxy + SDK. Trace-only mode for zero latency overhead, full gateway mode when you need routing or redaction.
- Strengths: Observability + AI gateway + PII redaction in one. Multi-provider routing and fallback. EU-hosted with GDPR defaults.
- Weaknesses: Smaller eval workflow than Braintrust. Younger product than Langfuse or Helicone.
- Pick if: You want observability + PII handling + cost tracking + multi-provider routing in one tool. Especially if you're EU-based.
Side-by-side comparison
| Tool | Architecture | Open source | Evals | Gateway features | Cost tracking | Best for |
|---|---|---|---|---|---|---|
| Langfuse | SDK | Yes (MIT) | Strong | No | Good | Open-source tracing |
| Helicone | Proxy | Yes | Basic | Partial | Strong | Fastest integration |
| Arize | SDK (OTel) | Partial (Phoenix) | Strong | No | Good | Enterprise ML + LLM |
| Braintrust | SDK | No | Best-in-class | No | Basic | Eval-heavy workflows |
| Lunary | SDK + proxy | Yes | Basic | Limited | Good | Small teams |
| Humanloop | SDK | No | Strong | No | Good | Prompt-first teams |
| LangSmith | SDK | No | Strong | No | Good | LangChain users |
| Grepture | Proxy + SDK | No | Production-focused | Full | Strong | Obs + gateway + PII |
How to decide
Start with integration constraint. Can't touch every service? You need a proxy — Helicone, Lunary (proxy mode), or Grepture. Can instrument? Everything else opens up.
Then filter on evals vs. monitoring. Daily prompt iteration → Braintrust or Humanloop. Watching production → Langfuse, Helicone, or Grepture.
Then compliance. Self-host or EU residency → Langfuse, Phoenix, Lunary, Grepture EU.
Finally, scope creep. Single-purpose obs tools tend to expand. If you know you'll need gateway, evals, and prompt management later, pick something that already has them.
Key takeaways
- "LLM observability" is fragmented — eight leading tools, four architectures, overlapping but distinct strengths.
- Biggest fork is proxy vs. SDK. Pick based on whether you can instrument every service.
- Evals and observability are converging. Eval tools now trace; tracing tools now eval.
- Default data capture varies — if your prompts contain PII, check what the tool logs before integrating.
- Pure observability is solved. The interesting question is whether you want it stitched together with gateway, prompt management, and redaction — or as separate products.
Top comments (0)