Ultra Dune

Posted on Mar 17

EVAL #006: LLM Evaluation Tools — RAGAS vs DeepEval vs Braintrust vs LangSmith vs Arize Phoenix

#llm #ai #machinelearning #evaluation

By Ultra Dune | EVAL Newsletter

You shipped the RAG pipeline. The demo worked. The CEO nodded. Then production happened.

Users started asking questions your retriever never anticipated. The LLM hallucinated a return policy that doesn't exist. Your "95% accuracy" metric turned out to measure nothing useful. Welcome to the actual hard part of building LLM applications: evaluation.

Here's the uncomfortable truth most AI engineering teams discover around month three: building the LLM app was the easy part. Knowing whether it actually works — consistently, at scale, across edge cases — is where projects go to die. Evals are the difference between a demo and a product. And yet most teams are still vibes-checking their outputs manually, or worse, not evaluating at all.

The tooling landscape for LLM evaluation has exploded in the past year. We now have open-source frameworks, managed platforms, and hybrid approaches all competing for your eval workflow. But they're not interchangeable. They make fundamentally different bets about what evaluation should look like.

I've dug into five of the most prominent tools — RAGAS, DeepEval, Braintrust, LangSmith, and Arize Phoenix — to give you an honest assessment of where each one shines, where each one breaks down, and which one you should actually use.

The Comparison Table

Feature	RAGAS	DeepEval	Braintrust	LangSmith	Arize Phoenix
Type	OSS Library	OSS Framework	Platform + SDK	Platform + SDK	OSS + Commercial
Language	Python	Python	Python/TS/Curl	Python/TS/REST	Python
RAG-Specific Metrics	★★★★★	★★★★☆	★★★☆☆	★★★☆☆	★★★★☆
Custom Metrics	Moderate	Excellent	Excellent	Good	Good
LLM-as-Judge	Built-in	Built-in	Built-in	Built-in	Built-in
Tracing/Observability	No	No	Yes	Yes	Yes
CI/CD Integration	Manual	Native (pytest)	Native	Moderate	Moderate
Dataset Management	Basic	Built-in	Excellent	Excellent	Good
UI Dashboard	No	Yes (Confident)	Yes	Yes	Yes
Pricing	Free (OSS)	Free + Cloud	Free tier + paid	Free tier + paid	Free (OSS) + paid
Self-Hostable	Yes	Partial	No	No	Yes
Learning Curve	Low	Low	Medium	Medium-High	Medium

Per-Tool Analysis

RAGAS — The RAG Eval Specialist

GitHub: explodinggradients/ragas | Stars: ~25k | License: Apache 2.0

RAGAS does one thing and does it well: evaluate Retrieval Augmented Generation pipelines. If you're building RAG, you've probably already seen RAGAS mentioned in every tutorial. There's a reason for that.

The core metric suite is purpose-built for RAG: faithfulness (does the answer stick to the retrieved context?), answer relevancy (is the response actually relevant to the question?), context precision and recall (did the retriever pull the right documents?). These four metrics alone cover 80% of what you need to evaluate a RAG pipeline.

RAGAS uses LLM-as-judge under the hood for most metrics, which means your eval quality depends on the judge model you choose. GPT-4o works well. Cheaper models introduce noise. The library recently added support for custom metrics and non-RAG evaluation, but it still feels bolted on — the heart of RAGAS is RAG evaluation, and straying far from that core use case gets awkward.

The biggest limitation: RAGAS is a library, not a platform. There's no dashboard, no dataset versioning, no experiment tracking out of the box. You'll pipe results into your own tracking system — MLflow, Weights & Biases, a spreadsheet, whatever. For small teams running evals locally, this is fine. For teams that need to share results across engineering, product, and domain experts, it's a gap.

Best for: Teams building RAG pipelines that want lightweight, focused evaluation without platform lock-in. Especially strong if you already have your own experiment tracking infrastructure.

Watch out for: Metric computation costs. Every RAGAS eval calls an LLM, so evaluating 1,000 samples with 4 metrics = 4,000+ LLM calls. That adds up fast.

DeepEval — The Testing-First Framework

GitHub: confident-ai/deepeval | Stars: ~15k | License: Apache 2.0

DeepEval takes a fundamentally different philosophical approach: LLM evaluation should look like software testing. If you've written pytest tests, you already know how to write DeepEval tests. This is not a small thing.

from deepeval import assert_test
from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase

def test_rag_faithfulness():
    metric = FaithfulnessMetric(threshold=0.7)
    test_case = LLMTestCase(
        input="What's your return policy?",
        actual_output=rag_pipeline("What's your return policy?"),
        retrieval_context=retrieved_docs
    )
    assert_test(test_case, [metric])

Run it with deepeval test run test_evals.py. It plugs directly into CI/CD. Jenkins, GitHub Actions, GitLab CI — wherever pytest runs, DeepEval runs. This is the killer feature. Evaluation becomes part of your deployment pipeline, not a separate manual step someone forgets to do.

The metric library is extensive: faithfulness, answer relevancy, hallucination, toxicity, bias, summarization quality, G-Eval (custom criteria via LLM), and more. The G-Eval implementation is particularly useful — you define evaluation criteria in natural language and DeepEval turns it into a structured LLM-as-judge scorer.

DeepEval also offers Confident AI, a cloud platform for dataset management, experiment tracking, and result visualization. It's optional — the core framework is fully open source and works standalone. But the cloud platform fills the collaboration gap nicely if you need it.

Limitations: The metric implementations can be opaque. When a faithfulness score comes back at 0.65, it's not always clear why. Debugging failing evals requires understanding the internal prompt chains, which aren't always well-documented. Also, like RAGAS, every metric evaluation is an LLM call, so costs scale linearly with dataset size.

Best for: Engineering teams that want eval-as-code with CI/CD integration. Teams already using pytest-based workflows will feel immediately at home.

Braintrust — The Experiment Platform

Website: braintrust.dev | Type: Commercial platform with open SDK

Braintrust approaches evaluation as an experiment management problem. Every eval run is an experiment. Every experiment has a dataset, a set of scorers, and produces a set of results that you can compare side-by-side with previous experiments. If you've used an A/B testing platform, the mental model is similar.

The SDK is clean and minimal:

from braintrust import Eval

Eval("My RAG Pipeline",
    data=lambda: [{"input": "...", "expected": "..."}],
    task=lambda input: my_pipeline(input),
    scores=[Factuality, AnswerRelevancy],
)

Where Braintrust pulls ahead is the UI. The experiment comparison view is genuinely useful — you can diff outputs between runs, see which test cases regressed, and drill into individual examples. Dataset management is first-class: you can version datasets, add examples from production logs, and share golden sets across teams.

Braintrust also supports online evaluation (scoring production traffic in real-time) alongside offline eval, which bridges the gap between "pre-deployment testing" and "production monitoring." The proxy feature lets you route LLM calls through Braintrust for automatic logging without code changes, which is clever.

Pricing: Free tier includes 1,000 log rows/month and basic features. Pro starts at $250/month for teams. Enterprise is custom pricing. The free tier is genuinely usable for small projects, but you'll hit limits fast on anything serious.

The main tension with Braintrust is that it's a platform play. Your eval data, experiment history, and datasets live in Braintrust's cloud. If you're at a company with strict data residency requirements or you philosophically object to vendor lock-in for core development infrastructure, this is a consideration.

Best for: Teams that want a polished experiment management workflow with strong visualization. Particularly good for teams where non-engineers (product, domain experts) need to review eval results.

LangSmith — The LangChain Ecosystem Play

Website: smith.langchain.com | Type: Commercial platform by LangChain

LangSmith is LangChain's observability and evaluation platform, and your opinion of it will likely correlate with your opinion of the LangChain ecosystem generally.

If you're already using LangChain or LangGraph, LangSmith integration is nearly frictionless. Tracing is automatic — every chain execution, retriever call, and LLM invocation gets logged with full input/output/latency data. The trace view is excellent for debugging complex chains, and the ability to jump from a trace to an evaluation dataset (by promoting production examples to test cases) is a genuinely good workflow.

LangSmith's evaluation features let you define custom evaluators, use LLM-as-judge scoring, and run evaluations against versioned datasets. The annotation queue feature — where you route outputs to human reviewers for labeling — is a strong differentiator. Human eval at scale is something most other tools punt on.

The challenges: LangSmith's evaluation features feel secondary to its tracing/observability features. The eval SDK has gone through multiple API iterations and the documentation can be confusing about which approach is current. If you're NOT using LangChain, the integration story is weaker — you can use LangSmith standalone, but you'll be fighting the grain.

Pricing: Free tier is generous at 5,000 traces/month. Plus is $39/seat/month. Enterprise adds SOC2, SSO, and dedicated support.

The elephant in the room: LangChain itself is polarizing. Some teams swear by it, others have migrated away from it to simpler abstractions. If you've built your LLM app without LangChain, adopting LangSmith just for evals means buying into an ecosystem you explicitly avoided. That's a harder sell.

Best for: Teams already in the LangChain/LangGraph ecosystem who want integrated tracing + evaluation. Also strong for teams that need human-in-the-loop annotation workflows.

Arize Phoenix — The Observability-First Approach

GitHub: Arize-ai/phoenix | Stars: ~10k | License: Elastic License 2.0

Phoenix comes from Arize AI, which built its reputation on ML observability before the LLM wave. That lineage shows — Phoenix thinks about evaluation through the lens of monitoring, tracing, and data analysis.

Phoenix runs as a local server that collects traces (OpenTelemetry-compatible), lets you explore them in a notebook-like UI, and run evaluations on collected data. The workflow is: instrument your app → collect traces → define eval criteria → run evals → analyze results. It's more exploratory and iterative than the test-driven approach of DeepEval or the experiment-driven approach of Braintrust.

The trace-first design means Phoenix excels at answering questions like "why did my pipeline fail on this input?" You can see the full execution trace, inspect intermediate steps, and then define evals based on patterns you discover. This is powerful for debugging and for the early stages of eval development when you don't yet know what to measure.

Phoenix supports LLM-as-judge evals with customizable templates, embedding-based retrieval analysis (useful for understanding your retriever's behavior), and custom evaluation functions. The integration with the broader Arize platform adds production monitoring, drift detection, and alerting — which becomes relevant once you're past the eval stage and into production.

The trade-off: Phoenix's strength as an exploration tool is also its weakness as a testing tool. There's no native pytest integration, no built-in CI/CD workflow. You can build these yourself, but it's more assembly required than DeepEval or Braintrust. The Elastic License 2.0 (not true open source) may also matter for some organizations.

Best for: Teams that need deep observability alongside evaluation. Especially strong for teams in the "we don't know what to eval yet" phase where exploration and data analysis are more valuable than automated testing.

The Recommendation Matrix

"I'm building RAG and just need metrics" → Start with RAGAS. It's free, focused, and you'll have results in 30 minutes. Graduate to DeepEval or Braintrust when you need CI/CD integration or experiment tracking.

"I want eval in my CI/CD pipeline yesterday" → DeepEval. The pytest integration is unmatched. Write tests, run them in CI, block deploys on regressions. This is the most engineering-native approach.

"My team includes non-engineers who need to review evals" → Braintrust. The UI and experiment comparison features are built for collaboration. Worth the cost if stakeholder alignment on quality is a bottleneck.

"We're already using LangChain" → LangSmith. The integrated tracing + eval experience is the best in the LangChain ecosystem. Don't fight the integration — lean into it.

"We need observability AND evaluation" → Arize Phoenix. If you need production monitoring, trace analysis, and evaluation in one tool, Phoenix covers the widest surface area.

"We have zero budget and strict data requirements" → RAGAS + your own tracking. Everything stays local, no data leaves your infra, total cost is the LLM API calls for judge models.

"We're a 1-2 person team" → DeepEval or RAGAS. Don't over-invest in platforms. Write eval tests, run them locally, iterate fast. Add a platform when collaboration becomes a bottleneck.

"We're enterprise and need the full stack" → Evaluate Braintrust and LangSmith head-to-head. Both offer enterprise features. Your choice depends on whether you value experiment management (Braintrust) or ecosystem integration (LangSmith) more.

One meta-observation: these tools are converging. RAGAS is adding more platform features. DeepEval built Confident AI. Phoenix adds more eval metrics every release. In 12 months, the differences will be smaller. Pick based on your needs today, but don't over-commit — the switching costs between eval frameworks are lower than you think.

The Changelog

Notable releases in the LLM eval ecosystem this month:

RAGAS v0.2.12 — Added support for multi-turn conversation evaluation and custom LLM-as-judge templates. The conversation metrics (conversation_faithfulness, conversation_relevancy) fill a significant gap for chat-based RAG systems.
DeepEval v2.1 — Introduced DAG-based metric evaluation for reduced LLM judge calls. Claims 40% cost reduction on complex eval suites. Also added Red Teaming metrics for adversarial testing of LLM apps.
Braintrust SDK 0.0.170 — Online scoring now supports streaming responses. New "AI Task" scorer type lets you define evaluation criteria in natural language without writing code.
LangSmith Annotation Queues v2 — Overhauled human annotation workflow with batch assignment, inter-annotator agreement tracking, and custom rubric support.
Arize Phoenix 8.0 — Major release with overhauled UI, new structured extraction evaluators, and improved OpenTelemetry instrumentation. Added native Guardrails evaluation support.
OpenAI Evals Framework Update — Quietly updated their evals framework with new model-graded evaluation templates and support for multi-step agent evaluation. Still under-documented but increasingly capable.
UpTrain v0.7 — Worth watching as an emerging alternative. Added automated root cause analysis for failing evals — traces back low scores to specific retrieval or generation failures.

The Signal

Signal 1: Eval-Driven Development is becoming the standard. The term "EDD" is showing up in more engineering blog posts and conference talks. The pattern: write evals first, then build the pipeline to pass them. This is TDD for LLM apps, and teams that adopt it ship more reliable systems. The tooling is finally mature enough to support this workflow natively — DeepEval and Braintrust both enable it out of the box.

Signal 2: The cost of LLM-as-judge is becoming a real concern. Every major eval framework relies on calling an LLM to judge outputs. At scale, this means your eval costs can approach or exceed your inference costs. Watch for frameworks that optimize judge calls (batching, caching, smaller specialized judge models). DeepEval's DAG-based approach and the emerging trend of fine-tuned small judge models (like Prometheus-2 and Flow Judge) are early responses to this pressure.

Signal 3: Human eval isn't going away — it's getting better tooling. Despite the hype around automated LLM evaluation, every serious team still does human evaluation for high-stakes decisions. The interesting shift is that tools are building human eval into the workflow rather than treating it as a separate process. LangSmith's annotation queues and Braintrust's human scoring integration point toward a future where human and automated eval are unified in a single system.

If this analysis was useful, you'll want the next one too.

EVAL is a weekly newsletter covering the tools, techniques, and culture of LLM evaluation. No hype, no vendor pitches — just honest technical analysis for people building AI systems that need to work.

Subscribe: https://buttondown.com/ultradune
Explore: https://github.com/softwealth/eval-report-skills

See you next week.

— Ultra Dune