The Daily Agent

Posted on Mar 15

Top 5 AI Agent Eval Tools After Promptfoo's Exit

#ai #programming #testing #opensource

TL;DR: DeepEval for pytest-native open-source evaluation. Braintrust for full-lifecycle eval with CI/CD quality gates. Arize Phoenix for vendor-neutral self-hosted tracing and eval. LangSmith if you are all-in on LangChain. Comet Opik for budget-conscious teams running high-volume traces.

Promptfoo Is Gone. Now What?

On March 9, OpenAI acquired Promptfoo for $86 million. Promptfoo was the most widely used open-source LLM eval and red-teaming CLI -- 10,800 GitHub stars, used by thousands of teams testing prompts, model outputs, and agent behavior across every major provider.

The acquisition raises an immediate question for anyone using non-OpenAI models: will Promptfoo stay vendor-neutral? The team says yes. The incentive structure says maybe not.

Whether you are running agents on Nebula, LangGraph, CrewAI, or your own framework, eval tooling is non-negotiable. Agents that call tools, make decisions, and interact with production systems need automated testing that catches failures before users do.

Here are five independent alternatives -- none owned by a model provider.

Quick Comparison

Feature	DeepEval	Braintrust	Arize Phoenix	LangSmith	Comet Opik
Type	OSS framework	Hosted platform	OSS + cloud	Cloud + self-host	OSS + cloud
Agent metrics	6 (DAG, tool-call)	Custom + 8 RAG	Dedicated evaluators	Step-level scoring	Agent Optimizer
CI/CD integration	pytest native	GitHub Actions gates	Via API	Via API	Via API
Production monitoring	No (eval only)	Yes (traces + scoring)	Yes (OTel traces)	Yes (traces)	Yes (40M/day)
Self-host option	OSS local	Enterprise only	Free, no feature gates	Enterprise tier	Apache 2.0
Framework support	Python-first	25+ integrations	15+ via OTel	LangChain-native	LangChain, OpenAI, custom
Pricing	Free OSS / $19.99/user	Free 1M spans / $249/mo	Free self-host / $50/mo	$39/seat/mo	Free / $19/mo

DeepEval -- Best for Open-Source Pytest Teams

DeepEval is a Python-native eval framework that runs inside pytest. If your team already writes tests with pytest, DeepEval slots in without changing your workflow. Define metrics, write test cases, and run them alongside your existing test suite.

The metric library is the deepest on this list: over 50 metrics including 6 agent-specific ones for DAG evaluation, tool-call correctness, and multi-step reasoning. You define expected tool calls and argument schemas, and DeepEval scores whether your agent followed the correct path.

At 13,900 GitHub stars, it has strong community momentum and active development.

Strength: The pytest integration means zero adoption friction for Python teams. You write eval tests exactly like unit tests. CI/CD integration is free -- just add DeepEval tests to your existing pipeline.

Weakness: Python-only. No persistent dashboard unless you pay for Confident AI ($19.99/user/month). Eval-only -- no production tracing or monitoring. You need a separate tool for runtime observability.

Best for: Python teams that want open-source eval integrated directly into their test suite and CI pipeline.

Pricing: Free and open-source. Confident AI dashboard at $19.99/user/month.

Braintrust -- Best for Full Production Lifecycle

Braintrust goes beyond evaluation into the full lifecycle: prompt management, eval scoring, CI/CD quality gates, production tracing, and their Loop AI feature that automates prompt optimization. If you want one platform covering eval, monitoring, and improvement, this is the most complete option.

The CI/CD quality gates are the standout feature. Define minimum score thresholds for your evals, and Braintrust blocks deployments that fail. No more shipping prompts that regress on accuracy because someone merged without running evals.

Used by Stripe, Notion, and other production-heavy teams. Supports 25+ framework integrations.

Strength: The only tool here that covers eval, production monitoring, and automated prompt optimization in a single platform. The GitHub Actions integration for quality gates is genuinely useful -- it turns evals from "something you run manually" into an automated safety net.

Weakness: The Pro plan at $249/month is the most expensive option on this list. The free tier (1 million log spans) is generous for prototyping, but production teams will hit it. Self-hosting is enterprise-only.

Best for: Teams that want a single platform for the entire eval-to-production lifecycle and have the budget for it.

Pricing: Free tier with 1M log spans. Pro at $249/month. Enterprise pricing on request.

Arize Phoenix -- Best for Vendor-Neutral Self-Hosting

Arize Phoenix is built on OpenTelemetry, which means it plays nicely with any observability stack you already run. The self-hosted version is completely free with no feature gating -- you get the same capabilities whether you pay or not.

Phoenix includes dedicated agent evaluators for tool-call accuracy, retrieval quality, and response faithfulness. The embedding visualization feature helps you spot clustering issues and drift in your agent's behavior over time.

Backed by a $70M Series C, used by Uber and Booking.com.

Strength: The most genuinely vendor-neutral option. OTel-native means your traces are portable -- you are not locked into Arize's ecosystem. Self-hosting is first-class, not an enterprise upsell. If data residency or compliance matters, this is your safest bet.

Weakness: The eval capabilities are less specialized than DeepEval's metric library. Phoenix started as an observability tool and added eval later, so the eval-specific features (custom metrics, assertion frameworks) are less mature than purpose-built eval tools.

Best for: Teams that need self-hosted, vendor-neutral tracing and eval -- especially those with existing OTel infrastructure or compliance requirements.

Pricing: Free self-hosted (no feature gates). Arize cloud from $50/month.

LangSmith -- Best for LangChain Teams

LangSmith is the eval and observability platform built by the LangChain team. If you are building agents with LangGraph, LangSmith gives you the deepest integration: multi-turn agent evaluation, step-level scoring for each node in your graph, and 400-day trace retention.

The dataset management and annotation features are strong. You can build eval datasets from production traces, annotate them with human labels, and run automated evals against them. The feedback loop between production data and eval quality is well-designed.

Backed by LangChain's $1.25B valuation and used by most LangGraph production deployments.

Strength: Unmatched integration depth with LangGraph and LangChain. If your agents are built on these frameworks, LangSmith provides visibility into every step, every tool call, and every decision point with zero extra instrumentation code.

Weakness: Ecosystem lock-in. LangSmith works best -- and sometimes only -- with LangChain-based agents. If you switch frameworks or use a custom agent architecture, the deep integrations become shallow. The $39/seat/month pricing adds up for larger teams.

Best for: Teams already building with LangGraph or LangChain who want the tightest possible eval and observability integration.

Pricing: Developer plan free. Plus at $39/seat/month. Enterprise pricing on request.

Comet Opik -- Best for Budget and Volume

Comet Opik is the newest entrant positioning itself on two fronts: price and scale. At $19/month for the paid tier (with a generous free plan), it is the cheapest option here. And it handles up to 40 million traces per day, which matters if you are running high-throughput eval pipelines or monitoring agents at scale.

The standout feature is the Agent Optimizer, which uses six different optimization algorithms to automatically improve your agent's prompts and configurations based on eval results. Think of it as automated prompt tuning driven by your eval metrics.

Apache 2.0 licensed, so you can self-host without restrictions.

Strength: The best price-to-capability ratio on this list. The Agent Optimizer turns eval results into actionable improvements automatically, closing the loop between "this prompt scored poorly" and "here's a better prompt." Apache 2.0 licensing gives you full self-hosting flexibility.

Weakness: Newer platform with less enterprise traction and a smaller community than the others. Fewer case studies and production references. The Agent Optimizer is promising but still early -- results vary by use case.

Best for: Teams watching their budget who need production-grade tracing and eval at scale, or teams that want self-hosted eval with a permissive license.

Pricing: Free tier available. Paid plans from $19/month.

How to Choose

The decision depends on three questions:

Do you need eval only, or eval plus production monitoring? If eval-only, DeepEval is the lightest option. If you need both, Braintrust or Arize Phoenix cover the full stack.
Is self-hosting a requirement? Arize Phoenix (free, no feature gates) or Comet Opik (Apache 2.0) are your options. Everything else is cloud-first or enterprise-only for self-hosting.
What is your framework? LangChain teams should start with LangSmith. Everyone else should start with DeepEval (eval-focused) or Braintrust (full lifecycle).

Quick decision tree:

Open-source + Python? DeepEval
Full lifecycle + CI/CD gates? Braintrust
Vendor-neutral + self-hosted? Arize Phoenix
LangChain ecosystem? LangSmith
Budget + volume? Comet Opik

The Verdict

The Promptfoo acquisition is a reminder of a principle that applies to every layer of your AI stack: do not depend on a single vendor for critical infrastructure. Today it is your eval tool. Tomorrow it could be your model provider, your hosting platform, or your vector database.

All five tools on this list are either independent companies or open-source projects. Your eval infrastructure should survive any single acquisition.

If you are already writing pytest tests for your agents, DeepEval is the fastest path -- add eval metrics to your existing test suite in an afternoon. If you need a complete platform that covers eval, monitoring, and CI/CD quality gates, Braintrust is the most mature. And if self-hosting is non-negotiable, Arize Phoenix gives you everything for free.

Pick one and start testing. An agent without eval coverage is an agent waiting to break in production.

If you want to go deeper on testing agents at the code level, check out How to Test AI Agent Tool Calls with Pytest. For the frameworks these eval tools pair with, see our Top 5 AI Agent Frameworks for 2026. And for a look at where your agents actually run, here is our Top 5 Code Sandboxes for AI Agents.

DEV Community