Ashish Verma

Posted on Jun 23

CortexOps vs Arize Phoenix: AI Agent Observability Compared

Arize Phoenix is one of the most popular open-source LLM observability platforms in 2026. If you are evaluating it alongside CortexOps, this article compares them directly on the things that matter for production agent workloads.

Short version:

Arize Phoenix is the right choice if you are doing notebook-based experimentation, RAG evaluation, or need embeddings analysis. It is well-established with 9,000+ GitHub stars and strong OTel support.
CortexOps is the right choice if you need a first-class CI/CD deployment gate, multi-framework agent tracing beyond LangGraph, or a flat-rate pricing model that doesn't get expensive at agent-scale span volumes.

What They Are

Arize Phoenix is an open-source LLM observability and evaluation platform from Arize AI. It self-hosts in one command, uses OpenTelemetry for tracing, and includes built-in evaluators covering faithfulness, relevance, hallucination, and toxicity. Phoenix supports popular frameworks including OpenAI Agents SDK, LangGraph, CrewAI, LlamaIndex, and DSPy.

CortexOps is an open-source AI agent observability platform focused on the full production lifecycle — tracing, evaluation in CI/CD, and continuous monitoring. It supports 12 agent frameworks via a unified instrumentation layer and ships a CLI-based deployment gate that exits with code 1 when quality drops below a configured threshold.

One Important Distinction

Phoenix is optimized for prompt-centric experimentation and LLM-as-judge evals in a notebook-friendly self-host. When your app stops looking like a notebook — a production agent that runs for ten minutes, calls fifteen tools, spawns a sub-agent, and fails four tool calls deep — you open Phoenix and get a span tree. You wanted to know what the agent said to the user, what the user said back, and which tool call threw.

CortexOps is designed for that second scenario: debugging multi-node agent failures in production with a structured execution waterfall, not a flat span list.

Feature Comparison

Feature	Arize Phoenix	CortexOps
Open source	✓ Elastic License 2.0	✓ MIT
Self-hostable	✓ Yes	✓ Yes
OTel support	✓ OpenInference conventions	✓ OTLP native
Framework support	LangGraph, CrewAI, OAI SDK + more	12 frameworks
LLM-as-judge eval	✓ Yes	✓ Yes
Embeddings analysis	✓ Yes	✗ Not yet
CI/CD eval gate CLI	Partial (custom script needed)	✓ First-class
GitHub Actions	Manual integration	✓ cortexops-eval-action
RAG-specific metrics	✓ Strong	✗ General metrics only
Free tier (hosted)	✓ AX Free (25k spans/15-day retention)	✓ 5k traces/month
Pro pricing	AX Pro $50/month (50k spans, 30-day retention)	$49/month flat (unlimited traces)
License	Elastic License 2.0	MIT

Tracing

Both platforms use OpenTelemetry. Phoenix ships OpenInference — the most widely adopted set of OpenTelemetry semantic conventions for LLM spans. CortexOps uses the emerging OTel LLM semantic conventions directly.

Phoenix instrumentation for LangGraph:

from phoenix.otel import register
tracer_provider = register(project_name="my-agent")
# Auto-instruments LangGraph calls

CortexOps instrumentation:

from cortexops import CortexTracer
tracer = CortexTracer(api_key="cxo-...", project="my-agent")
agent  = tracer.wrap(your_compiled_graph)

Both get you traces. The difference is in what you see: Phoenix shows a span tree. CortexOps shows a node waterfall organised by agent execution flow — which node ran, in what order, how long each took, and which tool calls happened inside each node.

Winner: Roughly equal for tracing. Phoenix has more mature OTel conventions. CortexOps has better agent-native execution view.

Evaluation

Phoenix has a strong evaluation suite. Built-in evaluators cover faithfulness, relevance, hallucination, toxicity, and custom criteria. LLM evaluators use function calling to extract structured judgments rather than parsing freeform text.

CortexOps evaluation uses a golden dataset approach with three built-in rubrics (task completion, response quality, safety) plus a CLI gate:

cortexops eval run \
  --dataset datasets/my_agent.yaml \
  --judge \
  --fail-on "task_completion < 0.90"

Phoenix can be integrated into CI/CD but requires custom scripting. The approach: run experiments on every PR, check scores against thresholds in a Python script, and use exit code to reflect pass/fail. CortexOps ships this pattern out of the box as a first-class CLI command and GitHub Action.

Winner: Phoenix for RAG and research-oriented evals. CortexOps for CI/CD deployment gates.

Pricing at Agent Scale

AX Free is 25k spans and 1GB at 15-day retention. AX Pro is $50/month for 50k spans and 10GB at 30-day retention. Graduating from Phoenix to AX is a new contract, not a tier upgrade, and span-based pricing gets expensive on agent workloads.

A production agent with 10 nodes running 1,000 times per day generates roughly 100,000 spans per day — 3 million per month. That is 60x the AX Pro limit at $50/month.

CortexOps Pro is $49/month for unlimited traces. No span counting.

Winner: CortexOps for high-volume agent workloads. Phoenix/AX for lower-volume experimentation.

License

This matters for some teams. Phoenix uses the Elastic License 2.0, which restricts certain commercial use cases (you cannot offer Phoenix as a managed service to others). CortexOps is MIT — no restrictions.

Winner: CortexOps if license flexibility matters.

When to Choose Arize Phoenix

You are doing RAG development and need embeddings analysis and context relevance metrics
You want notebook-friendly local development with a mature, established platform
You are already in the Arize ecosystem for traditional ML monitoring
You need the breadth of Phoenix's built-in evaluator library

When to Choose CortexOps
You need a CI/CD deployment gate that blocks merges on quality regression
Your agent uses multiple frameworks (CrewAI + OpenAI SDK + LangGraph simultaneously)
Span-based pricing would get expensive at your trace volume
MIT license matters for your use case

- You want a flat-rate Pro plan

Try Both

Both are open source with free tiers. The fastest way to decide:

# Phoenix
pip install arize-phoenix
python -m phoenix.server.main  # starts on localhost:6006

# CortexOps
pip install cortexops
# 3 lines to your first trace — getcortexops.com

Links:

CortexOps: getcortexops.com | github.com/ashishodu2023/cortexops

- Arize Phoenix: arize.com/phoenix | github.com/arize-ai/phoenix

Ashish Verma is a Senior AI Engineer at PayPal and co-founder of CortexOps. This comparison reflects publicly available information as of June 2026.

Top comments (1)

Luis Cruz • Jun 23

This post is basically about moving LiteLLM toward a Rust-based architecture and what the benchmark differences look like between the Python gateway vs a Rust implementation.

The core idea (and why it’s getting attention) is pretty simple:

LiteLLM has become a common “LLM gateway layer” in Python, but at scale it starts to show typical Python-infrastructure limits (latency overhead, concurrency constraints, memory pressure). The Rust migration argument is that rewriting or replacing parts of it in Rust can dramatically reduce overhead and improve throughput in production systems.

From the broader benchmarking direction in the ecosystem, the claims usually revolve around:

Lower per-request latency due to compiled execution
Much higher throughput under load
Smaller memory footprint and more predictable performance
Tradeoff: more complex implementation and less flexibility compared to Python-first tooling

This fits into a wider trend right now where LLM infrastructure is splitting into two layers:

Python: orchestration, experimentation, agent logic
Rust/Go: inference gateways, routing layers, and high-throughput serving infrastructure

The important nuance is that most of these posts (including this kind of benchmark-heavy migration narrative) are showing relative system design advantages, not “Rust magically replaces LiteLLM everywhere.” In practice, teams usually end up with hybrid stacks rather than full rewrites.

So the real signal here isn’t just “Rust is faster,” but:
LLM infrastructure is maturing into performance-critical systems where language choice at the gateway layer actually matters.

That’s the underlying shift this post is pointing at 🤝