Arize Phoenix is one of the most popular open-source LLM observability platforms in 2026. If you are evaluating it alongside CortexOps, this article compares them directly on the things that matter for production agent workloads.
Short version:
Arize Phoenix is the right choice if you are doing notebook-based experimentation, RAG evaluation, or need embeddings analysis. It is well-established with 9,000+ GitHub stars and strong OTel support.
CortexOps is the right choice if you need a first-class CI/CD deployment gate, multi-framework agent tracing beyond LangGraph, or a flat-rate pricing model that doesn't get expensive at agent-scale span volumes.
What They Are
Arize Phoenix is an open-source LLM observability and evaluation platform from Arize AI. It self-hosts in one command, uses OpenTelemetry for tracing, and includes built-in evaluators covering faithfulness, relevance, hallucination, and toxicity. Phoenix supports popular frameworks including OpenAI Agents SDK, LangGraph, CrewAI, LlamaIndex, and DSPy.
CortexOps is an open-source AI agent observability platform focused on the full production lifecycle — tracing, evaluation in CI/CD, and continuous monitoring. It supports 12 agent frameworks via a unified instrumentation layer and ships a CLI-based deployment gate that exits with code 1 when quality drops below a configured threshold.
One Important Distinction
Phoenix is optimized for prompt-centric experimentation and LLM-as-judge evals in a notebook-friendly self-host. When your app stops looking like a notebook — a production agent that runs for ten minutes, calls fifteen tools, spawns a sub-agent, and fails four tool calls deep — you open Phoenix and get a span tree. You wanted to know what the agent said to the user, what the user said back, and which tool call threw.
CortexOps is designed for that second scenario: debugging multi-node agent failures in production with a structured execution waterfall, not a flat span list.
Feature Comparison
| Feature | Arize Phoenix | CortexOps |
|---|---|---|
| Open source | ✓ Elastic License 2.0 | ✓ MIT |
| Self-hostable | ✓ Yes | ✓ Yes |
| OTel support | ✓ OpenInference conventions | ✓ OTLP native |
| Framework support | LangGraph, CrewAI, OAI SDK + more | 12 frameworks |
| LLM-as-judge eval | ✓ Yes | ✓ Yes |
| Embeddings analysis | ✓ Yes | ✗ Not yet |
| CI/CD eval gate CLI | Partial (custom script needed) | ✓ First-class |
| GitHub Actions | Manual integration | ✓ cortexops-eval-action |
| RAG-specific metrics | ✓ Strong | ✗ General metrics only |
| Free tier (hosted) | ✓ AX Free (25k spans/15-day retention) | ✓ 5k traces/month |
| Pro pricing | AX Pro $50/month (50k spans, 30-day retention) | $49/month flat (unlimited traces) |
| License | Elastic License 2.0 | MIT |
Tracing
Both platforms use OpenTelemetry. Phoenix ships OpenInference — the most widely adopted set of OpenTelemetry semantic conventions for LLM spans. CortexOps uses the emerging OTel LLM semantic conventions directly.
Phoenix instrumentation for LangGraph:
from phoenix.otel import register
tracer_provider = register(project_name="my-agent")
# Auto-instruments LangGraph calls
CortexOps instrumentation:
from cortexops import CortexTracer
tracer = CortexTracer(api_key="cxo-...", project="my-agent")
agent = tracer.wrap(your_compiled_graph)
Both get you traces. The difference is in what you see: Phoenix shows a span tree. CortexOps shows a node waterfall organised by agent execution flow — which node ran, in what order, how long each took, and which tool calls happened inside each node.
Winner: Roughly equal for tracing. Phoenix has more mature OTel conventions. CortexOps has better agent-native execution view.
Evaluation
Phoenix has a strong evaluation suite. Built-in evaluators cover faithfulness, relevance, hallucination, toxicity, and custom criteria. LLM evaluators use function calling to extract structured judgments rather than parsing freeform text.
CortexOps evaluation uses a golden dataset approach with three built-in rubrics (task completion, response quality, safety) plus a CLI gate:
cortexops eval run \
--dataset datasets/my_agent.yaml \
--judge \
--fail-on "task_completion < 0.90"
Phoenix can be integrated into CI/CD but requires custom scripting. The approach: run experiments on every PR, check scores against thresholds in a Python script, and use exit code to reflect pass/fail. CortexOps ships this pattern out of the box as a first-class CLI command and GitHub Action.
Winner: Phoenix for RAG and research-oriented evals. CortexOps for CI/CD deployment gates.
Pricing at Agent Scale
AX Free is 25k spans and 1GB at 15-day retention. AX Pro is $50/month for 50k spans and 10GB at 30-day retention. Graduating from Phoenix to AX is a new contract, not a tier upgrade, and span-based pricing gets expensive on agent workloads.
A production agent with 10 nodes running 1,000 times per day generates roughly 100,000 spans per day — 3 million per month. That is 60x the AX Pro limit at $50/month.
CortexOps Pro is $49/month for unlimited traces. No span counting.
Winner: CortexOps for high-volume agent workloads. Phoenix/AX for lower-volume experimentation.
License
This matters for some teams. Phoenix uses the Elastic License 2.0, which restricts certain commercial use cases (you cannot offer Phoenix as a managed service to others). CortexOps is MIT — no restrictions.
Winner: CortexOps if license flexibility matters.
When to Choose Arize Phoenix
- You are doing RAG development and need embeddings analysis and context relevance metrics
- You want notebook-friendly local development with a mature, established platform
- You are already in the Arize ecosystem for traditional ML monitoring
-
You need the breadth of Phoenix's built-in evaluator library
When to Choose CortexOps
You need a CI/CD deployment gate that blocks merges on quality regression
Your agent uses multiple frameworks (CrewAI + OpenAI SDK + LangGraph simultaneously)
Span-based pricing would get expensive at your trace volume
MIT license matters for your use case
- You want a flat-rate Pro plan
Try Both
Both are open source with free tiers. The fastest way to decide:
# Phoenix
pip install arize-phoenix
python -m phoenix.server.main # starts on localhost:6006
# CortexOps
pip install cortexops
# 3 lines to your first trace — getcortexops.com
Links:
- CortexOps: getcortexops.com | github.com/ashishodu2023/cortexops
- Arize Phoenix: arize.com/phoenix | github.com/arize-ai/phoenix
Ashish Verma is a Senior AI Engineer at PayPal and co-founder of CortexOps. This comparison reflects publicly available information as of June 2026.
Top comments (1)
This post is basically about moving LiteLLM toward a Rust-based architecture and what the benchmark differences look like between the Python gateway vs a Rust implementation.
The core idea (and why it’s getting attention) is pretty simple:
LiteLLM has become a common “LLM gateway layer” in Python, but at scale it starts to show typical Python-infrastructure limits (latency overhead, concurrency constraints, memory pressure). The Rust migration argument is that rewriting or replacing parts of it in Rust can dramatically reduce overhead and improve throughput in production systems.
From the broader benchmarking direction in the ecosystem, the claims usually revolve around:
Lower per-request latency due to compiled execution
Much higher throughput under load
Smaller memory footprint and more predictable performance
Tradeoff: more complex implementation and less flexibility compared to Python-first tooling
This fits into a wider trend right now where LLM infrastructure is splitting into two layers:
Python: orchestration, experimentation, agent logic
Rust/Go: inference gateways, routing layers, and high-throughput serving infrastructure
The important nuance is that most of these posts (including this kind of benchmark-heavy migration narrative) are showing relative system design advantages, not “Rust magically replaces LiteLLM everywhere.” In practice, teams usually end up with hybrid stacks rather than full rewrites.
So the real signal here isn’t just “Rust is faster,” but:
LLM infrastructure is maturing into performance-critical systems where language choice at the gateway layer actually matters.
That’s the underlying shift this post is pointing at 🤝