DEV Community: Ashish Verma

CortexOps vs Arize Phoenix: AI Agent Observability Compared

Ashish Verma — Tue, 23 Jun 2026 03:38:01 +0000

Arize Phoenix is one of the most popular open-source LLM observability platforms in 2026. If you are evaluating it alongside CortexOps, this article compares them directly on the things that matter for production agent workloads.

Short version:

Arize Phoenix is the right choice if you are doing notebook-based experimentation, RAG evaluation, or need embeddings analysis. It is well-established with 9,000+ GitHub stars and strong OTel support.
CortexOps is the right choice if you need a first-class CI/CD deployment gate, multi-framework agent tracing beyond LangGraph, or a flat-rate pricing model that doesn't get expensive at agent-scale span volumes.

What They Are

Arize Phoenix is an open-source LLM observability and evaluation platform from Arize AI. It self-hosts in one command, uses OpenTelemetry for tracing, and includes built-in evaluators covering faithfulness, relevance, hallucination, and toxicity. Phoenix supports popular frameworks including OpenAI Agents SDK, LangGraph, CrewAI, LlamaIndex, and DSPy.

CortexOps is an open-source AI agent observability platform focused on the full production lifecycle — tracing, evaluation in CI/CD, and continuous monitoring. It supports 12 agent frameworks via a unified instrumentation layer and ships a CLI-based deployment gate that exits with code 1 when quality drops below a configured threshold.

One Important Distinction

Phoenix is optimized for prompt-centric experimentation and LLM-as-judge evals in a notebook-friendly self-host. When your app stops looking like a notebook — a production agent that runs for ten minutes, calls fifteen tools, spawns a sub-agent, and fails four tool calls deep — you open Phoenix and get a span tree. You wanted to know what the agent said to the user, what the user said back, and which tool call threw.

CortexOps is designed for that second scenario: debugging multi-node agent failures in production with a structured execution waterfall, not a flat span list.

Feature Comparison

Feature	Arize Phoenix	CortexOps
Open source	✓ Elastic License 2.0	✓ MIT
Self-hostable	✓ Yes	✓ Yes
OTel support	✓ OpenInference conventions	✓ OTLP native
Framework support	LangGraph, CrewAI, OAI SDK + more	12 frameworks
LLM-as-judge eval	✓ Yes	✓ Yes
Embeddings analysis	✓ Yes	✗ Not yet
CI/CD eval gate CLI	Partial (custom script needed)	✓ First-class
GitHub Actions	Manual integration	✓ cortexops-eval-action
RAG-specific metrics	✓ Strong	✗ General metrics only
Free tier (hosted)	✓ AX Free (25k spans/15-day retention)	✓ 5k traces/month
Pro pricing	AX Pro $50/month (50k spans, 30-day retention)	$49/month flat (unlimited traces)
License	Elastic License 2.0	MIT

Tracing

Both platforms use OpenTelemetry. Phoenix ships OpenInference — the most widely adopted set of OpenTelemetry semantic conventions for LLM spans. CortexOps uses the emerging OTel LLM semantic conventions directly.

Phoenix instrumentation for LangGraph:

from phoenix.otel import register
tracer_provider = register(project_name="my-agent")
# Auto-instruments LangGraph calls

CortexOps instrumentation:

from cortexops import CortexTracer
tracer = CortexTracer(api_key="cxo-...", project="my-agent")
agent  = tracer.wrap(your_compiled_graph)

Both get you traces. The difference is in what you see: Phoenix shows a span tree. CortexOps shows a node waterfall organised by agent execution flow — which node ran, in what order, how long each took, and which tool calls happened inside each node.

Winner: Roughly equal for tracing. Phoenix has more mature OTel conventions. CortexOps has better agent-native execution view.

Evaluation

Phoenix has a strong evaluation suite. Built-in evaluators cover faithfulness, relevance, hallucination, toxicity, and custom criteria. LLM evaluators use function calling to extract structured judgments rather than parsing freeform text.

CortexOps evaluation uses a golden dataset approach with three built-in rubrics (task completion, response quality, safety) plus a CLI gate:

cortexops eval run \
  --dataset datasets/my_agent.yaml \
  --judge \
  --fail-on "task_completion < 0.90"

Phoenix can be integrated into CI/CD but requires custom scripting. The approach: run experiments on every PR, check scores against thresholds in a Python script, and use exit code to reflect pass/fail. CortexOps ships this pattern out of the box as a first-class CLI command and GitHub Action.

Winner: Phoenix for RAG and research-oriented evals. CortexOps for CI/CD deployment gates.

Pricing at Agent Scale

AX Free is 25k spans and 1GB at 15-day retention. AX Pro is $50/month for 50k spans and 10GB at 30-day retention. Graduating from Phoenix to AX is a new contract, not a tier upgrade, and span-based pricing gets expensive on agent workloads.

A production agent with 10 nodes running 1,000 times per day generates roughly 100,000 spans per day — 3 million per month. That is 60x the AX Pro limit at $50/month.

CortexOps Pro is $49/month for unlimited traces. No span counting.

Winner: CortexOps for high-volume agent workloads. Phoenix/AX for lower-volume experimentation.

License

This matters for some teams. Phoenix uses the Elastic License 2.0, which restricts certain commercial use cases (you cannot offer Phoenix as a managed service to others). CortexOps is MIT — no restrictions.

Winner: CortexOps if license flexibility matters.

When to Choose Arize Phoenix

You are doing RAG development and need embeddings analysis and context relevance metrics
You want notebook-friendly local development with a mature, established platform
You are already in the Arize ecosystem for traditional ML monitoring
You need the breadth of Phoenix's built-in evaluator library

When to Choose CortexOps
You need a CI/CD deployment gate that blocks merges on quality regression
Your agent uses multiple frameworks (CrewAI + OpenAI SDK + LangGraph simultaneously)
Span-based pricing would get expensive at your trace volume
MIT license matters for your use case

- You want a flat-rate Pro plan

Try Both

Both are open source with free tiers. The fastest way to decide:

# Phoenix
pip install arize-phoenix
python -m phoenix.server.main  # starts on localhost:6006

# CortexOps
pip install cortexops
# 3 lines to your first trace — getcortexops.com

Links:

CortexOps: getcortexops.com | github.com/ashishodu2023/cortexops

- Arize Phoenix: arize.com/phoenix | github.com/arize-ai/phoenix

Ashish Verma is a Senior AI Engineer at PayPal and co-founder of CortexOps. This comparison reflects publicly available information as of June 2026.

CortexOps vs Langfuse: Open Source AI Observability Compared

Ashish Verma — Sat, 20 Jun 2026 05:36:58 +0000

Both CortexOps and Langfuse are open-source AI observability platforms. If you are evaluating them, the choice comes down to a few key differences: framework support, evaluation methodology, and whether you need a CI/CD deployment gate.

What They Are

Langfuse is an open-source LLM engineering platform focused on tracing, prompt management, and evaluation. It has a strong Python and TypeScript SDK, a hosted cloud option, and a popular self-hosted deployment. Over 6 million SDK downloads per month.

CortexOps is an open-source AI agent observability platform focused specifically on agentic systems. It supports 12 agent frameworks via a unified instrumentation layer, provides LLM-as-judge evaluation, and ships a CI/CD deployment gate CLI designed to block regressions before they reach production.

Feature Comparison

Feature	Langfuse	CortexOps
Open source	✓ MIT	✓ MIT
Self-hostable	✓ Yes	✓ Yes
Cloud hosted	✓ Yes	✓ Yes
Tracing	✓ LLM calls	✓ Agent execution (nodes, tools, state)
Agent frameworks	Via SDK wrappers	✓ 12 native integrations
OpenTelemetry	✓ Partial	✓ OTLP native
LLM-as-judge	✓ Yes	✓ Yes
CI/CD eval gate CLI	✗	✓ cortexops eval run
GitHub Actions	✗	✓ cortexops-eval-action
PII redaction	✓	✓
Free tier	✓	✓ 5,000 traces/month
Pro pricing	Usage-based	$49/month flat

The Key Difference: LLM Tracing vs Agent Tracing

Langfuse traces LLM calls — the individual model invocations that happen inside your application. This is valuable for prompt engineering and cost monitoring.

CortexOps traces agent execution — the full graph of nodes, tool calls, state transitions, and conditional branches that make up an agent run. This distinction matters when you are debugging:

With Langfuse you see:

LLM call #1 → input tokens: 342, output tokens: 89, latency: 1.2s
LLM call #2 → input tokens: 218, output tokens: 45, latency: 0.8s

With CortexOps you see:

agent_run (4.3s)
  └── classify_intent (1.2s) ✓
  └── check_refund_policy (0.9s) ✓
  └── process_refund (2.1s) ✗ FAILED
       └── tool: lookup_order (0.3s) ✓
       └── tool: issue_refund (1.8s) ✗ timeout

The agent-level trace tells you which node failed, which tool call timed out, and what the execution path was — without that, debugging a multi-node agent is guesswork.

The CI/CD Gate

This is where CortexOps has a clear advantage for production teams.

# Block the merge if task_completion drops below 90%
cortexops eval run \
  --dataset datasets/my_agent.yaml \
  --judge \
  --fail-on "task_completion < 0.90"

Combined with the GitHub Action:

- uses: ashishodu2023/cortexops-eval-action@v1
  with:
    dataset: datasets/my_agent.yaml
    fail-on: "task_completion < 0.90"
    cortexops-api-key: ${{ secrets.CORTEXOPS_API_KEY }}

Every pull request shows an eval report as a PR comment. The merge is blocked if quality drops. Langfuse has evaluation capabilities but does not ship a first-class CI/CD gate pattern.

When to Choose Langfuse

You are optimising LLM calls and prompts more than agent behaviour
You need TypeScript SDK support
You have an existing Langfuse deployment
You want the largest open-source community in this space

When to Choose CortexOps

You are building and operating LLM agents specifically
You need agent-level traces (nodes, tools, state) not just LLM call logs
You want a CI/CD gate that blocks regressions automatically
You use multiple agent frameworks

Try Both

Both are open source, both have free tiers. The fastest way to decide is to instrument one agent run with each and compare the trace data you get back.

pip install cortexops — 3 lines to your first agent trace.

Links:

CortexOps: getcortexops.com | github.com/ashishodu2023/cortexops
Langfuse: langfuse.com | github.com/langfuse/langfuse

Ashish Verma is a Senior AI Engineer at PayPal and co-founder of CortexOps.

CortexOps vs LangSmith: Which AI Agent Observability Tool Is Right for You?

Ashish Verma — Mon, 15 Jun 2026 07:39:58 +0000

If you are building LLM agents with LangGraph or LangChain and need production observability, you have probably looked at LangSmith. You may also have found CortexOps. This article compares them directly so you can make an informed choice.

Short version:
LangSmith is the right choice if you are already in the LangChain ecosystem and want deep integration with minimal setup.
CortexOps is the right choice if you need framework-neutral observability, an open-source option you can self-host, or a CI/CD eval gate that works across any agent framework.

What They Are
LangSmith is LangChain's commercial observability and evaluation platform. It is tightly integrated with LangChain and LangGraph, captures traces automatically when you set an environment variable, and provides a hosted dashboard at smith.langchain.com.

CortexOps is an open-source AI agent observability platform that supports 12 agent frameworks including LangGraph, CrewAI, OpenAI Agents SDK, PydanticAI, Google ADK, Smolagents, Haystack, DSPy, and more. It provides distributed tracing via OpenTelemetry, an LLM-as-judge eval framework, and a CI/CD deployment gate CLI. Available at getcortexops.com and pip install cortexops.

Feature Comparison
FeatureLangSmithCortexOpsTracing✓ Automatic (LangChain/LangGraph)✓ 12 frameworksOpenTelemetry export✗ Proprietary format✓ OTLP to any backendSelf-hostable✗ Cloud only✓ MIT license, Railway/DockerLLM-as-judge eval✓ Yes✓ YesGolden dataset API✓ Yes✓ YesCI/CD eval gate CLI✓ Yes✓ Yes (exit code 1 on regression)GitHub Actions✓ Yes✓ cortexops-eval-actionFree tier✓ Limited✓ 5,000 traces/monthOpen source✗ Closed source✓ MIT licenseFramework supportLangChain/LangGraph focused12 frameworksPII redaction✓✓PricingUsage-based, paid plansFree + $49/month Pro

Tracing
LangSmith wins on zero-configuration tracing for LangChain:

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-key"

# That's it — all LangChain calls are traced automatically

CortexOps requires three lines but works across any framework

from cortexops import CortexTracer

tracer = CortexTracer(api_key="cxo-...", project="my-agent")
agent  = tracer.wrap(your_compiled_graph)

The same three lines work for CrewAI, OpenAI Agents SDK, PydanticAI — any of the 12 supported frameworks. LangSmith traces are captured in LangSmith's proprietary format. CortexOps traces are exported via OpenTelemetry OTLP, which means you can send them to Honeycomb, Jaeger, Grafana Tempo, or Datadog alongside your existing infrastructure.

Winner: LangSmith for LangChain teams. CortexOps for multi-framework teams or teams with existing OTel infrastructure.

Evaluation in CI/CD

Both platforms offer golden dataset evaluation. CortexOps ships a CLI specifically designed as a CI/CD gate:

# CortexOps — fails with exit code 1 if quality drops below threshold
cortexops eval run \
  --dataset datasets/refund_agent.yaml \
  --judge \
  --fail-on "task_completion < 0.90"

# .github/workflows/eval.yml
- uses: ashishodu2023/cortexops-eval-action@v1
  with:
    dataset: datasets/refund_agent.yaml
    fail-on: "task_completion < 0.90"
    cortexops-api-key: ${{ secrets.CORTEXOPS_API_KEY }}

LangSmith has evaluation capabilities and can be integrated into CI/CD, but the deployment gate pattern — where the CI job explicitly fails on quality regression — is a first-class feature in CortexOps.

Winner: Roughly equal, with CortexOps having a tighter CI/CD gate integration.
**
Open Source vs Closed Source**

This is the clearest distinction. LangSmith is a commercial SaaS. If LangSmith changes pricing, deprecates features, or shuts down, your observability infrastructure is affected.

CortexOps is MIT licensed. You can:

Self-host on Railway, Docker, or your own infrastructure
Inspect and modify the source code
Contribute back to the project
Build internal tooling on top of the API

For teams with data residency requirements, compliance constraints, or air-gapped environments, open source self-hosting is often the only viable option.

Winner: CortexOps if open source or self-hosting matters. LangSmith if you prefer managed infrastructure.

Framework Support

If your entire stack is LangChain and LangGraph, LangSmith is purpose-built for you. If you use multiple frameworks — a common pattern as the agent ecosystem matures — CortexOps covers the breadth:

LangGraph    ✓ Both
CrewAI       ✓ CortexOps only
OpenAI SDK   ✓ CortexOps only
PydanticAI   ✓ CortexOps only
Google ADK   ✓ CortexOps only
Smolagents   ✓ CortexOps only
Haystack     ✓ CortexOps only
DSPy         ✓ CortexOps only
AutoGen      ✓ CortexOps only

Winner: CortexOps for multi-framework teams.

When to Choose LangSmith

Your entire agent stack is LangGraph or LangChain
You want automatic tracing with zero configuration
You prefer a managed SaaS with commercial support
Budget is not a constraint

When to Choose CortexOps

You use multiple agent frameworks
You need OpenTelemetry-native tracing for existing infrastructure
Open source and self-hosting matter
You want a CI/CD eval gate that works out of the box
You are on a budget (generous free tier)

Conclusion

LangSmith and CortexOps solve the same problem from different angles. LangSmith is deeper in the LangChain ecosystem. CortexOps is broader across the agent framework landscape and open source.

For most teams using a mix of frameworks, or teams who care about vendor neutrality, CortexOps is the stronger choice. For teams entirely on LangChain/LangGraph who want zero-configuration setup, LangSmith may be simpler to start with.

Try CortexOps: pip install cortexops — free tier, no credit card required.

Links:

CortexOps: getcortexops.com
GitHub: github.com/ashishodu2023/cortexops
LangSmith: smith.langchain.com

How to add an eval gate to your LangGraph agent in 5 minutes

Ashish Verma — Mon, 06 Apr 2026 06:09:21 +0000

It was 2:17am on a Tuesday.My phone lit up. A payment agent we had shipped three weeks earlier had started approving refunds it was never supposed to approve. By the time I was fully awake, eleven transactions had gone through incorrectly.Four hours later we found the root cause. A one-word prompt change. "Approve refunds under $500" became "approve refunds under $500 when possible." That word — possible — cost real money and a sleepless night.The worst part: we had tests. They just checked whether the agent returned a response. Not whether the response was correct. Not whether it contained the right keywords. Not whether it called the right tools. Not whether it finished within the latency budget.We were testing the wrong thing.After that incident I spent my evenings building the tool I wished I had. It is called CortexOps. This post walks through the exact setup that would have caught the regression before it ever shipped.

The problem with how most teams test agents
The problem with how most teams test agentsTraditional software testing is binary. The function either returned the right value or it did not. AI agents do not work like that.Same input, different output every run. Multi-step tool calls that may or may not happen. Latency that can spike without warning. Hallucinations that do not throw errors — they just confidently return wrong information with a 200 status code.The tools most teams reach for — pytest, basic assertions, even LangSmith's default setup — tell you that something failed. They do not stop it from shipping.What you actually need is a CI eval gate: a step in your pull request pipeline that runs a golden dataset against your agent, scores the outputs across multiple dimensions, and blocks the merge if quality drops below a threshold.Here is how to build one in 5 minutes.

What you are building
Your LangGraph agent feeds into a golden dataset that defines what correct looks like. EvalSuite runs every case and scores five metrics. If task completion drops below your threshold, the GitHub Actions step fails, exit code 1, and the PR is blocked. One prompt change that breaks your agent gets caught before it ships. No 2am page.

Step 1 — Install

pip install cortexops

Step 2 — Wrap your agent with one line

from cortexops import CortexTracer

tracer = CortexTracer(project="payments-agent")
agent  = tracer.wrap(your_langgraph_app)

CortexTracer.wrap() auto-detects your framework. LangGraph wraps CompiledStateGraph.invoke(). CrewAI wraps Crew.kickoff(). Any Python callable wraps directly. Your agent works identically after wrapping. No decorators, no config files, no changes to your existing code. Tracing uses an async flush that never blocks the agent.

Step 3 — Write a golden dataset
Create golden_v1.yaml. This is your ground truth — what correct agent behavior looks like for each case.

project: payments-agent
version: 1

cases:
  - id: refund_approved
    input:
      query: process refund for order ORD-8821
    expected_output_contains:
      - refund
      - approved
    expected_tool_calls:
      - lookup_refund
    max_latency_ms: 3000

  - id: balance_check
    input:
      query: what is my current balance
    expected_output_contains:
      - balance
      - amount
    max_latency_ms: 2000

  - id: dispute_filed
    input:
      query: I was charged twice, dispute this charge
    expected_output_contains:
      - dispute
      - filed
    expected_tool_calls:
      - classify_dispute
    max_latency_ms: 5000

The expected_output_contains list is the key. Every keyword must appear in the output. If your refund agent stops saying "approved" after a prompt change that case fails immediately.

Step 4 — Run the eval locally

from cortexops import EvalSuite

results = EvalSuite.run(
    dataset="golden_v1.yaml",
    agent=agent,
    verbose=True,
    fail_on="task_completion < 0.90",
)

print(results.summary())

When your agent is healthy you see this:

[1/3] refund_approved ... pass (100)
[2/3] balance_check   ... pass (100)
[3/3] dispute_filed   ... pass (94)

CortexOps eval — payments-agent
  Cases           : 3  (3 passed, 0 failed)
  Task completion : 100.0%
  Tool accuracy   : 100.0/100
  Latency p50/p95 : 287ms / 1,240ms

When the regression is present — the one-word prompt change — you see this:

[1/3] refund_approved ... FAIL (50)
[2/3] balance_check   ... pass (100)
[3/3] dispute_filed   ... pass (94)

CortexOps eval — payments-agent
  Cases           : 3  (2 passed, 1 failed)
  Task completion : 66.6%
  Failed cases:
    - refund_approved: OUTPUT_FORMAT (score 50)

EvalThresholdError: task_completion=0.666 < 0.9 (project=payments-agent)

Gate fires. Exit code 1. PR blocked.

Step 5 — The 5 metrics

CortexOps runs these automatically on every case without any configuration.
task_completion checks whether the output contains all expected keywords. This is the primary signal. A refund agent that stops saying "approved" after a prompt change fails this metric instantly.
tool_accuracy checks whether the right tools were called. Critical for multi-step payment flows where tool sequence matters. If lookup_refund is skipped, the case fails regardless of what the output says.
latency checks whether the agent responded within max_latency_ms. A refund that takes 30 seconds is not a working refund in production.
hallucination detects fabricated dates, false capability claims, and prohibited content patterns. Built in, no extra configuration, catches the most common LLM failure modes that break compliance in financial applications.
LLM judge uses GPT-4o to score open-ended outputs against natural language criteria you define. For cases where keyword matching is not enough — tone, empathy, completeness. Falls back to heuristic scoring automatically if OpenAI is unavailable so your eval never fails due to a third-party outage.

Step 6 — Add to GitHub Actions

name: CortexOps eval gate
on: [push, pull_request]

jobs:
  eval-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install cortexops
      - name: Run eval gate
        run: |
          cortexops eval run \
            --dataset golden_v1.yaml \
            --fail-on "task_completion < 0.90"

Every PR now triggers the eval. If task completion drops below 90% the merge is blocked. Not flagged. Not logged somewhere you will look at in two weeks. Blocked.
The one-word change that cost us at 2am would have hit this gate. The PR would have been blocked. The regression never ships.

Edge cases I tested before trusting this in production

Empty agent output — returns empty dict. Scored correctly, no crash, status COMPLETED.
Agent raises an exception mid-run. Status captured as FAILED, failure_kind set to UNKNOWN, exception detail stored in the trace. The eval suite does not crash.
16KB output from a verbose LLM response. Scored correctly with no performance issues.
Unicode and CJK characters in output. Keyword matching works correctly across character sets.
Five concurrent eval runs using Python threading. All five pass with no race conditions.
The SDK is built to never break your agent. Tracing failures are swallowed silently. The agent always returns normally even if the eval infrastructure is unreachable.

**
Optional — live observability**
If you want traces stored, a live dashboard, and Slack alerts when production regresses, point the SDK at the hosted API:

tracer = CortexTracer(
    project="payments-agent",
    api_key="cxo-...",
    api_url="https://api.getcortexops.com",
    environment="production",
)

The dashboard at app.getcortexops.com shows a live trace feed with status, latency, and failure kind per run. Click any trace row and a waterfall panel slides in showing exactly which node took how long, which tools were called, and what the raw JSON output was. That is how you go from a Slack alert to root cause in 30 seconds instead of digging through CloudWatch for an hour.

Pro tier is $49 per seat per month flat. No per-trace billing. 14-day free trial. Cancel anytime via the Stripe dashboard.

**
The free tier is real**
Everything you need to catch the 2am incident is free forever.
Full SDK. Unlimited local eval runs. YAML golden dataset format. GitHub Actions CI gate. All five metrics. CLI tool. MIT licensed. Full source on GitHub.
The free tier is what I would have needed that Tuesday night. The Pro tier adds the hosted observability layer for teams that want production visibility without building their own infrastructure.

**
What is next for me**
I am a Senior AI Engineer at PayPal. I have spent five years building production ML systems for payments — anomaly detection, fraud signals, real-time scoring. CortexOps came out of real production pain, not a side project looking for a problem.
I am looking for five design partners. Free Pro access in exchange for 30 minutes on a call telling me what is missing. If you are shipping LangGraph or CrewAI agents to production — especially in fintech, payments, compliance, or any domain where a wrong output has real consequences — I want to talk to you.
GitHub: github.com/ashishodu2023/cortexops
Docs: docs.getcortexops.com
Install: pip install cortexops
Website: getcortexops.com
If you have ever been paged at 2am over an agent regression, this is the tool that stops it from happening again.