It was 2:17am on a Tuesday.My phone lit up. A payment agent we had shipped three weeks earlier had started approving refunds it was never supposed to approve. By the time I was fully awake, eleven transactions had gone through incorrectly.Four hours later we found the root cause. A one-word prompt change. "Approve refunds under $500" became "approve refunds under $500 when possible." That word — possible — cost real money and a sleepless night.The worst part: we had tests. They just checked whether the agent returned a response. Not whether the response was correct. Not whether it contained the right keywords. Not whether it called the right tools. Not whether it finished within the latency budget.We were testing the wrong thing.After that incident I spent my evenings building the tool I wished I had. It is called CortexOps. This post walks through the exact setup that would have caught the regression before it ever shipped.
The problem with how most teams test agents
The problem with how most teams test agentsTraditional software testing is binary. The function either returned the right value or it did not. AI agents do not work like that.Same input, different output every run. Multi-step tool calls that may or may not happen. Latency that can spike without warning. Hallucinations that do not throw errors — they just confidently return wrong information with a 200 status code.The tools most teams reach for — pytest, basic assertions, even LangSmith's default setup — tell you that something failed. They do not stop it from shipping.What you actually need is a CI eval gate: a step in your pull request pipeline that runs a golden dataset against your agent, scores the outputs across multiple dimensions, and blocks the merge if quality drops below a threshold.Here is how to build one in 5 minutes.
What you are building
Your LangGraph agent feeds into a golden dataset that defines what correct looks like. EvalSuite runs every case and scores five metrics. If task completion drops below your threshold, the GitHub Actions step fails, exit code 1, and the PR is blocked. One prompt change that breaks your agent gets caught before it ships. No 2am page.
Step 1 — Install
pip install cortexops
Step 2 — Wrap your agent with one line
from cortexops import CortexTracer
tracer = CortexTracer(project="payments-agent")
agent = tracer.wrap(your_langgraph_app)
CortexTracer.wrap() auto-detects your framework. LangGraph wraps CompiledStateGraph.invoke(). CrewAI wraps Crew.kickoff(). Any Python callable wraps directly. Your agent works identically after wrapping. No decorators, no config files, no changes to your existing code. Tracing uses an async flush that never blocks the agent.
Step 3 — Write a golden dataset
Create golden_v1.yaml. This is your ground truth — what correct agent behavior looks like for each case.
project: payments-agent
version: 1
cases:
- id: refund_approved
input:
query: process refund for order ORD-8821
expected_output_contains:
- refund
- approved
expected_tool_calls:
- lookup_refund
max_latency_ms: 3000
- id: balance_check
input:
query: what is my current balance
expected_output_contains:
- balance
- amount
max_latency_ms: 2000
- id: dispute_filed
input:
query: I was charged twice, dispute this charge
expected_output_contains:
- dispute
- filed
expected_tool_calls:
- classify_dispute
max_latency_ms: 5000
The expected_output_contains list is the key. Every keyword must appear in the output. If your refund agent stops saying "approved" after a prompt change that case fails immediately.
Step 4 — Run the eval locally
from cortexops import EvalSuite
results = EvalSuite.run(
dataset="golden_v1.yaml",
agent=agent,
verbose=True,
fail_on="task_completion < 0.90",
)
print(results.summary())
When your agent is healthy you see this:
[1/3] refund_approved ... pass (100)
[2/3] balance_check ... pass (100)
[3/3] dispute_filed ... pass (94)
CortexOps eval — payments-agent
Cases : 3 (3 passed, 0 failed)
Task completion : 100.0%
Tool accuracy : 100.0/100
Latency p50/p95 : 287ms / 1,240ms
When the regression is present — the one-word prompt change — you see this:
[1/3] refund_approved ... FAIL (50)
[2/3] balance_check ... pass (100)
[3/3] dispute_filed ... pass (94)
CortexOps eval — payments-agent
Cases : 3 (2 passed, 1 failed)
Task completion : 66.6%
Failed cases:
- refund_approved: OUTPUT_FORMAT (score 50)
EvalThresholdError: task_completion=0.666 < 0.9 (project=payments-agent)
Gate fires. Exit code 1. PR blocked.
Step 5 — The 5 metrics
CortexOps runs these automatically on every case without any configuration.
task_completion checks whether the output contains all expected keywords. This is the primary signal. A refund agent that stops saying "approved" after a prompt change fails this metric instantly.
tool_accuracy checks whether the right tools were called. Critical for multi-step payment flows where tool sequence matters. If lookup_refund is skipped, the case fails regardless of what the output says.
latency checks whether the agent responded within max_latency_ms. A refund that takes 30 seconds is not a working refund in production.
hallucination detects fabricated dates, false capability claims, and prohibited content patterns. Built in, no extra configuration, catches the most common LLM failure modes that break compliance in financial applications.
LLM judge uses GPT-4o to score open-ended outputs against natural language criteria you define. For cases where keyword matching is not enough — tone, empathy, completeness. Falls back to heuristic scoring automatically if OpenAI is unavailable so your eval never fails due to a third-party outage.
Step 6 — Add to GitHub Actions
name: CortexOps eval gate
on: [push, pull_request]
jobs:
eval-gate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pip install cortexops
- name: Run eval gate
run: |
cortexops eval run \
--dataset golden_v1.yaml \
--fail-on "task_completion < 0.90"
Every PR now triggers the eval. If task completion drops below 90% the merge is blocked. Not flagged. Not logged somewhere you will look at in two weeks. Blocked.
The one-word change that cost us at 2am would have hit this gate. The PR would have been blocked. The regression never ships.
Edge cases I tested before trusting this in production
Empty agent output — returns empty dict. Scored correctly, no crash, status COMPLETED.
Agent raises an exception mid-run. Status captured as FAILED, failure_kind set to UNKNOWN, exception detail stored in the trace. The eval suite does not crash.
16KB output from a verbose LLM response. Scored correctly with no performance issues.
Unicode and CJK characters in output. Keyword matching works correctly across character sets.
Five concurrent eval runs using Python threading. All five pass with no race conditions.
The SDK is built to never break your agent. Tracing failures are swallowed silently. The agent always returns normally even if the eval infrastructure is unreachable.
**
Optional — live observability**
If you want traces stored, a live dashboard, and Slack alerts when production regresses, point the SDK at the hosted API:
tracer = CortexTracer(
project="payments-agent",
api_key="cxo-...",
api_url="https://api.getcortexops.com",
environment="production",
)
The dashboard at app.getcortexops.com shows a live trace feed with status, latency, and failure kind per run. Click any trace row and a waterfall panel slides in showing exactly which node took how long, which tools were called, and what the raw JSON output was. That is how you go from a Slack alert to root cause in 30 seconds instead of digging through CloudWatch for an hour.
Pro tier is $49 per seat per month flat. No per-trace billing. 14-day free trial. Cancel anytime via the Stripe dashboard.
**
The free tier is real**
Everything you need to catch the 2am incident is free forever.
Full SDK. Unlimited local eval runs. YAML golden dataset format. GitHub Actions CI gate. All five metrics. CLI tool. MIT licensed. Full source on GitHub.
The free tier is what I would have needed that Tuesday night. The Pro tier adds the hosted observability layer for teams that want production visibility without building their own infrastructure.
**
What is next for me**
I am a Senior AI Engineer at PayPal. I have spent five years building production ML systems for payments — anomaly detection, fraud signals, real-time scoring. CortexOps came out of real production pain, not a side project looking for a problem.
I am looking for five design partners. Free Pro access in exchange for 30 minutes on a call telling me what is missing. If you are shipping LangGraph or CrewAI agents to production — especially in fintech, payments, compliance, or any domain where a wrong output has real consequences — I want to talk to you.
GitHub: github.com/ashishodu2023/cortexops
Docs: docs.getcortexops.com
Install: pip install cortexops
Website: getcortexops.com
If you have ever been paged at 2am over an agent regression, this is the tool that stops it from happening again.
Top comments (0)