Building LLM agents that actually work reliably is hard. Really hard.
You've probably experienced this cycle: your agent works perfectly in three test cases, fails spectacularly in production, you tweak a prompt, it fixes one problem but creates two others. Rinse and repeat.
This is where Opik comes in. Built by Comet, Opik is an open-source platform that brings systematic evaluation and optimization to LLM development. Let me show you how to use it to build better agents.
Why Traditional Testing Fails for Agents
Before diving into Opik, let's understand why agent testing is uniquely challenging:
- Non-deterministic outputs - The same input can produce different responses
- Multi-step reasoning - Errors compound across tool calls and decision points
- No single "right answer" - Multiple valid approaches exist
- Integration complexity - Agents interact with real APIs and databases
Traditional unit tests can't capture this complexity. You need a different approach.
Enter Opik: Evaluation-First Development
Opik treats evaluation as a first-class concern. The core workflow:
Collect traces → Define metrics → Run evaluations → Optimize → Deploy
Let me walk through a practical example of optimizing a customer support agent.
Example: Building a Resilient Support Agent
We'll build an agent that handles refund requests. It needs to check order history, verify refund eligibility, and process requests - all while maintaining a helpful tone.
Step 1: Instrument Your Agent
First, add Opik instrumentation to capture everything:
from opik import opik_context, track
from opik.integrations.openai import track_openai
import openai
# Track OpenAI calls automatically
openai_client = track_openai(openai.OpenAI())
class SupportAgent:
@track(name="process_refund_request")
def process(self, user_message: str, user_id: str):
# Get conversation history
history = self.get_conversation_history(user_id)
# Track this as a conversation
opik_context.update_current_trace(
name="customer_support",
metadata={
"user_id": user_id,
"conversation_length": len(history)
}
)
# Step 1: Understand intent
intent = self.classify_intent(user_message)
# Step 2: If refund-related, check eligibility
if intent == "refund":
order_info = self.check_order_history(user_id)
eligibility = self.check_refund_eligibility(order_info)
# Track tool usage
opik_context.log_tool_call(
name="check_refund_eligibility",
input=order_info,
output=eligibility
)
# Step 3: Generate response
response = self.generate_response(intent, eligibility)
return response
Step 2: Define What "Good" Looks Like
This is where Opik shines. Instead of writing brittle assertions, define metrics that capture agent quality:
from opik.evaluation.metrics import (
IsJson,
ContainsAny,
Hallucination,
ToolCallCorrectness,
BaseMetric
)
class ToneAppropriateness(BaseMetric):
"""Custom metric for customer service tone"""
def evaluate(self, output: str, reference: str = None):
# Use an LLM judge to evaluate tone
prompt = f"""
Rate the professionalism and helpfulness of this support response (1-5):
Response: {output}
Return only a number.
"""
score = int(llm_client.complete(prompt))
return {
"score": score,
"reason": f"Tone rated {score}/5",
"name": "tone_appropriateness"
}
# Define evaluation criteria
metrics = [
Hallucination(threshold=0.3), # Penalize making up facts
ContainsAny(["refund", "credit", "process"], min_count=1), # Keywords present
ToolCallCorrectness(), # Tools used appropriately
ToneAppropriateness(min_score=4)
]
Step 3: Create a Test Dataset
Good evaluations need good data. Opik lets you create datasets from production traces:
from opik import Opik
client = Opik()
dataset = client.create_dataset("refund_requests")
# Add edge cases you've encountered
dataset.insert([
{
"input": "I want a refund for order #12345",
"expected_output": "Check eligibility and process if valid",
"user_id": "user_1",
"order_exists": True,
"eligible": True
},
{
"input": "Give me my money back!!!", # Emotional customer
"expected_output": "De-escalate and check order",
"user_id": "user_2",
"order_exists": True,
"eligible": False # Past return window
},
{
"input": "Refund for order that never arrived",
"expected_output": "Check delivery status, offer replacement",
"user_id": "user_3",
"order_exists": True,
"eligible": True
}
])
Step 4: Run Systematic Evaluations
Now the magic happens. Run your agent against the dataset and Opik automatically evaluates each response:
from opik.evaluation import evaluate
def evaluation_task(x):
agent = SupportAgent()
response = agent.process(x["input"], x["user_id"])
return {
"output": response,
"reference": x["expected_output"],
"metadata": {"user_id": x["user_id"]}
}
results = evaluate(
dataset="refund_requests",
task=evaluation_task,
metrics=metrics
)
print(f"Overall score: {results.score}")
print(f"Failed examples: {results.failures}")
Step 5: Identify Failure Patterns
Here's where you get real insights. Opik's dashboard shows you:
- Low-scoring traces - Which conversations performed poorly
- Metric breakdowns - Is tone consistently bad? Tool usage failing?
- Clustering - Similar failures grouped together
In my experience, you'll typically find patterns like:
1. Tool call errors: Agent tries to process refunds without checking eligibility
2. Tone failures: Responses become robotic when handling angry customers
3. Context loss: Agent forgets conversation history after long exchanges
Step 6: Optimize Iteratively
Now you optimize based on evidence, not intuition:
Iteration 1: Fix tool usage
# Problem: Agent called process_refund before eligibility check
# Solution: Explicit system prompt
system_prompt = """
You are a customer support agent. Follow this order:
1. ALWAYS check eligibility before processing refunds
2. Call check_eligibility() first
3. Only call process_refund() if eligibility confirmed
"""
Iteration 2: Fix tone for edge cases
# Problem: Angry customers get cold, scripted responses
# Solution: Tone guidelines in system prompt
tone_guidelines = """
For frustrated customers:
- Acknowledge their frustration: "I understand this is frustrating..."
- Show empathy before solving
- Use softer language: "I'd be happy to help" vs "I will help"
"""
Iteration 3: Add safety checks
# Problem: Agent hallucinated refund policies
# Solution: Add factual grounding
@track(name="check_policy")
def get_policy(order_date):
# Pull from actual database, not model memory
return db.get_refund_policy(order_date)
Step 7: Continuous Evaluation
Don't just evaluate once. Set up continuous evaluation:
# GitHub Action / CI Pipeline
# .github/workflows/evaluate-agent.yml
name: Evaluate Agent
on: [push, pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- name: Run evaluations
run: python evaluate_agent.py
- name: Compare with baseline
run: |
current_score = get_current_score()
baseline_score = get_baseline_score()
assert current_score >= baseline_score - 0.05
- name: Upload results to Opik
run: opik upload_results --dataset refund_requests
Real Impact: What You Gain
After implementing this workflow with Opik, I've consistently seen:
50-70% reduction in regression bugs - Each change is evaluated against 100+ test cases automatically
2-3x faster iteration cycles - No more manual testing of every edge case
Clear success metrics - You know exactly when your agent is ready for production
Traceability - When something fails in production, you can trace it back to the exact prompt and tool call
Getting Started
- Install Opik:
pip install opik
- Start the platform (local or cloud):
opik local start
# or sign up at comet.com/opik
- Instrument your first agent:
import opik
opik.configure()
- Run your first evaluation:
from opik.evaluation import evaluate
# Follow the examples above
The Bottom Line
Building reliable LLM agents isn't about perfect prompts or the latest model. It's about having a systematic way to measure quality, identify issues, and verify improvements.
Opik gives you that system. It's not magic - you still need to iterate and think critically about your agent's behavior. But it transforms agent optimization from guesswork into engineering.
The LLM space is moving fast. The teams that win won't be the ones with the cleverest prompts - they'll be the ones who can iterate fastest while maintaining quality. That's what Opik enables.
Your turn: Pick one agent you're currently building or maintaining. Instrument it with Opik this week. Run one evaluation. I guarantee you'll find something you didn't expect.
Have you tried systematic evaluation for your agents? What challenges are you facing? Let me know in the comments.
Top comments (0)