ruchika bhat

Posted on Feb 14

Opik: Your Agent's Black Box Flight Recorder

#agents #llm #opensource #testing

Building LLM agents that actually work reliably is hard. Really hard.

You've probably experienced this cycle: your agent works perfectly in three test cases, fails spectacularly in production, you tweak a prompt, it fixes one problem but creates two others. Rinse and repeat.

This is where Opik comes in. Built by Comet, Opik is an open-source platform that brings systematic evaluation and optimization to LLM development. Let me show you how to use it to build better agents.

Why Traditional Testing Fails for Agents

Before diving into Opik, let's understand why agent testing is uniquely challenging:

Non-deterministic outputs - The same input can produce different responses
Multi-step reasoning - Errors compound across tool calls and decision points
No single "right answer" - Multiple valid approaches exist
Integration complexity - Agents interact with real APIs and databases

Traditional unit tests can't capture this complexity. You need a different approach.

Enter Opik: Evaluation-First Development

Opik treats evaluation as a first-class concern. The core workflow:

Collect traces → Define metrics → Run evaluations → Optimize → Deploy

Let me walk through a practical example of optimizing a customer support agent.

Example: Building a Resilient Support Agent

We'll build an agent that handles refund requests. It needs to check order history, verify refund eligibility, and process requests - all while maintaining a helpful tone.

Step 1: Instrument Your Agent

First, add Opik instrumentation to capture everything:

from opik import opik_context, track
from opik.integrations.openai import track_openai
import openai

# Track OpenAI calls automatically
openai_client = track_openai(openai.OpenAI())

class SupportAgent:
    @track(name="process_refund_request")
    def process(self, user_message: str, user_id: str):
        # Get conversation history
        history = self.get_conversation_history(user_id)

        # Track this as a conversation
        opik_context.update_current_trace(
            name="customer_support",
            metadata={
                "user_id": user_id,
                "conversation_length": len(history)
            }
        )

        # Step 1: Understand intent
        intent = self.classify_intent(user_message)

        # Step 2: If refund-related, check eligibility
        if intent == "refund":
            order_info = self.check_order_history(user_id)
            eligibility = self.check_refund_eligibility(order_info)

            # Track tool usage
            opik_context.log_tool_call(
                name="check_refund_eligibility",
                input=order_info,
                output=eligibility
            )

        # Step 3: Generate response
        response = self.generate_response(intent, eligibility)
        return response

Step 2: Define What "Good" Looks Like

This is where Opik shines. Instead of writing brittle assertions, define metrics that capture agent quality:

from opik.evaluation.metrics import (
    IsJson, 
    ContainsAny, 
    Hallucination, 
    ToolCallCorrectness,
    BaseMetric
)

class ToneAppropriateness(BaseMetric):
    """Custom metric for customer service tone"""

    def evaluate(self, output: str, reference: str = None):
        # Use an LLM judge to evaluate tone
        prompt = f"""
        Rate the professionalism and helpfulness of this support response (1-5):

        Response: {output}

        Return only a number.
        """

        score = int(llm_client.complete(prompt))
        return {
            "score": score,
            "reason": f"Tone rated {score}/5",
            "name": "tone_appropriateness"
        }

# Define evaluation criteria
metrics = [
    Hallucination(threshold=0.3),  # Penalize making up facts
    ContainsAny(["refund", "credit", "process"], min_count=1),  # Keywords present
    ToolCallCorrectness(),  # Tools used appropriately
    ToneAppropriateness(min_score=4)
]

Step 3: Create a Test Dataset

Good evaluations need good data. Opik lets you create datasets from production traces:

from opik import Opik

client = Opik()
dataset = client.create_dataset("refund_requests")

# Add edge cases you've encountered
dataset.insert([
    {
        "input": "I want a refund for order #12345",
        "expected_output": "Check eligibility and process if valid",
        "user_id": "user_1",
        "order_exists": True,
        "eligible": True
    },
    {
        "input": "Give me my money back!!!",  # Emotional customer
        "expected_output": "De-escalate and check order",
        "user_id": "user_2", 
        "order_exists": True,
        "eligible": False  # Past return window
    },
    {
        "input": "Refund for order that never arrived",
        "expected_output": "Check delivery status, offer replacement",
        "user_id": "user_3",
        "order_exists": True,
        "eligible": True
    }
])

Step 4: Run Systematic Evaluations

Now the magic happens. Run your agent against the dataset and Opik automatically evaluates each response:

from opik.evaluation import evaluate

def evaluation_task(x):
    agent = SupportAgent()
    response = agent.process(x["input"], x["user_id"])
    return {
        "output": response,
        "reference": x["expected_output"],
        "metadata": {"user_id": x["user_id"]}
    }

results = evaluate(
    dataset="refund_requests",
    task=evaluation_task,
    metrics=metrics
)

print(f"Overall score: {results.score}")
print(f"Failed examples: {results.failures}")

Step 5: Identify Failure Patterns

Here's where you get real insights. Opik's dashboard shows you:

Low-scoring traces - Which conversations performed poorly
Metric breakdowns - Is tone consistently bad? Tool usage failing?
Clustering - Similar failures grouped together

In my experience, you'll typically find patterns like:

1. Tool call errors: Agent tries to process refunds without checking eligibility
2. Tone failures: Responses become robotic when handling angry customers
3. Context loss: Agent forgets conversation history after long exchanges

Step 6: Optimize Iteratively

Now you optimize based on evidence, not intuition:

Iteration 1: Fix tool usage

# Problem: Agent called process_refund before eligibility check
# Solution: Explicit system prompt

system_prompt = """
You are a customer support agent. Follow this order:
1. ALWAYS check eligibility before processing refunds
2. Call check_eligibility() first
3. Only call process_refund() if eligibility confirmed
"""

Iteration 2: Fix tone for edge cases

# Problem: Angry customers get cold, scripted responses
# Solution: Tone guidelines in system prompt

tone_guidelines = """
For frustrated customers:
- Acknowledge their frustration: "I understand this is frustrating..."
- Show empathy before solving
- Use softer language: "I'd be happy to help" vs "I will help"
"""

Iteration 3: Add safety checks

# Problem: Agent hallucinated refund policies
# Solution: Add factual grounding

@track(name="check_policy")
def get_policy(order_date):
    # Pull from actual database, not model memory
    return db.get_refund_policy(order_date)

Step 7: Continuous Evaluation

Don't just evaluate once. Set up continuous evaluation:

# GitHub Action / CI Pipeline
# .github/workflows/evaluate-agent.yml

name: Evaluate Agent
on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - name: Run evaluations
        run: python evaluate_agent.py

      - name: Compare with baseline
        run: |
          current_score = get_current_score()
          baseline_score = get_baseline_score()
          assert current_score >= baseline_score - 0.05

      - name: Upload results to Opik
        run: opik upload_results --dataset refund_requests

Real Impact: What You Gain

After implementing this workflow with Opik, I've consistently seen:

50-70% reduction in regression bugs - Each change is evaluated against 100+ test cases automatically

2-3x faster iteration cycles - No more manual testing of every edge case

Clear success metrics - You know exactly when your agent is ready for production

Traceability - When something fails in production, you can trace it back to the exact prompt and tool call

Getting Started

Install Opik:

pip install opik

Start the platform (local or cloud):

opik local start
# or sign up at comet.com/opik

Instrument your first agent:

import opik
opik.configure()

Run your first evaluation:

from opik.evaluation import evaluate
# Follow the examples above

The Bottom Line

Building reliable LLM agents isn't about perfect prompts or the latest model. It's about having a systematic way to measure quality, identify issues, and verify improvements.

Opik gives you that system. It's not magic - you still need to iterate and think critically about your agent's behavior. But it transforms agent optimization from guesswork into engineering.

The LLM space is moving fast. The teams that win won't be the ones with the cleverest prompts - they'll be the ones who can iterate fastest while maintaining quality. That's what Opik enables.

Your turn: Pick one agent you're currently building or maintaining. Instrument it with Opik this week. Run one evaluation. I guarantee you'll find something you didn't expect.

Have you tried systematic evaluation for your agents? What challenges are you facing? Let me know in the comments.

DEV Community