DEV Community

Cover image for Opik: Your Agent's Black Box Flight Recorder
ruchika bhat
ruchika bhat

Posted on

Opik: Your Agent's Black Box Flight Recorder

Building LLM agents that actually work reliably is hard. Really hard.

You've probably experienced this cycle: your agent works perfectly in three test cases, fails spectacularly in production, you tweak a prompt, it fixes one problem but creates two others. Rinse and repeat.

This is where Opik comes in. Built by Comet, Opik is an open-source platform that brings systematic evaluation and optimization to LLM development. Let me show you how to use it to build better agents.

Why Traditional Testing Fails for Agents

Before diving into Opik, let's understand why agent testing is uniquely challenging:

  1. Non-deterministic outputs - The same input can produce different responses
  2. Multi-step reasoning - Errors compound across tool calls and decision points
  3. No single "right answer" - Multiple valid approaches exist
  4. Integration complexity - Agents interact with real APIs and databases

Traditional unit tests can't capture this complexity. You need a different approach.

Enter Opik: Evaluation-First Development

Opik treats evaluation as a first-class concern. The core workflow:

Collect traces → Define metrics → Run evaluations → Optimize → Deploy
Enter fullscreen mode Exit fullscreen mode

Let me walk through a practical example of optimizing a customer support agent.

Example: Building a Resilient Support Agent

We'll build an agent that handles refund requests. It needs to check order history, verify refund eligibility, and process requests - all while maintaining a helpful tone.

Step 1: Instrument Your Agent

First, add Opik instrumentation to capture everything:

from opik import opik_context, track
from opik.integrations.openai import track_openai
import openai

# Track OpenAI calls automatically
openai_client = track_openai(openai.OpenAI())

class SupportAgent:
    @track(name="process_refund_request")
    def process(self, user_message: str, user_id: str):
        # Get conversation history
        history = self.get_conversation_history(user_id)

        # Track this as a conversation
        opik_context.update_current_trace(
            name="customer_support",
            metadata={
                "user_id": user_id,
                "conversation_length": len(history)
            }
        )

        # Step 1: Understand intent
        intent = self.classify_intent(user_message)

        # Step 2: If refund-related, check eligibility
        if intent == "refund":
            order_info = self.check_order_history(user_id)
            eligibility = self.check_refund_eligibility(order_info)

            # Track tool usage
            opik_context.log_tool_call(
                name="check_refund_eligibility",
                input=order_info,
                output=eligibility
            )

        # Step 3: Generate response
        response = self.generate_response(intent, eligibility)
        return response
Enter fullscreen mode Exit fullscreen mode

Step 2: Define What "Good" Looks Like

This is where Opik shines. Instead of writing brittle assertions, define metrics that capture agent quality:

from opik.evaluation.metrics import (
    IsJson, 
    ContainsAny, 
    Hallucination, 
    ToolCallCorrectness,
    BaseMetric
)

class ToneAppropriateness(BaseMetric):
    """Custom metric for customer service tone"""

    def evaluate(self, output: str, reference: str = None):
        # Use an LLM judge to evaluate tone
        prompt = f"""
        Rate the professionalism and helpfulness of this support response (1-5):

        Response: {output}

        Return only a number.
        """

        score = int(llm_client.complete(prompt))
        return {
            "score": score,
            "reason": f"Tone rated {score}/5",
            "name": "tone_appropriateness"
        }

# Define evaluation criteria
metrics = [
    Hallucination(threshold=0.3),  # Penalize making up facts
    ContainsAny(["refund", "credit", "process"], min_count=1),  # Keywords present
    ToolCallCorrectness(),  # Tools used appropriately
    ToneAppropriateness(min_score=4)
]
Enter fullscreen mode Exit fullscreen mode

Step 3: Create a Test Dataset

Good evaluations need good data. Opik lets you create datasets from production traces:

from opik import Opik

client = Opik()
dataset = client.create_dataset("refund_requests")

# Add edge cases you've encountered
dataset.insert([
    {
        "input": "I want a refund for order #12345",
        "expected_output": "Check eligibility and process if valid",
        "user_id": "user_1",
        "order_exists": True,
        "eligible": True
    },
    {
        "input": "Give me my money back!!!",  # Emotional customer
        "expected_output": "De-escalate and check order",
        "user_id": "user_2", 
        "order_exists": True,
        "eligible": False  # Past return window
    },
    {
        "input": "Refund for order that never arrived",
        "expected_output": "Check delivery status, offer replacement",
        "user_id": "user_3",
        "order_exists": True,
        "eligible": True
    }
])
Enter fullscreen mode Exit fullscreen mode

Step 4: Run Systematic Evaluations

Now the magic happens. Run your agent against the dataset and Opik automatically evaluates each response:

from opik.evaluation import evaluate

def evaluation_task(x):
    agent = SupportAgent()
    response = agent.process(x["input"], x["user_id"])
    return {
        "output": response,
        "reference": x["expected_output"],
        "metadata": {"user_id": x["user_id"]}
    }

results = evaluate(
    dataset="refund_requests",
    task=evaluation_task,
    metrics=metrics
)

print(f"Overall score: {results.score}")
print(f"Failed examples: {results.failures}")
Enter fullscreen mode Exit fullscreen mode

Step 5: Identify Failure Patterns

Here's where you get real insights. Opik's dashboard shows you:

  • Low-scoring traces - Which conversations performed poorly
  • Metric breakdowns - Is tone consistently bad? Tool usage failing?
  • Clustering - Similar failures grouped together

In my experience, you'll typically find patterns like:

1. Tool call errors: Agent tries to process refunds without checking eligibility
2. Tone failures: Responses become robotic when handling angry customers
3. Context loss: Agent forgets conversation history after long exchanges
Enter fullscreen mode Exit fullscreen mode

Step 6: Optimize Iteratively

Now you optimize based on evidence, not intuition:

Iteration 1: Fix tool usage

# Problem: Agent called process_refund before eligibility check
# Solution: Explicit system prompt

system_prompt = """
You are a customer support agent. Follow this order:
1. ALWAYS check eligibility before processing refunds
2. Call check_eligibility() first
3. Only call process_refund() if eligibility confirmed
"""
Enter fullscreen mode Exit fullscreen mode

Iteration 2: Fix tone for edge cases

# Problem: Angry customers get cold, scripted responses
# Solution: Tone guidelines in system prompt

tone_guidelines = """
For frustrated customers:
- Acknowledge their frustration: "I understand this is frustrating..."
- Show empathy before solving
- Use softer language: "I'd be happy to help" vs "I will help"
"""
Enter fullscreen mode Exit fullscreen mode

Iteration 3: Add safety checks

# Problem: Agent hallucinated refund policies
# Solution: Add factual grounding

@track(name="check_policy")
def get_policy(order_date):
    # Pull from actual database, not model memory
    return db.get_refund_policy(order_date)
Enter fullscreen mode Exit fullscreen mode

Step 7: Continuous Evaluation

Don't just evaluate once. Set up continuous evaluation:

# GitHub Action / CI Pipeline
# .github/workflows/evaluate-agent.yml

name: Evaluate Agent
on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - name: Run evaluations
        run: python evaluate_agent.py

      - name: Compare with baseline
        run: |
          current_score = get_current_score()
          baseline_score = get_baseline_score()
          assert current_score >= baseline_score - 0.05

      - name: Upload results to Opik
        run: opik upload_results --dataset refund_requests
Enter fullscreen mode Exit fullscreen mode

Real Impact: What You Gain

After implementing this workflow with Opik, I've consistently seen:

50-70% reduction in regression bugs - Each change is evaluated against 100+ test cases automatically

2-3x faster iteration cycles - No more manual testing of every edge case

Clear success metrics - You know exactly when your agent is ready for production

Traceability - When something fails in production, you can trace it back to the exact prompt and tool call

Getting Started

  1. Install Opik:
pip install opik
Enter fullscreen mode Exit fullscreen mode
  1. Start the platform (local or cloud):
opik local start
# or sign up at comet.com/opik
Enter fullscreen mode Exit fullscreen mode
  1. Instrument your first agent:
import opik
opik.configure()
Enter fullscreen mode Exit fullscreen mode
  1. Run your first evaluation:
from opik.evaluation import evaluate
# Follow the examples above
Enter fullscreen mode Exit fullscreen mode

The Bottom Line

Building reliable LLM agents isn't about perfect prompts or the latest model. It's about having a systematic way to measure quality, identify issues, and verify improvements.

Opik gives you that system. It's not magic - you still need to iterate and think critically about your agent's behavior. But it transforms agent optimization from guesswork into engineering.

The LLM space is moving fast. The teams that win won't be the ones with the cleverest prompts - they'll be the ones who can iterate fastest while maintaining quality. That's what Opik enables.

Your turn: Pick one agent you're currently building or maintaining. Instrument it with Opik this week. Run one evaluation. I guarantee you'll find something you didn't expect.

Have you tried systematic evaluation for your agents? What challenges are you facing? Let me know in the comments.

Top comments (0)