Opik by Comet The Open Source Observability Tool Every AI Builder Needs in Their Stack

#ai #devtools

I came across Opik during the Commit to Change Hackathon by Encode Club, in partnership with Comet. I had never heard of it before, but after integrating it into my project, it became one of those tools I couldn't imagine building without.

If you're building LLM-powered applications whether that's a RAG pipeline, an AI agent, a chatbot, or any system that calls a

language model you already know the pain:
Something breaks and you don't know where
Your agent hallucinates and you can't trace why
Token costs spike and you have no visibility into what's consuming them
You change a prompt and don't know if it actually improved anything

Opik is the answer to all of that.
This article covers what Opik is, how it works, its core features, and how to integrate it into your existing LLM stack — with real code examples.

What is Opik?
Opik is an open-source LLM observability and evaluation platform built by Comet. It sits alongside your AI application and gives you complete visibility into everything your system does every LLM call, every tool invocation, every chain step logged, scored, and visualized in one dashboard.

It covers the full development lifecycle:
Development trace and debug your agents as you build
Evaluation score outputs and run experiments across prompt versions
Production monitor live traffic, detect issues, and auto-optimize

Think of it like this: if your AI agent is a car, Opik is the full onboard diagnostics system not just a dashboard light that tells you something is wrong, but the full readout that tells you exactly which component failed and why.

Core Concepts: Traces and Spans
Before diving into code, it helps to understand two key concepts Opik is built around.

A trace is a complete record of one end-to-end request through your LLM application. From the moment a user sends a question to the moment your app returns a response that entire journey is one trace.

A span is a single step inside that trace. If your agent calls a retrieval function, then calls the LLM, then formats the output — each of those is a span nested inside the parent trace.
This structure gives you surgical visibility. Instead of just knowing "the response was bad," you can see exactly which step produced the bad output, how long it took, and what it was working with.

Getting Started Installation
bash pip install opik
opik configure
Running opik configure sets up your API key and connects your environment to the Opik cloud dashboard. You can also self-host Opik if you prefer to keep everything local.

Core Integration The @track Decorator
The fastest way to get started with Opik is the @track decorator. Add it to any function in your LLM pipeline and Opik automatically logs it as a span.

python from opik import track

`@track
def llm_chain(user_question):
context = get_context(user_question)
response = call_llm(user_question, context)
return response

@track
def get_context(user_question):
# Retrieval logic hard coded here for simplicity
return ["The dog chased the cat.", "The cat was called Luky."]

@track
def call_llm(user_question, context):
# Your actual LLM call goes here
return "The dog chased the cat Luky."

response = llm_chain("What did the dog do?")
print(response)`

What happens when you run this:
llm_chain is logged as the parent trace
get_context and call_llm are logged as child spans nested inside it
Every input, output, and execution time is captured automatically
The full chain appears in your Opik dashboard instantly

No boilerplate. No manual logging. Just a decorator.

Integrations Works With Your Existing Stack
Opik isn't asking you to rewrite your application. It integrates directly with the tools you're already using:
LangChain:

`pythonfrom langchain_openai import ChatOpenAI
from opik.integrations.langchain import OpikTracer

opik_tracer = OpikTracer()
llm = ChatOpenAI(temperature=0)
llm = llm.with_config({"callbacks": [opik_tracer]})

llm.invoke("Hello, how are you?")
OpenAI SDK:
pythonfrom openai import OpenAI
from opik.integrations.openai import track_openai

openai_client = OpenAI()
openai_client = track_openai(openai_client)

response = openai_client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello, world!"}]
)
LlamaIndex:
pythonfrom llama_index.core import set_global_handler

set_global_handler("opik")`

One line. That's it for LlamaIndex.
Opik also supports LiteLLM, DSPy, Ragas, OpenTelemetry, and Predibase so whatever your stack looks like, it fits in.

Evaluation Stop Guessing, Start Scoring
Once your traces are being logged, you can start running evaluations. Opik has built-in eval metrics including:

Hallucination detection flags responses that contradict the provided context

Answer relevance scores how well the response addresses the question

Context precision measures the quality of retrieved context in RAG systems

Factuality checks responses against a ground truth dataset
Moderation flags harmful or policy-violating content

You can also define your own custom metrics using the SDK.
The real power here is running experiments give Opik a dataset, define what "good" looks like using your chosen metrics, and let it automatically score different versions of your app against each other. You stop debating which prompt is better and start measuring it.

Guardrails Safety Built In
Opik ships with built-in guardrails that screen both user inputs and LLM outputs before they cause problems:

PII detection and redaction
Competitor mention filtering
Off-topic content detection
Custom content moderation rules

You can use Opik's built-in models or plug in your own third-party guardrail libraries. This means safety isn't an afterthought you bolt on at the end — it's baked into the same observability pipeline you're already running.

Automatic Prompt Optimization
This is one of the most powerful features Opik offers and one that most developers don't expect from an observability tool.
Once you've defined your evaluation metrics and built a test dataset, Opik can automatically generate and test improved versions of your prompts using four built-in optimizers:

Few-shot Bayesian finds the best few-shot examples for your use case
MIPRO multi-stage instruction and prefix optimization
Evolutionary optimizer iteratively evolves prompt variations
MetaPrompt (LLM-powered) uses an LLM to rewrite and improve your prompts

The result is a production-ready, frozen prompt that you can lock in and deploy with confidence without manually iterating through dozens of variations yourself.

Production Monitoring
When you ship to production, Opik keeps running. Every live request is logged, scored using online eval metrics, and surfaced in your monitoring dashboard.

This means:
You catch regressions immediately when a new model version behaves differently
You build new test datasets directly from real production traffic
You close the loop between what you tested in development and what actually happens with real users

Why This Matters Right Now
The conversation in AI development has shifted. A year ago, the focus was almost entirely on prompts write a better prompt, get a better output. That still matters, but it's not enough anymore.

As AI systems get more complex multi-agent workflows, RAG pipelines, tool-calling chains the failure modes multiply. You can't eyeball your way through 10,000 production traces. You need instrumentation.

Opik gives you that instrumentation. And because it's open source with 18k+ GitHub stars, it's backed by a real community — not a vendor lock-in waiting to happen.

Getting Started

Install: pip install opik
Configure: opik configure
Add @track to your LLM functions
Open your Opik dashboard and watch your traces appear

Free to start, no credit card required.
🔗 comet.com/site/products/opik
⭐ github.com/comet-ml/opik