Online vs Offline Evals: Where Each One Catches the Bug

#ai #evals #observability #llm

Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Your CI is green. Forty golden cases passed. Faithfulness, relevance, the exact-match checks on the three prompts that broke last quarter — all green. You ship the prompt change at 4pm and go home.

By Thursday support has a pattern. Users in one locale are getting answers that read fine but cite the wrong policy version. Your golden set never had a case for that locale. It couldn't have. Nobody wrote that bug down before it happened.

This is the split that trips up most teams running evals in 2026. Offline evals and online evals are not two flavors of the same thing. They catch different bugs, on different timelines, with different blind spots. Run one and skip the other and you have a known hole in your safety net.

What offline evals actually are

An offline eval runs against a fixed dataset before you ship. You have a set of inputs, a set of expected outputs (or a scorer that doesn't need them), and a scoring function. You run it in CI on every prompt or model change. It gates the deploy.

The dataset is the golden set. It comes from three places: cases you wrote by hand, cases you pulled from production after they broke, and cases a colleague added after a customer complained. Every entry encodes a bug someone already knows about.

A minimal offline run looks like this:

import json

# golden.jsonl: one {"input","expected"} per line
def load_cases(path):
    with open(path) as f:
        return [json.loads(l) for l in f]


def score_case(case, run_model):
    out = run_model(case["input"])
    return {
        "input": case["input"],
        "got": out,
        "pass": case["expected"].lower() in out.lower(),
    }


def run_offline(path, run_model):
    cases = load_cases(path)
    results = [score_case(c, run_model) for c in cases]
    failed = [r for r in results if not r["pass"]]
    return results, failed

Wire the gate into CI:

def ci_gate(path, run_model, max_fail=0):
    _, failed = run_offline(path, run_model)
    if len(failed) > max_fail:
        for r in failed:
            print("FAIL:", r["input"][:60])
        raise SystemExit(1)
    print("offline evals passed")

That raise SystemExit(1) is the whole point. It blocks the merge. A model alias rotated under you, someone bumped temperature, a prompt edit broke the JSON contract — the gate catches it before a user does.

Where the golden set goes blind

The golden set only knows the bugs you taught it. That is its strength in CI and its hard ceiling everywhere else.

It cannot see the input distribution shift. Your dataset froze the day you wrote it. Real traffic moves: new tenants, new phrasings, a product launch that floods you with questions about a feature the golden set predates. The cases that matter most next month are the ones nobody has written yet.

It cannot see slice concentration. A 100-case set that passes at 98% feels safe. Production might be hiding a 30% failure rate on the 4% of traffic asking about entities added last week. The aggregate pass rate smooths over the exact slice that is on fire.

It also drifts toward the cases that are easy to write. Hand-authored golden sets over-represent clean, well-formed inputs because those are the ones a person thinks of at a desk. The messy real input — half a sentence, a paste of a stack trace, two languages in one message — rarely makes it in.

None of this is an argument against offline evals. They are the cheapest place to catch a named regression, and a named regression caught in CI costs you nothing. The argument is that "CI is green" answers a narrower question than the one your users keep asking.

What online evals actually are

An online eval scores live production traffic after the fact. You sample a slice of real requests, run a scorer against them, and emit the score as telemetry. No expected output, because there isn't one — production inputs are open-ended. The scorer is a code heuristic, a model judge, or a relevance check against retrieved context.

The key move: online evals read off the same trace stream your observability stack already produces. You are not building a second pipeline. You are attaching a scorer to spans you already emit.

import random
from opentelemetry import trace

tracer = trace.get_tracer("app.evals")


def judge_relevance(question, answer):
    # swap for your model-judge or heuristic;
    # returns a float 0-1
    ...


def maybe_eval_online(question, answer, sample_rate=0.05):
    if random.random() > sample_rate:
        return None
    score = judge_relevance(question, answer)
    span = trace.get_current_span()
    span.set_attribute("app.eval.relevance", score)
    span.set_attribute("app.eval.sampled", True)
    return score

You sample because judging every request is expensive and slow. Five percent of traffic is usually enough to move an aggregate. Stamp the score onto the span so it lands next to the latency, the token counts, and the cost you are already tracing. Then it is queryable like any other attribute.

Where the trace stream catches what CI can't

Once the score is an attribute on the span, the online eval becomes an alert. The rolling average of app.eval.relevance drops below a baseline and you page or open a ticket. This is the part that catches the bug your golden set never had a case for.

A drift query in PromQL, assuming the score is exported as a metric:

avg_over_time(app_eval_relevance[1h]) < 0.6
  and
avg_over_time(app_eval_relevance[7d]) > 0.75

Both terms matter. The first says quality dropped in the last hour. The second says it used to be fine, so this is a regression and not just a hard week of off-corpus questions. Without the second term you page every time traffic gets weird.

The Thursday bug from the opening shows up here. The locale you never tested floods in, the relevance judge scores those answers low, the 1-hour average sags under the 7-day baseline, and you get a signal on day one instead of a support pattern on day four. The golden set could not have caught it. The trace stream did, because it scores what actually arrived rather than what you predicted would arrive.

Slice the score by the dimensions that drift: model, prompt version, tenant, locale. A global average hides the regression that is destroying one customer's experience while everyone else stays fine.

Which one catches which bug

A short map, because the whole point is that they do not overlap:

Offline catches the named regression: a prompt edit that breaks the output contract, a model rotation that fails a known case, a code change that drops a required field. Caught in CI, before ship, for free.
Offline misses anything not in the dataset: distribution shift, new slices, the input nobody thought to write down.
Online catches the unnamed regression: a quality sag on real traffic, a slice silently failing, drift after a provider-side model change. Caught hours after ship, on live users, at sampling cost.
Online misses the deploy-blocking bug, because by the time it fires the bad version is already serving traffic.

That last line is why you run both. Offline is the gate that stops the bug you can name. Online is the net under the bug you can't. Neither covers the other's hole.

The cheapest way to run both

Build the online scorer first, even if you do it badly. The reason is that online evals are where your next golden cases come from. Every production failure the relevance alert surfaces is a case you copy into the golden set, so next time CI catches it before ship. The two systems feed each other: online finds the unknown bug, offline makes sure it stays found.

Use the same scorer code in both places. The relevance judge that stamps app.eval.relevance onto a live span is the same function you call in the offline scorer. One implementation, two harnesses — CI and the trace stream. When you change how you score, both move together, and you never have to reconcile two definitions of "good."

Start with one offline gate and one online alert. Sample 5% of traffic, score relevance, alert on the 1-hour-vs-7-day drop. Pull every alert that fires into the golden set. In a month the gate is sharper and the alert is quieter, because the bugs keep migrating from "unknown" to "named." That migration is the entire job.

If this was useful

If your CI is green and your users still aren't happy, the bug is almost always living in the gap between what your golden set tested and what production actually sent. The LLM Observability Pocket Guide walks through wiring both eval loops off one trace stream, picking scorers that work in CI and in prod, and the sampling math that keeps online evals cheap.