We wired our LLM eval suite into Datadog over about four months. Most of the panels we built got deleted. These are the five that stayed, and the metrics that feed them.
TL;DR: We run an LLM-as-judge eval suite on every PR that touches a prompt, and we ship the results to Datadog as custom metrics. The dashboard started with fourteen panels. We kept five. The one that catches the most real regressions is per-criterion pass-rate split out by judge criterion, not the single rolled-up pass-rate number, because an aggregate of 91 percent hid the fact that one criterion had dropped from 0.95 to 0.62. Below are the metrics we emit, the Python that submits them, the monitor config we alert on, and the panels we tried and dropped.
Some context on the setup so the rest makes sense. We are a Series-C dev-tool startup. We have a handful of prompts in production that do real work (classification, extraction, a summarization step in an agent loop). Each one has an eval set of tagged examples, somewhere between 80 and 400 per prompt. The judge is a separate model call that scores each output against a rubric. We run the suite in GitHub Actions. The eval job emits metrics to Datadog at the end of every run. Backend service health was already in Datadog, so putting eval data next to it meant one place to look during an incident instead of two.
1. Emit per-criterion pass-rate, not just the rolled-up number
This is the one that earns its place. Our judge scores each output against multiple criteria. For the extraction prompt it is four: correct fields, no hallucinated fields, format valid, no refusal. Early on we only emitted one number, prompt_eval.pass_rate, the fraction of examples that passed every criterion. That number is fine for a smoke test and useless for debugging.
The problem showed up on a prompt change that looked clean. Overall pass-rate went from 0.93 to 0.91. Two points. Nobody would block a PR on two points. But underneath, the "no hallucinated fields" criterion had dropped from 0.96 to 0.71, and "format valid" had gone up enough to mask it in the average. We were trading correctness for formatting and the rolled-up number said everything was basically fine.
So now every criterion gets its own metric, tagged. The metric name stays prompt_eval.pass_rate and the criterion rides as a tag. That keeps the metric count sane and lets you graph all criteria on one panel.
# eval_metrics.py
# Submits eval results to Datadog after a run completes.
from datadog import initialize, api
import os, time
initialize(api_key=os.environ["DD_API_KEY"], app_key=os.environ["DD_APP_KEY"])
def submit_eval_metrics(prompt_name, git_sha, results):
now = time.time()
base_tags = [f"prompt:{prompt_name}", f"git_sha:{git_sha[:12]}", "env:ci"]
series = []
for criterion, rate in results["per_criterion"].items():
series.append({"metric": "prompt_eval.pass_rate", "points": [(now, rate)],
"type": "gauge", "tags": base_tags + [f"criterion:{criterion}"]})
series.append({"metric": "prompt_eval.pass_rate", "points": [(now, results["overall_pass_rate"])],
"type": "gauge", "tags": base_tags + ["criterion:overall"]})
series.append({"metric": "prompt_eval.judge_kappa", "points": [(now, results["judge_kappa"])],
"type": "gauge", "tags": base_tags})
series.append({"metric": "prompt_eval.token_cost", "points": [(now, results["token_cost_usd"])],
"type": "gauge", "tags": base_tags})
series.append({"metric": "prompt_eval.p95_latency_ms", "points": [(now, results["p95_latency_ms"])],
"type": "gauge", "tags": base_tags})
api.Metric.send(series)
Two things I got wrong the first time. I submitted the criterion in the metric name (prompt_eval.pass_rate.no_hallucinated_fields) instead of as a tag. That generated a new custom metric per criterion per prompt, the cardinality climbed, and you cannot graph them together without listing each one. Tags fix both. The other thing: I tagged with the full 40-character git SHA, which is a high-cardinality tag value and not useful at that length. Truncating to 12 is enough to find the commit and stops the tag from exploding.
2. Track the judge against humans, or you are graphing noise
My standing opinion, and I will say it plainly: LLM-as-judge is the only scalable eval, but most teams use it wrong because they never validate the judge itself. A pass-rate panel that looks beautiful is worthless if the judge agreeing with itself is all you are measuring. We learned this the slow way on a hallucination-detection judge that ran around a 30 percent false-positive rate for weeks. The dashboard was green. Customers were not.
So prompt_eval.judge_kappa is a first-class metric now. We keep a small human-labeled holdout per prompt (200 examples, labeled by two of us, disagreements resolved by a third). Every eval run scores that holdout too and computes Cohen's kappa between the judge and the human labels. That number goes to Datadog next to the pass-rate.
The panel for it is a single timeseries with a marker line at 0.6. When kappa drifts under the line, the pass-rate numbers above it stop meaning anything and we know to re-look at the judge prompt before trusting any regression signal. In our setup kappa sits around 0.66 to 0.72 on a good prompt. When we rewrote a judge rubric badly once, it fell to 0.41 in a single run, and that drop is what told us the rubric change was the problem, not the model.
from sklearn.metrics import cohen_kappa_score
def compute_judge_kappa(human_labels, judge_labels):
# labels: 1 = pass, 0 = fail, aligned by example id.
if len(human_labels) != len(judge_labels):
raise ValueError("label lists must align by example id")
return round(cohen_kappa_score(human_labels, judge_labels), 3)
The holdout does not need to be big. It needs to be labeled by an actual person and refreshed when the prompt's job changes. We re-label maybe once a month, or whenever a prompt's scope moves.
3. Wire the monitors before you trust the dashboard
A dashboard nobody is staring at does not catch anything at 2am. The panels are for debugging once you already know something moved. The monitors are what tell you something moved. We run two kinds. The first is an absolute floor on per-criterion pass-rate. The second is a change-based monitor on the overall pass-rate, so a slow week-over-week slide gets caught even when no single run trips the floor.
Here is the per-criterion floor as a Terraform datadog_monitor resource, so it lives in version control instead of someone's browser tab.
resource "datadog_monitor" "extraction_no_hallucinated_fields" {
name = "[prompt-eval] extraction: no_hallucinated_fields below floor"
type = "metric alert"
query = "min(last_3): min:prompt_eval.pass_rate{prompt:extraction,criterion:no_hallucinated_fields,env:ci} < 0.85"
monitor_thresholds { critical = 0.85
warning = 0.90 }
notify_no_data = true
no_data_timeframe = 60
message = "no_hallucinated_fields for extraction fell below 0.85 on the last 3 runs. Check the most recent prompt change. @slack-eval-alerts"
tags = ["team:ai", "prompt:extraction"]
}
A note on min(last_3). We do not alert on a single run. Eval sets have sampling noise, and one unlucky run can dip a criterion below the floor and recover on the next. Requiring three consecutive runs under the line cut our false pages down a lot. The CI check itself goes red on the first run, so the PR is already blocked. The page is for the slow drift, the red check is for the obvious break. notify_no_data: true matters more than it looks. The most common failure was not a regression. It was the eval job silently not running and the dashboard quietly going flat.
4. The five panels we kept, and the nine we dropped
The test we landed on: if a panel has not changed what someone did in the last month, it goes.
| Panel | Metric | Keep or drop |
|---|---|---|
| Per-criterion pass-rate (one line per criterion) | prompt_eval.pass_rate by criterion | Kept. The single most-used panel. |
| Judge kappa vs human (marker at 0.6) | prompt_eval.judge_kappa | Kept. Tells you whether to trust everything else. |
| Token cost per run | prompt_eval.token_cost | Kept. A rewrite that doubles cost shows here before the bill does. |
| Pass-rate by git SHA (table, last 20) | prompt_eval.pass_rate by git_sha | Kept. The "which commit moved this" lookup. |
| p95 eval latency | prompt_eval.p95_latency_ms | Kept, barely. |
| Single big pass-rate number | overall pass-rate | Dropped. A green 0.91 gave false confidence. |
| Per-example score heatmap | per-example gauge | Dropped. Too dense, never drove a fix. |
| Cost cumulative sum for the month | summed cost | Dropped. A billing question, not an eval one. |
The pattern in what we dropped: anything that was a different view of a number we already had a better panel for, and anything too dense to read in the ten seconds you actually look at a dashboard mid-incident. We started by copying a generic service dashboard layout, and that was a mistake. Service dashboards assume a continuous stream of requests. Eval runs are discrete events on PRs.
5. Tag everything by prompt and SHA so the board answers "which change"
The whole point during a regression is to answer one question fast: which prompt change moved this metric. Every metric we send carries prompt, git_sha (truncated), and env. The pass-rate also carries criterion. With those tags, the "which commit" table is a straight group-by on git_sha. When a criterion drops, you read the table, find the SHA, and you are looking at the diff in under a minute. We also post a Datadog event at the start of each eval run as an overlay, so a drop on the graph lines up visibly with a commit.
FAQ
Do you really need a human-labeled holdout for kappa? You need it once per prompt and refresh it occasionally. 200 examples labeled by two people is an afternoon. Without it you are trusting the judge with no check.
Why Datadog instead of the eval tool's own dashboard? We already lived in Datadog for service health. If your team does not, this is probably not a reason to adopt it. The metrics matter more than the surface they render on.
What thresholds should I start with? Do not copy mine. Run the suite on main for a week, watch where each criterion sits, set the floor a little below the normal range.
Does this replace running Promptfoo or your eval framework locally? No. The framework still runs the evals and is where you read per-example detail. Datadog is the rollup and the alerting layer on top.
Why gauge and not count or rate? A pass-rate is a snapshot value at a point in time, so gauge fits. Using the wrong type was one of my early mistakes.
What I am still chewing on
The kappa holdout goes stale when a prompt's job drifts, and I do not have a clean signal for when it has gone stale short of re-labeling. The min(last_3) window trades detection speed for fewer false pages, and I am not sure three is the right number per eval set. And the harder one: this catches regressions in the prompts I already have eval sets for. The judge can only score what the rubric asks about. The class of bug where everything passes and the customer is still wrong lives in the gap between the criteria, and I do not have a panel for the thing I forgot to measure.
If you have wired per-criterion eval alerting and found a better window than three runs, or a way to tell when a judge holdout has gone stale without re-labeling it, I want to hear it.
Top comments (0)