- Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
- Also by me: Thinking in Go (2-book series): Complete Guide to Go Programming + Hexagonal Architecture in Go
- My project: Hermes IDE | GitHub. An IDE for developers who ship with Claude Code and other AI coding tools.
- Me: xgabriel.com | GitHub
Three dashboards is the right number for an LLM team. One. Two. Three. Cost, quality, latency. Anything more and people stop opening them. Anything less and you're flying blind on whichever you skipped.
That's the whole post. The rest is wiring.
Why three, not one
A team I talked to last month had a single "AI" dashboard. Forty-two panels. Cost graphs next to eval pass rates next to TTFT histograms next to a leaderboard of slow customers. Nobody opened it. Not the platform engineer, not the PM, not the finance partner who actually pays the OpenAI bill.
When you ask three different questions of one chart wall, you get three different audiences scrolling past each other's panels. Finance scrolls past TTFT. The PM scrolls past spend-per-route. The on-call scrolls past groundedness scores.
Three dashboards is the smallest split that keeps each one usable:
- Cost: owned by platform engineering, watched by finance. Cadence: weekly review, daily glance.
- Quality: owned by the product engineer. Cadence: per release, per prompt change.
- Latency: owned by on-call. Cadence: live during incidents.
One owner, one audience, one composite alert per dashboard. That's the rule.
The OTel pipeline they share
All three dashboards read from the same span and metric stream. You don't run three pipelines. You run one OTel pipeline emitting one set of attributes, and each dashboard slices what it needs.
Here is the attribute set every span on the LLM call path should carry. The names follow the OpenTelemetry GenAI semantic conventions where they exist, plus a few app-side fields you'll always wish you had:
# attached to every LLM span
gen_ai.system: "openai" # or "anthropic", "bedrock"
gen_ai.request.model: "gpt-4o-mini"
gen_ai.response.model: "gpt-4o-mini-2024-07-18"
gen_ai.usage.input_tokens: 1284
gen_ai.usage.output_tokens: 312
gen_ai.usage.cached_input_tokens: 1024 # prompt cache hit
gen_ai.response.finish_reason: "stop" # or length / tool_calls
gen_ai.request.temperature: 0.2
# app-side, not in the spec but always worth carrying
app.route: "support.summarize" # logical feature, not URL
app.conversation_id: "c_abc123"
app.tenant_id: "t_98123"
app.eval.pass: true # if a runtime judge ran
app.eval.groundedness: 0.87
You also emit a small set of metrics derived from these spans. The dashboards mostly read metrics, not raw spans, because metrics in a TSDB are 100× cheaper to query than aggregated traces:
llm_request_tokens_input_total{model, route, tenant}
llm_request_tokens_output_total{model, route, tenant}
llm_request_tokens_cached_total{model, route, tenant}
llm_request_cost_usd_total{model, route, tenant}
llm_request_duration_seconds{model, route, le} # histogram
llm_request_ttft_seconds{model, route, le} # histogram
llm_request_errors_total{model, route, status}
llm_eval_score{route, judge, dim} # gauge
Two notes that will save you a week:
The collector (not the SDK) is where you compute USD cost. Models change price, your code shouldn't. Use an OTel processor with a model→price table you can hot-reload. When OpenAI drops a price (they do), you change one config map, not every service.
app.route is the most useful attribute you'll add and it's not in any spec. Pick it carefully. It's the label finance will filter by, the label PMs will group by, the label on-call will sort by. Make it a short string that maps 1:1 to a product feature, not an HTTP route.
Dashboard 1: Cost
Audience: platform engineering + finance. Question it answers: where is the money going, and is it moving the wrong way?
Attributes read: gen_ai.request.model, gen_ai.response.model, app.route, app.tenant_id, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.cached_input_tokens, llm_request_cost_usd_total.
Panels:
- Spend per model: stacked area, last 30 days. Headline panel.
- Spend per route: bar chart, last 7 days, sorted desc.
- Top 10 tenants by spend: table, last 7 days.
- Cost per conversation P95: single stat, last 24h, with a delta vs last week.
- Prompt-cache hit ratio: single stat, last 24h. (Cached input tokens / total input tokens.)
- Cost burn vs monthly budget: gauge, current month.
The headline panel (spend per model) uses this PromQL:
sum by (gen_ai_request_model) (
rate(llm_request_cost_usd_total[5m])
) * 60 * 60 * 24
That gives you USD/day per model as a time series, smoothed over 5-minute rate windows. The Datadog equivalent:
sum:llm.request.cost_usd{*} by {gen_ai_request_model}.as_rate() * 86400
The cost-per-conversation P95 is the panel that catches the "one runaway agent loop" pattern before it shows up on the bill. PromQL:
histogram_quantile(0.95,
sum by (le, app_route) (
rate(llm_conversation_cost_usd_bucket[1h])
)
)
You'll need to emit llm_conversation_cost_usd_bucket as a histogram in the collector, summing per-conversation spend at conversation-end. It's worth the setup. The first time you catch a single conversation costing $40, you'll have paid for the dashboard.
Sample Grafana JSON for the headline "Spend per model" panel. Drop this into a dashboard's panels array and adjust the datasource UID:
{
"type": "timeseries",
"title": "Spend per model (USD/day)",
"datasource": {
"type": "prometheus",
"uid": "prom-default"
},
"fieldConfig": {
"defaults": {
"unit": "currencyUSD",
"custom": {
"drawStyle": "bars",
"stacking": { "mode": "normal" }
}
}
},
"options": {
"legend": { "showLegend": true, "placement": "right" }
},
"targets": [
{
"refId": "A",
"expr": "sum by (gen_ai_request_model) (rate(llm_request_cost_usd_total[5m])) * 86400",
"legendFormat": "{{gen_ai_request_model}}"
}
]
}
SLO template:
# slo-cost.yaml
slo:
name: llm-monthly-spend
owner: platform-eng
description: Total LLM spend stays within monthly budget
objective:
target: 8000 # USD, monthly
window: 30d
indicator:
metric: llm_request_cost_usd_total
aggregation: sum
alerts:
- name: budget-burn-fast
burn_rate: 4 # 4x expected burn over 1h
window: 1h
severity: page
- name: budget-burn-slow
burn_rate: 1.5 # 1.5x expected burn over 24h
window: 24h
severity: ticket
One alert per dashboard. For cost, it's burn-rate against budget, not "tokens spiked 20%". A 20% spike means a feature launched. A 4× burn rate sustained for an hour means something's wrong.
Dashboard 2: Quality
Audience: product engineer + the PM who owns the feature. Question it answers: is the model getting worse, and on which slice?
Attributes read: app.route, app.eval.pass, app.eval.groundedness, gen_ai.request.model, gen_ai.response.finish_reason, plus a feedback.score attribute if you collect thumbs-up/down.
This one is the hardest of the three because "quality" isn't free. Something has to score the output. You have three options and the dashboard supports all of them:
- A runtime judge (cheap model rates the response) emitting
app.eval.passandapp.eval.groundedness. - A nightly batch eval over yesterday's traffic, emitting
llm_eval_scoreby route. - User feedback signals (thumbs, copy-button clicks, retry rate).
Panels:
- Eval pass rate per route: multi-line, last 14 days. Headline panel.
- Groundedness P50/P95 over time: two lines, last 14 days.
- User feedback ratio (thumbs-up / total): single stat + sparkline, last 7 days.
- Top failure types: pie or bar, last 24h. Group by a
failure_typeattribute your judge emits. - Regression flag: annotation marker showing prompt-version deploys overlaid on eval pass rate.
- Refusal rate: single stat, last 24h. (
finish_reason="content_filter"divided by total.)
Headline panel PromQL, eval pass rate per route over a 1-hour rolling window:
sum by (app_route) (
rate(llm_eval_pass_total{outcome="pass"}[1h])
)
/
sum by (app_route) (
rate(llm_eval_pass_total[1h])
)
Datadog:
sum:llm.eval.pass{outcome:pass} by {app_route}.as_rate()
/ sum:llm.eval.pass{*} by {app_route}.as_rate()
The regression-flag panel is what makes this dashboard worth opening. You overlay prompt-version deploy events as Grafana annotations on the pass-rate chart. Every time someone ships a prompt change, a vertical line appears. Pass rate dipping after a vertical line is the conversation you want to have, fast.
Sample Grafana JSON for the eval pass rate panel. Note the annotation query for deploy events:
{
"type": "timeseries",
"title": "Eval pass rate per route",
"datasource": {
"type": "prometheus",
"uid": "prom-default"
},
"fieldConfig": {
"defaults": {
"unit": "percentunit",
"min": 0,
"max": 1
}
},
"targets": [
{
"refId": "A",
"expr": "sum by (app_route) (rate(llm_eval_pass_total{outcome=\"pass\"}[1h])) / sum by (app_route) (rate(llm_eval_pass_total[1h]))",
"legendFormat": "{{app_route}}"
}
],
"annotations": {
"list": [
{
"name": "Prompt deploys",
"datasource": { "type": "prometheus", "uid": "prom-default" },
"expr": "changes(llm_prompt_version_info[5m]) > 0",
"titleFormat": "Prompt {{app_route}} deployed",
"iconColor": "rgba(255, 96, 96, 0.8)"
}
]
}
}
SLO template:
# slo-quality.yaml
slo:
name: llm-eval-pass-rate
owner: product-eng
description: Eval pass rate per route stays above floor
objective:
target: 0.92 # 92% pass rate
window: 7d
indicator:
metric: llm_eval_pass_ratio
group_by: [app_route]
alerts:
- name: quality-drop-fast
burn_rate: 10 # 10x error budget in 1h
window: 1h
severity: page
- name: quality-drop-slow
burn_rate: 2
window: 6h
severity: ticket
The one gotcha here: don't alert on absolute pass rate. Alert on burn rate against the SLO. A pass rate of 88% on a route that historically sits at 96% is the same emergency as a pass rate of 75% on a route that's always been at 85%. Burn rate captures both. Absolute thresholds don't.
Dashboard 3: Latency
Audience: on-call. Question it answers: how fast is the model, where's it slow, and is it us or the provider?
Attributes read: gen_ai.request.model, app.route, gen_ai.response.finish_reason, plus llm_request_ttft_seconds, llm_request_duration_seconds, llm_request_errors_total.
Panels:
- TTFT P95 per route: multi-line, last 6 hours. Headline panel.
- Inter-token gap (ITG) P95: multi-line, last 6 hours.
- Perceived latency composite: single stat, last 1h.
TTFT + (output_tokens × ITG_p50). - Provider-side vs total: two-stacked-line per provider, last 6h. Shows what's network/queueing vs what's the provider.
- Error rate by status: multi-line, last 6 hours. Group by
error.type. - Rate-limit hit rate: single stat, last 1h. (
429errors / total.)
Why both TTFT and ITG? Because they fail differently. A slow TTFT means cold start, slow router, prompt caching miss. A slow ITG means the model is genuinely struggling with the output: long generation, busy provider, deep reasoning step. Watching one without the other lies to you.
Headline PromQL, TTFT P95 per route:
histogram_quantile(0.95,
sum by (le, app_route) (
rate(llm_request_ttft_seconds_bucket[5m])
)
)
Datadog:
p95:llm.request.ttft_seconds{*} by {app_route}
Provider-side vs total. This is the panel that ends the "is it our fault or theirs" argument in 4 seconds:
# total
histogram_quantile(0.95,
sum by (le) (rate(llm_request_duration_seconds_bucket[5m]))
)
# provider-side only
histogram_quantile(0.95,
sum by (le) (rate(llm_provider_duration_seconds_bucket[5m]))
)
You emit llm_provider_duration_seconds from the collector based on a span you start at the HTTP-request boundary and end at the last-byte-received event. The gap between "provider-side" and "total" is everything your stack adds: queueing, retries, post-processing.
Sample Grafana JSON for the TTFT P95 panel with multiple percentiles:
{
"type": "timeseries",
"title": "TTFT per route (P50/P95/P99)",
"datasource": {
"type": "prometheus",
"uid": "prom-default"
},
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "orange", "value": 1.5 },
{ "color": "red", "value": 3 }
]
}
}
},
"targets": [
{
"refId": "p50",
"expr": "histogram_quantile(0.50, sum by (le, app_route) (rate(llm_request_ttft_seconds_bucket[5m])))",
"legendFormat": "p50 {{app_route}}"
},
{
"refId": "p95",
"expr": "histogram_quantile(0.95, sum by (le, app_route) (rate(llm_request_ttft_seconds_bucket[5m])))",
"legendFormat": "p95 {{app_route}}"
},
{
"refId": "p99",
"expr": "histogram_quantile(0.99, sum by (le, app_route) (rate(llm_request_ttft_seconds_bucket[5m])))",
"legendFormat": "p99 {{app_route}}"
}
]
}
SLO template:
# slo-latency.yaml
slo:
name: llm-ttft-p95
owner: on-call
description: TTFT P95 stays under target per route
objective:
target: 1.5 # seconds
window: 7d
target_percentile: 95
indicator:
metric: llm_request_ttft_seconds_bucket
group_by: [app_route]
alerts:
- name: ttft-burn-fast
burn_rate: 14.4 # consumes 2% budget in 1h
window: 1h
severity: page
- name: ttft-burn-slow
burn_rate: 3
window: 6h
severity: ticket
One alert per dashboard, not per panel
This is the part most teams get wrong, and it's the part that makes the difference between a dashboard you open during an incident and a dashboard you'd silence given the chance.
A dashboard with twelve panel-level alerts produces twelve pages when something goes wrong. A dashboard with one composite SLO-burn alert produces one page that says "the thing this dashboard is for is broken." Then the on-call opens the dashboard and the panels tell them which slice.
So: one burn-rate alert per dashboard, against the SLO that dashboard's owner defined. Cost gets a budget-burn alert. Quality gets a pass-rate-burn alert. Latency gets a TTFT-burn alert. That's three alerts total for the whole LLM platform.
Panel-level "anomaly detected" or "value > X" alerts go in a separate, optional dashboard people can subscribe to if they want. They don't page.
What to leave out
The pruning list is as important as the panels you keep. Things you'll be tempted to add and shouldn't:
- Tokens-per-second per provider. Interesting once, useless monthly. Latency dashboard already shows what users feel.
- Top prompt templates by length. A debugging tool, not a dashboard panel. Put it in a Jupyter notebook.
- Embedding cache size. Infrastructure metric, belongs on the platform dashboard, not the LLM one.
- A "models used today" leaderboard. Cute on launch day. Nobody reads it on week 3.
- Real-time spend ticker. The cost dashboard already shows spend. A live counter just makes it feel urgent when it isn't.
- Per-tenant quality scores on the main quality dashboard. Too much fan-out. Build a separate drill-down filtered by tenant.
- GPU utilization (if you self-host). Lives on the infra dashboard. Cross-link, don't duplicate.
- A "model comparison" panel that A/B's two models on the same chart. Belongs in the eval report, not the live dashboard. Live dashboards are for current state; comparisons are for change reviews.
The test for any new panel is brutal and works: would the dashboard's one owner open the dashboard at 2am to look at this panel? If not, it goes elsewhere.
Three dashboards. Three owners. Three composite alerts. One OTel pipeline. That's the shape that survives the year.
What does your team's LLM dashboard split look like today? One mega-dashboard nobody opens, three lean ones, or something messier? Drop the panel you'd cut first in the comments.
If this was useful
The dashboard layer is the visible 10%; the OTel attribute design, the collector processors that compute USD cost server-side, and the burn-rate alert math underneath are the 90% that decides whether any of it holds up under a real incident. The LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team walks through the full pipeline (span design, attribute conventions, collector setup, eval signal integration, and the picker for which tracing backend actually fits your team) so the dashboards you build on top of it don't need to be rewritten every quarter.

Top comments (0)