Gabriel Anhaia

Posted on Apr 26

The 3 Alerts Every LLM Team Should Have Set Up by Tomorrow

#llm #observability #devops #tutorial

Book: LLM Observability Pocket Guide
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Picture it: at 03:14 on a Tuesday, a single user fires a 240-message conversation at your customer-support agent and racks up four figures of model spend inside a 90-second window. The on-call engineer finds out from the finance Slack at 11am. The traces are there. The spans are there. The alert is not.

Three alerts catch the things that actually break LLM systems in 2026. They are easy to write. They are hard to write correctly, because the OpenTelemetry GenAI semantic conventions have churned through several revisions in the last year (for example, gen_ai.system was superseded by gen_ai.provider.name, and gen_ai.usage.prompt_tokens/completion_tokens became input_tokens/output_tokens), so most blog posts you find are using attribute names that were renamed or deprecated. The current names live in the OpenTelemetry GenAI semconv registry, and as of March 2026 most of them are still marked experimental. The OTEL_SEMCONV_STABILITY_OPT_IN env var lets you opt into emitting the latest experimental GenAI conventions during migration.

This post is the three alerts, the OTel attributes they read, the actual queries you paste into Grafana or Datadog, and a Python emitter that produces spans those queries can hit.

Why three

You can have ten alerts. Most teams either have zero or have ninety, and ninety means none of them work because the on-call mutes the channel. Three is the floor that catches the failures that actually hurt: cost, quality, and retrieval. Add more once you have these three working and tuned.

The three:

Per-trace cost ceiling breach — a single request line item exceeds a threshold.
Judge-score drift over a 7-day baseline — quality is silently regressing.
Retrieval-relevance drop — RAG is sending the model worse context.

They map to the three things a user notices before they churn: the bill, the answer quality, and the "this thing has no idea what I'm asking" feeling.

The OTel attributes you actually need

The current GenAI conventions split into client spans (the request to a model) and server/agent spans (your application's view of the request). The attributes you need for the alerts below:

gen_ai.request.model         e.g. "gpt-4o-2024-11-20"
gen_ai.provider.name         e.g. "openai", "anthropic", "aws.bedrock"
gen_ai.usage.input_tokens    int
gen_ai.usage.output_tokens   int
gen_ai.response.finish_reasons   array
gen_ai.conversation.id       stable id across turns
gen_ai.data_source.id        for RAG: the index/store id

A few that are not in the spec yet but you should add as custom attributes. They cost nothing and make the alerts below trivial:

app.llm.cost_usd             float, computed from token usage
app.llm.judge.score          float 0-1, from your eval/judge step
app.rag.retrieved_count      int
app.rag.top_score            float, similarity of top hit
app.rag.relevance_score      float 0-1, from a relevance judge

The naming uses app.* because anything not in the spec belongs in a project-prefixed namespace. Datadog, Grafana, and Honeycomb all happily ingest custom attributes as long as you keep the prefix consistent.

A Python emitter that gets it right

This is the minimum honest emitter. It assumes you already have the OTel SDK initialized.

from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode

tracer = trace.get_tracer("app.llm")

# Token costs in USD per 1K tokens (April 2026 list prices;
# verify against your provider's pricing page — these rotate).
COSTS = {
    "gpt-4o-2024-11-20": (0.0025, 0.0100),
    "gpt-4o-mini": (0.00015, 0.00060),
    "claude-sonnet-4-5": (0.003, 0.015),
}


def usd(model: str, in_tok: int, out_tok: int) -> float:
    cin, cout = COSTS.get(model, (0.0, 0.0))
    return (in_tok / 1000) * cin + (out_tok / 1000) * cout


def emit_llm_span(model, provider, prompt, response,
                  usage, conv_id, judge_score=None):
    with tracer.start_as_current_span("gen_ai.chat") as span:
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("gen_ai.provider.name", provider)
        span.set_attribute(
            "gen_ai.usage.input_tokens", usage["in"]
        )
        span.set_attribute(
            "gen_ai.usage.output_tokens", usage["out"]
        )
        span.set_attribute("gen_ai.conversation.id", conv_id)

        cost = usd(model, usage["in"], usage["out"])
        span.set_attribute("app.llm.cost_usd", cost)

        if judge_score is not None:
            span.set_attribute(
                "app.llm.judge.score", judge_score
            )

        if response.get("error"):
            span.set_status(
                Status(StatusCode.ERROR, response["error"])
            )
        return cost

Two things this emitter does that most do not. It sets app.llm.cost_usd at emit time from your own price table; the provider's billing endpoint will not settle fast enough to drive an alert. It also threads gen_ai.conversation.id consistently across turns, which is the join key the cost-ceiling alert below needs. If you do not have a conversation id, generate one and thread it. Doing this later is painful.

Alert 1: per-trace cost ceiling breach

The runaway-loop pattern at the top of this post bites teams that have a per-call alert (no single call ever exceeds $5) but no per-trace alert. A long agent loop makes hundreds of cheap calls that sum to four figures inside one user session.

The alert: any conversation that crosses $X in cumulative cost over a rolling 5-minute window.

Prometheus / PromQL (assuming you've got the OTel collector exporting to Prometheus):

sum by (gen_ai_conversation_id) (
  rate(app_llm_cost_usd_sum[5m]) * 300
) > 25

That fires when any single conversation accumulates more than $25 of model spend in a 5-minute window. Set the threshold based on your p99 — a conversation that has never legitimately exceeded $5 should not be allowed to silently exceed $25.

Datadog DDQL:

sum:app.llm.cost_usd{*} by {gen_ai.conversation.id}
  .rollup(sum, 300) > 25

Grafana / Loki + LogQL (if you're aggregating from logs instead of metrics; assumes each log line is JSON with a flat cost_usd field):

sum by (gen_ai_conversation_id) (
  sum_over_time(
    {app="llm"} | json | unwrap cost_usd [5m]
  )
) > 25

The mistake teams make here is alerting on per-tenant cost instead of per-conversation cost. Per-tenant catches the slow overspend. Per-conversation catches the runaway agent loop. The runaway loop is the one that wakes you up.

Alert 2: judge-score drift over a 7-day baseline

You are running a judge or eval step on a sample of production traffic. (If you are not, that is the alert before this one — but assume you are.) The judge emits app.llm.judge.score between 0 and 1. The drift alert fires when the rolling 1-day average drops more than two standard deviations below the trailing 7-day baseline.

This catches:

A prompt change that subtly degraded quality.
A model version flip on the provider's side (both Anthropic and OpenAI rotate model snapshots behind aliases).
A retrieval regression that pushed the judge into "the model is hallucinating" territory.
A new tenant whose data type your prompt cannot handle.

PromQL:

(
  avg_over_time(app_llm_judge_score[1d])
  -
  avg_over_time(app_llm_judge_score[7d] offset 1d)
) <
( -2 * stddev_over_time(app_llm_judge_score[7d] offset 1d) )

DDQL:

avg:app.llm.judge.score{*}.rollup(avg, 86400)
- avg:app.llm.judge.score{*}.rollup(avg, 604800)
< -0.05

The DDQL form simplifies to a fixed threshold of 0.05 because Datadog's stddev rollup behavior over 7 days is awkward; pick a threshold from your own historical variance. The shape of failure this catches: someone bumps temperature from 0.2 to 0.7 "for variety," the judge sees a measurable drop within hours, and the team rolls back before a customer hits it.

The thing that makes this alert work in practice: slice it by the things that actually drift. Per model. Per prompt-version. Per tenant if you serve multiple. A global average smooths over the regression that is destroying one customer's experience.

Alert 3: retrieval-relevance drop (RAG)

Your retriever returns chunks. A relevance judge (a small model or a heuristic) scores how relevant each retrieved chunk is to the query. You emit app.rag.relevance_score per request. When it drops, the model still produces fluent answers, but the answers are wrong, because the context is wrong.

This is the alert teams skip and regret. Cost spikes are loud and judge drift is medium-loud. Retrieval drops are silent: the system returns 200, the latency is fine, the model dutifully synthesizes nonsense.

PromQL:

avg_over_time(app_rag_relevance_score[1h]) < 0.55
  and
avg_over_time(app_rag_relevance_score[7d]) > 0.70

Both terms matter. Without the second, you'd page every time someone asked an off-corpus question.

DDQL:

avg:app.rag.relevance_score{*}.rollup(avg, 3600) < 0.55

Pair it with a separate alert on retrieved_count = 0 (the retriever returned nothing). That is a different failure mode (index empty, embedder broken, query rewriter producing garbage) and deserves its own page.

What to do once they fire

Three alerts means three runbooks. Each one is a single page.

For cost ceiling: page the on-call. Kill the conversation if it is still open (you should have a kill switch keyed on gen_ai.conversation.id). Find the loop. Most cost spikes are agent loops without a step limit.

For judge drift: do not page. Open a ticket, tag the last deploy. Compare the judge-score histogram before and after the deploy boundary. If the deploy is the cause, roll back the prompt or the model alias. If it is not, look at tenant mix — a new customer with weird data might have crossed your sample threshold.

For retrieval drop: check the retriever, then the embedder, then the index. Most retrieval drops are index drift — a re-index that produced fewer or worse chunks, or a deletion that orphaned the index from the live corpus. The Building Production RAG writeup is a decent reference for what "good" relevance looks like at retrieval time.

The fourth alert you will want

Skip it for now. Latency. Add it once these three are tuned. Latency alerts on LLM systems mostly catch provider incidents you cannot fix anyway, and they generate noise that crowds out the three alerts above. When you do add it, alert on p95 of provider latency rather than p50, and exclude tool-call spans from the calculation. Tool-call spans will swamp the signal: agent loops fan out into many tool calls per turn, and their latency does not reflect user-perceived wait.

They take an afternoon to write and a week to tune. Ship them tomorrow.

If this was useful

The LLM Observability Pocket Guide walks through the full OTel GenAI attribute set, what a healthy span actually looks like, and the alert-tuning loop that keeps these queries from going stale every time a model version rotates.

DEV Community