An LLM Error Taxonomy: Classifying Failures in Your Traces

#llm #observability #devops #tutorial

Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You open the LLM dashboard after a bad week. Error rate is up. The number is 4.2%. That is the whole story the panel tells you. Up from what? Caused by what? You have no idea, because "error" is one bucket and everything that goes wrong gets thrown into it.

So you start reading traces by hand. One was a refusal. One got cut off mid-sentence. One returned JSON your parser choked on. One was a clean 200 with a fabricated citation. One timed out. One was a 429 from the provider. Six different problems, six different fixes, one number on the dashboard.

The fix is a taxonomy. You decide on a fixed set of error classes, tag every span with the one it hit, and your dashboard goes from "4.2% bad" to "refusals doubled after the system-prompt change, everything else flat." That sentence is actionable. The 4.2% was not.

The six classes that cover most of it

You do not need fifty categories. You need enough to route an incident to the right person. These six cover the failures that put engineers on a page, and each one has a different owner and a different fix.

refusal — the model declined. Safety filter, policy, or it decided your prompt was out of scope. Owner: prompt or policy.
truncation — the response stopped early. Hit max_tokens, or the provider cut it. Owner: config.
tool_call_malformed — the model emitted a tool call your code could not parse or validate. Bad JSON, wrong schema, hallucinated argument. Owner: tool schema or prompt.
hallucination — fluent, well-formed, wrong. Caught by an eval or a user report, not by a status code. Owner: retrieval or grounding.
timeout — the request never came back inside your deadline. Owner: infra or provider.
rate_limit — the provider returned 429. Owner: capacity or retry policy.

Two of these (timeout, rate_limit) are transport failures your APM already half-sees. The other four live inside a 200 and your APM is blind to them. That asymmetry is the whole reason you tag.

One attribute carries the class

Put the class on the span as a single low-cardinality attribute. Low-cardinality matters: you want to group by it without blowing up your metrics backend. Use a fixed string from a closed set.

from enum import Enum


class LLMError(str, Enum):
    OK = "ok"
    REFUSAL = "refusal"
    TRUNCATION = "truncation"
    TOOL_CALL_MALFORMED = "tool_call_malformed"
    HALLUCINATION = "hallucination"
    TIMEOUT = "timeout"
    RATE_LIMIT = "rate_limit"

The attribute name follows the project-prefix rule for anything not in the OTel GenAI spec: app.llm.error_class. Keep the prefix consistent across services so one query works everywhere.

Classifying at emit time

Most of these you can detect the moment the response lands, with no extra model call. finish_reasons from the GenAI conventions does a lot of the work, and the rest is cheap string and parse checks.

import json


def classify(resp, finish_reason, parsed_ok):
    if finish_reason == "length":
        return LLMError.TRUNCATION
    if finish_reason in ("content_filter", "safety"):
        return LLMError.REFUSAL
    if resp.get("tool_calls") and not parsed_ok:
        return LLMError.TOOL_CALL_MALFORMED
    if looks_like_refusal(resp.get("text", "")):
        return LLMError.REFUSAL
    return LLMError.OK


def looks_like_refusal(text: str) -> bool:
    t = text.strip().lower()
    cues = (
        "i can't help with",
        "i cannot assist",
        "i'm not able to",
        "as an ai",
    )
    return any(t.startswith(c) for c in cues)

The refusal heuristic is a starting point, not a verdict. Phrase matching catches the obvious cases and misses the polite ones. Treat it as a cheap first pass and let an eval correct it later.

Timeout and rate_limit you classify where you catch the exception, not from the response body:

import httpx


def classify_transport(exc) -> LLMError:
    if isinstance(exc, httpx.TimeoutException):
        return LLMError.TIMEOUT
    if (isinstance(exc, httpx.HTTPStatusError)
            and exc.response.status_code == 429):
        return LLMError.RATE_LIMIT
    raise exc

Hallucination is the one you cannot catch at emit time

The other five classes are knowable when the request finishes. Hallucination is not. The response is a clean 200 with the right shape, and only an eval or a human knows it is wrong.

So you tag it on a second pass. Run a judge over a sample of traffic, and when it flags an answer as unfaithful, write the class back onto the original span by id.

def tag_hallucination(span_id, judge_verdict):
    if judge_verdict.faithful:
        return
    # Re-open the stored span by id in your backend and
    # set app.llm.error_class = "hallucination".
    # Most backends expose this via their update API; if
    # yours does not, emit a linked correction span instead.
    backend.update_attribute(
        span_id,
        "app.llm.error_class",
        LLMError.HALLUCINATION.value,
    )

If your tracing backend will not let you mutate a closed span, emit a separate llm.eval span that links back to the original by trace id and carries the class. Your dashboard query then unions the two sources. Either shape works. What does not work is leaving hallucinations in the ok bucket because they returned 200.

Wiring it into the span

The full emit path sets the class once and sets the span status to error for the transport failures, so your existing error-rate panel still lights up for those.

from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode

tracer = trace.get_tracer("app.llm")


def emit(model, resp, finish_reason, parsed_ok):
    with tracer.start_as_current_span("gen_ai.chat") as sp:
        sp.set_attribute("gen_ai.request.model", model)
        cls = classify(resp, finish_reason, parsed_ok)
        sp.set_attribute("app.llm.error_class", cls.value)
        if cls in (LLMError.TIMEOUT, LLMError.RATE_LIMIT):
            sp.set_status(Status(StatusCode.ERROR, cls.value))
        return cls

Note what this does not do: it does not set span status to error for refusals or truncations. Those are not transport errors, and flagging them red would re-pollute the one number you were trying to clean up. The class attribute carries the nuance; the status stays honest about the HTTP layer.

Slicing the dashboard

Now the payoff. Instead of one error-rate panel, you get a stacked breakdown by class. The shape of the stack tells you which fix to reach for before you open a single trace.

A per-class rate in PromQL, assuming the collector exports a counter incremented per span with the class as a label:

sum by (error_class) (
  rate(app_llm_requests_total[5m])
)

Stack that and you read incidents off the chart. A spike in refusal right after a deploy points at the system prompt. A spike in tool_call_malformed points at a schema change or a model version flip. A climb in rate_limit is a capacity conversation, not a code one.

The same idea in Datadog DDQL, with the class as a tag:

sum:app.llm.requests{*} by {error_class}.as_rate()

The query that earns its keep is the ratio of one class to total. Refusal rate, for instance:

sum(rate(app_llm_requests_total{error_class="refusal"}[1h]))
/
sum(rate(app_llm_requests_total[1h]))

Alert on that ratio per class, not on the aggregate. A 2% global error rate hides a refusal rate that quietly tripled on one tenant. The aggregate smooths over the regression that is breaking one customer.

Where teams get the taxonomy wrong

Two mistakes show up again and again. The first is too many classes. If you have twenty, nobody remembers them and the tagging drifts; half your spans end up in other. Start with these six and split a class only when a real incident proves you need the resolution.

The second is treating the classes as mutually exclusive when they cascade. A truncation can produce a malformed tool call, because the JSON got cut off. Pick the root, not the symptom: tag it truncation, because raising max_tokens fixes both. Encode that precedence in the classify order and write it in the runbook so two engineers tag the same failure the same way.

Get the closed set right, tag every span, slice by class. The dashboard stops saying "something is wrong" and starts saying which thing, which is the only version of that sentence you can act on at 02:00.

If this was useful

A taxonomy is only as good as the spans it sits on, and getting those spans right means picking tracing and eval tools that let you tag, mutate, and slice without fighting the backend. The LLM Observability Pocket Guide walks through that tooling choice and the attribute conventions that keep a taxonomy like this from rotting every time a model version rotates.