Julio Molina Soler

Posted on May 3

LLM Observability Audit: 32% Error Rate, 720K-Token Bug, and One $1.11 Call

#ai #devops #observability #python

A self-hosted Langfuse instance, 21 hours of production traffic, 516 traces, $2.86 in spend, and an OpenRouter-fronted LLM router shuffling 24 different models. I pulled the entire dataset through Langfuse's REST API and ran a flat audit. Below is what surfaced — the kind of findings that don't show up on a dashboard until you actually grep the data.

This is a walkthrough of (1) how to extract every observable from Langfuse via the public API, and (2) the five concrete bugs the data exposed.

1. Pulling the data

Langfuse's public API at /api/public/* uses HTTP Basic Auth with a project-scoped key pair (pk-lf-… / sk-lf-…). Self-hosted and cloud (cloud.langfuse.com, us.cloud.langfuse.com) are identical. Three endpoints carry 95% of the analytical signal:

/api/public/traces — top-level requests
/api/public/observations — spans, generations, events (the LLM-level detail)
/api/public/scores — evaluator outputs

All paginate with page / limit (max 100) and return a meta block with totalPages. A minimal extractor:

import os, httpx
from dotenv import load_dotenv
load_dotenv()

BASE = os.environ["LANGFUSE_BASE_URL"].rstrip("/")
AUTH = (os.environ["LANGFUSE_PUBLIC_KEY"], os.environ["LANGFUSE_SECRET_KEY"])

def paginate(client, path, params=None):
    params = dict(params or {})
    params.setdefault("limit", 100)
    page = 1
    while True:
        params["page"] = page
        r = client.get(f"{BASE}{path}", params=params)
        r.raise_for_status()
        j = r.json()
        yield from j.get("data", [])
        if page >= j.get("meta", {}).get("totalPages", 1):
            break
        page += 1

with httpx.Client(auth=AUTH, timeout=60) as c:
    traces = list(paginate(c, "/api/public/traces"))
    obs    = list(paginate(c, "/api/public/observations"))
    scores = list(paginate(c, "/api/public/scores"))

Three calls, 1,398 records, full dataset on disk. From here it's pandas.

2. The first red flag: 32.1% error rate

Filtering observations to type == "GENERATION" and name == "LLM Generation" (the application's actual LLM calls, excluding the LLM-as-a-judge evaluator runs) gives 330 generations. Of those, 106 carry level == "ERROR":

Total errors: 106 / 330 = 32.1%

Classification by statusMessage:
  ctx_overflow     91
  other            15

A third of production calls failing isn't a tail problem — it's a structural one. Two patterns explain almost all of it.

3. Bug #1: `max_tokens` set to 720,000

Every ctx_overflow error had a near-identical statusMessage:

This endpoint's maximum context length is 262144 tokens. However, you requested about 720337 tokens (337 of text input, 720000 in the output)…

The input was 337 tokens. The system was requesting 720,000 output tokens. No model on the planet has a 720K output budget, so OpenRouter rejected the request before any inference ran (median latency: 0.094s — gateway-level rejection).

The smell of 720000 is an int that should have been 720 (or a temperature * 1000 style cast applied to the wrong field). Either way, the fix is a single line in the request builder:

def cap_max_tokens(model_ctx: int, input_tok: int, requested: int, margin: int = 256) -> int:
    return min(requested, max(0, model_ctx - input_tok - margin), 8192)

Hardcode an upper sanity bound (8192) regardless of what gets passed in. This alone removes ~28% of all errors.

4. Bug #2: invalid model slugs

Two slugs failed 100% of the time:

Slug	Calls	Errors
`openrouter/free`	91	91
`google/gemma-4-26b-a4b-it:free`	9	9

openrouter/free is not a real model — it looks like a placeholder or a fallback the routing layer emits when no slug is resolved. Latency p50 = 0.094s confirms gateway rejection. gemma-4-26b-a4b-it doesn't exist in OpenRouter's catalog either (Gemma 4 isn't a real release; the closest valid Gemma slugs are 2 and 3).

The fix is a startup-time validation against OpenRouter's /api/v1/models endpoint:

async def validate_models(used_slugs: set[str]) -> None:
    r = await httpx.AsyncClient().get("https://openrouter.ai/api/v1/models")
    valid = {m["id"] for m in r.json()["data"]}
    if invalid := used_slugs - valid:
        raise RuntimeError(f"Unknown OpenRouter slugs: {invalid}")

Run this in CI against your config. Catches drift the moment a model deprecates.

5. Bug #3: cost concentration — 52% of spend in 2 calls

Total cost across 330 generations: $2.8577. Of that, $1.486 (52%) came from two anthropic/claude-opus-4.6 calls:

traceId	model	input tokens	cost
#1	claude-opus-4.6	221,266	$1.1086
#2	claude-opus-4.6	75,101	$0.3773

A 221K input prompt to Opus is either an entire RAG corpus shoved into context, full chat history with no truncation, or a pasted document. Looking at the next tier — four gemini-2.5-flash-lite calls each carrying ~189K input tokens — confirms the pattern. The retrieval layer isn't truncating.

Cheap fix:

def trim_context(chunks: list[Chunk], budget_tok: int, encoder) -> list[Chunk]:
    """Greedy by score, stop when budget is exhausted."""
    chunks = sorted(chunks, key=lambda c: c.score, reverse=True)
    out, used = [], 0
    for c in chunks:
        n = len(encoder.encode(c.text))
        if used + n > budget_tok:
            break
        out.append(c); used += n
    return out

Pair with a hard ceiling on the system prompt + retrieved-content combined size, well below the model's context window. A 32K input cap on Opus would have cut that single call from $1.11 to ~$0.17.

6. Bug #4: input/output token ratio of 97:1

Aggregate token counts across the 330 generations:

Input: 9,745,108 tokens
Output: 100,371 tokens
Ratio: 97:1

A typical chat workload sits around 3:1 to 10:1. 97:1 means the system is shipping massive prompts and getting tiny responses. Combined with the cost finding above, this is a strong signal that:

Prompts include retrieved context that isn't deduplicated across turns.
Output is being aggressively constrained (tool-call JSON, classification, scoring) but the input side has no equivalent budget.

Action: add a token-budget metric per request to your dashboards. If the ratio drifts past ~20:1 sustained, your retrieval is overshooting.

7. Quality signal: model leaderboard from LLM-as-a-judge

A separate evaluator pipeline runs gemini-2.5-flash over each generation, scoring Correctness ∈ [0,1]. 183 scored runs across the model fleet (n ≥ 5):

Model	n	mean Correctness
openai/gpt-oss-20b:free	5	0.940
openai/gpt-oss-120b:free	10	0.870
qwen/qwen3-coder:free	11	0.836
nvidia/nemotron-3-nano-30b-a3b:free	8	0.819
qwen/qwen3-next-80b-a3b-instruct:free	7	0.814
z-ai/glm-4.5-air:free	8	0.800
nvidia/nemotron-3-super-120b-a12b:free	9	0.767
meta-llama/llama-3.3-70b-instruct:free	8	0.739
nvidia/nemotron-nano-12b-v2-vl:free	10	0.735
poolside/laguna-xs.2:free	6	0.700
poolside/laguna-m.1:free	6	0.683
nvidia/nemotron-nano-9b-v2:free	10	0.680
tencent/hy3-preview:free	9	0.589

Caveats: small samples, the judge is itself an LLM (gemini-2.5-flash), and "Correctness" was scored against ground-truth replications — which means the metric rewards faithful reproduction, not creative quality. Still, the spread is large enough that tencent/hy3-preview:free (0.589) is meaningfully below the median (~0.79). On a free-tier router that sees this slug routinely, the ROI is removing it.

gpt-oss-20b topping the chart is more interesting: a 20B model beating 70B+ peers on this workload suggests the workload is not capacity-bound. If your evaluator confirms similar results, your routing weights should reflect it.

8. Latency tail

p50    3.2s
p95   30.1s
p99   69.6s
max  223.7s

The p99 is 22× the median. The 223.7s outlier was a minimax/minimax-m2.5:free call with 20,619 input / 86 output tokens — not pathological size, just a free-tier provider stalling. Three takeaways:

Per-request timeouts, scoped per model. A free-tier slug should not get 220 seconds.
Hedging: fire a backup request to a different provider after 2× p50.
Retry budget: cap retries at the request level, not per-call, or your tail amplifies.

9. Observability gaps that made this audit harder than it needed to be

Three fields were essentially empty across the dataset:

userId: populated on 0.6% of traces.
sessionId: 0 unique sessions across 516 traces.
release: 0 populated.

Without these, you can't:

Bisect a regression to a deploy.
Reconstruct a multi-turn conversation from disjoint traces.
Attribute cost or errors to a customer cohort.

The Langfuse SDK accepts these as keyword args on every trace. They cost nothing to populate and are the single highest-leverage observability change you can make:

langfuse.trace(
    name="chat_completion",
    user_id=request.user_id,
    session_id=request.session_id,
    release=os.environ["GIT_SHA"],
    tags=[request.feature_flag],
    metadata={"tier": request.user.tier},
)

10. Prioritized action list

In order of effort-to-impact:

Cap max_tokens server-side. Eliminates 28% of errors. One line.
Validate model slugs at startup against OpenRouter's catalog. Eliminates the remaining ~3% of slug-related errors and prevents silent drift.
Populate userId / sessionId / release on every trace. Zero perf cost, unblocks every future audit.
Add an input-token budget to the retrieval layer. Will cut top-tier model spend by an order of magnitude on this workload.
Per-model timeouts and hedging. Brings p99 latency under control.
Drop tencent/hy3-preview:free from the routing pool until you have larger-n quality evidence.

Closing note

The audit took roughly 90 minutes of API pulling and pandas. The fixes are five lines of defensive code and a configuration change. The reason a 32% error rate persisted long enough to produce 516 traces of evidence is that none of these failures were loud — OpenRouter returned errors as completed responses, the gateway rejections were sub-100ms, and the cost spikes were in single calls that didn't trip any alert. What killed visibility wasn't the absence of telemetry — it was the absence of aggregation. Langfuse stored everything correctly. Nobody had run groupby(model).agg(error_rate) until now.

If you're running an LLM router on free-tier infrastructure and you haven't done this exact audit on your own data, you almost certainly have at least two of these five bugs. The REST API is right there.

DEV Community

LLM Observability Audit: 32% Error Rate, 720K-Token Bug, and One $1.11 Call

1. Pulling the data

2. The first red flag: 32.1% error rate

3. Bug #1: `max_tokens` set to 720,000

4. Bug #2: invalid model slugs

5. Bug #3: cost concentration — 52% of spend in 2 calls

6. Bug #4: input/output token ratio of 97:1

7. Quality signal: model leaderboard from LLM-as-a-judge

8. Latency tail

9. Observability gaps that made this audit harder than it needed to be

10. Prioritized action list

Closing note

Top comments (0)

1. Pulling the data

2. The first red flag: 32.1% error rate

3. Bug #1: max_tokens set to 720,000

4. Bug #2: invalid model slugs

5. Bug #3: cost concentration — 52% of spend in 2 calls

6. Bug #4: input/output token ratio of 97:1

7. Quality signal: model leaderboard from LLM-as-a-judge

8. Latency tail

9. Observability gaps that made this audit harder than it needed to be

10. Prioritized action list

Closing note

3. Bug #1: `max_tokens` set to 720,000