A self-hosted Langfuse instance, 21 hours of production traffic, 516 traces, $2.86 in spend, and an OpenRouter-fronted LLM router shuffling 24 different models. I pulled the entire dataset through Langfuse's REST API and ran a flat audit. Below is what surfaced — the kind of findings that don't show up on a dashboard until you actually grep the data.
This is a walkthrough of (1) how to extract every observable from Langfuse via the public API, and (2) the five concrete bugs the data exposed.
1. Pulling the data
Langfuse's public API at /api/public/* uses HTTP Basic Auth with a project-scoped key pair (pk-lf-… / sk-lf-…). Self-hosted and cloud (cloud.langfuse.com, us.cloud.langfuse.com) are identical. Three endpoints carry 95% of the analytical signal:
-
/api/public/traces— top-level requests -
/api/public/observations— spans, generations, events (the LLM-level detail) -
/api/public/scores— evaluator outputs
All paginate with page / limit (max 100) and return a meta block with totalPages. A minimal extractor:
import os, httpx
from dotenv import load_dotenv
load_dotenv()
BASE = os.environ["LANGFUSE_BASE_URL"].rstrip("/")
AUTH = (os.environ["LANGFUSE_PUBLIC_KEY"], os.environ["LANGFUSE_SECRET_KEY"])
def paginate(client, path, params=None):
params = dict(params or {})
params.setdefault("limit", 100)
page = 1
while True:
params["page"] = page
r = client.get(f"{BASE}{path}", params=params)
r.raise_for_status()
j = r.json()
yield from j.get("data", [])
if page >= j.get("meta", {}).get("totalPages", 1):
break
page += 1
with httpx.Client(auth=AUTH, timeout=60) as c:
traces = list(paginate(c, "/api/public/traces"))
obs = list(paginate(c, "/api/public/observations"))
scores = list(paginate(c, "/api/public/scores"))
Three calls, 1,398 records, full dataset on disk. From here it's pandas.
2. The first red flag: 32.1% error rate
Filtering observations to type == "GENERATION" and name == "LLM Generation" (the application's actual LLM calls, excluding the LLM-as-a-judge evaluator runs) gives 330 generations. Of those, 106 carry level == "ERROR":
Total errors: 106 / 330 = 32.1%
Classification by statusMessage:
ctx_overflow 91
other 15
A third of production calls failing isn't a tail problem — it's a structural one. Two patterns explain almost all of it.
3. Bug #1: max_tokens set to 720,000
Every ctx_overflow error had a near-identical statusMessage:
This endpoint's maximum context length is 262144 tokens. However, you requested about 720337 tokens (337 of text input, 720000 in the output)…
The input was 337 tokens. The system was requesting 720,000 output tokens. No model on the planet has a 720K output budget, so OpenRouter rejected the request before any inference ran (median latency: 0.094s — gateway-level rejection).
The smell of 720000 is an int that should have been 720 (or a temperature * 1000 style cast applied to the wrong field). Either way, the fix is a single line in the request builder:
def cap_max_tokens(model_ctx: int, input_tok: int, requested: int, margin: int = 256) -> int:
return min(requested, max(0, model_ctx - input_tok - margin), 8192)
Hardcode an upper sanity bound (8192) regardless of what gets passed in. This alone removes ~28% of all errors.
4. Bug #2: invalid model slugs
Two slugs failed 100% of the time:
| Slug | Calls | Errors |
|---|---|---|
openrouter/free |
91 | 91 |
google/gemma-4-26b-a4b-it:free |
9 | 9 |
openrouter/free is not a real model — it looks like a placeholder or a fallback the routing layer emits when no slug is resolved. Latency p50 = 0.094s confirms gateway rejection. gemma-4-26b-a4b-it doesn't exist in OpenRouter's catalog either (Gemma 4 isn't a real release; the closest valid Gemma slugs are 2 and 3).
The fix is a startup-time validation against OpenRouter's /api/v1/models endpoint:
async def validate_models(used_slugs: set[str]) -> None:
r = await httpx.AsyncClient().get("https://openrouter.ai/api/v1/models")
valid = {m["id"] for m in r.json()["data"]}
if invalid := used_slugs - valid:
raise RuntimeError(f"Unknown OpenRouter slugs: {invalid}")
Run this in CI against your config. Catches drift the moment a model deprecates.
5. Bug #3: cost concentration — 52% of spend in 2 calls
Total cost across 330 generations: $2.8577. Of that, $1.486 (52%) came from two anthropic/claude-opus-4.6 calls:
| traceId | model | input tokens | cost |
|---|---|---|---|
| #1 | claude-opus-4.6 | 221,266 | $1.1086 |
| #2 | claude-opus-4.6 | 75,101 | $0.3773 |
A 221K input prompt to Opus is either an entire RAG corpus shoved into context, full chat history with no truncation, or a pasted document. Looking at the next tier — four gemini-2.5-flash-lite calls each carrying ~189K input tokens — confirms the pattern. The retrieval layer isn't truncating.
Cheap fix:
def trim_context(chunks: list[Chunk], budget_tok: int, encoder) -> list[Chunk]:
"""Greedy by score, stop when budget is exhausted."""
chunks = sorted(chunks, key=lambda c: c.score, reverse=True)
out, used = [], 0
for c in chunks:
n = len(encoder.encode(c.text))
if used + n > budget_tok:
break
out.append(c); used += n
return out
Pair with a hard ceiling on the system prompt + retrieved-content combined size, well below the model's context window. A 32K input cap on Opus would have cut that single call from $1.11 to ~$0.17.
6. Bug #4: input/output token ratio of 97:1
Aggregate token counts across the 330 generations:
- Input: 9,745,108 tokens
- Output: 100,371 tokens
- Ratio: 97:1
A typical chat workload sits around 3:1 to 10:1. 97:1 means the system is shipping massive prompts and getting tiny responses. Combined with the cost finding above, this is a strong signal that:
- Prompts include retrieved context that isn't deduplicated across turns.
- Output is being aggressively constrained (tool-call JSON, classification, scoring) but the input side has no equivalent budget.
Action: add a token-budget metric per request to your dashboards. If the ratio drifts past ~20:1 sustained, your retrieval is overshooting.
7. Quality signal: model leaderboard from LLM-as-a-judge
A separate evaluator pipeline runs gemini-2.5-flash over each generation, scoring Correctness ∈ [0,1]. 183 scored runs across the model fleet (n ≥ 5):
| Model | n | mean Correctness |
|---|---|---|
| openai/gpt-oss-20b:free | 5 | 0.940 |
| openai/gpt-oss-120b:free | 10 | 0.870 |
| qwen/qwen3-coder:free | 11 | 0.836 |
| nvidia/nemotron-3-nano-30b-a3b:free | 8 | 0.819 |
| qwen/qwen3-next-80b-a3b-instruct:free | 7 | 0.814 |
| z-ai/glm-4.5-air:free | 8 | 0.800 |
| nvidia/nemotron-3-super-120b-a12b:free | 9 | 0.767 |
| meta-llama/llama-3.3-70b-instruct:free | 8 | 0.739 |
| nvidia/nemotron-nano-12b-v2-vl:free | 10 | 0.735 |
| poolside/laguna-xs.2:free | 6 | 0.700 |
| poolside/laguna-m.1:free | 6 | 0.683 |
| nvidia/nemotron-nano-9b-v2:free | 10 | 0.680 |
| tencent/hy3-preview:free | 9 | 0.589 |
Caveats: small samples, the judge is itself an LLM (gemini-2.5-flash), and "Correctness" was scored against ground-truth replications — which means the metric rewards faithful reproduction, not creative quality. Still, the spread is large enough that tencent/hy3-preview:free (0.589) is meaningfully below the median (~0.79). On a free-tier router that sees this slug routinely, the ROI is removing it.
gpt-oss-20b topping the chart is more interesting: a 20B model beating 70B+ peers on this workload suggests the workload is not capacity-bound. If your evaluator confirms similar results, your routing weights should reflect it.
8. Latency tail
p50 3.2s
p95 30.1s
p99 69.6s
max 223.7s
The p99 is 22× the median. The 223.7s outlier was a minimax/minimax-m2.5:free call with 20,619 input / 86 output tokens — not pathological size, just a free-tier provider stalling. Three takeaways:
- Per-request timeouts, scoped per model. A free-tier slug should not get 220 seconds.
- Hedging: fire a backup request to a different provider after 2× p50.
- Retry budget: cap retries at the request level, not per-call, or your tail amplifies.
9. Observability gaps that made this audit harder than it needed to be
Three fields were essentially empty across the dataset:
-
userId: populated on 0.6% of traces. -
sessionId: 0 unique sessions across 516 traces. -
release: 0 populated.
Without these, you can't:
- Bisect a regression to a deploy.
- Reconstruct a multi-turn conversation from disjoint traces.
- Attribute cost or errors to a customer cohort.
The Langfuse SDK accepts these as keyword args on every trace. They cost nothing to populate and are the single highest-leverage observability change you can make:
langfuse.trace(
name="chat_completion",
user_id=request.user_id,
session_id=request.session_id,
release=os.environ["GIT_SHA"],
tags=[request.feature_flag],
metadata={"tier": request.user.tier},
)
10. Prioritized action list
In order of effort-to-impact:
-
Cap
max_tokensserver-side. Eliminates 28% of errors. One line. - Validate model slugs at startup against OpenRouter's catalog. Eliminates the remaining ~3% of slug-related errors and prevents silent drift.
-
Populate
userId/sessionId/releaseon every trace. Zero perf cost, unblocks every future audit. - Add an input-token budget to the retrieval layer. Will cut top-tier model spend by an order of magnitude on this workload.
- Per-model timeouts and hedging. Brings p99 latency under control.
-
Drop
tencent/hy3-preview:freefrom the routing pool until you have larger-n quality evidence.
Closing note
The audit took roughly 90 minutes of API pulling and pandas. The fixes are five lines of defensive code and a configuration change. The reason a 32% error rate persisted long enough to produce 516 traces of evidence is that none of these failures were loud — OpenRouter returned errors as completed responses, the gateway rejections were sub-100ms, and the cost spikes were in single calls that didn't trip any alert. What killed visibility wasn't the absence of telemetry — it was the absence of aggregation. Langfuse stored everything correctly. Nobody had run groupby(model).agg(error_rate) until now.
If you're running an LLM router on free-tier infrastructure and you haven't done this exact audit on your own data, you almost certainly have at least two of these five bugs. The REST API is right there.
Top comments (0)