- Book: Observability for LLM Applications
- Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
Someone hands you a cost dashboard. Your LLM bill doubled month over month and finance wants it back under control. You do the obvious thing. You swap the big model for a smaller one on the hot path, drop max_tokens, and shave the few-shot examples out of the system prompt. The bill drops. You close the ticket.
Three weeks later support tickets climb. The smaller model gets the edge cases wrong, the tighter token budget truncates answers mid-sentence, and the stripped prompt removed the examples that kept the format stable. Nobody connects the dots, because the cost dashboard and the quality complaints live in different tools owned by different teams.
You optimized one corner of a triangle and the other two corners moved. The problem is not that you made a bad call. The problem is that you made the call blind.
The three axes are one decision
Every LLM request sits at a point inside a triangle:
- Cost — input tokens plus output tokens, priced per model.
- Quality — does the answer do what the user needed.
- Latency — time to first token, and total time to the full response.
You do not get to pick all three. A bigger model lifts quality and raises both cost and latency. Streaming improves perceived latency but does nothing for the total. A reasoning model that thinks before answering can lift quality a lot while quietly tripling output tokens and wall-clock time. Trim retrieved context to save cost and you can starve the model into worse answers.
The trade-off is real and it is constant. What teams lack is not the awareness of it. It is the instrument that shows all three numbers for the same request, in the same place, at the same time.
Put all three on one span
Most teams already trace latency. The fix is small: attach cost and a quality signal to the same span the latency lives on. The OpenTelemetry GenAI semantic conventions already define attribute names for the token counts, so you are not inventing a schema.
# llm_span.py — one span, all three axes.
import time
from opentelemetry import trace
tracer = trace.get_tracer("llm.app")
# pricing in USD per 1M tokens, per model
PRICING = {
"gpt-4o-2024-11-20": (2.50, 10.00),
"gpt-4o-mini": (0.15, 0.60),
}
def cost_usd(model, in_tok, out_tok):
pin, pout = PRICING[model]
return (in_tok * pin + out_tok * pout) / 1_000_000
def call_llm(client, model, messages):
with tracer.start_as_current_span("llm.chat") as span:
span.set_attribute("gen_ai.request.model", model)
t0 = time.monotonic()
resp = client.chat.completions.create(
model=model,
messages=messages,
)
latency = time.monotonic() - t0
u = resp.usage
span.set_attribute(
"gen_ai.usage.input_tokens", u.prompt_tokens)
span.set_attribute(
"gen_ai.usage.output_tokens",
u.completion_tokens)
span.set_attribute(
"llm.cost_usd",
cost_usd(model, u.prompt_tokens,
u.completion_tokens))
span.set_attribute("llm.latency_s", latency)
return resp
Now every request carries its position in the triangle. Cost is computed, not guessed. Latency is the real wall-clock number your user felt. The only axis missing is quality, and quality is the one that does not arrive for free in the API response.
Quality has to be measured, not assumed
You cannot read quality off resp.usage. You score it. For a fixed slice of traffic, run a judge against the output and attach the verdict to the same trace.
# quality.py — attach a judge score to the live trace.
import re
from opentelemetry import trace
JUDGE = """Score the RESPONSE to the PROMPT from 0 to 1.
Return only a number. 1 = fully correct and well-formed,
0 = wrong or unusable.
PROMPT:
{prompt}
RESPONSE:
{response}"""
def parse_score(text):
# judges drift: "0.8/1", a stray period, a newline.
# pull the first float and clamp it to [0, 1].
m = re.search(r"\d+(?:\.\d+)?", text)
if not m:
return None
return max(0.0, min(1.0, float(m.group())))
def score_quality(judge_client, prompt, response):
out = judge_client.chat.completions.create(
model="gpt-4o-2024-11-20", # pinned judge
temperature=0,
messages=[{
"role": "user",
"content": JUDGE.format(
prompt=prompt, response=response),
}],
)
raw = out.choices[0].message.content.strip()
score = parse_score(raw)
if score is None:
return None # skip the sample, do not crash
span = trace.get_current_span()
span.set_attribute("llm.quality_score", score)
return score
Run this on a sample, not on every request, or the judge cost swallows the savings you came for. One or two percent of live traffic is enough to keep the quality axis honest. Pin the judge to a model from a different family than the one under test; a judge rates its own family higher (arXiv:2410.21819), which masks exactly the regressions you want to see.
The dashboard that shows the tension
Three single-metric dashboards hide the trade-off. One dashboard that puts the three axes side by side exposes it. The query you want groups by model and shows all three at once.
-- per-model: cost, quality, latency on the same row
SELECT
model,
count(*) AS requests,
round(avg(quality_score), 3) AS avg_quality,
round(sum(cost_usd), 2) AS total_cost,
round(quantile(0.95)(latency_s),2) AS p95_latency
FROM llm_spans
WHERE ts > now() - INTERVAL 7 DAY
GROUP BY model
ORDER BY total_cost DESC;
A row like this is the whole point:
model requests avg_quality total_cost p95_latency
gpt-4o-mini 48210 0.78 61.40 0.9
gpt-4o-2024-... 12044 0.93 301.10 2.4
The mini model is five times cheaper and a hair faster. It is also fifteen points worse on quality. Whether that trade is good or bad is a product decision, but now it is a decision someone can make on purpose. Before the dashboard, the mini model looked like a free win on the cost report and an unexplained ticket spike on the support report.
The pattern that earns its keep is routing by request shape. Send the easy, high-volume queries to the cheap model and reserve the big model for the hard slice. The triangle dashboard is how you find that slice: filter by quality below a threshold, group by your own request-category attribute, and the categories where the cheap model fails fall out. Route those up, leave the rest down.
Watch the corners move together
The trade-off is not static. A vendor reprices a model and your cost axis jumps with no code change. A model gets quietly swapped behind the same ID and your quality axis drops while latency holds. A prompt template change adds three paragraphs of instructions and your input-token cost climbs across every request at once.
Because all three numbers now ride the same span, you can alert on the relationship rather than on any single axis. The alert that matters is not "cost went up." It is "cost went up and quality did not." That is a regression. "Cost went up and quality went up" is a choice you can defend.
# cost per unit of quality, week over week
(
sum(rate(llm_cost_usd[1d]))
/ sum(rate(llm_quality_score[1d]))
)
> 1.25 *
(
sum(rate(llm_cost_usd[7d] offset 7d))
/ sum(rate(llm_quality_score[7d] offset 7d))
)
Cost per unit of quality is the single number that survives all three axes moving at once. When it climbs, you are paying more for the same answer, and that is worth a page. When it falls, you found a better point in the triangle, and that is worth a brag.
Start with the span
You do not need a new platform for any of this. You need the cost and the quality score attached to the trace your latency already lives on, and one query that reads all three back together. Everything else — routing, alerting, slice analysis — is built on top of that one span.
If you want the longer version, Observability for LLM Applications walks through the GenAI tracing conventions, the eval pipeline that produces the quality axis, and the cost-accounting chapter that turns token counts into the per-request dollar figures the triangle runs on. The triangle is the through-line: you cannot manage a trade-off you cannot see on one screen.

Top comments (0)