The bill arrives before the alert does. Because the system that creates the cost isn't the system you're monitoring.
Inference observability isn't a tooling problem — it's a layer problem. Your APM stack tracks latency. Your infrastructure monitoring tracks GPU utilization. Neither one tracks the routing decision that sent a thousand requests to your most expensive model, or the prompt length drift that silently doubled your token consumption over three weeks.
By the time your cost alert fires, the tokens are already spent.
The Visibility Gap
Inference cost is generated at the decision layer. Routing decisions, token consumption, model selection, retry behavior — these are the variables that determine what you pay. But most observability exists at the infrastructure layer.
Here's how the layers break down:
| Layer | What It Tracks | What It Misses |
|---|---|---|
| Infrastructure | CPU, GPU, memory, latency | Token usage, routing decisions, model selection |
| Application | Errors, response time, request volume | Model decisions, prompt length, retry cost |
| Inference (decision layer) | Usually not instrumented | Everything that drives cost |
The inference layer is where routing decisions get made, where token budgets get consumed, where cache hits and misses determine whether you're paying for compute or serving from memory. It's also the layer that most monitoring stacks treat as a black box.
The 5 Signals That Predict Cost Before It Spikes
Standard metrics tell you what happened. These signals tell you what's about to happen.
Signal 01 — Token Consumption Rate (spend velocity)
Tokens per second per endpoint. A spike in token consumption rate precedes a cost spike by minutes to hours. Track it at the endpoint level, not the aggregate.
Signal 02 — Prompt Length Drift (silent cost multiplier)
The p95 prompt length over time. When prompt length drifts upward — users adding more context, system prompts growing, retrieval chunks increasing — token cost grows with it. No alert fires. No system breaks. The bill just quietly doubles over three weeks.
Signal 03 — Cache Hit Rate (efficiency signal)
Semantic cache and KV cache hit rates. A cache hit rate drop from 40% to 20% doubles your effective inference cost with no change in request volume. Most teams don't instrument it at all.
Signal 04 — Routing Distribution (decision quality signal)
The percentage of requests hitting each model tier. When routing distribution drifts — more requests hitting your frontier model than expected — cost escalates without any system error.
Signal 05 — Retry Rate (failure cost amplifier)
Failed requests that retry still consume tokens on the failed attempt. A 10% retry rate means 10% of your token spend generated zero value.
What to Instrument — The 3-Layer Observability Stack
Instrumentation must exist at the same layer where decisions are made.
Decision Layer (request-level)
- Tokens in / tokens out per request
- Model selected
- Routing path taken
- Cost per request
- Cache hit or miss
- Latency to first token
Behavior Layer (session-level)
- Total token budget consumed per session
- Routing path distribution
- Retry count
- Prompt length trend
- Token budget remaining vs elapsed session time
Business Layer (aggregate)
- Cost per feature
- Cost per user cohort
- Token burn rate (velocity)
- Routing distribution drift
- Cache efficiency trend
- Budget utilization rate
The Budget Signal Pattern
Dollar alerts are lagging indicators. Token rate alerts are leading indicators.
Most teams set cost alerts at the dollar level. By the time that alert fires, the tokens are already spent, the requests already executed, the routing decisions already made. You can't stop a cost spike that already executed.
Token rate — tokens consumed per minute per endpoint — fires earlier. A token rate anomaly is detectable within minutes of a routing change, a prompt length drift, or a cache configuration failure.
| Alert Type | When It Fires | Can You Intervene? |
|---|---|---|
| Dollar alert | After spend threshold exceeded | No — tokens already spent |
| Token rate alert | When consumption velocity anomalies detected | Yes — reroute, throttle, or kill |
Where Inference Observability Fails
Most teams can tell you what they spent. Very few can tell you why.
[01] Tracking latency, not tokens.
Response time is green. Token consumption has been climbing for two weeks. The system looks healthy. The bill doesn't.
[02] Tracking errors, not retries.
Error rate is 0.1%. Retry rate is 12%. Every retry is a token burn that generated zero output value.
[03] Tracking requests, not routing paths.
Request volume is flat. Routing distribution has drifted — 60% of requests now hitting the frontier model instead of the expected 20%. Volume didn't change. Cost per request tripled.
[04] Tracking cost, not cause.
Monthly spend alert fires. The investigation begins after the fact — sifting through logs to reconstruct which routing decision, which prompt length drift, which cache failure caused it. Post-incident analysis, not prevention.
How the Series Connects
This series has been building a single architecture across four posts:
- Part 1 — The cost model: why inference behaves like egress
- Part 2 — Execution budgets: runtime controls that cap spend before it cliffs
- Part 3 — Cost-aware routing: getting requests to the right model at the right cost
- Part 4 — Observability: the feedback loop that makes the other three work
Without observability, the other three are blind. Budgets are unvalidated. Routing is unconfirmed. Cost model predictions are theoretical.
Architect's Verdict
You can't enforce a budget you can't see. And you can't see inference cost until you instrument the decision layer.
Instrument the decision layer. Set token rate alerts, not just dollar alerts. Track routing distribution as a cost signal. Treat cache hit rate as an efficiency metric with direct cost implications.
The goal isn't more dashboards — it's visibility at the layer where cost decisions are actually made. That's the only layer where intervention is still possible.
Full post with HTML diagrams, the visibility gap table, and the complete 5-signal card breakdown: rack2cloud.com/ai-inference-observability
Part of the AI Infrastructure Architecture series on Rack2Cloud.



Top comments (0)