The Production Readiness Checklist for LLM Apps Nobody Gave You (18 Items)

#ai #observability #llm #devops

Book: Observability for LLM Applications — paperback and hardcover on Amazon · Ebook from Apr 22
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

18 items. Each one should be true before your LLM feature meets a paying customer. If any are false, you are accepting a specific risk — at least know which.

Save this page. Tick against it. Come back when the next feature ships.

Tracing (items 1–5)

1. Every LLM call emits an OpenTelemetry GenAI span with gen_ai.operation.name, gen_ai.provider.name, gen_ai.request.model, gen_ai.response.model, gen_ai.usage.input_tokens, and gen_ai.usage.output_tokens. Non-negotiable. The cost of adding this later is three sprints.

2. The gap between gen_ai.request.model and gen_ai.response.model is captured. They differ more often than you think. Bedrock aliases, OpenAI autorouting, Azure deployment IDs, gateway fallback tiers. If you only record one, you cannot debug silent routing.

3. Tool-calling agents are traced as a single invoke_agent parent span with chat and execute_tool children. Flat traces (every call a root) make it impossible to ask "how many times did this agent call this tool in this turn." That question is your first agent-loop detector.

4. Prompt and response content capture is gated behind OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT. The default is off. Turn on in dev. Turn on selectively in staging. In production, only after you have decided who reads traces and for how long you retain them.

5. Trace retention is at least 30 days. The window for discovering a quality regression is longer than the window for discovering an HTTP error. A week of retention is a week of not being able to bisect last month's drift.

Evals (items 6–10)

6. A canary eval runs hourly against every production model ID. 10–20 fixed prompts. Scored by an LLM-judge from a different provider to avoid self-preference bias. Emits a metric; alerts on a drop.

7. An online eval samples live production traces on a rolling window. Multi-axis: faithfulness, relevance, format, safety. Judged by a pinned model snapshot (never gpt-4o, always gpt-4o-2024-11-20).

8. The online judge has been meta-evaluated against at least 100 human labels. TPR and TNR measured. Both above 0.8. The baseline is stored in Git alongside the prompt.

9. Code-based evals run in CI. Every prompt change triggers a regression suite: schema checks, latency budgets, cost budgets, PII leak checks, tool-call argument shape. Fast. Deterministic. Merges block on regression.

10. Judge drift is monitored weekly. Cohen's kappa between judge and human on a rolling sample. Alert if kappa drops below 0.6 or TPR/TNR falls more than 5 points from baseline.

Operations (items 11–14)

11. Quality alerts fire per slice, not globally. Tenant, feature, prompt version, model, language. A single global judge-score alert on a multi-feature product is worse than no alert at all — it averages a regression in one feature into the noise of another.

12. Thresholds are baseline-relative, not absolute. Alert on the delta from a trailing 24-hour mean, not a hardcoded floor. A judge score of 0.82 is fine for one feature and a crisis for another.

13. Cost-per-tenant-per-hour is alerted. Fires when cost exceeds 3× the 7-day rolling mean for that tenant. Also alert on cache hit ratio dropping below 50% of baseline — cheapest cost-regression detector you will build.

14. A quality-aware circuit breaker wraps every LLM call path. Trips on HTTP error rate and judge-score drop. A breaker that only watches HTTP status will keep serving a silently degraded provider.

Incident response (items 15–18)

15. A multi-provider fallback is configured AND exercised monthly. Configured is not enough. The first real outage is a bad time to discover your secondary's tokenizer is different. Pick a Tuesday, break the primary, watch the secondary take over in traces.

16. The online judge runs on fallback tiers in steady state. A tertiary tier scoring 0.55 on a Tuesday is a tertiary tier that will fail you during the outage that forces you to use it. Steady-state measurement is how you find out before.

17. Every span carries app.feature.owner and app.oncall.rotation. The SRE who gets paged at 02:14 clicks the span, sees who owns the regressing feature, pages them. No Slack hunt. Ownership in the trace saves fifteen minutes per incident.

18. Postmortems answer this bold question at the top: Was this detectable from metrics alone? If not, what signal would have caught it, and what do we need to build to capture that signal next time? LLM incidents fail the metrics-alone test more often than traditional incidents. Every "no" is a ticket. Close it before the next quarter.

How to use this list

Not a bar for shipping MVP. A bar for shipping to a paying customer whose logo is on your website. Three tiers to triage:

Items 1, 6, 9, 11, 14, 15 are blockers. You cannot responsibly operate an LLM feature in production without these.
Items 2, 3, 7, 8, 12, 13, 17, 18 are fixes. You can ship without them for a week. You will regret shipping without them for a quarter.
Items 4, 5, 10, 16 are nits. Necessary for maturity. You get a year to add them before they become load-bearing.

Tick them. Commit the tick list to the repo. Re-tick quarterly. The checklist is a contract with your future on-call self.

If this was useful

This is Chapter 18 of Observability for LLM Applications, distilled. The book's actual Chapter 18 is 50 items, not 18, and each one has a paragraph on the failure mode it prevents and the specific instrument to add. Chapter 17 covers the alerting SLOs underneath. Chapters 15–16 cover roll-your-own stacks and cost accounting.

Book: Observability for LLM Applications — paperback and hardcover now; ebook April 22.
Hermes IDE: hermes-ide.com — the IDE for developers shipping with Claude Code and other AI tools.
Me: xgabriel.com · github.com/gabrielanhaia.