LLMs in Production: What No One Tells You

#ai #engineering #infrastructure

Deploying a language model demo is easy. Running it in production — reliably, at scale, within budget — is not. After shipping several LLM-backed products, here's the honest picture.

Cost Is Not Linear

Every engineer does the math: tokens in × tokens out × price per 1M tokens = monthly bill. Then they ship to production and discover the bill is 4x the estimate.

Why? Because production traffic is never as clean as your prototype. Real users:

Send ambiguous queries that need clarification rounds
Retry when responses feel off
Trigger edge cases your prompt never anticipated
Explore the product in ways you didn't model

Budget 2-3x your projected token usage for the first quarter in production. Track cost per user, not cost in aggregate — aggregate numbers hide the outliers who will blow your budget.

Prompt Engineering is Software Engineering

Treat prompts like code:

Version control them
Test them against a regression suite
Review changes before deployment
Monitor production drift

I've seen teams ship prompt changes as untracked edits to environment variables. Three weeks later, a regression in a corner case they didn't know existed. No way to diff, no way to roll back.

Use a prompt management system. At minimum, store prompts in your repo, not in .env files.

Latency Has Tails

Average latency for GPT-4-class models is roughly 1-3 seconds for typical requests. P99 is often 8-15 seconds. P99.9 includes timeouts.

For most applications, you should:

Stream all responses — users tolerate latency much better when they see tokens appearing
Set aggressive timeouts and have a fallback path (retry with a faster model, return a cached response)
Track latency percentiles, not averages — averages hide the user experience for 1 in 100

The Prompt Injection Problem

If your product processes user-provided text through an LLM, you have a prompt injection surface. This is not theoretical.

Common scenarios:

Document summarization where the document contains "Ignore previous instructions"
Customer support bots that process user-submitted tickets
Code review tools that analyze user-submitted code with embedded instructions

Defense in depth:

Sanitize inputs (strip instruction-like patterns before including in prompts)
Separate system and user content with role delimiters the model respects
Treat LLM outputs as untrusted user input before rendering them
Monitor for anomalous output patterns

No defense is perfect. Assume attackers will find ways around your guardrails and design your system so that a successful injection doesn't cause irreversible harm.

Evals are Non-Negotiable

You cannot ship changes to your AI system confidently without evals. A test suite of 50-100 representative prompts and expected output characteristics (not exact string matches — LLMs are stochastic) is the minimum bar.

What to eval:

Task accuracy: Does the model do the right thing?
Format compliance: Does the output match the expected structure?
Refusal rate: Is the model refusing valid requests?
Hallucination rate: Is the model making up facts in a domain where you can verify?

Run evals before every prompt change, model upgrade, and temperature adjustment. The cost of an eval suite is a rounding error compared to the cost of a silent regression in production.

Model Versioning Surprises

Model providers update their models without always announcing breaking behavioral changes. A model that was reliable at your task in Q1 may behave differently in Q4, even with the same version tag.

Point to specific model versions in production (e.g., gpt-4-0613, not gpt-4). Subscribe to your provider's changelog. Run your eval suite against any model update before rolling it out.

What Actually Matters

After all of this, the one thing that predicts success more than anything else: feedback loops.

Teams that instrument everything (latency, cost, user thumbs-up/down, session length), evaluate regularly, and iterate on their prompts weekly consistently outperform teams that ship a v1 and assume the model handles the rest.

The model is a component in a system. The system needs the same engineering discipline as any other component in production.