DEV Community

Denis Moroz
Denis Moroz

Posted on • Originally published at denismoroz.ai

LLMs in Production: What No One Tells You

Deploying a language model demo is easy. Running it in production — reliably, at scale, within budget — is not. After shipping several LLM-backed products, here's the honest picture.

Cost Is Not Linear

Every engineer does the math: tokens in × tokens out × price per 1M tokens = monthly bill. Then they ship to production and discover the bill is 4x the estimate.

Why? Because production traffic is never as clean as your prototype. Real users:

  • Send ambiguous queries that need clarification rounds
  • Retry when responses feel off
  • Trigger edge cases your prompt never anticipated
  • Explore the product in ways you didn't model

Budget 2-3x your projected token usage for the first quarter in production. Track cost per user, not cost in aggregate — aggregate numbers hide the outliers who will blow your budget.

Prompt Engineering is Software Engineering

Treat prompts like code:

  • Version control them
  • Test them against a regression suite
  • Review changes before deployment
  • Monitor production drift

I've seen teams ship prompt changes as untracked edits to environment variables. Three weeks later, a regression in a corner case they didn't know existed. No way to diff, no way to roll back.

Use a prompt management system. At minimum, store prompts in your repo, not in .env files.

Latency Has Tails

Average latency for GPT-4-class models is roughly 1-3 seconds for typical requests. P99 is often 8-15 seconds. P99.9 includes timeouts.

For most applications, you should:

  1. Stream all responses — users tolerate latency much better when they see tokens appearing
  2. Set aggressive timeouts and have a fallback path (retry with a faster model, return a cached response)
  3. Track latency percentiles, not averages — averages hide the user experience for 1 in 100

The Prompt Injection Problem

If your product processes user-provided text through an LLM, you have a prompt injection surface. This is not theoretical.

Common scenarios:

  • Document summarization where the document contains "Ignore previous instructions"
  • Customer support bots that process user-submitted tickets
  • Code review tools that analyze user-submitted code with embedded instructions

Defense in depth:

  • Sanitize inputs (strip instruction-like patterns before including in prompts)
  • Separate system and user content with role delimiters the model respects
  • Treat LLM outputs as untrusted user input before rendering them
  • Monitor for anomalous output patterns

No defense is perfect. Assume attackers will find ways around your guardrails and design your system so that a successful injection doesn't cause irreversible harm.

Evals are Non-Negotiable

You cannot ship changes to your AI system confidently without evals. A test suite of 50-100 representative prompts and expected output characteristics (not exact string matches — LLMs are stochastic) is the minimum bar.

What to eval:

  • Task accuracy: Does the model do the right thing?
  • Format compliance: Does the output match the expected structure?
  • Refusal rate: Is the model refusing valid requests?
  • Hallucination rate: Is the model making up facts in a domain where you can verify?

Run evals before every prompt change, model upgrade, and temperature adjustment. The cost of an eval suite is a rounding error compared to the cost of a silent regression in production.

Model Versioning Surprises

Model providers update their models without always announcing breaking behavioral changes. A model that was reliable at your task in Q1 may behave differently in Q4, even with the same version tag.

Point to specific model versions in production (e.g., gpt-4-0613, not gpt-4). Subscribe to your provider's changelog. Run your eval suite against any model update before rolling it out.

What Actually Matters

After all of this, the one thing that predicts success more than anything else: feedback loops.

Teams that instrument everything (latency, cost, user thumbs-up/down, session length), evaluate regularly, and iterate on their prompts weekly consistently outperform teams that ship a v1 and assume the model handles the rest.

The model is a component in a system. The system needs the same engineering discipline as any other component in production.

Top comments (0)