There's a great post on Dev.to right now arguing that LLMs are not deterministic and making them reliable is expensive. The author is right about both things. But I want to push on the "expensive" part, because I think most developers overestimate the difficulty and underestimate how mundane the solutions actually are.
Making LLMs reliable in production is not an AI problem. It's a plumbing problem.
The Demo vs. Production Gap
Every AI demo looks magical. One prompt, one model call, one beautiful result.
Then you try to ship it.
The model hallucinates a field name. It returns JSON with a trailing comma. It gives you a confident answer that's wrong in a way you'd never anticipate. It works perfectly 97 times out of 100, and the 3 failures are catastrophic.
This is not a bug. This is the fundamental nature of probabilistic systems. If you're surprised by it, you haven't shipped one yet.
The Boring Solutions That Actually Work
Here's what I've found works in practice, running AI-powered workflows daily:
1. Structured Output, Always
Never let the model free-form respond when you need to act on the output. Use JSON mode, function calling, or whatever structured output format your model supports. If the model's response needs to be parsed by code downstream, force it into a schema.
This alone eliminates maybe 60% of production issues.
2. Validate Like It's User Input
Treat every LLM response the way you'd treat a form submission from a user. Validate types, check required fields, verify that values are within expected ranges.
# Don't do this
result = call_llm(prompt)
do_something(result["action"])
# Do this
result = call_llm(prompt)
validated = validate_schema(result, expected_schema)
if validated.errors:
retry_or_fallback()
else:
do_something(validated.data["action"])
I know this looks obvious. I'm constantly amazed by how many production AI systems skip this step.
3. Retry With Variation
When a call fails validation, don't just retry the same prompt. Rephrase slightly, add an example of the expected output, or increase the temperature slightly. In my experience, 2-3 retries with small prompt variations will recover from most transient failures.
The key word is "most." You still need a fallback for when retries don't work.
4. Chain Reliability Compounds
This is where it gets interesting. If one LLM call is 95% reliable, a chain of 5 calls is about 77% reliable (0.95^5). That's a real problem.
The solution is boring: make each step independently validated, with clear success/failure signals. If step 3 fails, you need to know whether to retry step 3 or go back to step 2. You need checkpointing.
Sound familiar? It's the exact same pattern as any distributed system. Message queues, retry policies, dead letter queues, idempotency keys. The LLM is just another unreliable service in your architecture.
5. Log Everything
In a traditional app, you can reproduce bugs. With LLMs, the same input might produce different output tomorrow. Log the full prompt, the full response, and the validation result for every call. When something goes wrong in production, you'll need this to understand what happened.
The Mental Shift
The developers I see struggling with LLM reliability are usually thinking about it as an AI problem. They're reading papers about prompt engineering, fine-tuning, and model selection.
The developers who ship reliably are thinking about it as an infrastructure problem. The LLM is a service that sometimes returns bad data. Build accordingly.
That's not exciting. It's not going to get you Twitter engagement. But it works.
One Thing To Try
If you have an LLM call in production right now without output validation, add it this week. Just schema validation on the response. Track how often it fails. You'll learn more from that one metric than from any blog post (including this one).
I'm a developer who uses AI tools every day and writes about what actually works. If you're figuring out how to integrate AI into your work without the hype, I share practical workflows and tools at updatewave.gumroad.com.
Top comments (0)