17 Things I Wish I Knew Before Building My First Production LLM System
I deployed an LLM system to 236 real users last year. It broke in ways I couldn't have anticipated, taught me lessons I'll carry forever, and made me realize that building for production is fundamentally different from building a demo. Here's what I learned.
1. JSON Schema Validation Saves Your Life
Never trust an LLM to output valid JSON. Ever. I learned this the hard way when a single malformed response cascaded into 47 failed user requests. Validate every single output before you touch it.
import json
import jsonschema
schema = {
"type": "object",
"properties": {
"sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
"confidence": {"type": "number", "minimum": 0, "maximum": 1},
"reasoning": {"type": "string"}
},
"required": ["sentiment", "confidence", "reasoning"]
}
def validate_llm_output(response_text):
try:
data = json.loads(response_text)
jsonschema.validate(instance=data, schema=schema)
return data
except (json.JSONDecodeError, jsonschema.ValidationError) as e:
raise ValueError(f"Invalid LLM output: {e}")
This single pattern prevented approximately 60% of the bugs I would have encountered in production.
2. Model-Agnostic From Day 1: Store Model Name in DB
You will switch models. Maybe GPT-4 becomes too expensive, maybe Claude gets better at your task, maybe you need to self-host. If your model name is hardcoded, you're going to have a bad time. Store it in your database from day one.
CREATE TABLE llm_configs (
id INT PRIMARY KEY AUTO_INCREMENT,
config_name VARCHAR(255) UNIQUE,
model_name VARCHAR(255),
api_key_secret VARCHAR(255),
temperature FLOAT,
max_tokens INT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
);
INSERT INTO llm_configs (config_name, model_name, temperature, max_tokens)
VALUES ('sentiment_analysis', 'gpt-4-turbo', 0.0, 500);
Then in your code:
def get_llm_config(config_name):
result = db.query(
"SELECT model_name, temperature, max_tokens FROM llm_configs WHERE config_name = %s",
(config_name,)
)
return result[0] if result else None
This saved me three days of refactoring when I needed to downgrade from GPT-4 to GPT-3.5-turbo for cost reasons.
3. Retry Logic Is the Difference Between Dev and Prod
Your LLM provider will have blips. Rate limits will hit. Networks will hiccup. Exponential backoff with jitter isn't optional—it's required.
import time
import random
from functools import wraps
def retry_with_backoff(max_retries=3, base_delay=1):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(max_retries):
try:
return func(*args, **kwargs)
except Exception as e:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
print(f"Attempt {attempt + 1} failed. Retrying in {delay:.2f}s...")
time.sleep(delay)
return wrapper
return decorator
@retry_with_backoff(max_retries=3, base_delay=1)
def call_llm(prompt):
return openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
This reduced our error rate from 3.2% to 0.4% in production.
4. System Prompts Are Code: Version Them and Test Them
Your system prompt is not a one-time thing. It's code. It evolves. It needs versions, tests, and rollback capability.
# prompts.py
SYSTEM_PROMPTS = {
"v1.0": """You are a customer support assistant. Be helpful, concise, and professional.""",
"v1.1": """You are a customer support assistant. Be helpful, concise, and professional.
If you don't know something, say "I don't know" instead of guessing.""",
"v1.2": """You are a customer support assistant. Be helpful, concise, and professional.
If you don't know something, say "I don't know" instead of guessing.
Always cite sources when providing information."""
}
def get_system_prompt(version="latest"):
if version == "latest":
return SYSTEM_PROMPTS[max(SYSTEM_PROMPTS.keys())]
return SYSTEM_PROMPTS.get(version)
# In your tests
def test_system_prompt_v1_2():
response = call_llm("What's your favorite food?", prompt_version="v1.2")
assert "I don't know" in response or "prefer" not in response.lower()
When a system prompt change caused issues, we rolled back in 30 seconds instead of debugging for hours.
5. Streaming Is Not Optional for UX
Users will leave if they stare at a loading spinner for 8 seconds. Streaming isn't a nice-to-have—it's table stakes for production LLM apps.
from flask import Flask, Response
import openai
app = Flask(__name__)
@app.route("/chat", methods=["POST"])
def chat():
def generate():
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": "Explain quantum computing"}],
stream=True
)
for chunk in response:
if "content" in chunk["choices"][0]["delta"]:
content = chunk["choices"][0]["delta"]["content"]
yield f"data: {json.dumps({'text': content})}\n\n"
return Response(generate(), mimetype="text/event-stream")
And on the frontend:
const eventSource = new EventSource("/chat");
eventSource.onmessage = (event) => {
const data = JSON.parse(event.data);
document.getElementById("response").innerHTML += data.text;
};
Streaming reduced perceived latency from "feels slow" to "feels responsive" even though the actual time-to-first-token didn't change much.
6. Your Most Expensive Token Is the One That Causes a Retry Loop
A single token that causes validation to fail, which triggers a retry, which fails again—that's not just expensive, it's a death spiral. One user hit a retry loop that cost us $47 before we noticed.
def call_llm_with_circuit_breaker(prompt, config):
cost_this_call = 0
max_cost_per_call = 0.50 # Hard limit in dollars
for attempt in range(3):
try:
response = openai.ChatCompletion.create(
model=config["model"],
messages=[{"role": "user", "content": prompt}],
temperature=config["temperature"]
)
# Estimate cost (simplified)
cost_this_call += len(prompt.split()) * 0.00001
if cost_this_call > max_cost_per_call:
raise Exception(f"Cost exceeded limit: ${cost_this_call}")
return validate_llm_output(response["choices"][0]["message"]["content"])
except Exception as e:
if attempt == 2:
log_alert(f"Failed after 3 attempts, cost: ${cost_this_call}", severity="critical")
raise
7. Context Window Bloat Is a Hidden Cost
Every token you send costs money. Every token you receive costs money. I was sending 8KB of context when I needed 2KB. That's a 4x cost multiplier for no benefit.
def build_context_efficiently(user_history, max_context_tokens=2000):
"""Only include recent, relevant messages"""
# Don't include the entire history
recent_messages = user_history[-5:] # Last 5 messages only
context = "\n".join([
f"User: {msg['user_message']}\nAssistant: {msg['assistant_response']}"
for msg in recent_messages
])
# Check token count
token_count = len(context.split())
if token_count > max_context_tokens:
# Truncate oldest messages first
context = "\n".join(context.split("\n")[-20:])
return context
Implementing this reduced our per-request token usage by 65% without degrading quality.
8. Temperature 0 for Structured Output, 0.7 for Creative
This seems obvious in retrospect, but I spent two weeks debugging "inconsistent JSON output" before realizing I had temperature set to 0.9. Temperature isn't a knob you set once.
TEMPERATURE_CONFIGS = {
"sentiment_analysis": 0.0, # Deterministic
"classification": 0.0, # Deterministic
"content_generation": 0.7, # Creative but coherent
"brainstorming": 0.9, # Very creative
"code_generation": 0.2, # Mostly deterministic, slight variation
}
def call_llm_with_task(prompt, task_type):
temperature = TEMPERATURE_CONFIGS.get(task_type, 0.7)
return openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=temperature
)
9. Tool Calling Is More Reliable Than JSON-in-Prompt
When you need structured output, use the model's tool calling API instead of asking the model to output JSON in a prompt. It's more reliable, more consistent, and actually simpler.
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": "Extract the customer's name and email"}],
tools=[{
"type": "function",
"function": {
"name": "extract_contact_info",
"description": "Extract contact information from text",
"parameters": {
"type": "object",
"properties": {
"name": {"type": "string"},
"email": {"type": "string"}
},
"required": ["name", "email"]
}
}
}],
tool_choice="auto"
)
# The response is already structured
tool_call = response["choices"][0]["message"]["tool_calls"][0]
extracted_data = json.loads(tool_call["function"]["arguments"])
This reduced JSON parsing errors from 8% to 0.3%.
10. Log Every LLM Call With Input/Output/Latency/Cost
You cannot debug what you cannot see. Log everything. When something goes wrong at 2 AM, you'll be grateful.
python
import logging
import time
import json
from datetime import datetime
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def call_llm_with_logging(prompt, model, user_id):
start_time = time.time()
try:
response = openai.ChatCom
---
**Need an AI system for your business?**
I'm Alessandro Trimarco, AI engineer behind a 6-module AI stack for a 14-location restaurant chain (236 users, ~88k EUR/month processed).
Email: **alevibecoding@gmail.com** | [Portfolio](https://alessandrotrimarco.github.io) | [Case study](https://github.com/AlessandroTrimarco/aires-burger-case-study)
Top comments (0)