DEV Community

Ale Santini
Ale Santini

Posted on

17 Things I Wish I Knew Before Building My First Production LLM System

17 Things I Wish I Knew Before Building My First Production LLM System

I deployed an LLM system to 236 real users last year. It broke in ways I couldn't have anticipated, taught me lessons I'll carry forever, and made me realize that building for production is fundamentally different from building a demo. Here's what I learned.


1. JSON Schema Validation Saves Your Life

Never trust an LLM to output valid JSON. Ever. I learned this the hard way when a single malformed response cascaded into 47 failed user requests. Validate every single output before you touch it.

import json
import jsonschema

schema = {
    "type": "object",
    "properties": {
        "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
        "confidence": {"type": "number", "minimum": 0, "maximum": 1},
        "reasoning": {"type": "string"}
    },
    "required": ["sentiment", "confidence", "reasoning"]
}

def validate_llm_output(response_text):
    try:
        data = json.loads(response_text)
        jsonschema.validate(instance=data, schema=schema)
        return data
    except (json.JSONDecodeError, jsonschema.ValidationError) as e:
        raise ValueError(f"Invalid LLM output: {e}")
Enter fullscreen mode Exit fullscreen mode

This single pattern prevented approximately 60% of the bugs I would have encountered in production.


2. Model-Agnostic From Day 1: Store Model Name in DB

You will switch models. Maybe GPT-4 becomes too expensive, maybe Claude gets better at your task, maybe you need to self-host. If your model name is hardcoded, you're going to have a bad time. Store it in your database from day one.

CREATE TABLE llm_configs (
    id INT PRIMARY KEY AUTO_INCREMENT,
    config_name VARCHAR(255) UNIQUE,
    model_name VARCHAR(255),
    api_key_secret VARCHAR(255),
    temperature FLOAT,
    max_tokens INT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
);

INSERT INTO llm_configs (config_name, model_name, temperature, max_tokens)
VALUES ('sentiment_analysis', 'gpt-4-turbo', 0.0, 500);
Enter fullscreen mode Exit fullscreen mode

Then in your code:

def get_llm_config(config_name):
    result = db.query(
        "SELECT model_name, temperature, max_tokens FROM llm_configs WHERE config_name = %s",
        (config_name,)
    )
    return result[0] if result else None
Enter fullscreen mode Exit fullscreen mode

This saved me three days of refactoring when I needed to downgrade from GPT-4 to GPT-3.5-turbo for cost reasons.


3. Retry Logic Is the Difference Between Dev and Prod

Your LLM provider will have blips. Rate limits will hit. Networks will hiccup. Exponential backoff with jitter isn't optional—it's required.

import time
import random
from functools import wraps

def retry_with_backoff(max_retries=3, base_delay=1):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if attempt == max_retries - 1:
                        raise

                    delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
                    print(f"Attempt {attempt + 1} failed. Retrying in {delay:.2f}s...")
                    time.sleep(delay)
        return wrapper
    return decorator

@retry_with_backoff(max_retries=3, base_delay=1)
def call_llm(prompt):
    return openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
Enter fullscreen mode Exit fullscreen mode

This reduced our error rate from 3.2% to 0.4% in production.


4. System Prompts Are Code: Version Them and Test Them

Your system prompt is not a one-time thing. It's code. It evolves. It needs versions, tests, and rollback capability.

# prompts.py
SYSTEM_PROMPTS = {
    "v1.0": """You are a customer support assistant. Be helpful, concise, and professional.""",
    "v1.1": """You are a customer support assistant. Be helpful, concise, and professional. 
    If you don't know something, say "I don't know" instead of guessing.""",
    "v1.2": """You are a customer support assistant. Be helpful, concise, and professional. 
    If you don't know something, say "I don't know" instead of guessing.
    Always cite sources when providing information."""
}

def get_system_prompt(version="latest"):
    if version == "latest":
        return SYSTEM_PROMPTS[max(SYSTEM_PROMPTS.keys())]
    return SYSTEM_PROMPTS.get(version)

# In your tests
def test_system_prompt_v1_2():
    response = call_llm("What's your favorite food?", prompt_version="v1.2")
    assert "I don't know" in response or "prefer" not in response.lower()
Enter fullscreen mode Exit fullscreen mode

When a system prompt change caused issues, we rolled back in 30 seconds instead of debugging for hours.


5. Streaming Is Not Optional for UX

Users will leave if they stare at a loading spinner for 8 seconds. Streaming isn't a nice-to-have—it's table stakes for production LLM apps.

from flask import Flask, Response
import openai

app = Flask(__name__)

@app.route("/chat", methods=["POST"])
def chat():
    def generate():
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[{"role": "user", "content": "Explain quantum computing"}],
            stream=True
        )

        for chunk in response:
            if "content" in chunk["choices"][0]["delta"]:
                content = chunk["choices"][0]["delta"]["content"]
                yield f"data: {json.dumps({'text': content})}\n\n"

    return Response(generate(), mimetype="text/event-stream")
Enter fullscreen mode Exit fullscreen mode

And on the frontend:

const eventSource = new EventSource("/chat");
eventSource.onmessage = (event) => {
    const data = JSON.parse(event.data);
    document.getElementById("response").innerHTML += data.text;
};
Enter fullscreen mode Exit fullscreen mode

Streaming reduced perceived latency from "feels slow" to "feels responsive" even though the actual time-to-first-token didn't change much.


6. Your Most Expensive Token Is the One That Causes a Retry Loop

A single token that causes validation to fail, which triggers a retry, which fails again—that's not just expensive, it's a death spiral. One user hit a retry loop that cost us $47 before we noticed.

def call_llm_with_circuit_breaker(prompt, config):
    cost_this_call = 0
    max_cost_per_call = 0.50  # Hard limit in dollars

    for attempt in range(3):
        try:
            response = openai.ChatCompletion.create(
                model=config["model"],
                messages=[{"role": "user", "content": prompt}],
                temperature=config["temperature"]
            )

            # Estimate cost (simplified)
            cost_this_call += len(prompt.split()) * 0.00001

            if cost_this_call > max_cost_per_call:
                raise Exception(f"Cost exceeded limit: ${cost_this_call}")

            return validate_llm_output(response["choices"][0]["message"]["content"])

        except Exception as e:
            if attempt == 2:
                log_alert(f"Failed after 3 attempts, cost: ${cost_this_call}", severity="critical")
                raise
Enter fullscreen mode Exit fullscreen mode

7. Context Window Bloat Is a Hidden Cost

Every token you send costs money. Every token you receive costs money. I was sending 8KB of context when I needed 2KB. That's a 4x cost multiplier for no benefit.

def build_context_efficiently(user_history, max_context_tokens=2000):
    """Only include recent, relevant messages"""

    # Don't include the entire history
    recent_messages = user_history[-5:]  # Last 5 messages only

    context = "\n".join([
        f"User: {msg['user_message']}\nAssistant: {msg['assistant_response']}"
        for msg in recent_messages
    ])

    # Check token count
    token_count = len(context.split())
    if token_count > max_context_tokens:
        # Truncate oldest messages first
        context = "\n".join(context.split("\n")[-20:])

    return context
Enter fullscreen mode Exit fullscreen mode

Implementing this reduced our per-request token usage by 65% without degrading quality.


8. Temperature 0 for Structured Output, 0.7 for Creative

This seems obvious in retrospect, but I spent two weeks debugging "inconsistent JSON output" before realizing I had temperature set to 0.9. Temperature isn't a knob you set once.

TEMPERATURE_CONFIGS = {
    "sentiment_analysis": 0.0,      # Deterministic
    "classification": 0.0,           # Deterministic
    "content_generation": 0.7,       # Creative but coherent
    "brainstorming": 0.9,            # Very creative
    "code_generation": 0.2,          # Mostly deterministic, slight variation
}

def call_llm_with_task(prompt, task_type):
    temperature = TEMPERATURE_CONFIGS.get(task_type, 0.7)

    return openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature
    )
Enter fullscreen mode Exit fullscreen mode

9. Tool Calling Is More Reliable Than JSON-in-Prompt

When you need structured output, use the model's tool calling API instead of asking the model to output JSON in a prompt. It's more reliable, more consistent, and actually simpler.

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Extract the customer's name and email"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "extract_contact_info",
            "description": "Extract contact information from text",
            "parameters": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "email": {"type": "string"}
                },
                "required": ["name", "email"]
            }
        }
    }],
    tool_choice="auto"
)

# The response is already structured
tool_call = response["choices"][0]["message"]["tool_calls"][0]
extracted_data = json.loads(tool_call["function"]["arguments"])
Enter fullscreen mode Exit fullscreen mode

This reduced JSON parsing errors from 8% to 0.3%.


10. Log Every LLM Call With Input/Output/Latency/Cost

You cannot debug what you cannot see. Log everything. When something goes wrong at 2 AM, you'll be grateful.


python
import logging
import time
import json
from datetime import datetime

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def call_llm_with_logging(prompt, model, user_id):
    start_time = time.time()

    try:
        response = openai.ChatCom

---
**Need an AI system for your business?**
I'm Alessandro Trimarco, AI engineer behind a 6-module AI stack for a 14-location restaurant chain (236 users, ~88k EUR/month processed).
Email: **alevibecoding@gmail.com** | [Portfolio](https://alessandrotrimarco.github.io) | [Case study](https://github.com/AlessandroTrimarco/aires-burger-case-study)
Enter fullscreen mode Exit fullscreen mode

Top comments (0)