Introduction
If you have built APIs, databases, and distributed systems, you already have the mindset needed for AI engineering. The missing piece is a clear mental model of what a Large Language Model (LLM) actually is.
An LLM is not a search engine with better grammar. It is not a database of facts. It is a probabilistic token prediction engine trained on massive text corpora. You give it a sequence of tokens. It predicts the next token. Repeat that thousands of times and you get text that reads like an answer.
That single distinction explains both the power and the unreliability of modern AI applications.
Why This Matters
Backend engineers are comfortable with unreliable dependencies. Caches go stale. Third-party APIs return 500s. Queues back up. We design retries, circuit breakers, fallbacks, and observability around those failures.
LLMs are another unreliable dependency, but with a twist: they fail confidently. A wrong answer often looks as polished as a right one. There is no HTTP status code that says "this paragraph is hallucinated."
Production AI engineering is therefore the discipline of wrapping a non-deterministic core inside a deterministic system: retrieval, guardrails, parsers, eval gates, and human escalation paths.
Prerequisites
This is the first article in the AI Engineering Handbook. You should be comfortable with:
- Basic Python
- HTTP APIs and request/response flows
- The idea of latency, throughput, and error rates in production services
No prior machine learning coursework is required.
The Problem
Teams often treat the LLM as the entire product:
User -> LLM -> Answer
That mental model breaks in production:
- No grounding: The model invents facts not present in training data for your domain.
- No memory contract: Stateless APIs do not remember prior sessions unless you build memory.
- No permission boundary: The model will attempt any completion the prompt allows.
- Unbounded cost: Token usage scales with input and output length.
- Variable latency: Time to first token and total generation time depend on model size and load.
The model is the easy part. The system around it is where engineering begins.
Understanding the Core Concept
Tokens, not words
LLMs operate on tokens, subword units produced by a tokenizer. The phrase "unhappiness" might be one token or three depending on the tokenizer. This matters for:
- Billing: API pricing is per token.
- Context limits: Windows are measured in tokens, not characters.
- Domain failures: Rare product names may split into many tokens or map to unknown tokens, hurting quality and cost.
Next-token prediction
At each step, the model outputs a probability distribution over the vocabulary. A decoding strategy (greedy, temperature sampling, top-p) selects the next token. The process repeats autoregressively until a stop condition.
There is no separate "fact checking" step. There is no guaranteed retrieval from a knowledge base unless you add retrieval (covered in Blog 002).
Training vs inference
| Phase | What happens | Engineering concern |
|---|---|---|
| Pretraining | Learn language patterns from huge corpora | Model choice, license, capability ceiling |
| Fine-tuning / alignment | Adapt behavior to instructions or domain | Data quality, forgetting, eval |
| Inference | Generate tokens for your prompt | Latency, cost, guardrails |
As a backend engineer building applications, you mostly live in inference. You choose models, assemble context, and enforce policies around the call.
How It Works Internally (High Level)
- Tokenization: Raw text becomes integer token IDs.
- Embedding: Token IDs map to dense vectors.
- Transformer layers: Self-attention lets each token attend to others; feed-forward layers transform representations.
- Output head: Final layer projects to vocabulary logits.
- Sampling: Decoding strategy picks the next token.
- Repeat: Append token, update KV cache (Blog 013), continue until stop.
You do not need to implement a transformer to ship a product. You do need to know that latency and memory grow with context length and output length.
Step-by-Step Example
User question: "What is our refund policy for annual plans?"
Naive approach: Send the question directly to the LLM.
Likely failure: The model produces a plausible refund policy that does not match your actual terms.
Production approach:
- Authenticate the user.
- Retrieve policy chunks from your knowledge base (RAG, Blog 002).
- Assemble a prompt with system rules, retrieved context, and the user question.
- Call the model with temperature appropriate for factual tasks (low).
- Parse structured output if needed.
- Run output guardrails (no legal commitments beyond retrieved text).
- Log prompt hash, retrieval IDs, latency, and token counts.
The LLM generates language. Your system decides what it is allowed to say.
Architecture
The LLM is one box in a larger service. Treat it like you would treat an external API with no SLA on correctness.
Python Example
Minimal illustration: call an OpenAI-compatible chat API and measure tokens. This is inference-only, no retrieval.
"""
Minimal LLM inference wrapper with token usage logging.
Requires: pip install openai
Set OPENAI_API_KEY in environment.
"""
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
SYSTEM_PROMPT = (
"You answer using only the context provided. "
"If the context is insufficient, say you do not know."
)
def complete(user_message: str, context: str = "") -> dict:
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion:\n{user_message}",
},
]
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
temperature=0.2,
max_tokens=512,
)
choice = response.choices[0].message.content
usage = response.usage
return {
"text": choice,
"prompt_tokens": usage.prompt_tokens,
"completion_tokens": usage.completion_tokens,
"total_tokens": usage.total_tokens,
}
if __name__ == "__main__":
result = complete(
user_message="Summarize the refund window.",
context="Annual plans: refunds within 14 days of purchase.",
)
print(result["text"])
print(f"Tokens: {result['total_tokens']}")
In production, add: timeouts, retries with idempotency keys, structured logging, and budget caps per user.
Real-World Applications
- Customer support assistants with retrieval over help docs
- Code assistants with repo context and sandboxed execution (Blog 004)
- Document Q&A over internal PDFs and wikis
- Classification and extraction with constrained output schemas
- Agent workflows that call tools via standardized protocols (Blog 003)
Performance Considerations
- Latency: Dominated by model size, context length, and output length. Measure TTFT and tokens per second (Blog 015).
-
Cost:
prompt_tokens + completion_tokensat model-specific rates. Long system prompts are not free. - Concurrency: GPU memory limits concurrent sequences. Queue or route when saturated.
- Caching: Identical prefix prompts may benefit from prompt caching on some providers (later in handbook).
Common Mistakes
- Treating the model as source of truth for dynamic business data.
- Omitting observability on prompts, retrieval IDs, and token usage.
- Using high temperature for factual tasks.
- Ignoring tokenizer effects on domain-specific vocabulary.
- No failure mode when the model refuses or returns empty output.
Interview Questions
Q1: What is an LLM in one sentence?
A: A neural network trained to predict the next token in a sequence, used at inference time to generate text autoregressively.
Q2: How is an LLM different from a search engine?
A: Search retrieves existing documents by matching queries. An LLM generates new text from learned patterns without guaranteed grounding unless you add retrieval.
Q3: Why do LLMs hallucinate?
A: They optimize for plausible continuations, not verified truth. Without external grounding or constraints, they may invent facts.
Q4: What belongs in the system around the model?
A: Retrieval, auth, rate limits, guardrails, parsers, evals, logging, and escalation paths.
Q5: What drives LLM API cost?
A: Total tokens processed (input + output), model tier, and optional features like tool calls or vision.
Q6: When should you not use an LLM?
A: When deterministic rules suffice, when strict correctness is required without verification, or when latency and cost cannot tolerate probabilistic generation.
Summary
An LLM is a probabilistic token predictor, not an oracle. Backend engineers succeed with LLMs when they design systems: context assembly, retrieval, policy enforcement, and observability. The model generates language. Your architecture decides whether that language is safe, grounded, and useful.
Further Reading
- Outcome School: AI Engineering Explained (LLM, RAG, MCP overview)
- Jay Alammar: The Illustrated Transformer
- OpenAI API documentation: Chat Completions, token usage

Top comments (0)