DEV Community

Cover image for LLMs Explained for Backend Engineers
Ameer Hamza
Ameer Hamza

Posted on

LLMs Explained for Backend Engineers

Introduction

If you have built APIs, databases, and distributed systems, you already have the mindset needed for AI engineering. The missing piece is a clear mental model of what a Large Language Model (LLM) actually is.

An LLM is not a search engine with better grammar. It is not a database of facts. It is a probabilistic token prediction engine trained on massive text corpora. You give it a sequence of tokens. It predicts the next token. Repeat that thousands of times and you get text that reads like an answer.

That single distinction explains both the power and the unreliability of modern AI applications.

Why This Matters

Backend engineers are comfortable with unreliable dependencies. Caches go stale. Third-party APIs return 500s. Queues back up. We design retries, circuit breakers, fallbacks, and observability around those failures.

LLMs are another unreliable dependency, but with a twist: they fail confidently. A wrong answer often looks as polished as a right one. There is no HTTP status code that says "this paragraph is hallucinated."

Production AI engineering is therefore the discipline of wrapping a non-deterministic core inside a deterministic system: retrieval, guardrails, parsers, eval gates, and human escalation paths.

Prerequisites

This is the first article in the AI Engineering Handbook. You should be comfortable with:

  • Basic Python
  • HTTP APIs and request/response flows
  • The idea of latency, throughput, and error rates in production services

No prior machine learning coursework is required.

The Problem

Teams often treat the LLM as the entire product:

User -> LLM -> Answer
Enter fullscreen mode Exit fullscreen mode

That mental model breaks in production:

  1. No grounding: The model invents facts not present in training data for your domain.
  2. No memory contract: Stateless APIs do not remember prior sessions unless you build memory.
  3. No permission boundary: The model will attempt any completion the prompt allows.
  4. Unbounded cost: Token usage scales with input and output length.
  5. Variable latency: Time to first token and total generation time depend on model size and load.

The model is the easy part. The system around it is where engineering begins.

Understanding the Core Concept

Tokens, not words

LLMs operate on tokens, subword units produced by a tokenizer. The phrase "unhappiness" might be one token or three depending on the tokenizer. This matters for:

  • Billing: API pricing is per token.
  • Context limits: Windows are measured in tokens, not characters.
  • Domain failures: Rare product names may split into many tokens or map to unknown tokens, hurting quality and cost.

Next-token prediction

At each step, the model outputs a probability distribution over the vocabulary. A decoding strategy (greedy, temperature sampling, top-p) selects the next token. The process repeats autoregressively until a stop condition.

There is no separate "fact checking" step. There is no guaranteed retrieval from a knowledge base unless you add retrieval (covered in Blog 002).

Training vs inference

Phase What happens Engineering concern
Pretraining Learn language patterns from huge corpora Model choice, license, capability ceiling
Fine-tuning / alignment Adapt behavior to instructions or domain Data quality, forgetting, eval
Inference Generate tokens for your prompt Latency, cost, guardrails

As a backend engineer building applications, you mostly live in inference. You choose models, assemble context, and enforce policies around the call.

How It Works Internally (High Level)

  1. Tokenization: Raw text becomes integer token IDs.
  2. Embedding: Token IDs map to dense vectors.
  3. Transformer layers: Self-attention lets each token attend to others; feed-forward layers transform representations.
  4. Output head: Final layer projects to vocabulary logits.
  5. Sampling: Decoding strategy picks the next token.
  6. Repeat: Append token, update KV cache (Blog 013), continue until stop.

You do not need to implement a transformer to ship a product. You do need to know that latency and memory grow with context length and output length.

Step-by-Step Example

User question: "What is our refund policy for annual plans?"

Naive approach: Send the question directly to the LLM.

Likely failure: The model produces a plausible refund policy that does not match your actual terms.

Production approach:

  1. Authenticate the user.
  2. Retrieve policy chunks from your knowledge base (RAG, Blog 002).
  3. Assemble a prompt with system rules, retrieved context, and the user question.
  4. Call the model with temperature appropriate for factual tasks (low).
  5. Parse structured output if needed.
  6. Run output guardrails (no legal commitments beyond retrieved text).
  7. Log prompt hash, retrieval IDs, latency, and token counts.

The LLM generates language. Your system decides what it is allowed to say.

Architecture

Diagram showing an LLM as one component inside a larger service architecture, emphasizing that the LLM should be treated like an external API with no guarantee of correctness

The LLM is one box in a larger service. Treat it like you would treat an external API with no SLA on correctness.

Python Example

Minimal illustration: call an OpenAI-compatible chat API and measure tokens. This is inference-only, no retrieval.

"""
Minimal LLM inference wrapper with token usage logging.
Requires: pip install openai
Set OPENAI_API_KEY in environment.
"""
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

SYSTEM_PROMPT = (
    "You answer using only the context provided. "
    "If the context is insufficient, say you do not know."
)

def complete(user_message: str, context: str = "") -> dict:
  messages = [
      {"role": "system", "content": SYSTEM_PROMPT},
      {
          "role": "user",
          "content": f"Context:\n{context}\n\nQuestion:\n{user_message}",
      },
  ]
  response = client.chat.completions.create(
      model="gpt-4o-mini",
      messages=messages,
      temperature=0.2,
      max_tokens=512,
  )
  choice = response.choices[0].message.content
  usage = response.usage
  return {
      "text": choice,
      "prompt_tokens": usage.prompt_tokens,
      "completion_tokens": usage.completion_tokens,
      "total_tokens": usage.total_tokens,
  }

if __name__ == "__main__":
  result = complete(
      user_message="Summarize the refund window.",
      context="Annual plans: refunds within 14 days of purchase.",
  )
  print(result["text"])
  print(f"Tokens: {result['total_tokens']}")
Enter fullscreen mode Exit fullscreen mode

In production, add: timeouts, retries with idempotency keys, structured logging, and budget caps per user.

Real-World Applications

  • Customer support assistants with retrieval over help docs
  • Code assistants with repo context and sandboxed execution (Blog 004)
  • Document Q&A over internal PDFs and wikis
  • Classification and extraction with constrained output schemas
  • Agent workflows that call tools via standardized protocols (Blog 003)

Performance Considerations

  • Latency: Dominated by model size, context length, and output length. Measure TTFT and tokens per second (Blog 015).
  • Cost: prompt_tokens + completion_tokens at model-specific rates. Long system prompts are not free.
  • Concurrency: GPU memory limits concurrent sequences. Queue or route when saturated.
  • Caching: Identical prefix prompts may benefit from prompt caching on some providers (later in handbook).

Common Mistakes

  1. Treating the model as source of truth for dynamic business data.
  2. Omitting observability on prompts, retrieval IDs, and token usage.
  3. Using high temperature for factual tasks.
  4. Ignoring tokenizer effects on domain-specific vocabulary.
  5. No failure mode when the model refuses or returns empty output.

Interview Questions

Q1: What is an LLM in one sentence?

A: A neural network trained to predict the next token in a sequence, used at inference time to generate text autoregressively.

Q2: How is an LLM different from a search engine?

A: Search retrieves existing documents by matching queries. An LLM generates new text from learned patterns without guaranteed grounding unless you add retrieval.

Q3: Why do LLMs hallucinate?

A: They optimize for plausible continuations, not verified truth. Without external grounding or constraints, they may invent facts.

Q4: What belongs in the system around the model?

A: Retrieval, auth, rate limits, guardrails, parsers, evals, logging, and escalation paths.

Q5: What drives LLM API cost?

A: Total tokens processed (input + output), model tier, and optional features like tool calls or vision.

Q6: When should you not use an LLM?

A: When deterministic rules suffice, when strict correctness is required without verification, or when latency and cost cannot tolerate probabilistic generation.

Summary

An LLM is a probabilistic token predictor, not an oracle. Backend engineers succeed with LLMs when they design systems: context assembly, retrieval, policy enforcement, and observability. The model generates language. Your architecture decides whether that language is safe, grounded, and useful.

Further Reading

  • Outcome School: AI Engineering Explained (LLM, RAG, MCP overview)
  • Jay Alammar: The Illustrated Transformer
  • OpenAI API documentation: Chat Completions, token usage

Top comments (0)