How Large Language Models (LLMs) Generate Text
These notes summarize how Large Language Models (LLMs) generate text, based on my learning from the DeepLearning.AI RAG course and further exploration.
This is a mental model, not marketing.
High-Level Overview
A Large Language Model (LLM) is fundamentally a next-token prediction system.
Given a sequence of tokens as input, the model:
- Predicts the most probable next token
- Appends it to the sequence
- Repeats the process until the response is complete
That’s it.
What LLMs Do Not Do
LLMs do not:
- Look up words in a dictionary at runtime
- Search the internet by default
- Reason like humans
Instead, they rely entirely on statistical patterns learned during training.
Two Core Components of an LLM
1️⃣ Training Data
LLMs are trained on massive text datasets:
- Books
- Articles
- Websites
- Code repositories
- Documentation
During training:
- The model learns statistical relationships between tokens
- It does not memorize exact sentences
- It learns generalizable language patterns
Example of a learned pattern:
After “the sun is”, tokens like shining, bright, or hot are statistically likely.
These patterns are encoded into the model’s parameters (weights).
2️⃣ Tokenizer and Vocabulary
Before training begins, every LLM is assigned a tokenizer.
The tokenizer:
- Splits text into tokens (sub-word units)
- Converts tokens into numeric IDs
- Defines a fixed vocabulary (e.g. 20k–100k tokens)
Important properties:
- The vocabulary is fixed at training time
- The model can only generate tokens from this vocabulary
- Different models use different tokenizers
Tokens Are Not Words
A token:
- Might be a full word
- Might be part of a word
- Might include spaces or punctuation
Example:
"unbelievable"
May be split into:
["un", "believ", "able"]
This is why:
- Token counts ≠ word counts
- Prompt length matters
- Context limits exist
How a Single Token Is Generated
At each step:
- The model takes the current token sequence
- Produces a probability distribution over all tokens
- Selects one token (based on decoding strategy)
- Appends it to the sequence
This repeats token by token.
Why Output Feels Like “Reasoning”
LLMs appear to reason because:
- Language itself encodes reasoning patterns
- The model has seen millions of examples of explanations
- It predicts tokens that look like reasoning
But internally:
It’s still just predicting the next token.
Mental Model (Remember This)
LLMs generate text one token at a time based on probability, not understanding
If you remember this, most confusion around LLM behavior disappears.
Why This Matters (Especially for RAG)
In RAG systems:
- The LLM does not know facts
- It only knows patterns
- Retrieved context steers token prediction
Good retrieval = better next-token probabilities.
TL;DR
- LLMs are next-token predictors
- They don’t think or search by default
- Tokenizers define what models can generate
- Everything happens one token at a time
Understanding this mental model makes prompt engineering, RAG design, and debugging much easier.
Top comments (1)
👌