DEV Community

Cover image for How Large Language Models (LLMs) Actually Generate Text
Micheal Angelo
Micheal Angelo

Posted on

How Large Language Models (LLMs) Actually Generate Text

How Large Language Models (LLMs) Generate Text

These notes summarize how Large Language Models (LLMs) generate text, based on my learning from the DeepLearning.AI RAG course and further exploration.

This is a mental model, not marketing.


High-Level Overview

A Large Language Model (LLM) is fundamentally a next-token prediction system.

Given a sequence of tokens as input, the model:

  1. Predicts the most probable next token
  2. Appends it to the sequence
  3. Repeats the process until the response is complete

That’s it.


What LLMs Do Not Do

LLMs do not:

  • Look up words in a dictionary at runtime
  • Search the internet by default
  • Reason like humans

Instead, they rely entirely on statistical patterns learned during training.


Two Core Components of an LLM

1️⃣ Training Data

LLMs are trained on massive text datasets:

  • Books
  • Articles
  • Websites
  • Code repositories
  • Documentation

During training:

  • The model learns statistical relationships between tokens
  • It does not memorize exact sentences
  • It learns generalizable language patterns

Example of a learned pattern:

After “the sun is”, tokens like shining, bright, or hot are statistically likely.

These patterns are encoded into the model’s parameters (weights).


2️⃣ Tokenizer and Vocabulary

Before training begins, every LLM is assigned a tokenizer.

The tokenizer:

  • Splits text into tokens (sub-word units)
  • Converts tokens into numeric IDs
  • Defines a fixed vocabulary (e.g. 20k–100k tokens)

Important properties:

  • The vocabulary is fixed at training time
  • The model can only generate tokens from this vocabulary
  • Different models use different tokenizers

Tokens Are Not Words

A token:

  • Might be a full word
  • Might be part of a word
  • Might include spaces or punctuation

Example:

"unbelievable"
Enter fullscreen mode Exit fullscreen mode

May be split into:

["un", "believ", "able"]
Enter fullscreen mode Exit fullscreen mode

This is why:

  • Token counts ≠ word counts
  • Prompt length matters
  • Context limits exist

How a Single Token Is Generated

At each step:

  1. The model takes the current token sequence
  2. Produces a probability distribution over all tokens
  3. Selects one token (based on decoding strategy)
  4. Appends it to the sequence

This repeats token by token.


Why Output Feels Like “Reasoning”

LLMs appear to reason because:

  • Language itself encodes reasoning patterns
  • The model has seen millions of examples of explanations
  • It predicts tokens that look like reasoning

But internally:

It’s still just predicting the next token.


Mental Model (Remember This)

LLMs generate text one token at a time based on probability, not understanding

If you remember this, most confusion around LLM behavior disappears.


Why This Matters (Especially for RAG)

In RAG systems:

  • The LLM does not know facts
  • It only knows patterns
  • Retrieved context steers token prediction

Good retrieval = better next-token probabilities.


TL;DR

  • LLMs are next-token predictors
  • They don’t think or search by default
  • Tokenizers define what models can generate
  • Everything happens one token at a time

Understanding this mental model makes prompt engineering, RAG design, and debugging much easier.

Top comments (1)

Collapse
 
olivia-john profile image
Olivia John

👌