You type a question into ChatGPT. Half a second later, a surprisingly relevant, well-formed answer starts streaming back.
It feels like magic. It isn't. It's math — a lot of math, wrapped in a system that's been engineered to feel conversational.
In this post, we'll pull back the curtain and walk through exactly what happens between the moment you hit "Enter" and the moment a response appears — from raw text, to numbers, to tokens, to Transformers.
You: "What's the capital of France?"
│
▼
[ ...this article... ]
│
▼
ChatGPT: "The capital of France is Paris."
Let's open the black box.
1. What is an LLM?
LLM stands for Large Language Model.
Strip away the buzzwords, and an LLM is a program trained on enormous amounts of text (books, articles, code, websites) to do one core thing really well: predict the next word in a sequence, given everything that came before it.
That sounds almost too simple to power something like ChatGPT — but predicting "the next most likely word" over and over, billions of times during training, is enough for a model to learn grammar, facts, reasoning patterns, coding syntax, and conversational tone.
What problems do LLMs solve?
Before LLMs, computers were great at structured tasks (math, sorting, database lookups) but terrible at anything involving human language — understanding context, tone, ambiguity, or intent. LLMs closed that gap. They let computers:
- Understand loosely-worded, ambiguous human input
- Generate fluent, context-aware text
- Summarize, translate, and explain complex information
-
Hold a conversation instead of just returning search results
Popular examples of LLMs
GPT-4 / GPT-4o / GPT-5 (OpenAI) — powers ChatGPT
Claude (Anthropic) — the model answering this very prompt
Gemini (Google DeepMind)
LLaMA (Meta) — open-weight models
-
Mistral, DeepSeek, and other open-source families
Common applications in daily life
Chatbots and customer support assistants
Code autocomplete (GitHub Copilot, Claude Code)
Writing assistants and grammar tools
Search engines with AI-generated summaries
Voice assistants that sound less robotic
Translation apps
┌─────────────────────────────┐
│ LLM USES │
├─────────────────────────────┤
│ Chatbots │ Coding │
│ Translation │ Summaries │
│ Search │ Writing │
└─────────────────────────────┘
2. What Happens When You Send a Message to ChatGPT?
Let's trace the journey of a single message.
Step 1 — Typing a prompt
You type something like:
"Explain how tokenization works, like I'm 12."
This is your prompt — plain human text, full of the messiness humans use: casual phrasing, typos, incomplete sentences.
Step 2 — Processing your message
Before the model can "think" about anything, your text has to be transformed into a format it can actually work with — a sequence of numbers. (We'll go deep on this in Sections 3 and 4.)
The system also adds invisible context around your message: system instructions, earlier turns in the conversation, and formatting markers — all packaged together and fed into the model at once.
Step 3 — Generating a response
The model doesn't write the whole answer in one shot. It predicts one token at a time — a token being a small chunk of text, often a word or part of a word — feeding each new token back in as input for predicting the next one.
Prompt tokens → [ MODEL ] → "The"
Prompt + "The" → [ MODEL ] → "capital"
Prompt + ... → [ MODEL ] → "of"
Prompt + ... → [ MODEL ] → "France"
Prompt + ... → [ MODEL ] → "is"
Prompt + ... → [ MODEL ] → "Paris."
This is why responses "stream in" word by word — that's not a UI trick, that's literally the order in which the model generates them.
Step 4 — Why responses aren't copied from the internet
A common misconception: people assume ChatGPT is "searching the web" or "quoting a database" for its answers. It isn't (unless a browsing tool is explicitly turned on).
Instead, during training, the model adjusted billions of internal parameters based on patterns across huge amounts of text. By the time you're chatting with it, none of the original training text is being looked up — the model is generating new text based on learned patterns, one probable next-token at a time. That's also why it can occasionally get things confidently wrong: it's producing what's statistically likely, not retrieving a verified fact from a source.
3. Why Computers Don't Understand Human Language
Here's the uncomfortable truth: computers don't understand words at all.
Text vs numbers
At the hardware level, a computer only understands one thing: numbers, encoded as electrical signals — on or off, 1 or 0. Every piece of text, image, or sound you've ever seen on a screen is, underneath, a pile of numbers.
"Hi" → H = 72, i = 105 (in plain ASCII)
That's fine for storing and displaying text. But it's nowhere near enough for a computer to grasp meaning, grammar, or intent.
Why computers need everything converted into numbers
Machine learning models are, at their core, mathematical functions — matrix multiplications, additions, and non-linear transformations. You can't multiply a matrix by the word "dog." You can multiply it by a list of numbers that represents "dog."
So before any language model can process your sentence, that sentence has to be converted into numerical form the math can operate on.
Introduction to tokens
This is where tokens come in — the bridge between human-readable text and machine-readable numbers. Instead of chopping text into individual characters (too granular) or whole words (too limiting for a language with typos, slang, and multiple languages), models break text into tokens: small, reusable chunks of text.
That's the perfect segue into the next section.
4. Tokenization
What tokens are
A token is a chunk of text — sometimes a whole word, sometimes part of a word, sometimes a single character or punctuation mark. Tokenization is the process of breaking your input into these chunks before feeding it to the model.
Why tokenization is needed
Tokenization solves a few problems at once:
- It gives the model a manageable, fixed-size vocabulary (tens of thousands of tokens) instead of infinite possible words.
- It lets the model handle words it's never seen before, by breaking them into familiar sub-pieces.
- It keeps common words efficient (one token) while rare or complex words get split into multiple tokens. ### Words vs tokens
A common beginner assumption is "1 word = 1 token." That's often wrong.
| Text | Token Breakdown | Token Count |
|---|---|---|
| "cat" | cat |
1 |
| "cats" |
cat, s
|
2 |
| "tokenization" |
token, ization
|
2 |
| "unbelievable" |
un, believ, able
|
3 |
| "ChatGPT" |
Chat, G, PT
|
3 |
As a rough rule of thumb, in English, 1 token ≈ 4 characters, or about ¾ of a word.
Simple example — full sentence
Text: "Tokenization is cool!"
Tokens: [ "Token" | "ization" | " is" | " cool" | "!" ]
Numbers: [ 15496 | 1634 | 318 | 3608 | 0 ]
(illustrative IDs)
Each token gets mapped to a unique number (a token ID) from the model's vocabulary. That list of numbers is what actually gets fed into the model — not the words themselves.
TEXT
│
▼
TOKENS ["Token", "ization", " is", " cool", "!"]
│
▼
TOKEN IDs [15496, 1634, 318, 3608, 0]
│
▼
MODEL INPUT
5. Transformers
What a Transformer is
The Transformer is a neural network architecture introduced by Google researchers in 2017, in a paper with the memorable title "Attention Is All You Need." It's the architectural backbone behind GPT ("Generative Pre-trained Transformer"), Claude, Gemini, and virtually every major LLM today.
At its core, a Transformer takes a sequence of token numbers and passes them through many stacked layers that repeatedly ask: "Given every other token in this sequence, how relevant is each one to understanding this particular token?"
Why it changed AI
Before Transformers, models processed text mostly one word at a time, in order (like reading left to right, one step at a time, forgetting earlier context along the way). This made it hard to handle long sentences or connect ideas that were far apart in the text.
Transformers introduced a mechanism called self-attention, which lets the model look at all tokens in a sequence simultaneously and weigh how much each one matters to every other one — regardless of distance. This was faster to train (highly parallelizable on GPUs) and dramatically better at capturing long-range context.
How it helps understand language
Consider this sentence:
"The trophy didn't fit in the suitcase because it was too big."
What does "it" refer to — the trophy or the suitcase? Humans resolve this instantly using context. Self-attention lets the model do something similar: when processing the token "it," it can assign high attention weight to "trophy" (and lower weight to "suitcase"), effectively learning which words matter most for interpreting which other words.
ATTENTION FOR "it"
The trophy didn't fit suitcase because it was too big
. ███ . . ▓ . ● . . .
███ = strong attention ▓ = some attention ● = current token
Stack dozens of these attention layers on top of each other, across billions of parameters, and the model builds up a rich internal representation of grammar, meaning, and even reasoning patterns.
Why almost every modern LLM uses Transformers
Transformers scale remarkably well — feed them more data and more compute, and they reliably keep getting better, a trend often called the "scaling laws" of deep learning. Combined with their ability to train in parallel (much faster than older sequential architectures) and their strength at capturing context over long passages, Transformers became the default choice for essentially every major LLM built since 2018 — GPT, Claude, Gemini, LLaMA, and beyond.
Putting It All Together — The Complete Workflow
┌──────────┐ ┌────────────┐ ┌─────────────┐ ┌──────────────┐ ┌──────────┐
│ YOUR │──▶│ TOKENIZE │──▶│ CONVERT TO │──▶│ TRANSFORMER │──▶│ RESPONSE │
│ PROMPT │ │ (text → │ │ NUMBERS │ │ (predicts │ │ (decoded │
│ "Hi!" │ │ tokens) │ │ (token IDs)│ │ next token, │ │ back to │
│ │ │ │ │ │ │ repeatedly) │ │ text) │
└──────────┘ └────────────┘ └─────────────┘ └──────────────┘ └──────────┘
Bonus concept: the context window
Every LLM has a context window — the maximum number of tokens (prompt + conversation history + response) it can "see" at once. Think of it as the model's short-term memory span for a single conversation.
┌──────────────────────── CONTEXT WINDOW (e.g. 128K tokens) ────────────────────────┐
│ [ System instructions ] [ Earlier chat turns ] [ Your prompt ] [ Response so far ]│
└──────────────────────────────────────────────────────────────────────────────────┘
▲ ▲
older tokens model generates
may get dropped new tokens here
if window is full
If a conversation grows longer than the context window, the oldest parts eventually fall outside the model's view — which is why very long chats can sometimes cause a model to "forget" something mentioned much earlier.
Bonus concept: temperature
Temperature is a setting that controls how "predictable" vs "creative" the model's next-token choices are.
LOW TEMPERATURE (e.g. 0.2) HIGH TEMPERATURE (e.g. 1.2)
──────────────────────── ────────────────────────
Prompt: "The sky is" Prompt: "The sky is"
blue ████████████ 92% blue ███████ 40%
clear ██ 5% falling █████ 28%
grey █ 2% infinite ████ 20%
(other) ▏ 1% (other) ███ 12%
→ Picks "blue" almost every time → More willing to pick unusual,
(focused, deterministic, surprising words (creative,
repetitive on retries) varied, less predictable)
Lower temperature is useful for factual, consistent answers (like code or math). Higher temperature is useful for creative writing, brainstorming, or varied phrasing.
Wrapping Up
ChatGPT doesn't "understand" your question the way a human does. It doesn't have beliefs, memories of you between separate conversations, or a database of facts it looks up. What it has is:
- Your text, broken into tokens
- Those tokens converted into numbers
- A Transformer architecture that uses self-attention to figure out which parts of your input matter most to each other
- A generation loop that predicts the most probable next token, one at a time, until it forms a complete response It's pattern-matching at a staggering scale — trained on enough human-written text that the patterns it learned are genuinely useful for reasoning, writing, and conversation. Not magic. Just very, very good math.
If you found this useful, the next post in this series will dig into **embeddings* — how tokens actually get turned into the vectors that carry meaning inside a Transformer. Stay tuned.*
Top comments (0)