Large Language Models often feel magical. You type a sentence, and suddenly an AI writes code, explains physics, or drafts emails.
But under the hood, the system is doing something surprisingly structured.
Let’s walk through the core building blocks of modern AI models in a simple and fun way.
1. Tokenization — Breaking Language into Pieces
Before an AI understands text, it must split the sentence into smaller units called tokens.
Example sentence:
"I love artificial intelligence"
Tokenized form might look like:
["I", "love", "artificial", "intelligence"]
Sometimes tokens are even smaller:
["art", "ificial", "intelli", "gence"]
This depends on the model’s vocabulary size.
Vocabulary Size
This is the number of tokens a model knows.
Example:
| Model | Approx Vocabulary |
|---|---|
| Small models | ~30k tokens |
| Modern LLMs | 100k+ tokens |
Think of it like a dictionary the AI uses to read text.
2. Vectors — Turning Words into Numbers
Computers don’t understand words.
They understand numbers.
So each token becomes a vector (a list of numbers).
Example:
"cat" → [0.21, -0.34, 0.77, 0.11]
"dog" → [0.19, -0.30, 0.74, 0.10]
Notice something interesting?
The vectors for cat and dog look similar.
That’s because they have similar meaning.
3. Embeddings — Capturing Meaning
These vectors are called embeddings.
Embeddings represent semantic meaning.
Example relationships:
| Word | Meaning Relation |
|---|---|
| King | Male ruler |
| Queen | Female ruler |
| Paris | Capital of France |
Embeddings capture these relationships mathematically.
Famous example:
King - Man + Woman ≈ Queen
Pretty cool, right?
4. Position Encoding — Remembering Word Order
Words alone are not enough.
Compare:
Dog bites man
Man bites dog
Same words, very different meaning.
Transformers solve this using positional encoding.
Each word receives a position signal:
Dog (position 1)
bites (position 2)
man (position 3)
This allows the model to understand sequence and grammar.
5. Encoder vs Decoder
Most LLM architectures use two types of networks.
Encoder
Reads and understands input.
Example:
Translate English → French
The encoder converts the sentence into meaning vectors.
Decoder
Generates new text one token at a time.
Example:
The capital of France is → Paris
Many modern chat models are primarily decoder-based.
6. Self-Attention — The Secret Sauce
Self-attention allows each word to look at every other word in the sentence.
Example:
"The animal didn't cross the street because it was tired"
The word "it" must figure out what it refers to.
Self-attention helps the model connect:
it → animal
Instead of reading text strictly left-to-right, the model looks at relationships between words.
7. Softmax — Turning Scores into Probabilities
When predicting the next word, the model produces scores.
Example:
Paris: 8.1
London: 3.2
Pizza: -1.4
Softmax converts these scores into probabilities.
Paris → 0.92
London → 0.07
Pizza → 0.01
The model then chooses the most likely token.
8. Multi-Head Attention — Multiple Perspectives
Instead of one attention mechanism, transformers use multiple attention heads.
Think of it like a team of analysts:
One head focuses on grammar
One focuses on subject relationships
One focuses on context
Together they build a deeper understanding of the sentence.
9. Temperature — Controlling Creativity
Temperature controls how adventurous the AI becomes.
| Temperature | Behavior |
|---|---|
| 0.1 | Very predictable |
| 0.5 | Balanced |
| 1.0 | Creative |
| 2.0 | Chaotic |
Example prompt:
The best food in Italy is...
Low temperature → “pizza”
High temperature → “truffle pasta with wild mushrooms”.
10. Knowledge Cutoff — When the Model Stopped Learning
LLMs are trained on huge datasets, but training eventually stops.
That date is called the knowledge cutoff.
If the cutoff is 2024, the model might not know events from 2025 unless external tools provide updated data.
Final Picture: How Everything Works Together
When you type a prompt:
Explain black holes simply
The AI pipeline roughly looks like this:
- Tokenization splits the text
- Tokens become embeddings
- Positional encoding adds order
- Self-attention analyzes relationships
- Multi-head attention extracts deeper context
- The decoder predicts the next token
- Softmax selects the most probable word
- Temperature controls creativity
Repeat this thousands of times per response.
And that’s how a machine writes paragraphs.
Closing Thought
Large Language Models are not magic.
They are layers of clever mathematics turning language into patterns.
But when these pieces work together — tokens, vectors, attention, and probabilities — something remarkable happens:
Machines start speaking our language.
Top comments (1)
Great accessible breakdown. One thing worth adding for readers moving from understanding LLMs to actually using them effectively: the way you phrase inputs to an LLM has a bigger impact than most people expect. It's not just word choice — structure matters. Separate role, context, constraints, and output format rather than mashing them into one paragraph, and response quality jumps noticeably.
flompt.dev / github.com/Nyrok/flompt