Otaku Intros Run Long
Hey there.
These days — though it's honestly not even "these days" anymore — AI is all the rage, right?
Adults and kids alike, everyone's using AI now. Ask ChatGPT and you'll get an answer to anything. Chappie (everyone's pet name for ChatGPT around here) does it all for you. The ninth Tokugawa shogun, what to make for dinner tonight, your past life — Chappie's got you covered, lol.
There's clearly the smell of money in the air too, because everyone and their dog is climbing up on a soapbox to say stuff like "Leverage AI for MASSIVE value!!" Never mind how many of those dogs actually understand the first thing about AI.
If you're going to go on about Chappie this and Claude that (or even just sit there listening to that kind of talk), you're way better off getting a grip on how they're built, one way or another. It pays off.
This article tackles that nagging question — "ChatGPT, Claude, whatever — what even ARE these guys?! It makes no damn sense!!" — casually, long-windedly, and absolutely stuffed with otaku in-group humor.
Having complicated stuff explained at you in an equally complicated tone — "okay so you tokenize, then embed, the attention scores, and the fal'Cie's l'Cie in Cocoon get Purged, and—" — yeah, that's no fun, and it's not funny either, right?
I mean, this article runs on full-blast cringe energy too, so the moment someone hits me with "Hey, otaku-kun, did you actually think this was funny while you were writing it? lol," all I can do is curl up and shrink into myself, but...
Back in those unforgettable days, holed up in the corner of the classroom griping about card-game ban-list updates and debating whether L or Light Yagami was smarter in Death Note — I figured that explaining this complicated stuff with that exact same energy has gotta have at least a little demand. So this time, I picked up my pen (keyboard).
I hope this article helps those of you who want to learn about AI get a better grasp of it.
It's Less "AI," More "LLM"
People throw the word "AI" around constantly, but it actually covers a LOT of ground. It's not like chatty robots such as ChatGPT are the only things that count as AI.
Rule-based expert systems, image recognition, speech synthesis — all of these are types of AI too.
That party member in an RPG who's supposed to be your ally, yet keeps making clueless, off-the-mark decisions and dragging the whole party down? That's a perfectly upstanding AI in its own right. Honestly, NPCs like that don't make anybody happy, do they. Just quietly stick to healing, man — that's all we're asking.
Now, among all these AIs, the approach of learning patterns from huge piles of data is called machine learning. And the flavor of machine learning that stacks up many layers of a mysterious contraption called a neural network — which straight-up cribs how the brain works — is what we call deep learning.
Things like ChatGPT, Claude, and Gemini are AI models that use the approaches above to learn the patterns of language from massive amounts of text data and predict "what's the most likely word (token) to come next." Those guys are what we call Large Language Models (LLMs).
Lumping Chappie under the blanket label "AI" is kind of like your mom calling every game console a "Nintendo"... no wait, it's the other way around — it's like calling your Nintendo, Uno, and I Spy all "games." It's not technically wrong, but it leaves you with a faint, nagging sense that something's just slightly off.
What This Article Is For
In this article, I'll walk you step by step through everything that happens between the moment an LLM receives some text and the moment it spits text back out.
Me: "What's the capital of Japan?"
LLM: "THE CAPITAL OF JAPAN IS TOKYO."
Understand just why an exchange like this even works, and congratulations — you've achieved Ultimate, Complete LLM Mastery.
No prior knowledge of natural language processing or machine learning required. I mean — if you already had that kind of background, you wouldn't be struggling with this stuff in the first place, right?
Now, since we're getting into the actual internals, there are a few spots where you'll need some programming know-how and a little math. But even without that background, if you just read straight through while skipping over the parts that look scary, I think you can still walk away with a pretty solid grasp of what LLMs are and how the whole thing fits together.
Even if you're not feeling confident, just dive in anyway. You might find it goes down a lot easier than you'd expect.
Also, for the concrete parameter values in this article's formulas and code examples, I'll be using the configuration of GPT-2 (117M parameters). Some cryptic-looking numbers are gonna show up right off the bat, but feel free to just skim them with a "huh, okay" and keep moving.
| Parameter | Value |
|---|---|
| vocab_size | 50,257 token types |
| d_model (hidden dimension) | 768 dimensions |
| n_heads (number of attention heads) | 12 heads |
| n_layers (number of decoder layers) | 12 layers |
| d_ff (FFN intermediate dimension) | 3,072 dimensions |
Just a heads-up: these are values specific to GPT-2, not some kind of absolute, set-in-stone constants, got it!?
First Things First: The Big Picture
The Flow at the Chat Level
Using ChatGPT or Claude is ridiculously easy, right? All we do is type in some text. A little while later, some nice-sounding text comes back. Everybody's happy.
Now, here's the thing — the LLM isn't handing back that whole response in one breath. Get this: what an LLM actually outputs in a single inference pass is just the next single token (≒ one word). The entire response gets assembled through the heroic, tear-jerking effort of repeating this one-token prediction over and over.
The crucial thing here is that the LLM isn't actually understanding your question and then answering it.
At the risk of beating a dead horse: what the LLM is doing, at the end of the day, is predicting "what's the most natural text to follow on from this context."
For example, take the input "What's the capital of Japan?". The LLM first picks whichever token is most likely to come next.
Once it lands on "The", it then predicts "capital" as the continuation of "What's the capital of Japan? The", and from there it just keeps going one token at a time — "of", "Japan", "is", "Tokyo" — building up the answer piece by piece.
An LLM can only ever output one token at a time. The reason it looks like a whole response comes back at once in the chat is simply that this loop is spinning at blazing speed.
So, how about it? I think you've now got a rough sense that these things are monsters that don't comprehend a word of human language.
The Single-Step Inference Pipeline
From here on, I'll walk you through a single pass of the loop I described above — in other words, the pipeline that runs from feeding in a token sequence to getting back the next single token.
At each step, the shape of the data keeps morphing. Sort of like a conveyor belt, huh.
| Step | Data shape | Mental image | Example (GPT-2) |
|---|---|---|---|
| Input token sequence | Sequence of token IDs (integers) | Numbers stamped onto the text | [46036, 25, 171, 120, 234] |
| Embedding | Sequence of vectors | Each token now carries meaning | Each token becomes a 768-dim vector |
| Transformer decoder | Sequence of vectors (transformed) | Meaning reworked with context in mind | Passed through 768-dim × 12-layer computation |
| Output layer | Probability distribution | A shortlist of likely next words | A probability for each of 50,257 token types |
| Sampling | A single token ID | One pick from the candidates | The next token ID |
One more note: converting back and forth between text and token sequences (tokenizing / detokenizing) is actually handled outside the LLM model proper. But since it's essential for understanding the pipeline, I'll go ahead and cover it alongside everything else.
Look, if you're gonna talk Fate, it's only human nature to want you to watch Zero too, dammit!!
Um, otaku-kun — Zero isn't exactly seamless with the original canon, you know.
From Text to Tokens
Strings and Unicode
Alright. Text on a computer is represented as a sequence of Unicode characters. For example, "Hello" is a sequence of five code points: [U+0048, U+0065, U+006C, U+006C, U+006F].
But alas — the LLM is a cold-blooded monster, so it can only deal in numbers. Which means we somehow have to convert warm, human Unicode into a cold, lifeless string of numbers. As if Unicode itself wasn't already plenty lifeless to begin with.
The most naive approach would be to just assign a number to each individual character — but Unicode defines over 150,000 characters, so fiddling around doing that would blow your vocabulary up to a ludicrous size. And the Attention operation we'll get to later has a compute cost that scales with the square of the sequence length, so there's just no way we can afford to do things this way.
Go the other way and assign a number to each whole word, and now you can't handle unknown words (words that never appeared in the training data). Japanese is especially bad for this — you can crank out plausible-sounding compound words all day long, so you'd wind up absolutely drowning in unknowns.
The fix for all this is a slick little technique called subword tokenization. It chops text into units that live somewhere between characters and words (subwords) and assigns a number (token ID) to each one. Basically, it's the best of both worlds. These days, this is pretty much the only method anyone uses.
How BPE Works
Alright. The most widely used subword tokenization method in today's LLMs is a thing called BPE (Byte Pair Encoding). It's basically become common knowledge at this point, so let's go ahead and lock it down while we're here.
Building the Merge Rules (at Training Time)
A BPE vocabulary is learned from the training data using an algorithm something like this:
- Register all byte values (0–255) as the initial vocabulary.
- Split the entire training corpus into units of that initial vocabulary (i.e., into byte sequences).
- Count how often each adjacent pair of tokens occurs.
- Add the most frequent pair to the vocabulary as a single new token (this is a merge rule).
- Replace every occurrence of that pair throughout the training data with the new token.
- Repeat steps 3–5 until the vocabulary hits the target size (e.g., 50,257).
Let's see it with a concrete example. Say the training data contains "low lower lowest" — here's how it plays out.
Initial: l o w l o w e r l o w e s t
If the l–o pair is the most frequent, a new token lo gets created.
merge 1: lo w lo w e r lo w e s t
Next, if the lo–w pair is the most frequent, a token low gets created.
merge 2: low low e r low e s t
And just like that, the more frequently a pattern shows up, the more it gets bundled together into a single token. There you go — round of applause!
The Tokenizer's Artifacts
Once BPE training finishes, the merge rules and the vocabulary get saved as files in the model's project. For GPT-2, the deliverables are two text files:
-
merges.txt— the list of merge rules. One pair per line, written in priority order. -
vocab.json— the lookup table mapping token strings to IDs.
The project's directory layout looks something like this:
models/gpt-2/
├── merges.txt # BPE merge rules (text)
├── vocab.json # token → ID lookup table (JSON)
└── model.ckpt # model weights
You're dying to know what's actually written in these, right? Just what kind of cutting-edge, high-level wizardry is about to leap off the page...? Let's hold our breath and peek inside each one.
# merges.txt (excerpt from the actual GPT-2 file)
#version: 0.2
Ġ t
Ġ a
h e
i n
r e
o n
Ġt he
e r
...
// vocab.json (excerpt from the actual GPT-2 file)
{
"!": 0,
"\"": 1,
"#": 2,
"$": 3,
"%": 4,
"Ġthe": 262,
"Ġof": 286,
"Ġand": 290,
...
}
And the lovely part: you can actually try tokenization out for yourself in Python.
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# vocabulary size
print(tokenizer.vocab_size) # 50257 (total number of tokens in the vocabulary)
# run tokenization
text = "Hello world"
print(tokenizer.tokenize(text)) # ['Hello', 'Ġworld']
print(tokenizer.encode(text)) # [15496, 995]
Yep. That's it.
When you get down to it, a tokenizer is really just a combination of some text files (merges.txt, vocab.json) and the code that reads them in and does the splitting.
For all the fuss people make about ChatGPT, there's no ultra-special, mysterious sorcery at work — at the end of the day, it's nothing more than a bundle of the same files and programs we engineers handle every single day.
Special Tokens
Alright. The vocabulary also contains special tokens that never appear in ordinary text.
Rather than being treated as words that carry meaning in their own right, they exist to hand the model instructions like "the input starts here" or "wrap up generation at this point."
# check GPT-2's special tokens
print(tokenizer.eos_token) # '<|endoftext|>'
print(tokenizer.eos_token_id) # 50256
| Token | Name | Role |
|---|---|---|
| endoftext | EOS (End of Sequence) | Marks the end of the text. Once the model outputs this, generation stops. |
| padding | PAD (Padding) | Filler used to make multiple inputs the same length. |
Remember the "End token" I mentioned as the stop condition back in "The Flow at the Chat Level"? That's exactly this EOS token. The instant the LLM predicts EOS as its next token, response generation comes to a halt.
Tokenization at Inference Time
Let's talk about what actually happens when you run inference. That said, all it really does is apply the merge rules we built earlier in priority order and chop the text up — that's the whole story. It goes something like this:
- Break the input text down into a byte sequence (the units of the initial vocabulary).
- Apply the merge rules in priority order (earliest-added-during-training first).
- Stop when there are no more applicable merges.
- Look up each token's corresponding token ID in the vocabulary table.
Converting to Token IDs
So what do you actually wind up with after all this "tokenizing" business? The finished product is a sequence of integer token IDs. The vocabulary table is just a mapping from token strings to IDs, and the conversion is a stupidly simple lookup.
Let's start with a normal English query:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
text = "What's the capital of Japan?"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)
print(tokens) # ['What', "'s", 'Ġthe', 'Ġcapital', 'Ġof', 'ĠJapan', '?']
print(ids) # [2061, 338, 262, 3139, 286, 2869, 30]
Nice and tidy — English breaks into recognizable, word-ish chunks, each mapped to a clean integer ID. But now watch what happens when we throw Japanese at the very same tokenizer:
text = "日本の首都はどこですか"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)
print(tokens) # ['æĹ', '¥', 'æľ', '¬', 'ãģ®é', '¦', 'ĸ', ...]
print(ids) # [33768, 98, 17312, 105, 33426, 99, 244, ...]
Here's the thing: GPT-2 was trained mostly on English, so Japanese gets shredded into fine-grained byte fragments. It looks like total gibberish, I know — but rest easy, this is purely because GPT-2's vocabulary was built English-first, not some limitation of BPE itself. Hate GPT-2 for it all you want; just don't take it out on BPE.
And this token ID sequence is what becomes the input to the LLM model itself.
From Tokens to Vectors
The Embedding Table
From here on, we're stepping into the world of neural networks! There's even gonna be a little math! How cool is that!
Alright — we've successfully whittled everything down into a language LLM-kun can actually read (integers). But the numbers we handed out in that last step are nothing more than serial numbers. How large or small they are means absolutely nothing. ID 15496 ("Hello") is bigger than ID 995 (" world"), but there's zero linguistic meaning in that.
And so, this is the step where we give those flavorless, odorless integers the warmth of meaning (weren't we just stripping that warmth away a second ago?!). We transform a plain ID into an embedding vector packed with 768-dimensional (768 dimensions?!) semantic features.
This Embedding step looks like pure nonsense at first glance, but a map analogy makes it click.
City names like "Tokyo" and "Osaka" are just labels, but convert them into latitude/longitude coordinates and suddenly you can compute things like "Tokyo and Osaka are close together" and "Tokyo and London are far apart," right?
Embedding runs on the exact same idea, placing each token somewhere in a multi-dimensional coordinate space. With a whole 768 dimensions to work with, it seems plausible you could capture all sorts of relationships — "close in meaning," "same part of speech," "swappable in context," and so on — as distances in that space.
A relationship on a number line (1D) carries less information than one on a flat 2D plane, and once you stack on a third axis of depth to get a 3D space, the picture is richer still, right? Say "768 dimensions" out loud and you might recoil a little — "the heck is 768 dimensions? Is it, like, some realm where eldritch horrors dwell?" — but really, it all boils down to "the more measuring sticks you've got, the happier you are." That's the whole story.
Now, the conversion mechanism itself is simple: you prepare a 2D table (a matrix) of size vocab_size × d_model, then just look up a row using the token ID as the row number.
- : the embedding table (a matrix)
- : vocabulary size (50,257 token types in GPT-2)
- : embedding dimensionality (768 in GPT-2)
In GPT-2's case, this table consists of
floating-point numbers. That's roughly 38.6 million parameters eaten up by the embedding alone. Can't even picture it.
import torch
import torch.nn as nn
# define the embedding table
embedding = nn.Embedding(num_embeddings=50257, embedding_dim=768)
# token ID sequence → vector sequence
token_ids = torch.tensor([15496, 995]) # "Hello world"
vectors = embedding(token_ids) # returns a 2-token × 768-dim matrix
Where token_ids was just two integers, vectors has become two 768-dimensional vectors. The values in each vector are learned during training, and semantically similar tokens wind up with similar vectors.
In other words, every token has finally gotten to carry meaning. Ahh, what a happy occasion.
Positional Encoding
Embedding let every individual token take on meaning — but we're not done yet. Because simply running Embedding throws away the information about the order of the tokens.
In language, word order is a key factor in determining meaning. Without that information, we're in real trouble. ["cat", "chases", "dog"] and ["dog", "chases", "cat"] would collapse into the exact same set of embedding vectors. That's a problem, right?
So we use a technique called positional encoding to bake each token's positional information into its vector.
Learned Positional Embedding
In the original Transformer (2017) and GPT-2, a fixed vector is learned for each position and then added to the token's Embedding. This is called Positional Embedding.
What the hell does that even mean?! ...Right. Let me use movie theater seats as an analogy. The same person (token) ends up with totally different relationships to everyone around them depending on whether they're sitting in seat 1 or seat 10. This technique adds together both pieces of info — "who's sitting there" plus "which seat it is."
Written out as a formula, it looks like this:
- : the final input vector for the token at position
- : the embedding vector looked up from the token ID
- : the position vector corresponding to position
# GPT-2's positional encoding
pos_embedding = nn.Embedding(num_embeddings=1024, embedding_dim=768) # up to 1024 positions
positions = torch.arange(len(token_ids)) # [0, 1]
x = embedding(token_ids) + pos_embedding(positions) # 2-token × 768-dim matrix
GPT-2's positional embedding table works out to parameters. Simplicity is nice, but it comes with a catch: it can't handle any position beyond the maximum length it saw during training (1024 tokens).
RoPE (Rotary Position Embedding)
Awkward to bring this up right after introducing it, but Positional Embedding is basically old news now.
"Ummm, you're STILL doing Positional Embedding?? Positional Embedding was, like, only okay up until April 2021, riiiight~ lmaooo"
We've reached the point where even some California girl doing a TikTok dance with a Starbucks in one hand will hit you with this.
Today's LLMs use a slick, now-standard approach called RoPE (Rotary Position Embedding) — a hot technique that dropped around April 2021.
Whereas the classical approach "adds" a position vector onto the Embedding, RoPE "rotates" the vector based on its position. Once again, I have no idea what that means~. There's no way an operation that just spins things around can beat straight-up composition (addition)!! ...Right.
Picture the hands of a clock. Start at the 12 o'clock position and rotate the hand by a fixed angle for every token you move forward. How far apart two tokens are is something you can read straight off the difference in their hands' angles.
This very property — the fact that "the angle difference tells you the distance" — is the heart of RoPE. It expresses position in relative terms (how far apart things are) rather than absolute terms (which slot in line you're in).
Mathematically, the trick goes like this: inside the Attention computation (which we'll cover in the next section), you multiply the Q (Query) and K (Key) vectors by a rotation matrix, and the relative distance between two tokens naturally falls out in the dot product.
- : the vector being rotated (Q or K)
- : the token's position (0, 1, 2, ...)
- : the rotation matrix corresponding to position
Let me answer the obvious question — "Okay, so what's actually so cool about this?" Here are RoPE's advantages:
- Handles relative position naturally — the distance between two tokens shows up directly in the dot product.
- Extrapolation — it can cope, to a degree, with input lengths it never saw during training.
- No extra parameters — there's no positional embedding table; the rotation angles are fixed by a formula.
Boss Battle!! The Transformer Decoder
In the previous section, each token was reborn as a 768-dimensional vector through Embedding plus positional encoding, picking up information about "what meaning do I carry," and from there, LLM-kun became able to grasp meaning.
But at this very moment, a jolt of electricity shoots through LLM-kun──────.
"The correlations between the tokens... I can't figure 'em out!!"
Take "He grabbed a bat and stepped up to the plate." Zero in on just the word "bat," and LLM-kun hasn't the faintest idea whether it means a piece of sporting equipment or a little flying mammal.
The Transformer decoder is the core component in charge of integrating context. It has each token go around asking every other token, "Hey you — how related to me are you, exactly?", then soak up more information from the ones it's closely related to and less from the ones it barely is.
This is the single hardest part of understanding LLMs.
It's long, and it's crawling with baffling jargon! How delightful!
Now, in GPT-2, twelve decoder blocks of identical structure are stacked on top of one another, and each block consists of the following two operations:
- Attention — gathers information from the other tokens
- Feed-Forward Network (FFN) — transforms and reworks the gathered information
Attention (Multi-Head Attention)
So, What Even IS Attention?
In a single sentence, Attention is the mechanism where "each token computes how much attention it should pay to every other token, and then gathers information in proportion to those attention levels."
Picture a meeting. When you're the one summarizing what everyone said, you don't give every single word from every participant equal weight — you zero in on the remarks relevant to the topic at hand, right? Attention does this exact same thing, just numerically. "Pfft, as if an otaku would ever be the one put in charge of summarizing a meeting"? Hey — don't you talk back to me!!
That said, there isn't necessarily just one lens for deciding "how much to pay attention." Take the sentence "She opened an account at the bank yesterday" and focus on the word "bank." Syntactically, what matters is its predicate relationship with "opened"; semantically, what matters is its co-occurrence with "account."
Two tokens that look only loosely connected from one perspective can turn out to be tightly connected from another.
From one angle you might go "eh, these two have zero chemistry," but switch your angle and it could flip to "wait, no — these two are totally a thing." Looking at a tsundere heroine who's secretly carrying a torch for the protagonist and writing them off with "nah, those two? Not happening (smug grin...)" would be way too much of a waste. There's nothing wrong with Sora-and-Naminé fan art existing, okay? If they're in the same panel, they're a couple. Period.
This is why real Attention uses multiple heads, each computing attention scores from a different perspective at the same time. In GPT-2, 12 heads run in parallel, each one capturing a different kind of relationship. This is what we call Multi-Head Attention.
Input Projection
The math piles on all at once from here, but if it gets to be too much, feel free to just skim for the vibe!
From here on, we'll fix our attention on the 768-dimensional vector of some token (strictly speaking, a position index within the sequence), and follow how a given head processes it with respect to each of the comparison tokens ( ).
Here, since is the specific token we're zooming in on, it's a constant — but the comparison index is a variable that can take on any token in the input, including itself ( ). Since the whole point is for each token to compute how much attention to pay to every other token and gather information accordingly, it inevitably works out this way.
Now, from the focus token and each comparison token , we construct three kinds of vectors.
- Q (Query) — represents "what information am I looking for." Generated from the focus token .
- K (Key) — represents "what information do I hold." Generated from the comparison token .
- V (Value) — represents "the actual content of that information." Also generated from the comparison token .
In short: Q is "what I'm looking for," while K and V are "what the other party can offer, plus the actual contents."
- : the input vectors of the tokens at positions and (768-dimensional each)
- : head 's learned weight matrices (768 × 64 each)
- : the Query vector for position in head (64-dimensional)
- : the Key and Value vectors for position in head (64-dimensional)
And here's the very heart of multi-head: each head carries its own separate set of weight matrices. Because the same input yields different Q, K, and V depending on the head, each head gets to compute attention from a different perspective. Spelling it all out makes the formulas a pain to read, so from here on I'll drop the head superscript
except where it specifically needs to be shown. Honestly, it's just a hassle.
Deriving the Attention Score
Within a given head, the token at position computes "how much attention should I pay to the token at position ?" You get the score by taking the dot product of your own Q with the other token's K.
- : the raw attention score from the token at position to the token at position
- : the Query vector for position (64-dim per head)
- : the Key vector for position (64-dim per head)
- : a scaling factor ( = 64 dims per head). Keeps the dot product from blowing up too large.
Because LLM decoders apply a causal mask, the token at position can only reference tokens at or before its own position ( ). After all, a model whose entire job is predicting the next token would be utterly pointless if it got to peek at the answer ahead of time. Talk about cheating!
We then normalize the resulting scores with softmax, turning them into weights between 0 and 1. Big unwieldy numbers are a pain to deal with, right? Corralling everything into the 0-to-1 range makes life a lot easier in all sorts of ways.
- : the normalized attention weight.
The Weighted Sum
Now that we've got the weights, we use them to take a weighted average of the V (the actual information content) at each position. Information from high-attention tokens gets pulled in heavily, while the info from low-attention ones barely registers at all. ...Which means guys like us barely register either. C'mon, dry those tears — we're "buddies," aren't we?
- : the output vector for position in head (64-dimensional)
- : the attention weight toward position
- : the Value vector for position (64-dim per head)
And that's the whole process for a single head. All 12 heads each carry their own distinct weight matrices , running this exact same computation in parallel from their different perspectives.
Concatenation and Linear Transformation
We concatenate the outputs of all 12 heads (64-dimensional each) back into a single 768-dimensional vector, then run it through a linear transformation with the output weight matrix to merge the information from every head. In coder terms, it's roughly like taking the elements of twelve 64-element 1D arrays and packing them, in order, into one big 1D array to produce a single 768-element array. Not the most rigorous description, but you get the idea.
- : concatenation of the 12 head outputs (768-dimensional)
- : the output weight matrix (768 × 768)
- : the final Attention output for position (768-dimensional)
In practice, this whole computation is handled in one batched matrix operation across all the tokens.
import torch.nn.functional as F
n_heads = 12
d_head = 64 # 768 dims ÷ 12 heads = 64 dims/head
# 1. Input projection: generate Q, K, V for all tokens at once, then split into 12 heads
Q = x @ W_Q # (num_tokens, 768) → 12 heads × (num_tokens, 64)
K = x @ W_K
V = x @ W_V
# 2. Compute attention scores per head (apply causal mask)
scores = (Q @ K.transpose(-2, -1)) / (d_head ** 0.5) # (num_tokens, num_tokens)
scores = scores.masked_fill(causal_mask, float('-inf')) # set future positions to -∞
weights = F.softmax(scores, dim=-1)
# 3. Weighted sum
head_output = weights @ V # per head: (num_tokens, 64)
# 4. Concatenate the 12 heads and apply the linear transformation
concat = torch.cat(all_heads, dim=-1) # (num_tokens, 768)
output = concat @ W_O # (num_tokens, 768)
Feed-Forward Network (FFN)
After Attention has shuffled information between the tokens, each token's representation is now in a "has absorbed info from the other tokens" state — but its internal transformation is still incomplete. The FFN (Feed-Forward Network) fills that gap, applying a nonlinear transformation to each token's representation individually to polish it into its final form (with zero information exchange between tokens). Like I keep telling you, this makes no sense!! ...Right.
If Attention is the "horizontal" pass that gathers information across tokens, then FFN is the "vertical" pass that transforms and reworks each token's information on its own, deepening its meaning.
"Teamwork! The bonds between friends! My friends are my power!" and "I'm an egoist! I win on my own power and mine alone!" — both of them are great, aren't they? That's the conversation we're having here. ...Were we really having that conversation? Probably. Partially.
Anyway, the structure of this FFN is just a simple two-layer neural network.
- : the output vector from Attention (768-dimensional)
- : the first-layer weight matrix (768 × 3,072). Expands the vector to 4× the dimensions.
- : the second-layer weight matrix (3,072 × 768). Brings it back to the original dimensions.
- : bias terms
- : the activation function (a nonlinear transformation)
Why blow it up to 3,072 dimensions just to squash it back down to 768?! ...Right. Because widening the dimensions opens up a "workspace" where more complex patterns can be expressed. Then, by compressing it back to the original size, only the important features survive. That's the trick.
# FFN computation
W1 = ... # (768, 3072)
W2 = ... # (3072, 768)
hidden = F.gelu(x @ W1 + b1) # (num_tokens, 3072) — expand
output = hidden @ W2 + b2 # (num_tokens, 768) — back to original
Residual Connections and LayerNorm
Actually — in each decoder block, a residual connection and a LayerNorm are applied after both the Attention step and the FFN step. Sorry for springing this on you after the fact; bear with me. It's genuinely hard to find a good spot to slip this explanation in.
A residual connection is a mechanism that takes the input of an operation and adds it straight back onto the output.
Running through as many as 12 layers of processing brings the risk of the original information gradually getting lost, or of training going unstable. But if a residual connection reframes things as adding a delta on top of the original information, then each layer only has to learn the delta it's responsible for improving. What a bargain...
LayerNorm is an operation that normalizes the values of each element in a vector (nudging them toward a mean of 0 and a variance of 1).
- : the input vector (768-dimensional)
- : the mean of all elements of
- : the variance of all elements of
- : a tiny value to prevent division by zero (e.g., )
- : learnable scale and shift parameters (768-dimensional each)
- : element-wise product
It keeps the vector values from ballooning or shrinking to extremes as the layers get deeper, holding training steady — the unsung hero quietly doing the heavy lifting behind the scenes!
Putting the residual connection and LayerNorm together, the processing inside a decoder block ends up looking like this:
Stacking the Layers
In GPT-2, the decoder block we just described (Attention → residual connection + LayerNorm → FFN → residual connection + LayerNorm) is stacked 12 layers deep. As you'd guess, that's an absolutely brutal amount of computation. Long live the GPU!
Notice here that both the input and the output are num_tokens × 768 — the shape never changes! Each layer takes in data of the same shape and hands back data of the same shape. It's only the contents that get refined as they pass through layer after layer.
Also, research seems to suggest that the shallower layers tend to capture surface-level features like syntax and parts of speech, while the deeper layers capture more abstract features like meaning and contextual understanding.
The KV Cache
Why We Need a Cache
In the Attention computation, each token computes an attention score against every other token. So for a sequence of tokens, the number of score computations scales with . Double the sequence length and the compute cost quadruples; make it 10× longer and it's 100×. Gahaha — yeah, no way.
Now, as we saw back in "The Flow at the Chat Level," an LLM generates autoregressively, one token at a time. Implement that naively and you'll redo the Attention computation for the whole sequence from scratch every single time you produce one token. And where does that leave you? Generating the -th token costs , and you repeat that times, so the full sequence works out to . Gahaha — yeah, no way, take2.
But let's actually stop and think for a second. I bet you'll arrive at this realization: "Look, I ain't the sharpest, so maybe I'm missin' somethin', but — do we really gotta recompute every single K and V from scratch each time?"
For example, say you're predicting the token that comes after "The capital of Japan is" (5 tokens). Attention does its computation using the K and V of all 5 tokens. Then, once "Tokyo" is generated, you move on to predict what follows "The capital of Japan is Tokyo" (6 tokens) — but at this point, the K and V for those first 5 tokens are the exact same values as last time, right? Since the causal mask keeps past tokens from being affected by future ones, the same input always produces the same K and V.
Recomputing K and V for every token, every single time, is a colossal waste. So the trick of keeping the already-computed K and V in memory and only tacking on the K and V for the new token — that's the KV cache. An almost absurdly sensible idea, isn't it.
How It Works in the Autoregressive Loop
All right, let's walk through the autoregressive loop with the KV cache, one concrete step at a time.
Step 1: The first pass (prompt input)
Compute the K and V for every token in the input "The capital of Japan is" (5 tokens) and store them in the cache.
K_cache = [k_0, k_1, k_2, k_3, k_4] # all 5 tokens
V_cache = [v_0, v_1, v_2, v_3, v_4]
This step, naturally, requires the full Attention computation across all 5 tokens. No way around it.
Step 2: Generating the first token
Compute only the K and V for the new token "Tokyo" and append them to the cache.
K_cache = [k_0, k_1, k_2, k_3, k_4, k_5] # one appended
V_cache = [v_0, v_1, v_2, v_3, v_4, v_5]
Just use the new token's Q together with the whole cache's K and V, and you skip computing Q for the past tokens entirely — along with all those attention scores among the past tokens. Best thing ever!!
Step 3 onward: Just repeat. Every step, all you do is append one new token's worth of K and V to the cache.
A Worked Example of Memory Consumption
The KV cache saves compute, but it pays for that in memory. Of course it does — the whole fix is to memorize the things you'd otherwise recompute. An unavoidable trade-off. Inescapable karma.
Let's run the actual numbers for GPT-2.
Here's the KV cache size per token:
- : the two kinds, K and V
- : number of decoder layers (12)
- : hidden dimension size (768)
- : the size of one floating-point number (4 bytes for float32)
And for GPT-2's maximum sequence length of 1,024 tokens, it looks like this:
At GPT-2's scale, 72 MB is downright modest — but in modern LLMs this number balloons fast.
| Model | Layers | Dimension | Max sequence length | KV cache (float16) |
|---|---|---|---|---|
| GPT-2 | 12 | 768 | 1,024 tokens | 36 MB |
| Llama 3 8B | 32 | 4,096 | 8,192 tokens | 4 GB |
| Llama 3 70B | 80 | 8,192 | 8,192 tokens | 20 GB |
Advances in Attention
The Attention we covered in this article computes a score for every single token pair, so the compute cost and an ever-bloating KV cache are unavoidable. Modern LLMs have researched and deployed all sorts of approaches in answer to this problem. Let me walk through a few.
GQA (Grouped Query Attention)
In ordinary multi-head Attention, every head carries its own K and V — but GQA shrinks the KV cache by having several Query heads share a single set of K/V heads. In Llama 3 8B, for instance, 32 Query heads are served by just 8 KV heads, bringing the cache size down to a quarter of plain multi-head Attention. Awesome!
Sure, it won't hit the exact same accuracy as full multi-head Attention — but because the Attention mechanism itself is unchanged, it's a wildly practical technique that saves memory while keeping the quality hit minimal. This one's the real deal.
DeltaNet (Gated DeltaNet)
As an approach that attacks Attention's cost head-on, there's a whole family of methods called linear Attention. The one that's been turning heads lately is DeltaNet.
Rather than a KV cache, DeltaNet uses a fixed-size memory matrix, holding the compute down to . Ordinary linear Attention suffers from old information accumulating and interfering — but DeltaNet fixes that with the delta rule (erase the old info first, then write in the new). Way too slick, c'mon......
Naturally, since memory usage stays flat no matter how long the sequence grows, it's a great fit for long-form processing. That said, because everything gets squeezed into a fixed-size state, the fine details of distant information do degrade — but we'll let that slide as part of its charm.
Hybrid Attention
To enjoy both the accuracy of softmax Attention and the efficiency of linear Attention like DeltaNet, Hybrid Attention setups that combine the two have made it into production as well. Qwen3, for instance, adopts a 3:1 configuration — one softmax Attention layer for every three linear Attention (Gated DeltaNet) layers — handling long contexts efficiently while holding onto high quality.
Honestly, at my level of technical skill, my reaction is basically "Ahh, I see — a flawless plan! ...assuming, of course, we just turn a blind eye to the teeny-tiny detail that it's impossiiible~~!!!!" — and yet, somehow, it works. Qwen, you're unreal, I swear......
Generating the Output
Logits and the Probability Distribution
After passing through all 12 layers of the Transformer decoder, each token's 768-dimensional vector has the context thoroughly woven into it. But there's a catch: left as a bare 768-dimensional vector, it can't actually decide "what the next token should be."
So what's that supposed to mean, ya know?! For the input "The capital of Japan is," the candidates for the next token are literally all 50,257 token types in the vocabulary, right? It could be "Tokyo," it could be "Kyoto," it could even be "where." The point is, the model needs to turn that 768-dimensional vector into a score — a "how likely to come next" value — for every one of those 50,257 candidates.
Concretely: take the vector at the final token position, multiply it by a weight matrix sized to the vocabulary, and out comes a 50,257-dimensional vector. These raw scores are what we call logits!
- : the output vector of the decoder's final layer (768-dim)
- : the vocabulary-projection weight matrix (768 × 50,257)
- : the bias term
- : the logits vector (50,257-dim); each element corresponds to one token in the vocabulary
Logits are raw scores that can come out positive or negative. Left like that they're unwieldy as all heck, so we push them through the softmax function to turn them into probabilities between 0 and 1.
- : the probability that token comes next
- : the logit value of token
- : the vocabulary size (50,257 token types)
import torch.nn.functional as F
# Compute logits from the decoder's final output
logits = h @ W_vocab + b # (50257-dim)
# Convert to a probability distribution via softmax
probs = F.softmax(logits, dim=-1) # (50257-dim) all elements sum to 1
# Example: the highest-probability token
top_token_id = torch.argmax(probs).item()
print(tokenizer.decode([top_token_id]))
At this point, each of the 50,257 tokens has been assigned a probability of "how likely it is to be the next token."
Sampling Strategies
Once you've got the probability distribution, you select a single next token from it. And there are all kinds of ways to make that choice. Working out which method to use is what we call a sampling strategy (or decoding strategy).
Greedy
You simply pick the highest-probability token. Simple is best.
What's nice about greedy is that it's deterministic and reproducible — but since it always goes for the "safest" option, the text it generates tends to come out monotonous. Here's the thing about us humans, see! Even knowing full well the other side's a machine — yeah! — it still stings a little when it even talks like one!!
Temperature
So this is what you reach for when you want to inject some human-like wobble. Father, I'll adjust the Temperature and make Rei Ayanami smile. Concretely, by dividing the logits by a constant
(the Temperature) before computing the softmax, you tune the "sharpness" of the probability distribution.
- : the Temperature parameter
- : the distribution sharpens (high-probability tokens become more dominant) — more certain, conservative output
- : the original distribution, unchanged
- : the distribution flattens (low-probability tokens get picked more easily) — more diverse, creative output
Top-k Sampling
You keep only the top tokens by probability as candidates, then pick randomly from among them. This keeps extremely low-probability tokens from getting chosen. Simple, but effective.
Top-p Sampling (Nucleus Sampling)
You add up tokens in descending order of probability, taking everything up until the cumulative probability crosses (say, 0.9) as your candidate pool. Unlike Top-k, the number of candidates shifts dynamically with the shape of the distribution. When the probability is concentrated, you're choosing from a handful of tokens; when it's spread out, from many. Pretty slick.
From Tokens to Text
Detokenization
Sampling has locked in a single next token ID. The process of turning that token ID back into text is detokenization. 'Cause us humans can't read machine code, dammit!!
The mechanism is simply tokenization run backwards — you just look up the string that the token ID maps to in the vocabulary table.
# A token ID obtained from sampling
next_token_id = 11790 # example
# Detokenize: convert the ID back into a string
next_token_str = tokenizer.decode([next_token_id])
print(next_token_str) # e.g. " Tokyo"
You can also decode several token IDs all at once.
token_ids = [15496, 995] # "Hello world"
text = tokenizer.decode(token_ids)
print(text) # "Hello world"
Because this is the inverse of BPE's subword splitting, token boundaries and word boundaries don't always coincide. For example, if "playing" had been split into the two tokens ["play", "ing"], decoding stitches them back together into the original word.
The Big Picture of the Autoregressive Loop
To close out, let's pull everything from all the sections so far into a single autoregressive loop.
This is the recap episode. Rejoice:)
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch
# Load the model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.eval()
# Input text
prompt = "The capital of Japan is"
input_ids = tokenizer.encode(prompt, return_tensors="pt") # Tokenize
# Autoregressive loop
max_new_tokens = 20
generated = input_ids
for _ in range(max_new_tokens):
with torch.no_grad():
outputs = model(generated) # Embedding → 12 decoder layers → logits
logits = outputs.logits[:, -1, :] # logits at the final token position (50257-dim)
# Sampling (Greedy here)
next_token = torch.argmax(logits, dim=-1, keepdim=True)
# Stop if it's the EOS (end token)
if next_token.item() == tokenizer.eos_token_id:
break
# Append the generated token to the input and loop again
generated = torch.cat([generated, next_token], dim=-1)
# Detokenize
output_text = tokenizer.decode(generated[0])
print(output_text)
Every single step we've covered in this article is running inside this short snippet of code.
I've been clattering on about all sorts of complicated stuff, but once you actually implement it, this is about all there is to it.
-
tokenizer.encode— text to tokens (BPE) -
model(generated)— Embedding → positional encoding → 12 Transformer decoder layers (Attention + FFN) → logits -
torch.argmax— sampling (picking the next token from the probability distribution) - comparison against
tokenizer.eos_token_id— deciding when to stop, via the EOS (special token) -
torch.cat— in this naive implementation without a KV cache, the already-generated tokens get appended to the input and recomputed -
tokenizer.decode— tokens to text (detokenization)
The End, and What Comes Next
All right. And so, y'all have officially made it to the end of this long-winded, cringe-inducing wall of drivel. Nice work.
As I noted at the very start — and it bears repeating — LLM-kun turned out to be no magic wand, no ultimate general intelligence, and definitely no god. Text gets turned into numbers, operated on as vectors, the next token gets drawn from a probability distribution, and the whole thing turns back into text...... This one continuous data-transformation pipeline is the true identity of the "Chappie" we all use every single day.
What I've covered here is nothing more than the basic inference architecture. The latest technical methods, the many extensions for actually putting LLMs to work, the details of "training" — the crucial counterpart to inference — etcetera, etcetera. There's still a whole lot left to learn.
I'm hoping to write all that up too, bit by bit.
Our battle is only just beginning!! Be sure to look forward to Sensei's next series!
Once more, and in closing: I hope this article proves to be a small aid in the understanding of everyone out there who wants to learn about AI.
Farewell.





Top comments (0)