Everyone's talking about AI like it's magic. I work with it daily. It's not. Here's what's actually happening inside.
I've fine-tuned LLMs. I've published research on them. I've built systems around them.
And the single most honest thing I can tell you about large language models is this:
At the bottom, it's matrix multiplication. That's it.
Not intelligence. Not reasoning. Not understanding. Matrices of floating point numbers being multiplied together, billions of times per second.
But here's the uncomfortable part. That doesn't mean nothing interesting is happening.
Let me break this down without the hype, without the doomsaying, and without the marketing.
What a "Model" Actually Is
Forget the word "model." It carries too much baggage.
What you're actually dealing with is a file. A very large file full of numbers, floats arranged in matrices. GPT-2 has 117 million of them. GPT-3 has 175 billion. These numbers are called weights.
That's the model. Numbers in a file.
When you send a message to an LLM, here's what happens mechanically:
- Your text gets converted to tokens (integers from a vocabulary)
- Each token gets looked up in an embedding table (a matrix) and becomes a vector
- That vector passes through N identical blocks, each doing attention (matmul) and feedforward (matmul + nonlinearity)
- Final layer produces logits over the vocabulary
- Sample from that distribution and get the next token
Repeat until done.
No memory. No state. No "thinking." Pure function application.
What "Training" Actually Is
This is where people get philosophical, so let me be precise.
Training is not teaching. There's no curriculum, no explanation, no understanding being transferred.
Here's the actual process:
- Show the model some text
- It predicts the next token
- Compare prediction to actual next token, compute loss (a single number)
- Backpropagate gradients through every matrix
- Nudge every weight by a tiny amount in the direction that reduces loss
- Repeat roughly a trillion times
The only signal the model ever receives is: your probability distribution over the next token was wrong, adjust.
No grammar lessons. No semantic explanations. No world knowledge explicitly provided. Just: you were wrong, here's by how much, here's which direction to shift.
And yet grammar emerges. Semantics emerges. World knowledge emerges.
That's not magic. That's what happens when you apply a single optimization pressure billions of times across the entire written record of human thought.
Why Next-Token Prediction Even Works
This is the question nobody asks clearly enough.
The implicit assumption is that predicting the next word is a shallow task. A parlor trick.
It's not. Here's why.
Language is not random. Language is massively structured at every level simultaneously:
- Surface statistics: "New York" appears together with near-deterministic frequency
- Syntax: "The ___" expects a noun or adjective, not a verb. Every time.
- Semantics: "eat" expects a food object. "drink" expects liquid. Violations sound wrong.
- World knowledge: "Paris is the capital of ___" has essentially one answer across millions of documents
- Discourse: A medical article doesn't randomly switch to cooking. Topics are coherent.
- Causal structure: "He dropped the glass. It ___" and physics is implicitly encoded because text describing physics is consistent
To predict the next token accurately across all of these simultaneously, the model is forced to learn all of these regularities. Not because anyone labeled them. Because they're all load-bearing for loss reduction.
Shannon estimated English has roughly 1 to 1.5 bits of true entropy per character. Language is not a high-entropy signal. It's a highly compressible, deeply structured one.
Next-token prediction works because language itself is learnable. The model just exploits that ruthlessly.
What the Weights Actually Encode
Here's the honest answer: nobody fully knows.
What we do know from interpretability research is directional. Early layers tend to capture surface patterns, later layers tend to capture more abstract task-relevant signals. But it's distributed, messy, and not cleanly decomposable.
You cannot point to a weight and say "this one handles subject-verb agreement."
What the weights collectively store is a giant tangled function that behaves as if it knows grammar, semantics, world facts, and reasoning patterns. Because that behavior is what minimizes loss on human-generated text.
It's not a list of rules. It's a compressed statistical model of human language and thought, discovered purely by gradient pressure.
"Just Matrix Multiplication" Is Not the Same as "Trivial"
This is where I'll push back on the reductive take, including my own initial instinct.
Yes, it's matmul. But DNA is just chemical interactions. The brain is just electrical signals. Both produce systems of staggering complexity.
The Universal Approximation Theorem tells us that stacked layers with nonlinearities can approximate any function given enough capacity. The architecture isn't doing something magical. But the composition of many simple operations produces a function of extraordinary complexity.
"Just matmul" at the mechanistic level does not imply "nothing meaningful" at the behavioral level.
What it does imply is that the meaningful behavior is not designed, it's emergent. And that's a genuinely different and more honest framing than either "it's just statistics" or "it's basically thinking."
The Transformer Is Not The Only Answer
Here's something the hype cycle obscures: the transformer architecture is not special in principle. It's dominant in practice.
What you actually need to exploit language regularities:
- A way to consume sequential input
- A way to condition predictions on context
- Enough capacity to store learned regularities
- A next-token prediction objective on enough data
RNNs satisfied all four. So did CNNs on text. So do State Space Models like Mamba, which use no attention at all, run in O(n) instead of O(n squared), and are competitive with transformers on many benchmarks today.
The transformer won because attention handles long-range context without compression, and because it parallelizes perfectly on GPUs. The hardware fit mattered as much as the architecture itself.
Mamba is a serious contender. Hybrid architectures mixing SSM and attention layers are emerging. Five years from now, the dominant architecture is probably neither pure transformer nor pure SSM.
And honestly? It's probably something nobody is currently working on.
What We Still Don't Know
I want to be precise here because this matters.
We don't know why scaling works.
We know that scaling works. More parameters plus more data plus more compute equals better performance, with surprising consistency. But the mechanism by which scaling produces emergent capabilities, tasks the model couldn't do at smaller scale suddenly working at larger scale, is genuinely unresolved.
"Combinatorial pattern composition" is a label for the mystery, not an explanation of it.
We built something that behaves intelligently. We know the training procedure in full detail. We do not know why that procedure, at scale, produces the behavior it does.
That's not a gap that marketing will fill. That's an open research problem.
The Real Summary
| Claim | Truth |
|---|---|
| It's just matrix multiplication | True, mechanistically |
| Nothing meaningful is happening | False |
| It understands language | False |
| It learns statistical structure | True |
| We fully understand why it works | False |
| The transformer is the final architecture | Almost certainly false |
Why This Framing Matters
If you think LLMs are magic, you'll use them wrong. You'll trust outputs that are confident but wrong, anthropomorphize failure modes, expect capabilities that aren't there.
If you think LLMs are "just statistics" and therefore uninteresting, you'll also use them wrong. You'll dismiss genuine capabilities, fail to understand where they're reliable vs brittle.
The accurate framing is: a large function approximator trained via optimization that exhibits emergent structured behavior, whose full mechanism we don't yet understand.
Not magic. Not trivial. Something genuinely new that deserves precise thinking.
I research LLM training, continual learning, and quantization. If this sparked something, let's discuss in the comments.
Top comments (0)