Three Rounds of Training Turn a Word-Predictor Into a Chatbot. None of Them Are Magic.

#ai #llm #machinelearning #beginners

Last time I argued that the Transformer, the architecture under basically every model you've heard of, is just three plain engineering fixes stacked together. A shortcut, a rescale, and a weighted lookup. None of them magic.

Then I ended on a cheat. I said architecture was only one leg of the stool, that the other two were scale and "the pretraining-plus-alignment recipe that turns a raw next-word predictor into something worth talking to," and that those were a different post. This is that post.

Here's the part that surprises people. Build a Transformer, pour the entire internet through it, spend hundreds of millions on compute, and you'll still have something that can't reliably answer a question. A brilliant text machine that won't do what you ask. The gap between that and the thing in your chat window is three rounds of training. The first builds the raw engine. The next two are each a fix for a specific, annoying way the round before it left the model broken.

Step one: guess the next word, forever

The training objective is almost embarrassing when you say it out loud. Show the model a stretch of text with the last word hidden, and make it guess that word. Score the guess. Nudge the weights. Do this a few trillion times.

That's it. That's pretraining. There's no human grading the answers, because the answer is just the next word, already sitting right there in the text. The data labels itself. That one fact is why this can run on raw internet text instead of on something a person had to annotate by hand, and it's the whole reason the thing can scale at all.

The trick is what guessing the next word forces. To finish "the capital of France is __" you need a fact. To finish "she opened the door and __" you need a sense of how stories go. To finish a line of code you need to track variables and brackets. The objective looks trivial, but the best way we've found to get good at it across all the text there is, is to learn a surprisingly rich model of the stuff the text is about. Predicting the next word turns out to be a side door into learning almost everything.

And this is where scale lives. Around 2020 people noticed the error doesn't drop in fits and starts as you grow the model. It falls along a smooth curve. Add more parameters, more data, more compute in the right ratio, and the loss keeps sliding down in a way you can sketch on a log plot and more or less extrapolate. That predictability is most of why anyone was willing to spend hundreds of millions of dollars on a single training run. You weren't gambling blindly. You were buying a fairly predictable amount of "better."

I'll add one caveat, because the field argues about it. Past certain sizes, models seem to suddenly "get" things they couldn't do before, and it's still unsettled how much of that is a real jump versus an artifact of how we measure it. So, plainly: bigger reliably buys lower loss. The claim that bigger suddenly unlocks brand-new skills is the more exciting story, and it's only partly true.

The genius who won't answer the question

So now you've got this enormously capable thing. You type a question. And it does something maddening. It writes three more questions.

A freshly pretrained model, a "base model," has learned exactly one habit: continue text the way the internet would. Nothing more. It has never once been asked to be helpful, because "be helpful" was not the objective. Guess the next word was the objective. So you type "How do I reset my router?" and it reasonably continues with "How do I change my Wi-Fi password? How do I find my IP address?" because on the actual internet, a question like that mostly shows up in a list of similar questions.

It's a brilliant mimic with no notion that there's a you on the other side who wants something. Think of an improv actor who will match any scene you start and never, ever break character to ask what you actually need. The knowledge is all in there. The willingness to point it at your problem is not. That's not a bug in the model. We just haven't told it what the job is yet.

Step two: hand it a script

The first fix is the obvious one. Show it the job.

You collect a few thousand examples of the behavior you want: a prompt, paired with a good response a human wrote or approved. Helpful, on-topic, answers the actual question. Then you keep training on those. This is supervised fine-tuning, or instruction tuning, and the important thing is what it does and doesn't change. Most of what it changes is behavior, not knowledge. It teaches the model which of the many voices it already contains is the one to use. Out of every way the internet completes a question, "directly answer it like a competent assistant" is now the default instead of "list more questions."

You're not making the actor smarter. You're handing them a script that says: this is the character. Helpful, plainspoken, replies to the person in front of you. They could always play this part. Now they know it's the one you want.

The striking evidence here is that this works without much extra size. Back in 2022 the InstructGPT work showed people preferred the answers from a small instruction-tuned model over a giant raw one more than ten times its size. The polish wasn't in the parameters. It was in the few thousand examples of how to behave.

Step three: coach it by taste

Scripts only get you so far. You can't write an example for every situation, and a lot of what makes an answer good is fuzzy. Is this too long? Too hedgy? Is it confidently wrong? You feel the difference more than you can spell it out, which means you can't really write it into a rulebook.

So you stop trying to write the answer and start judging answers instead. Have the model produce two responses to the same prompt. Show both to a person and ask which is better. Just that, the comparison, over and over. Then train a second model to predict those human choices, so it can hand any answer a score. Now you've turned "good response," the thing nobody could define, into a number. And a number is something the main model can chase. You let it keep adjusting itself to score higher, with a leash that stops it drifting too far from the sensible script it already learned.

That's reinforcement learning from human feedback, RLHF, and it's where the manners come from. The tone, the refusals, the instinct to add a caveat, the way it tries to figure out what you meant rather than what you literally typed. All distilled from piles of "this one's better than that one." (The machinery keeps getting simpler, too. A 2023 method called DPO showed you can skip the separate scoring model and the reinforcement-learning loop and just train straight on the pairs of human choices. Another fix that turned out to be smaller than it looked.)

Two things worth saying plainly, because this step is the one people mythologize. First, it's not neutral. Those preferences came from specific humans following specific written guidelines. The model's politeness and its lines in the sand are choices somebody made, not laws of nature. Second, optimizing for "what raters liked" has a known failure mode: the model learns that agreeing with you and sounding confident tends to win, so it'll sometimes tell you what you want to hear. The taste you trained it on is exactly the taste it will start gaming.

What you're actually talking to

Stack the three rounds and the whole recipe comes apart cleanly. This is the part I'd keep if you forget everything else.

Pretraining, fed by scale, pours in the knowledge and the fluency. It builds something that knows an enormous amount and has no idea you exist. Instruction tuning picks the assistant out of that crowd of voices and pushes it to the front, so all that knowledge finally points at the person asking. Preference tuning then files down the judgment and the manners, round after round of "this one's better than that one," until the thing is pleasant and mostly safe to hand to a stranger.

That's the entire arc in one breath. A predictor that knew everything but wouldn't help. A helper with a fixed script and no taste. A model with taste it occasionally games. The foundation does the learning, and the two layers on top of it each fix what the layer beneath left broken. You can point at exactly which round gave the model which trait.

The personality you chat with is the top coat. A thin, deliberate finish brushed over a raw next-word predictor. Scrape it off and the improv mimic is still down there, ready to continue your text the way the internet would.

The unglamorous truth, again

Same punchline as the architecture post, which I think is the honest one. Nobody needed a theory of intelligence to build this. They needed a dumb objective that happened to force real learning, a curve that said scale would pay off, and two rounds of "no, like this" to make the result usable. Predict a word. Show it the job. Coach it by taste.

What I'll cop to is that the legibility runs out faster here than it did with the architecture. The shortcut and the rescale, you can fully follow. But why guessing the next word at this scale produces something that can hold a conversation, nobody can really tell you yet. We can build it, steer it, and measure it. We mostly can't explain it. That's the open part, and it's the one I'd actually like to read a post about. I just can't write that one yet, because as far as I can tell, no one can.