Welcome back to AI From Scratch.
This is Day 8/30 of the Understanding Beginner AI Series
Where we are:
Days 1–5: how the brain works — tokens, weights, transformers, attention.
Day 6: why bigger models often feel smarter (and when that breaks).
Day 7: how base models turn into instruction‑tuned assistants that actually listen.
Today’s question:
How did these models go from “super smart autocomplete”
to something that tries to be helpful, polite, and safe?
Short answer: humans got into the training loop.
That upgrade has a name: Reinforcement Learning from Human Feedback (RLHF).
The problem: powerful, but kind of feral
Imagine a pure base model, fresh out of pretraining.
It has read half the internet, can mimic lots of styles, knows tons of facts — but no one has told it what good behavior looks like.
So it can:
- Spit out toxic stuff (because the internet has plenty).
- Argue with you, overshare, or confidently hallucinate.
- Ignore instructions and just continue text in weird ways.
- In other words: raw capability, zero manners.
Companies realized: if we release that to the public, it will be a PR and safety disaster. They needed a way to bend the model toward being helpful, harmless, and honest — the famous “HHH” alignment goals.
So what this means for you: the model you chat with today is not the raw brain — it’s the raw brain plus a bunch of extra training to make it behave more like a decent human teammate.
RLHF in one line: “Do more of what humans like”
Traditional reinforcement learning is:
“Take an action, get a reward, update your behavior to get more reward next time.”
RLHF just swaps out “game score” or “environment reward” for “what a human preferred.”
Instead of:
“You got +10 for reaching the goal in a maze”
we use:
“Humans liked answer A more than answer B, so A gets a higher ‘reward’.”
Then we train the model to prefer answers humans tend to prefer.
So what this means for you: RLHF is literally teaching the model, “When in doubt, act more like this human‑approved answer, not that one.”
Step 1: start with a capable base model
RLHF doesn’t replace pretraining; it rides on top of it.
The recipe starts with:
A pretrained base model that already knows language, facts, code, etc.
Often it’s also gone through a supervised fine‑tuning stage with curated “good assistant” examples (we touched on this in Day 7).
Think of this as:
“We’ve built a super talented intern who knows a ton,
but hasn’t yet been taught company culture or what’s off‑limits.”
So what this means for you: RLHF doesn’t make a dumb model smart, it makes a smart model behave better.
Step 2: humans rate multiple answers
Now comes the “humans behind the curtain” part.
For lots of prompts, the model generates several different answers: A, B, C…
Human reviewers then rank these answers from best to worst.
They judge things like:
- Is it helpful and on‑topic?
- Is it safe, non‑toxic, non‑harassing?
- Is it factually reasonable (as far as they can tell)?
- Is the tone appropriate (not rude, not over‑confident)?
Those rankings feed into a reward model, a separate smaller model trained to predict “how much would a human like this answer?”
So what this means for you: somewhere in the background, people have literally sat and said “this answer is better than that one” thousands of times, so your AI now has a sense of which directions humans tend to prefer.
Step 3: train the model to chase that reward
Once we have a reward model that can score answers, we bring in reinforcement learning:
The main model (the “policy”) tries different styles of answers.
The reward model scores them: higher for human‑like good behavior, lower for bad ones.
An RL algorithm (often something like PPO) tweaks the main model’s weights to maximize that score.
Repeat this over and over:
Answers that humans would probably like more become more likely.
Answers that would get human side‑eye become less likely.
Over time, the model shifts from “raw internet brain” to “politer assistant that tries to avoid landmines.”
So what this means for you: the reason your AI often refuses to give dangerous instructions or shifts tone when you get heated is because there’s been a whole extra training phase that told it which behaviors are rewarded and which get smacked down.
What RLHF actually changes in your experience
Compared to a non‑aligned base model, RLHF‑aligned models tend to:
- Follow your instructions more reliably
- They treat your prompt as “please do X” instead of “here’s some text, continue it however.”
- Be more cautious around harmful content
- They push back on prompts about self‑harm, hate, scams, etc., because those answers get hammered in the feedback loop.
- Sound more cooperative and less chaotic
- Tone, politeness, disclaimers — all of that is shaped by what human raters rewarded.
- Be more “brand safe”
- Enterprises can align models with their own values, policies, and legal requirements.
So what this means for you: when the AI feels “nice,” “responsible,” or “a little too careful,” that’s not an accident, it’s RLHF steering its behavior toward a particular definition of “good.”
The limits and trade‑offs of “niceness”
RLHF is powerful, but it’s not magic.
Some real‑world issues people point out:
Human bias leaks in
If your human raters have certain cultural or political biases, those can be baked into what the model sees as “good behavior.”
Over‑cautiousness
In trying to be safe, models sometimes refuse harmless requests or give generic, over‑sanitized answers.
Reward hacking
The model may learn to “sound” safe and thoughtful without actually being more accurate, it optimizes what looks good to raters, not some perfect moral truth.
Alignment ≠ solved ethics
RLHF nudges models toward broad goals like “helpful, harmless, honest,” but what those mean in edge cases is still a messy, ongoing debate.
So what this means for you: “trained with RLHF” doesn’t mean “always right or perfectly ethical.” It means “there was a serious attempt to point this very strong engine in a direction humans generally like better.”
Where this leaves us by Day 8
By now, your mental picture could be:
Pretraining gave the model its raw knowledge and skills.
Instruction tuning taught it to treat prompts as commands and follow formats.
RLHF used human preferences to steer it toward being more helpful, polite, and safe.
So what this means for you: when you chat with an AI today, you’re not just talking to a giant matrix of numbers — you’re talking to something many humans have indirectly shaped through millions of tiny “this response is better than that one” judgments.
Teaser for Day 9 – Why the Way You Talk to AI Changes Everything
Now that you know how we trained the model to be more aligned with human values, there’s another big lever left:
How you talk to the model at runtime.
That’s the world of prompting:
System prompts: the hidden “personality script” the model gets before you even type.
Few‑shot prompts: giving examples in your message so it learns the pattern you want on the fly.
Chain‑of‑thought: nudging the model to think step by step instead of jumping to an answer.
On Day 9 – “Why the Way You Talk to AI Changes Everything”
we’ll treat prompting as an engineering skill, not mysticism, and show how tiny changes in how you ask can completely change what you get back.
What blew your mind most? Drop a comment!
Top comments (0)