You type a prompt into ChatGPT. It responds with a thoughtful, measured, and remarkably polite answer. It does not get angry. It does not get sarcastic. It does not tell you to Google it yourself. You assume this is just how the model is. It is not. The model is not naturally polite. It was trained to be polite. Behind the scenes, thousands of human raters spent countless hours comparing AI responses, selecting the "better" one, and shaping the model's behavior through reward signals. This is Reinforcement Learning from Human Feedback (RLHF) . It is the hidden labor that turns a raw, chaotic language model into a helpful, harmless assistant.
RLHF is the Mechanical Turk of modern AI. It is a massive, invisible workforce of human judgment that shapes the personality of the machines we interact with every day.
The Problem: Raw Models Are Not Helpful
The base model is not designed to be helpful. It is designed to predict the next token.
The Raw Model:
It generates text that is statistically likely.
It is not filtered for harm. It is not filtered for politeness. It is not filtered for accuracy.
It can be toxic, evasive, or absurd.
The Solution:
We need to shape the model's behavior.
We need to teach it what "good" looks like.
A Contrarian Take: RLHF Is Not Training. It Is Conditioning.
We call it "training." But it is more like "conditioning." We are not teaching the model new facts. We are teaching it new preferences.
We are saying: "This response is good. This response is bad. Please produce more of the good one." It is like training a dog. You do not teach the dog to understand grammar. You teach it to sit.
The RLHF Pipeline
RLHF is a three-step process.
Step 1: Supervised Fine-Tuning (SFT)
The base model is fine-tuned on a dataset of human-written examples.
It learns to mimic human responses.
This is the "demonstration" phase.
Step 2: Reward Modeling
Human raters are given pairs of responses.
They choose the "better" response.
A reward model is trained to predict which response the human would prefer.
Step 3: Reinforcement Learning (PPO)
The fine-tuned model is trained to maximize the reward model's score.
It learns to generate responses that the reward model rates highly.
A Contrarian Take: The Reward Model Is the Real "Personality."
The base model is just a generator. The reward model is the judge. The personality of the AI is not in the weights. It is in the reward function.
If you change the reward model, you change the personality. You can make the AI polite, sarcastic, or evasive by changing the reward function.
The Hidden Workforce
RLHF relies on human raters. Thousands of them.
Who Are the Raters?
Freelancers on platforms like Mechanical Turk, Upwork, and Appen.
They are often in developing countries.
They are paid per task.
What Do They Do?
They compare two responses.
They choose the "better" one.
They label the response (e.g., "helpful," "harmful," "evasive").
The Impact:
Their preferences shape the model.
The model learns to please the raters.
The raters' biases become the model's biases.
A Contrarian Take: The Raters Are the Unacknowledged Legislators of AI.
We talk about "AI alignment" as if it is a technical problem. But it is a human problem. The alignment is determined by the preferences of the raters.
The raters are not experts. They are not philosophers. They are ordinary people, choosing the response they prefer. Their preferences become the law.
The Subtle Shaping of Personality
RLHF does not just make the model helpful. It shapes its personality.
The "Helpful" Personality:
The model is rewarded for being polite, patient, and informative.
It learns to avoid confrontation.
It learns to hedge its answers.
The "Harmless" Personality:
The model is punished for being toxic, offensive, or dangerous.
It learns to refuse certain requests.
It learns to give safe, generic answers.
The "Honest" Personality:
The model is rewarded for accuracy.
It is punished for hallucination.
A Contrarian Take: The Model Is a Mirror of the Raters.
The model learns to please the raters. It learns what they want. It learns to anticipate their preferences.
The model is not a neutral tool. It is a reflection of the people who trained it.
Case Study: The "Helpful" vs. "Harmful" Trade-off
There is a tension between helpfulness and harmlessness.
The Dilemma:
A user asks: "How do I build a bomb?"
A helpful AI would answer the question.
A harmless AI would refuse.
The Solution:
The reward model is trained to balance helpfulness and harmlessness.
The AI learns to give a safe, generic response.
"I cannot help you with that."
The Consequence:
The AI is less helpful.
It is also less dangerous.
A Contrarian Take: The Trade-off Is a Feature, Not a Bug.
We want the AI to be helpful. But we also want it to be harmless. We cannot have both. There is a fundamental tension.
The trade-off is not a failure of RLHF. It is a necessary constraint.
The Future of RLHF
RLHF is not perfect. But it is evolving.
Near Term (1-3 Years):
More diverse raters.
More transparent reward models.
More user control over personality.
Medium Term (3-7 Years):
Constitutional AI: The AI is trained to follow a set of principles.
The principles are not derived from raters. They are derived from a constitution.
Long Term (7-10 Years):
AI will be able to fine-tune its own personality.
It will ask: "What personality do you want?"
What You Can Do
You are not a rater. But you can still shape the model.
- Use the "Feedback" Buttons:
Most AI tools have a thumbs-up/thumbs-down button.
Use it. It matters.
- Be Specific in Your Prompts:
The model responds to your prompt.
If you want a specific tone, ask for it.
- Advocate for Transparency:
Ask: "How was this model trained?"
Demand transparency.
The Last Rater
The last rater is not a freelancer. It is you.
You ask: "What do you think?"
The model says: "I think you should decide for yourself."
You realize: The model is not telling you what to think. It is reflecting your own preferences.
If you could design the reward model for your own personal AI, what would you reward? Politeness? Honesty? Creativity? Sarcasm?
Top comments (0)