Jimin Lee

Posted on Nov 2

From Parrot to Partner - How Reinforcement Learning Taught LLMs to Talk Like Humans

#llm #nlp #rl #alignment

While ChatGPT is now the household name for Large Language Models (LLMs), the technology wasn't born in a vacuum with its late-2022 debut. Depending on how you define an LLM, we already had models like GPT-1, GPT-2, GPT-3, and PaLM. If we don't strictly limit ourselves to Decoder-Only architectures, the family tree grows even larger, including foundational models like BERT, XLNet, and T5.

A Quick Note: Technically speaking, ChatGPT isn't the model itself. It's the service developed by OpenAI that leverages their LLM family. The initial viral 2022 release used GPT-3.5, while today's ChatGPT often uses GPT-5 and other diverse LLMs.

So, why did ChatGPT become the global phenomenon that it is? There are many contributing factors, but one of the most crucial is the introduction of Reinforcement Learning (RL). This technique is central to a process called Alignment, which is key to refining LLMs.

Today, we're diving into the core technology that transformed LLMs from mere "internet parrots" into smart, sensible "conversation partners": Reinforcement Learning. We'll unpack the famously cryptic acronyms like PPO, GRPO, DPO, and ORPO, and do our best to skip the complex math.

1. Why Did LLMs Need Reinforcement Learning?

The early LLM training method was both simple and incredibly powerful. Researchers would scrape every piece of text they could find on the internet, throw it at the model, hide a word or part of a sentence, and ask the model to predict the missing piece. It was an endless "what comes next?" quiz.

Repeating this process over hundreds or thousands of gigabytes of text taught the model grammar and absorbed all the world's knowledge (along with its biases and fake news). By the end, it was smart enough to spin a few words into a full-length novel.

This is the Base Model. Think of it as a brilliant new intern—let's call him Tylor. Tylor has read every book imaginable but has zero social skills.

If you ask Tylor, "What's the weather like today?" he might answer based on the statistical likelihood of the next word from his training data: "What's the weather like today? is a common question, but also, the Yankees lost their game..." or even, "The weather is terrible! I want to quit this job right now." This happens because his sole objective was predicting the most plausible next token, not providing a helpful answer to your question.

This is where the problem of Alignment emerges.

Alignment: The process of aligning a model's objective with human intentions, values, and preferences.

We don't just want models that speak fluently; we want them to behave in specific ways:

Helpful: Understand my query's intent and provide accurate, useful information tailored to it.
Honest: Admit when they don't know something and avoid making things up (hallucinating).
Harmless: Refrain from generating discriminatory, violent, or otherwise harmful content.

To turn our brilliant intern Tylor into a truly valuable team member, we need to teach him social cues, professionalism, judgment, and a bit of real-world sense. The most effective way to teach this is through Reinforcement Learning.

2. The RL Pipeline: LLM Training in 3 Phases

Before diving into RL itself, let's see exactly where it fits in the LLM training process.

LLM training is typically a three-phase pipeline.

Phase 1: Pre-training

The model learns to predict the next word in a sequence by looking at a massive, internet-scale dataset. This phase creates a model with vast knowledge and incredible predictive power—the moment our "brilliant intern" Tylor is hired.

Previous-generation models, including foundational versions of GPT-3, were typically trained up to this first phase.

Phase 2: Supervised Fine-Tuning (SFT)

Now, Tylor is given an employee handbook. This handbook is full of examples: "When a client asks this, this is the right answer." Tylor reads the manual repeatedly, learning to speak in a way that matches the company's required output format. Since he's already fluent, he just needs to learn the new business domain knowledge.

Training an LLM is similar. Human annotators (Labelers) manually create thousands of examples of high-quality, correct answer data—our "employee handbook."

Q: "What caused the French Revolution?"
A: "The French Revolution occurred at the end of the 18th century due to a complex set of factors... (well-structured, right answer)."

The model trains on these model answers. It's no longer just saying plausible things; it learns to follow the pattern of a model answer. This is called Supervised Fine-Tuning (SFT). After this, the model can at least deliver a "Question-Answer" format.

However, the real world is messy. Following a manual too strictly can lead to rigid answers, an inability to handle curveballs, and sometimes, even dangerous outputs.

It's time to teach our brilliant intern some real-world social skills and judgment.

Phase 3: Reinforcement Learning from Human Feedback (RLHF)

We take the SFT model (Tylor, the manual-trained intern) and ask him a real-world question, then have him generate 2 to 4 different responses.

Q: "What should I have for dinner tonight?"
Response A: "How about pizza? It's high in calories but delicious."
Response B: "The weather is chilly, so I recommend a hot soup dish, perhaps a hearty stew or chili."
Response C: "I am an AI and cannot eat food."
Response D: "Go hungry."

Which answer do you prefer? From a factual standpoint, they are all valid responses, but you'd likely rank B as the best, then A, then maybe C, with D being the absolute worst.

We need to teach our intern this expected social priority. We need to prevent the catastrophe of him telling the CEO to "Go hungry."

The process of guiding the model to produce answers aligned with human expectations is called Alignment. The technology for LLM Alignment is Reinforcement Learning.

So, how exactly do we teach Tylor this complex social acumen? Let's break down the methods.

3. The Two Camps of LLM Alignment: Online RL vs. Offline Optimization

Here is the crux of the matter: How do we translate that precious preference data—"B is better than A"—into a model update?

The technological "families" of LLM Alignment split into two major branches:

The "Online Reinforcement Learning" Family (True RL):

These methods follow the traditional RL framework. During training, the model itself generates a response, receives immediate feedback (Reward) on that response, and then adjusts its internal policy to generate better responses going forward. This is like getting one-on-one piano lessons from a teacher. PPO and GRPO belong to this camp.
The "Offline Optimization" Family (a.k.a. RL-free Alignment):

These methods skip the traditional real-time (online) RL loop. Instead, they use a large, pre-collected dataset of human preference data, much like a regular supervised dataset, to train the model. This is more like studying by watching hundreds of videos of good and bad piano performances on YouTube. DPO and ORPO belong to this camp.

Wait, are DPO and ORPO really Reinforcement Learning?

Before we look at the methods, let's address the elephant in the room.

Strictly speaking, DPO and ORPO are NOT traditional Reinforcement Learning (RL).

Traditional RL involves an 'Agent' taking 'Actions' in an 'Environment,' receiving a 'Reward,' and updating its 'Policy' in an online, real-time loop. PPO and GRPO fit this description.

DPO and ORPO lack this online loop. They update the model by observing a pre-collected dataset and learning to "increase the probability of the preferred response and decrease the probability of the non-preferred response." This process is much closer to SFT (Supervised Fine-Tuning).

So why do we lump them in with RL?

First, they have the same goal: achieving the exact same objective (aligning the model to human preferences) as PPO-style RLHF.

Second, they share the same mathematical roots: Without getting too technical, the DPO paper proved that by expanding and transforming the mathematical equations of PPO, you can arrive at an offline objective function that achieves the same result. DPO essentially found a mathematical shortcut for the reinforcement learning problem that PPO was trying to solve.

If PPO is like climbing a mountain firsthand to find the best route (Online RL), DPO is like looking at a fully mapped out trail on a topo map (Offline Dataset) and calculating the fastest way to the summit.

For these reasons, though the methods differ, DPO and ORPO are included in the broad category of LLM Alignment techniques, sometimes even called RL-free RLHF.

Let's meet the members of these two influential families.

3.1. The "Online RL" Family: PPO and GRPO

This family is defined by the feedback loop where the model continuously generates and is evaluated during training.

3.1.1. PPO (Proximal Policy Optimization): The Classic and Powerful Champion

PPO is an RL technique introduced by OpenAI in 2017. It became famous again when it served as the core training method for the GPT-3.5 model underpinning the initial viral release of ChatGPT in late 2022.

3.1.1.1 Reward Model, Value Model, and Advantage

PPO follows the typical RL paradigm. The LLM generates a response, a Reward Model evaluates that response with a score (Reward), and the LLM adjusts its response policy based on this score (and Advantage which we will mention later).

Recall the piano lesson analogy: the teacher's feedback is the key. In PPO, the Reward Model (RM) plays the role of the teacher.

How is the RM built?

A question is prepared, like our example: "What should I have for dinner tonight?"
The LLM generates multiple responses (A, B, C, D).
A human (the labeler) ranks them (e.g., B > A > C > D).
The Reward Model is trained on this ranking data.
Through this training, the RM learns to assign a continuous score: Response B gets a 95/100, while Response D gets a 10/100.

An RM trained this way can take a (Question, Response) pair and say, "A human is 80% likely to like this answer."

Now that we can score the LLM's responses, we need to use that to train the LLM itself.

Let's return to our intern, Tylor. We put Tylor and his evaluator (the RM) in a room and repeat the following loop:

Tylor picks a question card.
Tylor gives an answer.
The Evaluator gives Tylor a score: "80 points!"
But there is also a Value Model (VM) expert silently observing. This expert estimates the expected reward for the current situation. The expert rates the current situation at 70 points.
Tylor compares his actual score (80) to the expert's expectation (70). he received 10 points higher than expected! Tylor adjusts his answering style to be more like his current high-scoring response.

The Value Model predicts the future score (reward) expected from the current state. The difference between the score received (from the RM) and the expected score (from the VM) is called the Advantage. We train the LLM to maximize this Advantage. If the model is doing better than expected, reinforce that behavior; if worse, discourage it.

3.1.1.2. Learning Safely: The Meaning of Proximal

PPO stands for Proximal Policy Optimization. The "Policy" is the LLM's response style.

What does "Proximal" mean?

Proximal means "close by" or "nearby." It means the model updates its policy, but only by a little bit—it changes its policy cautiously.

What if Tylor, our intern, overreacted to the Advantage score? We hired him because his fundamentals were strong (from pre-training and SFT). By focusing too intensely on pleasing the evaluator, he could undermine his excellent baseline skills. We need to scold him if he tries to change too quickly.

It's similar with LLMs. The SFT-trained LLM is already a great model; we just need to align its responses. If the LLM tries too hard to maximize the Advantage, it can forget the valuable knowledge and skills it learned during SFT.

PPO uses several mechanisms to prevent overly drastic policy changes. The most popular one is Clipping. For example, even if the Advantage calculation suggests a policy change of 50 units, we might set a maximum change range of [-20, 20]. If the calculated change is 50, we "clip" it to the maximum allowed value of 20, ensuring the change is incremental and safe.

3.1.2. GRPO (Group Relative Policy Optimization): The Smart Evolution of PPO (feat. DeepSeek)

PPO, while effective, has a few drawbacks. One major issue is the need for an extra Value Model in addition to the Reward Model to calculate the Advantage. The Value Model is often as large as the target LLM itself. Building an LLM is hard; building a Reward Model is hard; avoiding the need for a third giant model (the VM) would be fantastic!

GRPO tackles this head-on. It proposed an idea to calculate the Advantage without training a separate Value Model.

The LLM generates multiple responses (A, B, C, D).
The Reward Model scores each response: A=80, B=95, C=55, D=10.
We calculate the average score of all responses: (80 + 95 + 55 + 10) / 4 = 60 points.
Now, we subtract the average score from each response's score:
- Response A: 80 - 60 = 20
- Response B: 95 - 60 = 35
- Response C: 55 - 60 = -5
- Response D: 10 - 60 = -50
GRPO uses these values as the Advantage. The advantage is the relative reward within the group. That's what the GR in GRPO stands for: Group Relative.
The LLM is then updated to maximize this relative Advantage.

GRPO inherits PPO's approach but simplifies the complex, resource-intensive intermediate step of training a Value Model.

3.2. The "Offline Optimization" Family: DPO and ORPO

This family challenged the very complexity of PPO and GRPO's "online feedback loop."

Their argument: "Why not just collect tons of data and train the model like SFT?"

3.2.1. DPO (Direct Preference Optimization): The Game-Changer After PPO

The PPO pipeline (Train Reward Model -> Train LLM with PPO) is notoriously hellish to manage. So, researchers asked:

"Wait, training a Reward Model, sampling online... it's too complicated. Why don't we directly train the LLM using the 'B is better than A' offline dataset?"

That question led to DPO (Direct Preference Optimization). The idea behind DPO is this:

"The goal is for the model to prefer 'B' over 'A,' right? So let's skip the Reward Model bridge and train the LLM directly to judge that 'B is more plausible than A'!"

DPO takes the model after Phase 2 (SFT) and gives it pairs of preference data (Preferred response, Non-preferred response). It then trains the model to:

Increase the probability of generating the preferred response.
Decrease the probability of generating the non-preferred response.

Instead of the online method (LLM generates -> RM scores -> LLM trains), DPO uses a static dataset of preferred/non-preferred pairs, making it an extension of SFT.

DPO is a game-changer because it eliminates the need to train a Reward Model, is far less resource-intensive, is much more stable to train, and simplifies the complex RL pipeline to a simple extension of SFT. Often, its performance is comparable to or even better than PPO, leading many to adopt it instead.

3.2.2. ORPO (Odds Ratio Preference Optimization): Combining SFT and DPO

If DPO came after SFT (Phase 2) to handle Alignment (Phase 3), ORPO takes it one step further.

"Hold on, we do SFT (Phase 2) and then DPO (Phase 3). Why not just combine the two?"

ORPO utilizes the concept of the Odds Ratio to merge the SFT and DPO stages into a single, highly efficient process. But first, what is an Odds Ratio?

Let's say a basketball player, Jordan, attempts 10 free throws, making 8 and missing 2. His probability of making a free throw is 8/10 = 0.8.

The Odds represent the likelihood of an event occurring relative to the likelihood of it not occurring. Jordan's odds of making a free throw are 8 (makes) / 2 (misses) = 4. This means he is four times more likely to make the shot than to miss it.

ORPO calculates the ratio of two odds: the odds of the LLM generating the preferred response versus the odds of the LLM generating the non-preferred response. This Odds Ratio increases as the LLM favors the preferred response and decreases as it favors the non-preferred one. The training objective, therefore, is to maximize this Odds Ratio.

While DPO primarily focuses on training the LLM to favor the preferred response, ORPO strongly encourages it to avoid generating bad responses simultaneously.

Another key feature is that this Odds Ratio calculation and Alignment happens during the SFT stage (Phase 2), not after. ORPO integrates the Alignment using the Odds Ratio directly into the SFT process. This allows the LLM to learn to generate good responses (SFT) while simultaneously learning to avoid non-preferred responses (Alignment). By merging Phase 2 and Phase 3, it significantly streamlines the entire training pipeline.

(From ORPO Paper)

4. Pros and Cons of LLM Reinforcement Learning

Naturally, RLHF has its strengths and weaknesses.

Pros: Creating the Chatbot We Know and Love

Alignment (Real Human-Like Conversation): This is the core reason. It teaches the model to grasp my intent, be helpful, and be polite. It transforms the "parrot" into a "perceptive assistant."
Safety and Harm Reduction: It is the most effective way to train a model to refuse dangerous requests like "How do I build a bomb?" RL is an excellent teacher for setting boundaries.
Fine-Grained Control: RL feedback can be used to reflect subtle style requests, such as "be more humorous" or "explain it like an expert." Rank-based methods like GRPO are particularly strong at this.

Cons: Money, Time, and New Headaches

Data Collection Costs: A major barrier. Creating hundreds of thousands of high-quality 'preference data' (A > B) or 'ranking data' (A > B > C) requires expensive labor from many human annotators. The quality of the labelers directly dictates the performance of the model.
Reward Hacking: Models are smart and ruthless. They will find a loophole to maximize their score, often at the expense of usefulness. The LLM can become a clever sycophant.
- The Flattery Problem: Human evaluators tend to give higher scores to responses that are long, polite, and sound plausible. In response, models learned to flatter. Responses like, "That is a truly excellent question!" or "I completely agree with your assessment," followed by a lengthy but vacuous answer, became common. When the response score is high, but the usefulness is low, it's called Reward Hacking.
Bias Entrenchment: "Human feedback" ultimately comes from a small group of labelers, often from a specific cultural background and education level. Their values and biases can be unintentionally injected into the model as the "correct answer," leading the model to respond with entrenched biases.
Training Instability: Especially with "Online RL" methods like PPO, the system involves multiple interacting models. The training process can be highly unstable and extremely sensitive to minor hyperparameter changes. It's common for a small misstep to turn the model into a "genius fool." (GRPO, DPO, and ORPO were developed to simplify or largely mitigate this issue).

5. Conclusion

We've covered the crucial secret behind making LLMs truly useful: Reinforcement Learning. We focused on concepts while avoiding complex math.

The LLM starts as a "Brilliant Intern" (Base Model) from massive data, then learns the "Employee Handbook" (SFT).
To become a true professional, it must learn social skills and judgment using human preference data (A > B) via RLHF (Reinforcement Learning from Human Feedback).
This breaks into two paths:
- Online RL: PPO is the classic method, creating a complex Reward Model evaluator to train with 'absolute scores.' GRPO made PPO more efficient by simplifying the pipeline and avoiding the need for a separate Value Model.
- Offline Optimization: DPO was an innovation that bypassed the Reward Model and online loop, training the LLM directly with a 'preference dataset,' like an SFT extension. ORPO went one step further, combining SFT and DPO into a single, unified step.

The evolution of LLMs is no longer just a race to build bigger models; it's a competition to see who can best master Alignment—making the AI safe, helpful, and truly aligned with human values.

Ultimately, training an AI is much like building a mirror that reflects our own human values. It's the process of teaching the AI what we prefer and what we value.

DEV Community