DEV Community

Shrijith Venkatramana
Shrijith Venkatramana

Posted on

The Three Phases of Post-Training: How LLMs Learn to Provide Sensible Responses

Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.


Most developers have heard the phrase:

"LLMs are trained on massive amounts of internet data."

While technically true, it leaves out the most interesting part.

Pretraining teaches a model how language works. But it doesn't teach the model how to be helpful, harmless, honest, or aligned with human expectations.

If pretraining creates a brilliant but socially awkward intern, post-training turns that intern into a productive teammate.

Modern AI systems such as ChatGPT, Gemini, Claude, and others rely heavily on a multi-stage post-training pipeline. While implementations differ, the overall pattern is surprisingly consistent:

  1. Supervised Fine-Tuning (SFT)
  2. Reward Modeling (RM)
  3. Reinforcement Learning (RL)

Let's explore what each stage does, why it exists, and how they work together.

Why Pretraining Isn't Enough

Imagine we train a model on the entire internet and ask:

"How do I become a better software engineer?"

The model has seen thousands of answers:

  • Good advice
  • Bad advice
  • Contradictory advice
  • Sarcasm
  • Reddit arguments
  • Technical blog posts
  • Motivational speeches

The model learns patterns in text, but it doesn't inherently know which response humans would prefer.

It only knows what tends to come next.

This is the core limitation of pretraining.

The model learns:

"What people write."

But not:

"What humans want."

Post-training bridges this gap.

Stage 1: Supervised Fine-Tuning (SFT)

The first step is teaching the model what good behavior looks like.

Researchers create high-quality examples consisting of:

User: Explain TCP vs UDP.

Assistant:
TCP provides reliable ordered delivery...
Enter fullscreen mode Exit fullscreen mode

Or:

User: Write a Python function that reverses a linked list.

Assistant:
def reverse(head):
    ...
Enter fullscreen mode Exit fullscreen mode

Thousands or millions of these examples are collected.

The model is then trained to imitate the desired responses.

Conceptually:

Question → Ideal Answer
Enter fullscreen mode Exit fullscreen mode

becomes

Model → Learn to reproduce ideal answer
Enter fullscreen mode Exit fullscreen mode

The objective is still next-token prediction, but now on carefully curated data instead of arbitrary internet text.

Software Engineering Analogy

Think of SFT as onboarding a new engineer.

Instead of letting them learn exclusively from random GitHub repositories, you provide:

  • Coding standards
  • Architecture guidelines
  • Example pull requests
  • Internal best practices

The engineer begins to imitate the patterns you want.

What SFT Solves

SFT dramatically improves:

  • Instruction following
  • Formatting
  • Tool usage
  • Coding style
  • Conversational behavior

However, it still has a limitation.

For many prompts, there isn't one correct answer.

There may be multiple reasonable responses with varying quality levels.

That's where Reward Modeling enters.

Stage 2: Reward Modeling (RM)

Suppose a user asks:

"How should I learn distributed systems?"

Three responses might all be technically correct.

Response A:

Read a textbook.
Enter fullscreen mode Exit fullscreen mode

Response B:

Read a textbook and build projects.
Enter fullscreen mode Exit fullscreen mode

Response C:

Study networking, databases, consensus algorithms,
then implement a small Raft cluster.
Enter fullscreen mode Exit fullscreen mode

Most humans would likely prefer C.

But how does a model learn that preference?

The answer is Reward Modeling.

Collecting Human Preferences

Human evaluators compare multiple outputs:

Prompt

Answer A
Answer B
Enter fullscreen mode Exit fullscreen mode

They choose the better response.

Thousands or millions of comparisons are collected.

Example:

Prompt:
How do I learn Go?

Preferred:
Build projects and read effective Go.

Rejected:
Just read documentation.
Enter fullscreen mode Exit fullscreen mode

A separate model is trained to predict these preferences.

This becomes the Reward Model.

Conceptually:

Response → Quality Score
Enter fullscreen mode Exit fullscreen mode

The reward model acts like an automated judge.

Why This Matters

SFT teaches:

"Produce answers similar to examples."

Reward Modeling teaches:

"Recognize which answers humans prefer."

This distinction is subtle but important.

One is imitation.

The other is evaluation.

Stage 3: Reinforcement Learning (RL)

Now we have:

  • A policy model (the assistant)
  • A reward model (the judge)

The final stage uses Reinforcement Learning to optimize the assistant.

The process looks like:

Prompt
   ↓
Model generates answer
   ↓
Reward model scores answer
   ↓
Update model to increase reward
Enter fullscreen mode Exit fullscreen mode

Repeated millions of times.

Over time, the assistant learns to generate responses that maximize the reward signal.

PPO and Modern Variants

Historically, many systems used:

  • PPO (Proximal Policy Optimization)

More recently, newer approaches such as:

  • DPO (Direct Preference Optimization)
  • RLAIF (Reinforcement Learning from AI Feedback)
  • GRPO and related techniques

have gained popularity.

The exact algorithm matters less than the goal:

Move the model toward outputs that humans consistently prefer.

Software Engineering Analogy

Imagine code review automation.

SFT teaches an engineer using examples of good pull requests.

Reward Modeling creates a senior reviewer that scores submissions.

RL repeatedly updates the engineer based on reviewer feedback.

Eventually the engineer starts producing code that receives better review scores.

The New Trend: Models Training Models

One interesting detail from recent Gemini research is the growing use of models themselves in post-training workflows.

Instead of relying exclusively on humans, powerful models help:

  • Generate candidate responses
  • Identify low-quality data
  • Detect inconsistencies
  • Perform ranking tasks
  • Assist evaluation pipelines

This creates a feedback loop:

Model
  ↓
Generates data
  ↓
Humans verify
  ↓
Improved model
  ↓
Generates better data
Enter fullscreen mode Exit fullscreen mode

The result is dramatically improved scalability.

The future of post-training may involve humans increasingly acting as supervisors while models handle much of the operational workload.


Why Data Quality Beats Model Size

A common assumption is that better AI comes primarily from larger models.

The industry increasingly suggests otherwise.

Many recent gains come not from:

More parameters
Enter fullscreen mode Exit fullscreen mode

but from:

Better post-training data
Enter fullscreen mode Exit fullscreen mode

A smaller model trained on excellent preference data can often outperform a larger model trained on mediocre data.

This explains why modern research papers frequently emphasize:

  • Preference datasets
  • Evaluation quality
  • Synthetic data generation
  • Data filtering
  • Alignment pipelines

The quality of feedback often matters more than the quantity of compute.

Pretraining is about language, SFT is about sensible responses

Pretraining teaches a model how language works.

Supervised Fine-Tuning teaches it how to respond.

Reward Modeling teaches it what humans prefer.

Reinforcement Learning teaches it to consistently optimize for those preferences.

Together, these stages transform a statistical text predictor into something that feels surprisingly useful.

As foundation models become increasingly capable, the competitive advantage may shift away from raw model size and toward the sophistication of post-training systems and data quality pipelines.

The next major breakthrough in AI might not come from a bigger model.

It might come from a better teacher.

If you were designing a reward model for a coding assistant, what signals would you optimize for—correctness, readability, maintainability, performance, or something else entirely?


*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

GitHub logo HexmosTech / git-lrc

Free, Micro AI Code Reviews That Run on Git Commit




GenAI today is a race car without brakes. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents silently break things: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.

git-lrc is your braking system. It hooks into git commit and runs an AI review on every diff before it lands. 60-second setup. Completely free.

In short, git-lrc helps Prevent Outages, Breaches, and Technical Debt Before They Happen

At a glance: 10 risk categories · 100+ failure patterns tracked · every commit…

Top comments (0)