Shrijith Venkatramana

Posted on Jun 19

The Three Phases of Post-Training: How LLMs Learn to Provide Sensible Responses

#webdev #ai #programming #productivity

Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.

Most developers have heard the phrase:

"LLMs are trained on massive amounts of internet data."

While technically true, it leaves out the most interesting part.

Pretraining teaches a model how language works. But it doesn't teach the model how to be helpful, harmless, honest, or aligned with human expectations.

If pretraining creates a brilliant but socially awkward intern, post-training turns that intern into a productive teammate.

Modern AI systems such as ChatGPT, Gemini, Claude, and others rely heavily on a multi-stage post-training pipeline. While implementations differ, the overall pattern is surprisingly consistent:

Supervised Fine-Tuning (SFT)
Reward Modeling (RM)
Reinforcement Learning (RL)

Let's explore what each stage does, why it exists, and how they work together.

Why Pretraining Isn't Enough

Imagine we train a model on the entire internet and ask:

"How do I become a better software engineer?"

The model has seen thousands of answers:

Good advice
Bad advice
Contradictory advice
Sarcasm
Reddit arguments
Technical blog posts
Motivational speeches

The model learns patterns in text, but it doesn't inherently know which response humans would prefer.

It only knows what tends to come next.

This is the core limitation of pretraining.

The model learns:

"What people write."

But not:

"What humans want."

Post-training bridges this gap.

Stage 1: Supervised Fine-Tuning (SFT)

The first step is teaching the model what good behavior looks like.

Researchers create high-quality examples consisting of:

User: Explain TCP vs UDP.

Assistant:
TCP provides reliable ordered delivery...

Or:

User: Write a Python function that reverses a linked list.

Assistant:
def reverse(head):
    ...

Thousands or millions of these examples are collected.

The model is then trained to imitate the desired responses.

Conceptually:

Question → Ideal Answer

becomes

Model → Learn to reproduce ideal answer

The objective is still next-token prediction, but now on carefully curated data instead of arbitrary internet text.

Software Engineering Analogy

Think of SFT as onboarding a new engineer.

Instead of letting them learn exclusively from random GitHub repositories, you provide:

Coding standards
Architecture guidelines
Example pull requests
Internal best practices

The engineer begins to imitate the patterns you want.

What SFT Solves

SFT dramatically improves:

Instruction following
Formatting
Tool usage
Coding style
Conversational behavior

However, it still has a limitation.

For many prompts, there isn't one correct answer.

There may be multiple reasonable responses with varying quality levels.

That's where Reward Modeling enters.

Stage 2: Reward Modeling (RM)

Suppose a user asks:

"How should I learn distributed systems?"

Three responses might all be technically correct.

Response A:

Read a textbook.

Response B:

Read a textbook and build projects.

Response C:

Study networking, databases, consensus algorithms,
then implement a small Raft cluster.

Most humans would likely prefer C.

But how does a model learn that preference?

The answer is Reward Modeling.

Collecting Human Preferences

Human evaluators compare multiple outputs:

Prompt

Answer A
Answer B

They choose the better response.

Thousands or millions of comparisons are collected.

Example:

Prompt:
How do I learn Go?

Preferred:
Build projects and read effective Go.

Rejected:
Just read documentation.

A separate model is trained to predict these preferences.

This becomes the Reward Model.

Conceptually:

Response → Quality Score

The reward model acts like an automated judge.

Why This Matters

SFT teaches:

"Produce answers similar to examples."

Reward Modeling teaches:

"Recognize which answers humans prefer."

This distinction is subtle but important.

One is imitation.

The other is evaluation.

Stage 3: Reinforcement Learning (RL)

Now we have:

A policy model (the assistant)
A reward model (the judge)

The final stage uses Reinforcement Learning to optimize the assistant.

The process looks like:

Prompt
   ↓
Model generates answer
   ↓
Reward model scores answer
   ↓
Update model to increase reward

Repeated millions of times.

Over time, the assistant learns to generate responses that maximize the reward signal.

PPO and Modern Variants

Historically, many systems used:

PPO (Proximal Policy Optimization)

More recently, newer approaches such as:

DPO (Direct Preference Optimization)
RLAIF (Reinforcement Learning from AI Feedback)
GRPO and related techniques

have gained popularity.

The exact algorithm matters less than the goal:

Move the model toward outputs that humans consistently prefer.

Software Engineering Analogy

Imagine code review automation.

SFT teaches an engineer using examples of good pull requests.

Reward Modeling creates a senior reviewer that scores submissions.

RL repeatedly updates the engineer based on reviewer feedback.

Eventually the engineer starts producing code that receives better review scores.

The New Trend: Models Training Models

One interesting detail from recent Gemini research is the growing use of models themselves in post-training workflows.

Instead of relying exclusively on humans, powerful models help:

Generate candidate responses
Identify low-quality data
Detect inconsistencies
Perform ranking tasks
Assist evaluation pipelines

This creates a feedback loop:

Model
  ↓
Generates data
  ↓
Humans verify
  ↓
Improved model
  ↓
Generates better data

The result is dramatically improved scalability.

The future of post-training may involve humans increasingly acting as supervisors while models handle much of the operational workload.

Why Data Quality Beats Model Size

A common assumption is that better AI comes primarily from larger models.

The industry increasingly suggests otherwise.

Many recent gains come not from:

More parameters

but from:

Better post-training data

A smaller model trained on excellent preference data can often outperform a larger model trained on mediocre data.

This explains why modern research papers frequently emphasize:

Preference datasets
Evaluation quality
Synthetic data generation
Data filtering
Alignment pipelines

The quality of feedback often matters more than the quantity of compute.

Pretraining is about language, SFT is about sensible responses

Pretraining teaches a model how language works.

Supervised Fine-Tuning teaches it how to respond.

Reward Modeling teaches it what humans prefer.

Reinforcement Learning teaches it to consistently optimize for those preferences.

Together, these stages transform a statistical text predictor into something that feels surprisingly useful.

As foundation models become increasingly capable, the competitive advantage may shift away from raw model size and toward the sophistication of post-training systems and data quality pipelines.

The next major breakthrough in AI might not come from a bigger model.

It might come from a better teacher.

If you were designing a reward model for a coding assistant, what signals would you optimize for—correctness, readability, maintainability, performance, or something else entirely?

*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

HexmosTech / git-lrc

Free, Micro AI Code Reviews That Run on Git Commit

git-lrc

Free, Micro AI Code Reviews That Run on Commit

GenAI today is a race car without brakes. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents silently break things: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.

git-lrc is your braking system. It hooks into git commit and runs an AI review on every diff before it lands. 60-second setup. Completely free.

In short, git-lrc helps Prevent Outages, Breaches, and Technical Debt Before They Happen

At a glance: 10 risk categories · 100+ failure patterns tracked · every commit…

View on GitHub