Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.
Most developers have heard the phrase:
"LLMs are trained on massive amounts of internet data."
While technically true, it leaves out the most interesting part.
Pretraining teaches a model how language works. But it doesn't teach the model how to be helpful, harmless, honest, or aligned with human expectations.
If pretraining creates a brilliant but socially awkward intern, post-training turns that intern into a productive teammate.
Modern AI systems such as ChatGPT, Gemini, Claude, and others rely heavily on a multi-stage post-training pipeline. While implementations differ, the overall pattern is surprisingly consistent:
- Supervised Fine-Tuning (SFT)
- Reward Modeling (RM)
- Reinforcement Learning (RL)
Let's explore what each stage does, why it exists, and how they work together.
Why Pretraining Isn't Enough
Imagine we train a model on the entire internet and ask:
"How do I become a better software engineer?"
The model has seen thousands of answers:
- Good advice
- Bad advice
- Contradictory advice
- Sarcasm
- Reddit arguments
- Technical blog posts
- Motivational speeches
The model learns patterns in text, but it doesn't inherently know which response humans would prefer.
It only knows what tends to come next.
This is the core limitation of pretraining.
The model learns:
"What people write."
But not:
"What humans want."
Post-training bridges this gap.
Stage 1: Supervised Fine-Tuning (SFT)
The first step is teaching the model what good behavior looks like.
Researchers create high-quality examples consisting of:
User: Explain TCP vs UDP.
Assistant:
TCP provides reliable ordered delivery...
Or:
User: Write a Python function that reverses a linked list.
Assistant:
def reverse(head):
...
Thousands or millions of these examples are collected.
The model is then trained to imitate the desired responses.
Conceptually:
Question → Ideal Answer
becomes
Model → Learn to reproduce ideal answer
The objective is still next-token prediction, but now on carefully curated data instead of arbitrary internet text.
Software Engineering Analogy
Think of SFT as onboarding a new engineer.
Instead of letting them learn exclusively from random GitHub repositories, you provide:
- Coding standards
- Architecture guidelines
- Example pull requests
- Internal best practices
The engineer begins to imitate the patterns you want.
What SFT Solves
SFT dramatically improves:
- Instruction following
- Formatting
- Tool usage
- Coding style
- Conversational behavior
However, it still has a limitation.
For many prompts, there isn't one correct answer.
There may be multiple reasonable responses with varying quality levels.
That's where Reward Modeling enters.
Stage 2: Reward Modeling (RM)
Suppose a user asks:
"How should I learn distributed systems?"
Three responses might all be technically correct.
Response A:
Read a textbook.
Response B:
Read a textbook and build projects.
Response C:
Study networking, databases, consensus algorithms,
then implement a small Raft cluster.
Most humans would likely prefer C.
But how does a model learn that preference?
The answer is Reward Modeling.
Collecting Human Preferences
Human evaluators compare multiple outputs:
Prompt
Answer A
Answer B
They choose the better response.
Thousands or millions of comparisons are collected.
Example:
Prompt:
How do I learn Go?
Preferred:
Build projects and read effective Go.
Rejected:
Just read documentation.
A separate model is trained to predict these preferences.
This becomes the Reward Model.
Conceptually:
Response → Quality Score
The reward model acts like an automated judge.
Why This Matters
SFT teaches:
"Produce answers similar to examples."
Reward Modeling teaches:
"Recognize which answers humans prefer."
This distinction is subtle but important.
One is imitation.
The other is evaluation.
Stage 3: Reinforcement Learning (RL)
Now we have:
- A policy model (the assistant)
- A reward model (the judge)
The final stage uses Reinforcement Learning to optimize the assistant.
The process looks like:
Prompt
↓
Model generates answer
↓
Reward model scores answer
↓
Update model to increase reward
Repeated millions of times.
Over time, the assistant learns to generate responses that maximize the reward signal.
PPO and Modern Variants
Historically, many systems used:
- PPO (Proximal Policy Optimization)
More recently, newer approaches such as:
- DPO (Direct Preference Optimization)
- RLAIF (Reinforcement Learning from AI Feedback)
- GRPO and related techniques
have gained popularity.
The exact algorithm matters less than the goal:
Move the model toward outputs that humans consistently prefer.
Software Engineering Analogy
Imagine code review automation.
SFT teaches an engineer using examples of good pull requests.
Reward Modeling creates a senior reviewer that scores submissions.
RL repeatedly updates the engineer based on reviewer feedback.
Eventually the engineer starts producing code that receives better review scores.
The New Trend: Models Training Models
One interesting detail from recent Gemini research is the growing use of models themselves in post-training workflows.
Instead of relying exclusively on humans, powerful models help:
- Generate candidate responses
- Identify low-quality data
- Detect inconsistencies
- Perform ranking tasks
- Assist evaluation pipelines
This creates a feedback loop:
Model
↓
Generates data
↓
Humans verify
↓
Improved model
↓
Generates better data
The result is dramatically improved scalability.
The future of post-training may involve humans increasingly acting as supervisors while models handle much of the operational workload.
Why Data Quality Beats Model Size
A common assumption is that better AI comes primarily from larger models.
The industry increasingly suggests otherwise.
Many recent gains come not from:
More parameters
but from:
Better post-training data
A smaller model trained on excellent preference data can often outperform a larger model trained on mediocre data.
This explains why modern research papers frequently emphasize:
- Preference datasets
- Evaluation quality
- Synthetic data generation
- Data filtering
- Alignment pipelines
The quality of feedback often matters more than the quantity of compute.
Pretraining is about language, SFT is about sensible responses
Pretraining teaches a model how language works.
Supervised Fine-Tuning teaches it how to respond.
Reward Modeling teaches it what humans prefer.
Reinforcement Learning teaches it to consistently optimize for those preferences.
Together, these stages transform a statistical text predictor into something that feels surprisingly useful.
As foundation models become increasingly capable, the competitive advantage may shift away from raw model size and toward the sophistication of post-training systems and data quality pipelines.
The next major breakthrough in AI might not come from a bigger model.
It might come from a better teacher.
If you were designing a reward model for a coding assistant, what signals would you optimize for—correctness, readability, maintainability, performance, or something else entirely?
*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.
git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*
Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.
HexmosTech
/
git-lrc
Free, Micro AI Code Reviews That Run on Git Commit
| 🇩🇰 Dansk | 🇪🇸 Español | 🇮🇷 Farsi | 🇫🇮 Suomi | 🇯🇵 日本語 | 🇳🇴 Norsk | 🇵🇹 Português | 🇷🇺 Русский | 🇦🇱 Shqip | 🇨🇳 中文 | 🇮🇳 हिन्दी |
git-lrc
Free, Micro AI Code Reviews That Run on Commit
GenAI today is a race car without brakes. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents silently break things: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.
git-lrc is your braking system. It hooks into git commit and runs an AI review on every diff before it lands. 60-second setup. Completely free.
In short, git-lrc helps Prevent Outages, Breaches, and Technical Debt Before They Happen
At a glance: 10 risk categories · 100+ failure patterns tracked · every commit…

Top comments (0)