Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.
Large Language Models have gotten remarkably good at generating text.
But there has always been a fundamental problem:
How do you tell an AI whether its answer is actually correct?
For creative writing, opinions, brainstorming, and conversations, correctness is fuzzy. Human feedback is usually required.
But what about problems where correctness can be objectively verified?
- Is the code passing the tests?
- Did the SQL query return the expected result?
- Does the mathematical proof produce the correct answer?
- Does the generated webpage match the specification?
This simple observation is driving one of the most interesting developments in modern AI training:
Reinforcement Learning with Verifiable Rewards (RLVR).
Instead of asking humans to score every answer, we let reality score it.
And that changes everything.
The Traditional RLHF Approach
Most developers have heard of Reinforcement Learning from Human Feedback (RLHF).
The basic process looks like this:
- Model generates answers.
- Humans rank the answers.
- A reward model learns human preferences.
- Reinforcement learning optimizes against that reward model.
Conceptually:
Question
↓
Model Output
↓
Human Evaluation
↓
Reward Signal
↓
Model Improvement
This worked well for making models more helpful, harmless, and conversational.
But it has a major limitation:
Humans are expensive.
You need thousands or millions of human judgments.
Even worse, humans often disagree.
Ask ten programmers whether a piece of code is elegant and you'll get eleven opinions.
The Key Insight: Some Tasks Are Self-Verifying
Now imagine a different task:
Write a Python function that reverses a linked list.
You don't necessarily need a human reviewer.
You can simply run:
pytest
If all tests pass:
Reward = 1
If tests fail:
Reward = 0
The reward becomes objective.
This is the central idea behind RLVR.
Instead of asking:
"Does a human like this answer?"
we ask:
"Can we verify this answer automatically?"
Whenever verification is possible, reward generation becomes dramatically cheaper and more scalable.
Why This Works So Well for Coding
Coding is one of the most natural domains for RLVR.
Consider a coding benchmark:
Input:
Implement binary search.
Output:
Generated code
Verification is straightforward:
run_tests()
If:
assert binary_search([1,2,3],2) == 1
passes for all test cases, the model receives a high reward.
Otherwise it receives a low reward.
The model gradually learns patterns that lead to successful execution.
Over millions of examples, it begins discovering:
- Better debugging strategies
- Better decomposition strategies
- Better reasoning chains
- Better code structures
without needing humans to manually inspect every solution.
This is one reason coding models have improved so rapidly in recent years.
Beyond Coding: Mathematics
Mathematics is another ideal RLVR environment.
Suppose the task is:
Solve:
127 × 348
The final answer can be checked automatically.
Even more interesting:
Find x:
2x + 5 = 17
Verification is trivial:
Substitute x
Check equation
Correct answer?
Reward = 1.
Incorrect answer?
Reward = 0.
This allows models to practice enormous numbers of mathematical problems without requiring armies of human annotators.
Many recent reasoning-focused models have benefited heavily from this kind of training.
What Is Actually Being Optimized?
Under the hood, RLVR still looks like reinforcement learning.
The model generates a solution:
State → Action → Outcome
The difference is the source of the reward.
Traditional RLHF:
Reward = Human Preference
RLVR:
Reward = Verifiable Correctness
A simplified objective looks like:
maximize E[reward]
where reward comes from an automated verifier.
The verifier might be:
- Unit tests
- Mathematical checking
- Compilation success
- Benchmark execution
- Formal proof validation
- Simulation outcomes
The model is effectively searching for behaviors that maximize success rates.
An Example: Training a Coding Model
Imagine training an AI on algorithmic problems.
For each problem:
Problem
↓
Model generates solution
↓
Compile
↓
Run tests
↓
Assign reward
Example:
def factorial(n):
return n
Tests:
assert factorial(5) == 120
Fails.
Reward:
0
The model tries another approach:
def factorial(n):
if n <= 1:
return 1
return n * factorial(n - 1)
Tests pass.
Reward:
1
Over time, reinforcement learning shifts probability mass toward successful behaviors.
The model isn't memorizing solutions.
It's learning patterns associated with success.
The Hidden Superpower: Scaling Rewards
The most important consequence of RLVR is not accuracy.
It's scalability.
Suppose you want:
- 10 million training examples
- 100 million training examples
- 1 billion training examples
Human evaluation becomes impossible.
Automated verification remains feasible.
Once a verifier exists, reward generation can scale almost indefinitely.
This transforms the economics of model training.
Instead of hiring more evaluators, you simply generate more problems and run more verifications.
Many researchers believe this is one of the major reasons reasoning and coding models have improved so quickly over the last few years.
Limitations and Open Problems
RLVR is powerful, but it is not universal.
Many important tasks lack objective verification.
Examples include:
- Writing a compelling novel
- Designing a great product strategy
- Creating a persuasive marketing campaign
- Conducting a nuanced negotiation
For these domains, correctness is subjective.
Human judgment remains necessary.
Another challenge is reward hacking.
If a model discovers shortcuts that exploit the verifier rather than solving the underlying problem, training can become misleading.
The verifier itself must be robust.
In practice, designing good reward functions is often harder than training the model.
Final Thoughts
For years, the AI community focused on teaching models through human preferences.
RLVR introduces a different idea:
Whenever reality can verify an answer, let reality provide the reward.
For coding, mathematics, theorem proving, scientific reasoning, and other objective domains, this approach dramatically reduces the need for human supervision while enabling massive training scale.
The result is a new generation of models that aren't just learning from people.
They're learning from whether their outputs actually work.
And that may be one of the most important shifts in modern AI training.
If you were training an AI for your domain, what would serve as the verifier? Unit tests, simulations, customer metrics, formal proofs, or something else entirely?
*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.
git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*
Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.
HexmosTech
/
git-lrc
Free, Micro AI Code Reviews That Run on Git Commit
| 🇩🇰 Dansk | 🇪🇸 Español | 🇮🇷 Farsi | 🇫🇮 Suomi | 🇯🇵 日本語 | 🇳🇴 Norsk | 🇵🇹 Português | 🇷🇺 Русский | 🇦🇱 Shqip | 🇨🇳 中文 | 🇮🇳 हिन्दी |
git-lrc
Free, Micro AI Code Reviews That Run on Commit
GenAI today is a race car without brakes. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents silently break things: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.
git-lrc is your braking system. It hooks into git commit and runs an AI review on every diff before it lands. 60-second setup. Completely free.
In short, git-lrc helps Prevent Outages, Breaches, and Technical Debt Before They Happen
At a glance: 10 risk categories · 100+ failure patterns tracked · every commit…

Top comments (0)