Why Reasoning Models Changed Everything

#vickybytes #machinelearning #deeplearning #ai

For years, making language models smarter meant making them bigger. Then someone asked a different question: what if, instead of training more, you let the model think longer?

In September 2024, OpenAI released o1. In January 2025, DeepSeek released R1. These two models together, invalidated the assumption that was governing the entire field since 2020, that the path to better AI runs through bigger training runs.

This assumption was backed by the Kaplan et al. scaling laws paper, which showed the language model performance follows a reliable power law with training compute. If you provide more parameters, more data, more GPU-hours, you get a more capable model. The field organized itself around this insight. Every major lab poured billions into pre-training.

o1 and R1 showed that there’s a second dimension to scale that the field had largely ignored: compute at inference time. And it turns out that for tasks requiring multi-step reasoning, this second dimension can be just as powerful as the first, and far cheaper to exploit.

What chain-of-thought actually is

Chain-of-thought prompting has been around since 2022, when Wei et al. at Google Brain showed that simply asking a language model to “think step by step” before answering dramatically improved its performance on math and logic tasks. This was a really surprising result. The model wasn’t being retrained, it was just prompted differently. The extra tokens the model generated while reasoning served as scratchpad that improved its final answer.

Transformers generate one token at a time, and each token is conditioned on all previous tokens. When a model solves a math problem by writing out steps, those steps become part of the context that informs the final answer. The model is using its own output as working memory. Without chain-of-thought, it has to compress all that computation into a single forward pass.

Nobody had figured out how to train a model to do this reliably, until o1. Prompting a model to “think step by step” helps, but the quality of reasoning is inconsistent. You might get careful and structured reasoning sometimes, but sometimes you get verbose filler that doesn’t really help. The model doesn’t know when to think hard and when not to.

The reinforcement learning connection

OpenAI’s o1 system card describes training the model with reinforcement learning to produce a chain of thought before answering. The core thing is that RL can teach the model how to think specifically, to develop reasoning strategies that lead to correct answers on verifiable tasks like mathematics and code.

According to OpenAI’s published description, o1 learns through RL to recognize and correct its mistakes in the middle of reasoning, break hard problems into simpler subproblems, and abandon approaches that aren’t working. These behaviors like self-correction, decomposition, backtracking, emerge from the training signal.

The result is a model where more thinking time produces better answers. On the AIME 2024 benchmark (American Invitational Mathematics Examination), o1 scored in the 74th percentile of human test-takers. GPT-4o, using the same benchmark, scored around 9%. That gap is from the model being trained to use its inference compute more productively.

DeepSeek R1: showing the mechanism

DeepSeek R1 published in January 2025 was an extremely important advancement because it achieved on benchmark scores the same level of performance as o1; however, this is the easy part. The significance of DeepSeek was that it provided an in-depth description of the training recipe allowing other people in the field the ability to study and replicate it.

They started with DeepSeek-V3-Base, a 671B parameter pretrained model built on a Mixture-of-Experts architecture. Their first experiment called DeepSeek-R1-Zero, applied reinforcement learning directly to this base model without supervised fine-tuning beforehand. The reward signal was such that, the model gets rewarded for producing a correct final answer, and for formatting its output with explicit reasoning inside <think> tags.

No direct human-labeled examples were used to provide the basis for generating a reward signal for successful attempts. Similarly, no preference-based human reward model was utilized for training and reward purposes for deep learning via reinforcement learning. Simply put, their reinforcement learning experiment used the following evaluation criteria: (1) Did you provide the correct final answer?

The results of R1-Zero were remarkable and, to be honest, a little unsettling. The model's average pass@1 score on AIME 2024 increased from 15.6% at the start of training to 71.0% by the end. More striking was what happened to the model's behavior during this process. The reasoning traces grew substantially longer as training progressed.

The model spontaneously developed strategies like re-reading the problem from the beginning partway through a solution, checking its own work, and explicitly noting when it suspected an error. None of this was designed in. It emerged from optimizing for correct answers.

The paper describes a notable emergent behavior where the model, while solving a math problem, pauses mid-reasoning, re-evaluates its approach, and switches to a different strategy to reach the correct answer. This emerges from reward training, where the model learns that revisiting its reasoning can sometimes lead to better outcomes.

The training algorithm: GRPO

The RL algorithm that DeepSeek used (Group Relative Policy Optimization (GRPO)), is worth understanding, because it's part of why this approach is tractable at scale.

Standard reinforcement learning for language models (the approach used in earlier RLHF pipelines) depends on Proximal Policy Optimization (PPO). This requires an additional network called a critic, which is basically another large neural network that can estimate the value of partially completed responses. The size of the critic is typically the same as the size of the model that is being trained, resulting in twice the memory/compute requirements of the PPO-based reinforcement learning algorithm, because you must perform two full forward passes for every training step.

GRPO, first introduced in the DeepSeekMath paper (2024), eliminates the critic. With GRPO, instead of estimating value through a learned value estimation model, GRPO now generates multiple responses for each prompt and receives a reward for each response. This group of multiple responses will provide an average reward, or baseline, for comparison when determining the advantage of any response (i.e., the signal that tells the model whether that particular response was better or worse than what was expected) through the formula:

A_i = (r_i - mean(r_1...r_G)) / std(r_1...r_G)

This is similar to the REINFORCE algorithm from the 90’s, however it is being created on larger scales with modern hardware and still holds the same clipping mechanism found within PPO for training stability. Because of this, you would see how using 50% fewer resources is huge, especially given that your policy model contains 671 billion parameters.

There were only two main types of rewards that were given in the reward function: correctness rewards (whether or not the final answer matched the ground truth for math problems and/or coding problems) and format rewards (whether or not the model used the structure expected for output).

There were no process rewards, nor step-by-step supervision. The implicit assumption of this approach was that if you give the model a correct-answer signal, along with a good deal of freedom to explore, the model will be able to find its own way to good reasoning.

Why R1-Zero wasn't the final model

R1-Zero had a problem because DeepSeek-V3-Base was pretrained on multilingual data, the model sometimes switched languages mid-reasoning. It would start a problem in English, shift into Chinese for a few sentences then continue in English. The reasoning was often correct but the output was unreadable. It also had readability issues. The thinking traces were sometimes clear internal monologues, but at other times, they were nearly incoherent.

DeepSeek R1 addressed this through a more involved training pipeline. They first collected a small number of high-quality chain-of-thought examples that demonstrated the kind of structured and readable reasoning they wanted. This was used for supervised fine-tuning before RL began, giving the RL process a better starting point. They also added a language consistency reward to penalize mid-reasoning language switching. A subsequent round of rejection sampling and SFT on the RL model's own outputs added coverage for non-reasoning tasks like writing and general question-answering. The result DeepSeek-R1, performs comparably to OpenAI o1-1217 on reasoning benchmarks.

DeepSeek released smaller models like 7B, 14B, 32B parameters, trained by fine-tuning on reasoning traces generated by the full R1 model. The 32B distilled version outperforms o1-mini on several benchmarks. This is knowledge distillation applied to reasoning i.e, smaller models learn reasoning patterns from larger models, without running the expensive RL training themselves.

Scaling just got a second dimension

The pre-training scaling laws described by Kaplan et al. in 2020 showed a clean relationship, i.e, training compute in, model capability out. The Chinchilla paper (Hoffmann et al., 2022) refined this further, showing that for a fixed compute budget, the optimal strategy is to train a smaller model on more data rather than a larger model on less data. These results organized the field for years.

Reasoning models introduce a second scaling curve. OpenAI’s o1 blog says performance improves when the model gets more time to think.

In “The Bitter Lesson” (2019), Richard Sutton makes the case that methods of making use of computing power in a general sense usually outperform in the long term methods that encode human knowledge explicitly. The first instance of this was the scaling of the amount of computational resources utilized for pre-training; the second instance is the scaling of the number of computational resources utilized for inference through the use of learned reasoning. This indicates that we are not done with increasing the scale of AI; in actuality, we have only just entered into a new chapter in that story.

That being said, scaling with regards to inference does work best for tasks that can have verified answers (i.e., math, formal proofs, code that can be tested). This makes it rather easy to determine whether the answer produced by a model is correct or not (there is either an answer or there is not).

Extending this to tasks that are open ended (e.g., writing) where there is no clear ground truth is a very active area of research and far more difficult. The current generation of reasoning models is likely very powerful in the STEM and coding fields, although exactly how broadly they will be able to extend beyond these fields remains an active area of research as well.

This one is different

There is often a claim every few years that AI has reached a major breakthrough, but it usually turns out to be a small improvement. But this time is different. Now models can use more compute during inference to produce better reasoning, which creates new possibilities that were not available before.

Tasks that previously needed human intervention because models can’t be trusted to reason carefully, now work differently. You can spend more compute at inference time to get more reliable results. For simple tasks the model can run fast, while harder or more important problems can be given more time to think, making the tool much more flexible than models from a few years ago.

DeepSeek R1 being released with open weights under an MIT license also changes the economics. The distilled R1-32B model, when used with quantization on a single high-end GPU, performs better than o1-mini on numerical tests. Now that there are multiple APIs, labs, researchers, and smaller organizations with limited resources can take advantage of them. This is a major change in the number of people who would have had access to this type of model before the DeepSeek R1 was made freely available.

Making models bigger and training them on more data helps them learn more, like reading lots of books makes someone more knowledgeable. But letting models “think longer” when answering questions helps them reason better and make fewer mistakes, like a person taking more time to work through a problem.

Both of these improvements are important and now we have both. What we do with that combination is still an open question.