soohan abbasi

Posted on May 16

Chain-of-Thought and Beyond: How LLMs Actually Learn to Reason

#ai #deeplearning #llm #machinelearning

"The ability to reason step-by-step is not just a feature. It might be the difference between a language model that sounds intelligent and one that actually is."

Introduction: When AI Started Thinking

In 2022, researchers at Google Brain published a paper titled "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models". At the time, nobody quite anticipated it would mark the beginning of a shift that would reshape the entire AI field.

The idea was simple: instead of asking a model to answer directly, give it time to think. Ask it to write out intermediate steps. Accuracy improves dramatically.

That paper now sits at over 10,000 citations. But the question it raised has never been fully answered:

Do LLMs actually think? Or do they create a very convincing illusion of thinking?

That is what this blog is about. And as someone preparing for a PhD in AI, it is a question I keep coming back to.

Part 1: What Is Chain-of-Thought?

Standard Prompting vs. CoT Prompting

Imagine asking a model this:

"Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many does he have now?"

With standard prompting, the model jumps straight to: "11"

With chain-of-thought prompting, the model works through it first:

Roger starts with 5 balls.
2 cans × 3 balls = 6 balls.
5 + 6 = 11 balls.
Answer: 11

Both get the same answer. So what is the point?

The gap shows up on harder problems. Models that reason through steps outperform those that answer directly on multi-step math, symbolic reasoning, and commonsense problems. The more complex the task, the bigger the difference.

Zero-Shot CoT: One Phrase Changes Everything

In the same year, researchers discovered something even more surprising. Simply adding the phrase "Let's think step by step" to a question, without any examples, significantly improved reasoning accuracy.

No demonstrations. No fine-tuning. Just those five words.

This became known as zero-shot CoT. And the obvious follow-up question is: why does this even work?

My Own Experiment: Testing CoT on GSM8K

Before going deeper into the theory, I wanted to test this myself. So I ran a small experiment using an open-source model on a standard benchmark.

Setup:

Model: Qwen 2.5 1.5B Instruct (free, runs on Kaggle GPU)
Dataset: GSM8K (grade school math problems)
Test: Standard prompting vs. "Let's think step by step"
Sample: 10 problems

Results:

Approach	Correct	Accuracy
Without CoT	2/10	20%
With CoT	3/10	30%

[CoT vs No-CoT Results on GSM8K]

Even on a model roughly 360 times smaller than the one used in the original paper, the improvement showed up. A single phrase shifted accuracy by 10%.

A few things stood out from the per-problem breakdown:

Problem 1 was solved correctly with CoT, but not without it. Problem 7 showed the same pattern. Problem 4 was solved correctly either way. But Problem 6 was actually solved correctly without CoT and incorrectly with it. The model overthought a straightforward calculation and got it wrong.

That last observation matters and connects to something I discuss in Part 4.

Quick note: the overall accuracy numbers look low because this model is tiny compared to what the original paper used. The point here is the relative difference, not the absolute numbers.

Part 2: What Is Actually Happening Inside the Model?

More Than Pattern Matching

The common criticism of LLMs is that they are sophisticated autocomplete. They match patterns from training data rather than genuinely reasoning. This criticism is not entirely wrong, but it is incomplete.

Between 2023 and 2024, researchers doing mechanistic interpretability work found some interesting things inside these models.

LLMs contain specific reasoning circuits: groups of neurons and attention heads that work together to perform logical operations. They use something called induction heads, which are attention patterns that identify sequences in context and predict what follows. Some models have developed implicit world models, meaning they internally represent concepts like spatial relationships, time, and causality.

None of this was explicitly programmed. It emerged from training on text.

The picture that comes out of this research is more interesting than "just pattern matching." These models have developed internal structures that support reasoning-like behavior. Whether that constitutes real reasoning is a separate philosophical question, but it is clearly more than autocomplete.

Process Reward Models: Grading the Work, Not Just the Answer

Here is an idea that changed how reasoning models are trained. Instead of grading only the final answer, what if you graded every individual reasoning step?

That is the core of a Process Reward Model (PRM).

In standard training, the model produces an answer and gets told whether it was right or wrong. In PRM-based training, each step in the reasoning chain gets its own score. A wrong step gets flagged early, before it derails the rest of the solution.

OpenAI's 2023 paper "Let's Verify Step by Step" showed that PRMs significantly outperform outcome-based reward models on mathematical reasoning tasks.

This idea became the foundation for something much bigger, which I will cover in Week 12 when we get to test-time compute scaling.

Part 3: OpenAI o1 and DeepSeek-R1

OpenAI o1: Giving Models Time to Think

In September 2024, OpenAI released o1, and the response from the research community was immediate.

The idea behind o1 is straightforward. Give the model more time to think about the inference. Before producing an answer, o1 generates a hidden chain of thought that the user never sees, but the model uses internally. This chain is trained with reinforcement learning: the model gets rewarded for reaching correct answers, which teaches it to develop better internal reasoning strategies.

The results on AIME 2024, a notoriously difficult high school math competition, were striking. GPT-4o scored 12%. o1 scored 74%.

That is not a small improvement. That is a different class of performance, driven almost entirely by letting the model think longer.

DeepSeek-R1: The Open Source Answer

In January 2025, a Chinese startup called DeepSeek released R1, and it caused genuine disruption in the Western AI community.

DeepSeek-R1 matched o1-level performance at a fraction of the training cost. And it was fully open source.

Three technical contributions made this possible.

Group Relative Policy Optimization (GRPO): Standard RLHF needs a separate critic model to score responses, which adds significant overhead. GRPO removes that requirement. Instead, the model generates multiple responses to the same question, compares them against each other, and rewards the best one. No criticism needed.

Warm Start Before RL: Training a model from scratch with pure reinforcement learning is unstable because the model starts random. DeepSeek's approach was to first run supervised fine-tuning to give the model a reasonable starting point, then apply RL on top of that. A sensible idea that turned out to matter a lot.

Emergent Reasoning Behaviors: During training, R1 developed behaviors that were never explicitly programmed. The model began catching its own mistakes mid-reasoning and reconsidering. It started verifying its own answers before finalizing them. It explored alternative solution paths. These behaviors just appeared from the training process. For researchers trying to understand what is happening inside these models, this is genuinely interesting territory.

Part 4: Where CoT Fails

Unfaithful Reasoning

One of the more unsettling findings in recent research is that CoT explanations do not always reflect what the model actually computed.

Anthropic's 2023 research showed that models sometimes produce post-hoc rationalizations. They settle on an answer through some internal process, then construct a reasoning chain that appears to justify it. The explanation and the computation are decoupled.

What the model writes as its reasoning may not be what actually happened.

Reasoning or Memorization?

There is a deeper question underneath CoT performance: is the model actually reasoning, or is it recalling reasoning-shaped patterns from its training data?

Researchers created a symbolic variant of GSM8K where the logic of each problem stayed the same, but surface features like numbers and names were changed. Performance dropped significantly. If the model were truly reasoning about the structure of the problem, this change should not matter. The fact that it does suggests some of the apparent reasoning is memorization in disguise.

The Overthinking Problem

My experiment showed a small version of this. On Problem 6, the model solved it correctly without CoT. With CoT, it added extra steps, got confused, and got it wrong.

Researchers have documented this pattern at scale. Longer reasoning chains are not always better. Past a certain point, additional steps introduce errors rather than correct them. This has been called "overthinking" or the "lost in the middle" problem.

Compositional Generalization

LLMs also struggle when they need to combine reasoning skills in novel ways. They can handle familiar patterns well. But put two familiar patterns together in a configuration the model has not seen, and performance degrades. This suggests the reasoning ability is less flexible and generalizable than it might appear from benchmark numbers.

Part 5: What We Still Do Not Know

CoT has genuinely advanced what language models can do. But there are open questions that the field has not resolved.

Are the Explanations Honest?

When a model shows its reasoning, is that actually what happened computationally? The unfaithful reasoning research says it often is not. We do not have reliable tools to check whether a model's stated reasoning matches its internal computation. This matters a lot if you want to trust the reasoning, not just the answer.

Where Does Reasoning End and Memorization Begin?

The symbolic variant experiments raise a question that nobody has cleanly answered yet. For any given correct reasoning chain, how much of it reflects genuine logical inference versus pattern recall? The boundary is not well defined.

Why Does CoT Work in English and Struggle Elsewhere?

Almost all CoT research was conducted in English. When you apply the same techniques to Arabic, Urdu, or other lower-resource languages, performance drops noticeably. Whether this is primarily a data coverage problem or something more structural about how reasoning transfers across language families is still an open question.

Can We Formally Verify a Reasoning Step?

A calculator gives you a provably correct answer. An LLM gives you a confident one. There is currently no reliable way to formally verify whether an individual step in an LLM's reasoning chain is logically valid. Researchers are exploring integrations with formal theorem provers such as Lean4, but this remains largely unsolved.

Does Interpretability Scale?

Mechanistic interpretability research has produced real insights at small model scales: specific circuits identified, specific behaviors localized. But as models grow to hundreds of billions of parameters, these techniques become computationally impractical. How interpretability research keeps pace with model scale is an open problem.

Papers Worth Reading

Paper	What It Contributes	Venue
Wei et al. (2022)	Original CoT paper	NeurIPS 2022
Kojima et al. (2022)	Zero-shot CoT discovery	NeurIPS 2022
Lightman et al. (2023)	Process Reward Models	OpenAI Tech Report
DeepSeek-AI (2025)	GRPO and DeepSeek-R1	arXiv 2501
Turpin et al. (2023)	Unfaithful reasoning	NeurIPS 2023
Wang et al. (2022)	Self-consistency via majority voting	ICLR 2023

Research Groups Doing Interesting Work Here

Anthropic's interpretability team is doing some of the most rigorous work on understanding what is happening inside these models. DeepMind's Gemini team is pushing multimodal reasoning. MIT's BCS and CSAIL groups are connecting cognitive science with language model research. Peking University's NLP group has produced strong work on multilingual reasoning.

Benchmarks You Should Know

GSM8K covers grade school math with 8,500 problems. MATH is competition-level with 12,500 problems. MMLU covers broad knowledge across many domains. ARC-Challenge focuses on scientific reasoning. BIG-Bench Hard collects 23 tasks specifically designed to be difficult for current models.

Conclusion

Chain-of-thought prompting is one of the more surprising ideas in recent AI research. A single phrase, added to a prompt, unlocks reasoning capabilities that were already there but not being used.

And yet the central question it raised remains unanswered. Do these models actually reason, or do they produce sophisticated simulations of reasoning? The honest answer is that we do not fully know.

The gap between sounding intelligent and being intelligent is where the most interesting work in this field is happening right now.

Next week: Small Language Models. How models like Phi-3 and Gemma became serious competitors to GPT-4, and what the research landscape looks like when you do not need a data center to run your model.

References

1.Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022.

Kojima, T., et al. (2022). Large Language Models are Zero-Shot Reasoners. NeurIPS 2022.
Lightman, H., et al. (2023). Let's Verify Step by Step. arXiv:2305.20050.
DeepSeek-AI. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948.
Turpin, M., et al. (2023). Language Models Don't Always Say What They Think. NeurIPS 2023.
Wang, X., et al. (2022). Self-Consistency Improves Chain-of-Thought Reasoning in Language Models. ICLR 2023.
Elhage, N., et al. (2021). A Mathematical Framework for Transformer Circuits. Anthropic.

Code for this experiment is available on GitHub: Week 01 Code

This is part of a weekly series on AI/ML research. Each post covers theory, recent work, and experiments I run myself.

Connect on LinkedIn Soohan Abbasi

DEV Community