Jeff Reese

Posted on Apr 21 • Edited on Apr 28 • Originally published at purecontext.dev

Chain-of-Thought: Teaching AI to Reason Out Loud

#ai #prompting #reasoning #chainofthought

AI in Practice, No Fluff — Day 2/10

When I had just started using ChatGPT, I asked it to calculate how many business days were between two dates. It gave me a confident number. The number was very wrong... I only caught it because I had already done the count by hand and was just double-checking.

I asked the exact same question again, added four words at the end of the prompt, "Let's think step by step," and watched it walk through the weekends, subtract a holiday, and then land on the correct number.

Same model. Same question. Four extra words. A different answer.

There is a specific reason for that. The reason matters more in 2026 than the technique does.

What is going on under the hood

In the first series we covered the three pieces that make a prompt work: context, task, format. Chain-of-thought is not a fourth piece. It sits on top of those, in the territory of how the AI should think before it responds.

The AI doesn't have a private thinking step. There is no silent internal process happening before it starts writing. Every word it produces is part of the same running output. If you ask for an answer, the first thing it writes is an answer. If you ask it to reason first, the reasoning is the first thing it writes, and the answer comes after.

The words that come out between your question and the final answer are where reasoning actually lives. These are the intermediate tokens. They are not a description of thinking. They are the thinking.

The act of generating "Monday to Monday is 5 business days, subtract the holiday on Thursday, that leaves 4" is the reasoning step itself. Take those tokens away and the thought did not happen.

That is why "think step by step" is not a magic spell. It is a structural move. You are asking the model to lay down the intermediate computation as written words before committing to an answer, because without those words there is no computation.

When it helps

Chain-of-thought earns its tokens on anything that requires more than one step to get right.

Math with multiple operations, especially word problems
Logic puzzles and constraint-satisfaction problems
Planning a sequence of actions
Analyzing tradeoffs between options
Debugging why some system behaves the way it does
Any judgment call that depends on comparing several factors

If the answer depends on holding more than one fact in mind and combining them, letting the model write out the combination first usually produces a better result than asking it to jump to the answer.

When it does not help

Not every problem rewards reasoning out loud. If you are asking the AI to retrieve a single fact, summarize a passage, translate between languages, or pick the correct word from context, there is no multi-step reasoning to surface. You are not asking it to think in parallel about several things; you are asking it to produce one thing. Requesting step-by-step reasoning on a lookup task just generates filler and makes the response longer without making it better.

It can be worse than neutral. A 2024 paper tested reasoning models on tasks where deliberation pushes the model away from its correct intuitive answer. Forcing chain-of-thought dropped accuracy by more than a third compared to answering directly. Step-by-step reasoning is a tool, not a default setting; on the wrong task it actively hurts.

A rough test I use: if I would not need to show my work to get credit for the answer, the AI does not need to either.

How to structure a chain-of-thought prompt

The simplest pattern is to append "Let's think step by step" to your question. That alone will often flip a wrong answer into a right one. It is the lowest-effort move, and it is often enough on its own.

For anything more involved, give the model a scaffold. A reliable template is first identify, then determine, then answer:

Question: [your question]

Please solve this by:
1. First, identify what information you have and what you need to find
2. Then, determine the steps required to get from one to the other
3. Work through each step
4. Finally, state your answer

The explicit structure does two things. It names the stages so the model is less likely to skip one, and it slows the jump to the answer until the work is done. The "state your answer" as a distinct final step matters. Without it the model sometimes trails off into more reasoning and never commits.

The shift with reasoning models

This started shifting in late 2024 with OpenAI's o-series and Anthropic's extended thinking. By 2026 it has flipped all the way. Reasoning is built into the flagship models by default. GPT-5, Claude 4.6, and Gemini 3 all default to reasoning in their main consumer interfaces. Claude's approach, called adaptive thinking, lets the model decide how much to reason based on the question. You steer how hard it thinks through an effort parameter in the API rather than setting a token budget by hand.

If you are using a current flagship, explicit "think step by step" prompting is mostly redundant. A 2025 paper measuring CoT performance across reasoning-class models found the benefit of adding explicit step-by-step prompting is small, and sometimes negative. The reasoning is already happening. You are not unlocking anything the model was not already going to do.

There is a tradeoff worth knowing: reasoning models are slower and more expensive per response because they are generating a lot of hidden thinking tokens before answering. For simple questions that did not need reasoning, you are paying for thinking that did not improve the output. This is why most providers let you dial the reasoning effort up and down, or offer a non-reasoning mode alongside their reasoning model.

The practical move: turn the effort up for hard problems, turn it down for easy ones, and if you are on a model without built-in reasoning, reach for explicit chain-of-thought prompting when the problem has more than one step.

The reflex

When you get a confident wrong answer from an AI, the reflex is to add more context. More background, more examples, more specificity about what you want.

That is sometimes the right move. It is often the wrong one.

The question worth asking first is whether the model had room to think. If the answer depended on more than one step and the model jumped straight to the answer, the failure is structural, not informational. On a non-reasoning model, give the model room to think out loud before answering. On a reasoning model, the reasoning was probably running already; the fix is usually switching the approach rather than adding to the prompt.

If the answer requires thinking, make the thinking happen out loud before the answer.

Tomorrow: one exchange is rarely enough. How to design a back-and-forth with an AI so the conversation does not drift.

If there is anything I left out or could have explained better, tell me in the comments.

DEV Community