Devanshu Biswas

Posted on Jun 7

7 Magic Words That Make Your LLM 10 Smarter at Math

#ai #llm #prompting #beginners

🌐 Live demo (LOOK · UNDERSTAND · BUILD): https://dev48v.infy.uk/prompt/day2-chain-of-thought.html

Day 2 of my PromptFromZero series — 50 LLM techniques in 50 days, each visualized with LOOK / UNDERSTAND / BUILD.

Today: Chain of Thought (CoT). The single highest-impact prompt change you can make. Costs nothing. Adds 7 words. Often turns wrong answers into right ones.

The setup

Same problem. Same model. Two prompts.

Roger has 5 tennis balls. He buys 2 cans of 3 balls each.
How many balls does he have now?

Prompt A — "just answer"

…question… Just answer with the number, nothing else.

Small / older models often answer: 8. Wrong.

Prompt B — Chain of Thought

…same question… Let's think step by step.

Model writes:

Roger starts with 5 balls.
He buys 2 cans, each holding 3 balls.
2 × 3 = 6 new balls.
5 + 6 = 11.

Final answer: 11.

Right.

Same model. Same problem. Seven extra words on the prompt. The accuracy boost on multi-step math problems is consistently massive.

Why it works

LLMs generate one token at a time, each token conditioned on every token that came before. If you ask for the answer with no working, the model has to compress the whole computation into a single number prediction. There's nowhere to "scratch paper".

Chain of Thought forces the model to write the scratch paper out. Each step becomes additional context for the next step. By the time it gets to "Final answer:", the arithmetic is already on the page — anchored to real numbers, not vibes.

More tokens spent = more compute per problem = more reasoning capacity. CoT is literally trading latency for accuracy.

When to use it

Use CoT	Skip CoT
Math word problems	Factual lookups ("What's the capital of France?")
Multi-step logical reasoning	Creative writing
Cause-and-effect chains	Short summaries
Subtle classifications	Code completion

Heuristic: if you would write scratch-paper math yourself, the model will benefit from CoT.

Build it in 10 minutes

mkdir cot-from-zero && cd cot-from-zero
npm init -y
npm install ai @ai-sdk/google
echo "GOOGLE_GENERATIVE_AI_API_KEY=your_key_here" > .env

Get a free Gemini key at https://aistudio.google.com/apikey (no credit card).

// cot.mjs
import { generateText } from "ai";
import { google } from "@ai-sdk/google";

const model = google("gemini-2.5-flash");
const problem = "Roger has 5 tennis balls. He buys 2 cans of 3 balls each. How many balls does he have now?";

const bad = await generateText({
  model,
  prompt: problem + "\n\nJust answer with the number, nothing else."
});

const good = await generateText({
  model,
  prompt: problem + "\n\nLet's think step by step."
});

console.log("=== Without CoT ===\n" + bad.text);
console.log("\n=== With CoT ===\n" + good.text);

node --env-file=.env cot.mjs

Two runs of the same model on the same problem, side by side. The difference is visible immediately.

Levels of CoT

1. Zero-shot CoT (above)

Just add "Let's think step by step." Works on most modern models.

2. Few-shot CoT

Prepend 2-3 worked examples before the question:

Q: Sara had 4 apples and got 2 more. How many?
A: Sara had 4. She got 2 more. 4 + 2 = 6. Answer: 6.

Q: Roger has 5 tennis balls. He buys 2 cans of 3 each. How many balls?
A: [model continues in same format]

Better on harder problems — the model has explicit examples of the reasoning depth you want.

3. Structured CoT

Force a format:

"Solve this. Number your steps 1, 2, 3. Final answer on a new line starting 'Answer:'."

Easier to parse programmatically.

4. Hidden CoT

Generate the chain, then strip it before showing the user:

const reply = result.text;
const clean = reply.replace(/<thinking>[\s\S]*?<\/thinking>/g, '').trim();

User sees just the answer; the model gets the accuracy benefit.

What about reasoning models?

GPT-5, Claude 4 Sonnet, o1, o3, Gemini 2.5 — modern flagship models train with reasoning baked in. They don't need "let's think step by step." They do it automatically.

But:

They cost 10× more per token
They're slower (visible "thinking..." UI)
They're overkill for simple tasks

Cheap model + CoT prompt ≈ reasoning model output, at ~10% of the cost. CoT is still the highest-leverage technique you can use on small models.

What this unlocks

CoT is the foundation. Every fancier reasoning technique builds on top:

Self-consistency — sample N CoT runs, take majority vote
ReAct — CoT + tool calls interleaved (Day 1)
Tree of Thoughts — branch CoT into multiple paths, evaluate
Reflection — generate, criticize own output, regenerate

Master CoT first. Everything else is variations.

Try it now

Three tabs on one page:
https://dev48v.infy.uk/prompt/day2-chain-of-thought.html

LOOK — animated side-by-side trace of both prompts
UNDERSTAND — 8 click-through steps on why CoT works
BUILD — copy the code, run it on your machine

What's next in PromptFromZero

Day 3: Self-consistency. Sample 5 CoT runs, take majority vote. Same model, even higher accuracy.

Series: 50 LLM techniques · 50 days · Vercel AI SDK throughout.

🌐 All techniques: https://dev48v.infy.uk/promptfromzero.php

DEV Community