Plan-and-Solve: make the model plan the steps before it computes any of them

#ai #promptengineering #llm #beginners

Ask a plain language model a multi-step word problem and it will often blurt out a confident number that is wrong — not because the model is dumb, but because it quietly skipped a step or fumbled the arithmetic. "Let's think step by step" (zero-shot chain-of-thought) helps a lot, yet the same two mistakes survive. Plan-and-Solve (PS) prompting is a one-line upgrade that targets exactly those two failure modes: first make the model devise a plan of subtasks, then make it carry out the plan in order. No examples, no extra calls — just a better trigger phrase.

The two ways reasoning breaks

Researchers who studied the reasoning chains that CoT produces found the errors cluster into two buckets:

Missing-step error — the chain simply never performs an operation it needed. It reads fluently and looks complete, but a step is gone.
Calculation error — a step does the right operation but produces the wrong number.

Plain prompting is worst of all because it compresses several operations into one leap, so a step vanishing is almost the default. CoT spreads the work out and helps, but it still slips.

A problem that exposes it

A shop starts with 120 apples. It sells 40% in the morning, then 15 more in the afternoon. How many are left?

The correct answer is 120 − 48 − 15 = 57. Watch what plain prompting does: "40% of 120 is 48, so 120 − 48 = 72." It computes the morning sale, subtracts it, and stops — the afternoon sale of 15 just disappears. That is a textbook missing-step error, and "think step by step" often falls into the same trap because nothing forces the last operation to exist.

The Plan-and-Solve trigger

Instead of appending "let's think step by step", append this:

Let's first understand the problem and devise a plan to solve it.
Then, let's carry out the plan and solve the problem step by step.

The magic is in the plan first half. Before touching a number, the model has to enumerate the subtasks:

Find how many were sold in the morning (40% of 120).
Subtract the morning sales from the start.
Subtract the 15 sold in the afternoon.
Report what's left.

Once step 3 is written down on the plan, it is much harder for the execution to glide past it. The plan acts as a commitment device — the model has told itself the afternoon subtraction exists, so it does it. Execute the plan and 48 → 72 → 57 falls straight out.

PS+ : also kill the arithmetic slips

Plain PS fixes missing steps but not bad arithmetic, so the authors add two more instructions. This is PS+:

Let's first understand the problem, extract relevant variables and
their corresponding numerals, and devise a plan. Then, let's carry
out the plan, calculate intermediate results (pay attention to
calculation and commonsense), and solve the problem step by step.

Two clauses are doing real work here. "Extract relevant variables and numerals" pins the numbers down — morning_sold = 48, after_morning = 72 — so the model reasons over concrete, named values instead of re-deriving them from memory on every line. And the parenthetical "pay attention to calculation and commonsense" nudges it to double-check each operation and reject impossible results (you can't have negative apples; an average speed can't exceed the fastest leg). In the paper, PS+ beat plain PS and closed most of the gap to few-shot CoT — with zero worked examples.

A second trap it dodges

A train goes 60 km/h for 2 hours, then 90 km/h for 3 hours. What's the average speed?

The tempting wrong answer is (60 + 90) / 2 = 75. Average speed is total distance over total time, not the mean of the two speeds. PS forces the plan — leg-1 distance, leg-2 distance, total distance, total time, divide — so it computes (120 + 270) / 5 = 78 km/h, the right number. Same shape of fix: the plan makes the missing "weight by time" step impossible to skip.

Where PS sits, and its limits

Plan-and-Solve is the cheapest member of the "decompose first" family. Least-to-most prompting also breaks a problem into ordered subproblems but usually needs examples and multiple calls; full task-decomposition frameworks route subtasks to different tools or agents. PS packs the plan and its execution into a single zero-shot call — most of the decomposition benefit, almost none of the machinery.

It is not a guarantee, though. The same model still executes its own plan in one generation, so a flawed plan produces a confidently wrong answer, and very long or symbolic problems can still defeat it. Treat PS+ as a strong, nearly-free first line of defence, and escalate when accuracy has to be higher: self-consistency (sample many PS chains and take the majority vote), least-to-most for deep dependencies, or offloading the actual arithmetic to a calculator or code tool so the model only has to plan.

Try the interactive version — run plain vs CoT vs Plan-and-Solve on the same problem and watch only the plan path keep every step: https://dev48v.infy.uk/prompt/day24-plan-and-solve.html