Chain-of-Thought vs Few-Shot: 34% Accuracy Gap on GSM8K

#llm #chainofthought #fewshotlearning #promptengineering

Chain-of-Thought Beats Few-Shot by 34% on Grade School Math

GSM8K is a dataset of 8,500 grade school math word problems that trip up even large language models. When I tested GPT-3.5-turbo with standard few-shot prompting, accuracy hovered around 23%. Switching to chain-of-thought (CoT) prompting — where the model writes out intermediate reasoning steps — jumped accuracy to 57%.

That's a 34 percentage point gap from the same model, same API call, just a different prompt structure.

The gap isn't magic. CoT forces the model to decompose multi-step reasoning into explicit steps, which prevents the arithmetic and logical errors that plague direct answer generation. But it comes with a cost: 3.2x more tokens per response, which translates directly to API spend and latency.

A man in a suit reading religious books at a wooden table indoors. — Photo by cottonbro studio on Pexels

What GSM8K Actually Tests

GSM8K problems look deceptively simple:

"Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"

Continue reading the full article on TildAlice