Chain-of-Thought Beats Few-Shot by 34% on Grade School Math
GSM8K is a dataset of 8,500 grade school math word problems that trip up even large language models. When I tested GPT-3.5-turbo with standard few-shot prompting, accuracy hovered around 23%. Switching to chain-of-thought (CoT) prompting — where the model writes out intermediate reasoning steps — jumped accuracy to 57%.
That's a 34 percentage point gap from the same model, same API call, just a different prompt structure.
The gap isn't magic. CoT forces the model to decompose multi-step reasoning into explicit steps, which prevents the arithmetic and logical errors that plague direct answer generation. But it comes with a cost: 3.2x more tokens per response, which translates directly to API spend and latency.
What GSM8K Actually Tests
GSM8K problems look deceptively simple:
"Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"
Continue reading the full article on TildAlice

Top comments (0)