I spent the past weekend running a head-to-head experiment: five popular prompt engineering patterns, one real model (Claude Sonnet 4.5), fifty real movie reviews. The goal was simple — find out which technique actually delivers the best results.
The result? The simplest approach won. And the most sophisticated one — Chain-of-Thought — didn't just underperform. It actively made things worse.
Here's what happened.
The 5 Patterns I Tested
1. Zero-Shot
Direct instruction, no examples.
Classify this movie review as 'positive' or 'negative'. Reply with only one word.
Review: "The food was terrible but the service was great."
2. Few-Shot (k=3)
Three input-output examples before the query.
Classify movie reviews as 'positive' or 'negative'. Reply with only one word.
Review: "a masterpiece of modern cinema" → positive
Review: "boring and pointless" → negative
Review: "absolutely loved every minute" → positive
Review: "The food was terrible but the service was great." →
3. Chain-of-Thought (CoT)
Ask the model to reason step by step.
Classify this movie review as 'positive' or 'negative'.
Think step by step about the sentiment words and overall tone,
then give your final answer on the last line as just one word.
4. Role Prompting
Assign the model a persona.
You are an expert sentiment analyst with 20 years of experience
in film criticism and NLP.
Classify this movie review as 'positive' or 'negative'. Reply with only one word.
5. Structured Output
Force the model to respond in JSON.
Analyze this movie review and respond ONLY with valid JSON:
{"sentiment": "positive" or "negative"}
The Experiment
I ran all 5 patterns against 50 real movie reviews from the SST-2 dataset using Claude Sonnet 4.5. Each review was classified as positive or negative, and I measured:
- Accuracy — did it get the right answer?
- Latency — how long did it take?
- Token cost — how many tokens were consumed?
Results
| Pattern | Accuracy | Avg Latency | Avg Tokens | Relative Cost |
|---|---|---|---|---|
| Zero-Shot | 98.0% | 1.58s | 50 | 1.0x |
| Few-Shot (k=3) | 98.0% | 1.78s | 86 | 1.7x |
| Role Prompting | 98.0% | 1.83s | 72 | 1.4x |
| Structured Output | 98.0% | 2.06s | 65 | 1.3x |
| Chain-of-Thought | 64.0% | 5.23s | 228 | 4.6x |
The Surprising Finding
Four out of five patterns hit 98% accuracy. The model is simply good enough at binary sentiment that Zero-Shot, Few-Shot, Role Prompting, and Structured Output all achieve nearly the same result.
But Chain-of-Thought collapsed to 64% — barely better than guessing.
Here's a real example. For the review "an utterly compelling 'torture' story" (label: positive), Zero-Shot immediately returned "positive." But Chain-of-Thought went down a rabbit hole:
"The word 'torture' has negative connotations... however 'compelling' is positive... the quotes around 'torture' suggest it may be used figuratively... but the overall sentiment is ambiguous..."
And got it wrong.
Why? Because asking the model to "think step by step" about something it already knows how to do introduces confusion. The reasoning process picks up on ambiguity that doesn't exist when the model just answers directly. It overthinks.
And it costs 4.6x more tokens for that worse result.
What This Means for You
1. Don't reach for complex patterns by default
If your model is capable enough for the task, Zero-Shot might be all you need. In my test, the simplest approach was the cheapest, fastest, AND tied for most accurate. There was literally no reason to use anything fancier.
2. Chain-of-Thought can actively hurt on "simple" tasks
CoT is designed for multi-step reasoning (math, logic, planning). When you apply it to tasks the model already handles well in one shot, you're adding noise, not signal. In my test, it cut accuracy by 34 percentage points while costing nearly 5x more. That's the worst possible trade-off.
3. Fancier patterns cost more without accuracy gains
Few-Shot used 1.7x the tokens. Role Prompting used 1.4x. Structured Output used 1.3x. All hit the same 98% as Zero-Shot. If you're running thousands of classifications per day in production, that cost difference adds up — for literally zero accuracy benefit.
4. Match your pattern to your problem
Stop asking "which pattern is best?" and start asking "how hard is this task for this model?" If the answer is "not very" — and for a frontier model on binary classification, it usually isn't — just use Zero-Shot and move on.
A Practical Decision Guide
Based on these results and broader published research:
| Situation | Recommended Pattern |
|---|---|
| Model is capable enough for the task | Zero-Shot |
| Model needs calibration on output format | Few-Shot |
| Multi-step math or logic problems | Chain-of-Thought |
| Need specific tone or perspective | Role Prompting |
| Need machine-parseable output | Structured Output |
| High-stakes decisions needing max accuracy | Self-Consistency (multiple CoT runs + voting) |
The Bottom Line
Start with the simplest pattern that works. Add complexity only when the data proves you need it.
Zero-Shot was the fastest (1.58s), cheapest (50 tokens), and tied for most accurate (98%). Every other pattern either matched it at higher cost, or actively hurt performance.
The biggest mistake I see is reaching for Chain-of-Thought or complex prompting strategies when a direct instruction would have been faster, cheaper, and more reliable.
Full experiment code available on Kaggle. Model: Claude Sonnet 4.5. Dataset: 50 SST-2 sentiment samples (28 positive, 22 negative).
References
- Wei, J., et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS.
- Wang, X., et al. (2023). "Self-Consistency Improves Chain of Thought Reasoning in Language Models." ICLR.
- Brown, T., et al. (2020). "Language Models are Few-Shot Learners." NeurIPS.
- Anthropic. (2024). "Prompt Engineering for Claude." Anthropic Documentation.
Top comments (0)