Shuntaro Okuma

Posted on Mar 26

I Tested 12 LLMs With Few-Shot Examples. The Results Were Not What I Expected.

#ai #machinelearning #opensource #discuss

In a previous article, I tested 8 models across 4 tasks and reported on "few-shot collapse" — cases where adding few-shot examples actually degrades LLM performance.

This time, I expanded the experiment to 12 models (6 cloud + 6 local) and 5 tasks to see whether those findings hold at a larger scale. They do — and I found even more dramatic cases, including a model that dropped from 93% to 30% with more examples.

What I tested

I evaluated 12 models — 6 cloud APIs and 6 local models — across 5 tasks designed to mirror real business scenarios.

Cloud models: Claude Haiku 4.5, Claude Sonnet 4.6, Gemini 2.5 Flash, Gemini 3 Flash, GPT-4o-mini, GPT-5.4-mini

Local models: Gemma 3 27B, LLaMA 4 Scout (17B active, MoE), GPT-OSS 120B, Qwen 3.5 (35B total / 3B active, MoE), Ministral 3 14B Reasoning, Phi-4 Reasoning Plus

Tasks:

Classification — Categorize customer support inquiries into specific categories (exact match scoring)
Code Fix — Identify and fix bugs in short Python functions
Route Optimization — Calculate optimal delivery routes with time windows and fuel costs (LLM-as-judge scoring)
Sentiment Analysis — Classify product reviews as positive/negative/neutral/mixed
Summarization — Extract key points from news articles into summaries (F1 scoring)

Each model-task pair was evaluated at 0, 1, 2, 4, and 8 shots, with 3 trials per configuration and TF-IDF-based dynamic example selection. That's 60 model-task pairs and over 27,000 individual evaluations.

I'll describe how to explore the full results later, but here are three patterns that stood out.

Pattern 1: The zero-shot leader can crash to last place

Gemini 3 Flash scored 93% on route optimization at zero-shot — the highest of any model. Then I added examples.

Shots	0	1	2	4	8
Score	93%	93%	43%	53%	30%

At 8-shot, it scored 30%. The model that was the best at zero-shot became the worst with examples. A 63-point drop.

Here's the twist: Gemma 3 27B — from the same model family — stayed stable around 90% across all shot counts. Same architecture lineage, completely different behavior. This isn't a property of the Gemini/Gemma family. It's specific to Gemini 3 Flash on this task.

Pattern 2: Most models benefit from few-shot examples

On classification, every model scored between 0% and 20% at zero-shot. They all looked equally bad. Based on a zero-shot benchmark alone, you'd conclude these models can't classify customer support tickets.

But with examples, performance improved dramatically across the board. The graph is a bit busy, but you can see the overall upward trend from 0-shot to 8-shot:

At 8-shot:

Claude Haiku: 80% (from 20%)
Claude Sonnet: 73% (from 20%)
GPT-OSS 120B: 73% (from 0%)
Gemma 3 27B: 67% (from 0%)
GPT-4o-mini: 33% (from 0%)
Gemini 2.5 Flash: 27% (from 13%)

Models that scored below 20% at zero-shot improved significantly with examples. Claude Haiku reached 80%, and Claude Sonnet and GPT-OSS 120B also showed strong gains. Gemma 3 27B, which performed well on route optimization in Pattern 1, went from 0% to 67%. On the other hand, models like GPT-4o-mini and Gemini 2.5 Flash barely improved.

If you pick your model from zero-shot benchmarks, you might choose the wrong one.

Pattern 3: Models bad at a task stay bad — even with examples

On summarization, most models improved steadily with more examples. This is the behavior everyone expects from few-shot prompting. The graph is busy again, but the overall upward trend is clearer than with classification:

Gemma 3 27B — a local model — achieved the highest score at 75%, outperforming all cloud models. Claude Sonnet followed at 73%, then Gemini 3 Flash at 72%. For straightforward tasks, local models can be more than enough.

However, even on this task, Phi-4 Reasoning Plus and Ministral 3 14B scored poorly. Both are reasoning-specialized models, optimized for expanding and elaborating information — not compressing it as summarization requires. This isn't "collapse" from adding examples; they simply weren't suited for the task.

Few-shot prompting works well for most models on most tasks, but models that are fundamentally mismatched with a task won't be saved by more examples.

The 60 model-task combinations fall into three patterns

To summarize the three patterns:

1. Few-shot causes collapse — Like Gemini 3 Flash on route optimization in Pattern 1, adding examples dramatically degrades performance. The most notable cases:

Model	Task	Behavior	Drop
Gemini 3 Flash	Route Optimization	Gradual decline	93% → 30%
Qwen 3.5 (3B active)	Code Fix	Gradual decline	56% → 0%
Ministral 3 14B	Code Fix	Peak regression	44% → 33%

2. Few-shot works as expected — Like summarization for most models, performance improves steadily with more examples.

3. Task-model mismatch — As described in Pattern 3, models like Phi-4 Reasoning Plus and Ministral 3 14B scored low on summarization even at zero-shot. Adding examples didn't help — this isn't "collapse" but a fundamental mismatch.

Additionally, four pairs showed temporary dips that recovered. Scores eventually returned, but testing at a single shot count could lead to the wrong conclusion:

Model	Task	Detail
GPT-5.4-mini	Classification	60% at 2-shot → 27% at 4-shot → 60% at 8-shot
GPT-OSS 120B	Route Optimization	78% at 0-shot → 58% at 1-shot → 74% at 8-shot
Gemini 2.5 Flash	Route Optimization	63% at 2-shot → 52% at 4-shot → 63% at 8-shot
Qwen 3.5	Classification	40% at 2-shot → 20% at 4-shot → 40% at 8-shot

Testing at multiple shot counts helps catch these, though the issue may be less about shot count itself and more about the interaction between the specific examples provided and the model's state.

Who performed best?

Measuring adaptation efficiency (area under the learning curve across all tasks):

Rank	Model	Type	Avg AUC
1	Claude Haiku 4.5	Cloud	0.815
2	Gemma 3 27B	Local	0.814
3	Claude Sonnet 4.6	Cloud	0.802
4	LLaMA 4 Scout	Local	0.748
5	GPT-5.4-mini	Cloud	0.730

A 27B local model matched Claude Haiku's adaptation efficiency. LLaMA 4 Scout, with only 17B active parameters (MoE), outperformed GPT-5.4-mini. Results will vary depending on the evaluation method and tasks, but this suggests that with proper few-shot prompting, local models can achieve performance comparable to cloud APIs.

Prior research

Few-shot performance degradation has been reported by several independent studies:

Tang et al. (2025) documented "over-prompting" — performance peaks then declines — across GPT-4o, DeepSeek-V3, Gemma-3, LLaMA-3, and Mistral.
Lin & Mohaisen (NDSS 2025) found that few-shot examples degraded vulnerability detection: Gemma 7B dropped from 78% to 40%.
Chroma Research (2025) showed that simply adding more tokens — even irrelevant ones — degrades performance.
Min et al. (2022) found that randomly replacing labels in few-shot examples barely hurts performance — suggesting models aren't learning from examples the way we assume.

The phenomenon is well-documented. This makes it all the more important to evaluate whether few-shot prompting actually works for your specific use case before deploying to production.

Practical takeaways

Don't assume more examples = better. It's worth testing at multiple shot counts. The optimal number varies by model and task.
Don't pick models from zero-shot benchmarks alone. We found that rankings can change significantly with examples. When referencing benchmarks, check whether they were measured at zero-shot or few-shot — the methodology matters.
Distinguish collapse from task mismatch. If scores drop after adding examples, check the zero-shot baseline. Low from the start suggests a model-task compatibility issue. High at zero-shot but dropping with examples points to a few-shot prompting effect.
Measure, don't guess. Whether few-shot prompting helps a specific model-task pair can only be determined by actually evaluating it. Tracking the full learning curve ensures you don't miss non-monotonic patterns.

Reproducing these results

The evaluation was run with AdaptGauge (OSS, MIT license), a tool that tracks learning curves, auto-detects collapse, and classifies degradation patterns.

The full results from this 12-model × 5-task experiment are available as default demo data. After installation, you can immediately explore the patterns and learning curves discussed in this article — no API keys needed.

To evaluate your own tasks and models, AdaptGauge supports cloud APIs as well as local models via any OpenAI-compatible API (LM Studio, Ollama, etc.).

GitHub: github.com/ShuntaroOkuma/adapt-gauge-core

References