Shuntaro Okuma

Posted on Feb 27 • Originally published at shuntaro-okuma.Medium

When More Examples Make Your LLM Worse: Discovering Few-Shot Collapse

#llm #opensource #ai #machinelearning

Here's something everyone agrees on about few-shot prompting: give the model more examples, it performs better.

I believed that too. Then I measured it.

So I built AdaptGauge, an open-source tool that measures how efficiently LLMs learn from few-shot examples.

What I tested

I evaluated eight models across four tasks designed to mirror real business scenarios, at shot counts of 0, 1, 2, 4, and 8:

Classification — Categorize customer support inquiries into one of 8 categories (billing, technical support, returns, etc.)
Code Fix — Identify and fix bugs in short Python functions (off-by-one errors, missing edge cases)
Summarization — Extract key points from Japanese news articles into bullet-point summaries
Route Optimization — Calculate optimal delivery routes across multiple destinations with time windows and fuel costs

Models tested:

Cloud APIs: Claude Haiku 4.5, Claude Opus 4.5, Gemini 2.5 Flash, Gemini 3 Flash, Gemini 3 Pro
Local models: Gemma 3 27B, GPT-OSS 120B, Qwen3-VL 8B

For each model-task pair, I also compared two example selection strategies:

Fixed — The same hand-picked examples used for every test input.
TF-IDF dynamic selection — For each test input, score all candidate examples by word-overlap similarity and pick the closest matches. The idea: examples that resemble the current input should help the model more. Tang et al. (2025) reported that combining this with stratified sampling achieves better performance with fewer examples.

Full task definitions — including prompts, examples, and scoring rubrics — are in the demo task pack.

Most of the time, the results looked exactly like you'd expect. More examples, better scores. But not always.

Three patterns that break the assumption

Pattern 1: The model learns, then unlearns

On a route optimization task, Gemini 3 Flash scored 33% at zero-shot. Adding examples helped — performance climbed to 64% at 4-shot. Textbook behavior.

Then I added more. At 8-shot, the score crashed back to 33%. Right back where it started.

The model learned, then unlearned.

Four models improved steadily. One didn't. I call this peak regression — and you can't spot it without tracking the full learning curve.

Pattern 2: Rankings flip completely

On a classification task, something even stranger happened. The model rankings reversed between zero-shot and eight-shot:

Look at Gemini 2.5 Flash: it scored just 20% at zero-shot, but climbed to 80% with eight examples — the highest of any model. Meanwhile, Gemini 3 Pro stayed flat at 60% regardless of shot count.

A "Pro" model isn't necessarily better than a "Flash" model — it depends on how you prompt it. Choosing a model based on public benchmarks alone can lead you to the wrong conclusion.

Pattern 3: How you pick examples can trigger collapse

I tested two methods for selecting few-shot examples: fixed (hand-picked) and TF-IDF (dynamically selected by text similarity).

Tang et al.'s "The Few-shot Dilemma" (2025) found that TF-IDF-based selection combined with stratified sampling achieved superior performance with fewer examples. And on most of my tasks, TF-IDF did help.

But on a route optimization task with GPT-OSS 120B, it made things dramatically worse:

With fixed examples, the model stayed above 50%. With TF-IDF, it collapsed to 35% at 2-shot — a 58% relative drop. The method designed to find "better" examples triggered a failure.

Adding in-context examples — or changing how you select them — can actively degrade model performance. I call this few-shot collapse.

I'm not the first to see this

After finding these patterns, I dug into the literature. Turns out researchers have been documenting the same thing.

The over-prompting problem. Tang et al. (2025) showed that LLM performance peaks at a certain number of examples and then declines. LLaMA and Gemma models showed dramatic degradation. GPT models held up better.

Catastrophic drops in security tasks. An NDSS 2025 study (Lin & Mohaisen) found that few-shot examples dramatically degraded vulnerability type identification. In terms of AP (accurate-response percentage), Gemma 7B dropped from 77.9% to 39.9%, and LLaMA-2 70B from 68.6% to 21.0%.

Labels don't even matter. Min et al. (2022) found that randomly replacing labels in few-shot examples barely hurts performance. Models aren't learning input-label mappings — they're picking up format and distribution cues. The mechanism behind few-shot "learning" is far more fragile than most people assume.

Why this happens

A few factors are at play:

More tokens = worse performance. Chroma Research's "Context Rot" study (2025) showed that simply increasing input tokens — even with irrelevant whitespace — significantly degrades performance across a wide range of models and tasks. More examples means more tokens.

Position matters. Liu et al. (2024) showed that models struggle with information in the middle of long contexts. When examples push the actual task further down, the model loses track.

Pre-training biases conflict with examples. Some models have strong priors. When examples contradict those priors, or introduce patterns the model over-indexes on, the result is worse than no examples at all.

Example selection amplifies or dampens all of this. My TF-IDF comparison showed that "textually similar" doesn't always mean "helpful." A relevant example can still confuse the model.

What this means for you

If you're using LLMs in production:

Your prompt "improvements" might be breaking things. Adding examples is the default fix when a model underperforms. My data shows it can backfire — and without measurement, you won't know until users complain.

Leaderboard rankings don't predict this. Alzahrani et al. (2024) showed that minor benchmark changes shift rankings by up to 8 positions. My classification results confirm it: the zero-shot leader dropped to third once examples were added.

Different models break on different tasks. Gemini 3 Flash collapsed on route optimization but improved on summarization. There's no universal "safe" model.

Example selection is a variable, not a constant. Switching from hand-picked to TF-IDF examples turned a working model into a broken one. This isn't a "set and forget" choice.

Detecting it automatically

These findings led me to a framework inspired by Chollet's "On the Measure of Intelligence" (2019):

"The intelligence of a system is the measure of its skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty."

Instead of "how good is this model?", the question should be "how efficiently does it adapt?" — and critically, does it ever adapt in the wrong direction?

I built this idea into AdaptGauge. For each model-task pair across shot counts (0 through 8), it automatically computes:

Learning Curve AUC — Overall learning efficiency. Higher means the model learns faster from examples.
Few-Shot Collapse — Auto-alerts when 8-shot performance drops below 80% of the 0-shot baseline.
Collapse Pattern — Classifies each curve as immediate collapse, gradual decline, peak regression, or stable.
Resilience Score — How well the model holds up as shot count increases.
Example Selection Comparison — Runs fixed vs TF-IDF side-by-side to find what works for each model-task pair.

AdaptGauge is primarily a CLI tool, but it also includes a simple GUI for reviewing results:

In my evaluation, it flagged the peak regression in Gemini 3 Flash and the TF-IDF-induced collapse in GPT-OSS 120B automatically. These are patterns that spot-checking would miss entirely.

Try it

AdaptGauge is open-source. Clone the repo, check the pre-computed demo results, or run your own evaluations against any model with an API. For local models, LM Studio makes it easy to test.

If you've ever added examples to a prompt and wondered whether it actually helped — now you can find out.

ShuntaroOkuma / adapt-gauge-core

Measure LLM adaptation efficiency — how fast models learn from few examples

adapt-gauge-core

日本語

Measure how fast LLMs learn from few-shot examples — and detect when they break.

adapt-gauge-core is an open-source evaluation harness that measures Adaptation Efficiency — how quickly a language model improves with few-shot examples (0, 1, 2, 4, 8 shots) and whether it suffers from few-shot collapse (performance degradation with more examples).

Why Adaptation Efficiency?

Standard LLM benchmarks measure accuracy at a single point. But in production, teams often use few-shot prompting to adapt models to specific tasks. Two critical questions arise:

How many examples does this model need? Some models reach peak performance at 2 shots; others need 8.
Does adding examples ever hurt? For some model-task combinations, performance drops with more examples — a phenomenon known as few-shot collapse (also called over-prompting in the literature).

adapt-gauge-core answers both questions automatically.

In our evaluations, we observed that leaderboard rankings reverse depending on shot count — a model…

View on GitHub

References