I Added 6 Few-Shot Examples to One Prompt. Two of Them Made the Output Worse.

#ai #llm #contextengineering #prompts

For a long time I treated few-shot examples like seasoning. More is more. If two examples made a prompt better, six would make it great, and I never bothered to check the math on that assumption.

Last month I sat down with one classification prompt and actually measured it, one example at a time. I had six examples I was proud of. Four of them pulled accuracy up. Two of them pulled it down. Not "added noise," not "no effect." Two examples I hand-picked, that looked perfectly reasonable in isolation, dragged the output 9 points lower than the prompt with four examples.

The uncomfortable part: in isolation, the two bad examples were the ones I would have shown off. They were the most detailed. That is exactly why they did the damage.

This is the story of which two, why, and how I now catch this before it ships.

The setup, so you can discount my numbers properly

This is not a clean benchmark paper. It is one task, run on one model, scored against a 60-item labeled set I built by hand. Treat the numbers as directional, not gospel.

The task: route incoming support messages into one of four buckets: billing, bug, feature_request, account. Plain text in, one label out. I am using a dummy product called Hookline (it does not exist) so nothing here leaks a real prompt.

The prompt is a short system instruction plus N few-shot examples, each a message paired with its correct label:

You route support messages into exactly one bucket:
billing, bug, feature_request, account.

Message: "I was charged twice this month."
Label: billing

Message: "The export button does nothing on Safari."
Label: bug

[...more examples...]

Message: {incoming}
Label:

I scored accuracy on the 60-item holdout set. Same model, same temperature, same holdout, the only thing changing is which examples sit in the prompt. I added them one at a time and re-ran the whole set after each addition.

The numbers

Here is the curve I expected versus the curve I got.

Examples in prompt	Accuracy	Change from previous
0 (zero-shot)	68%	--
+1 (billing)	73%	+5
+2 (bug)	78%	+5
+3 (account)	81%	+3
+4 (feature_request)	84%	+3
+5 (long billing case)	79%	-5
+6 (long bug case)	75%	-4

The first four examples did what few-shot is supposed to do. Smooth climb, 68 to 84. Then I added two more examples that I thought were my best material, and the line went the wrong way. Six examples scored 9 points below four.

If you wallpaper a room and step back to find you covered the light switch, you know the feeling. I had spent the afternoon making the prompt worse and feeling productive the entire time.

Which two, and why

Let me be specific about the two that broke it, because the pattern is more useful than the verdict.

Example 5 was a long, detailed billing case. A three-sentence message about a failed refund, a duplicate charge, and a confusing invoice, labeled billing. Reasonable label. The problem was length and position. It was four times longer than my other examples and it sat near the end of the prompt.

Example 6 was a long, detailed bug report. Stack trace, repro steps, browser version, labeled bug. Again, a fine label in isolation. Again, long, and now it was the very last thing the model read before the real message.

Two failure modes stacked on top of each other here, and both have names.

Recency: the model over-weights the last thing it saw

LLMs lean toward the label that appeared most recently in the prompt. This is a documented bias, not a quirk of my setup: the "Calibrate Before Use" work showed models drift toward recency, majority, and common-token biases, and that contextual calibration can recover up to 30% absolute accuracy by correcting for exactly this (Zhao et al.).

My example 6 was a bug example, and it was last. After I added it, my misroutes skewed bug. Messages that were really account or feature_request started getting stamped bug, because the freshest, heaviest thing in the model's short-term memory was a vivid bug report. The example was not wrong. Its position was.

Distribution shift: the example did not look like the test set

My real support messages are short. One or two sentences, often a typo, usually no stack trace. Examples 5 and 6 were polished, multi-sentence, well-formatted. So I had handed the model a picture of "what an input looks like" that did not match what inputs actually look like.

This is the part I find genuinely counterintuitive. The two examples I added were higher quality in the abstract. More complete, more information, more careful. But few-shot examples are not documentation, they are a sample of the input distribution, and a beautiful example that misrepresents your real traffic teaches the model the wrong shape. A blurry photo of the right thing beats a sharp photo of the wrong thing.

The 2026 literature has a tidy name for the general version of this. They call it the "few-shot dilemma": performance peaks at some number of examples and then declines as you add more, and the decline shows up across many models, not just one (Zhang et al., 2025). The takeaway isn't "use fewer examples." Example count has a ceiling that depends on what the examples actually contain, and going past that ceiling costs you.

What few-shot is actually for

This experiment forced me to be honest about a category error I had been making. I was using few-shot examples to do a job they are bad at.

Few-shot is a control surface for form and behavior: output shape, label vocabulary, tone, how the model handles "I don't know." It is not a knowledge channel. If you want the model to know a new fact, examples will not put it there. That is what retrieval is for, and even retrieval has its own failure mode where misleading retrieved context can drag a model below its zero-shot answer (Ming et al., 2025).

So the clean division of labor I now keep in my head:

Job	Right tool	Wrong tool
Fix output format / label set	Few-shot	RAG
Inject new facts	RAG	Few-shot
Teach "admit when unsure"	Few-shot	More facts

My examples 5 and 6 failed because I was unconsciously treating them as "more information is better" knowledge injection, when their only real job was to demonstrate the mapping from message to label. For that job, two short, on-distribution examples beat one long, off-distribution masterpiece.

How I catch it now

The fix was not cleverness. It was measuring at the right granularity. Three habits came out of this:

Add examples one at a time and re-score. The aggregate "4 examples, 84%" hid nothing, but I only learned which two examples were toxic because I had the per-example deltas. If I had added all six in one batch I would have seen 75% and shrugged, blaming the model.

Match example length and style to real inputs. Before an example goes in the prompt, I ask whether it looks like something a real user would actually send. If my traffic is one-line typo-ridden messages, my examples are one-line typo-ridden messages. The polished ones go in the docs, not the prompt.

Watch the last example specifically. Because of recency, the final example carries extra weight. I now make sure the last slot is either neutral or rotated, rather than letting one heavy label sit there poisoning the tail. Contextual calibration is the rigorous version of this if you want to go further; the cheap version is just not stacking your most dramatic example last.

The whole thing cost me an afternoon and a small dent in my confidence. But I no longer treat "add another example" as a free move. Every example is a vote, recency makes the late votes louder, and two confident voters facing the wrong direction can outvote four that are right.

Six examples. Four good citizens. Two that I would have bragged about, quietly making everything worse.

This experiment is a worked-through version of one chapter from my book on context engineering, where I go deeper on the form-versus-knowledge split and when to reach for retrieval instead. If that division of labor is useful to you, the full treatment is in Context Engineering.