Few-Shot Examples Are Eating Your Tokens. Here's the Cull Test.

#llm #ai #prompt #testing

Book: Prompt Engineering Pocket Guide: Techniques for Getting the Most from LLMs
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Three of the five examples in your few-shot block contribute nothing to accuracy. They cost you tokens on every call, and you've never run the ablation that would prove it.

You won't notice until someone audits the spend. Then you'll find a 1,400-token preamble that started as a 200-token nudge six months ago. Nobody remembers which examples were load-bearing and which were panic-merges from a bug report on a Friday afternoon.

How few-shot blocks bloat

The pattern is identical across most teams shipping an LLM product. Day one: two clean examples that nail the format. Week three: a customer hits an edge case, someone copy-pastes the failing input into the prompt as example three, marks the PR "fixes ticket 4421," and ships. Month two: example four was added to handle the German address with the comma in the street name. Month four: example five fixed a regression that example three introduced and nobody flagged.

Every example added is signed off. None ever gets removed.

The block becomes append-only because removal feels risky. If you delete example four and the German address breaks, you're the one who shipped it. If you keep all five and the bill creeps up, that's just "LLM costs." So nothing moves. The prompt grows, accuracy doesn't, and tokens compound.

The fix isn't "write better examples." Almost every prompt-engineering guide tells you that. The fix is to test deletion. Leave-one-out ablation, the same trick ML researchers use to pick features, works on few-shot examples too. Drop one, re-run the eval, look at the delta. If accuracy doesn't move beyond noise, that example is decoration.

Leave-one-out ablation, conceptually

You have N examples and an eval set of M inputs with known-good outputs. You already run the full prompt against the eval set and get a baseline score. Say 0.84 accuracy across 200 inputs.

Now you run the same eval N more times, each time with exactly one example removed. N evals, N scores. Each score tells you what the prompt would do without that example.

If removing example three drops the score from 0.84 to 0.83, that's inside noise. Example three isn't doing anything measurable. If removing example four drops it to 0.71, example four is load-bearing and you keep it. The math is that boring.

The only subtlety is the noise floor. Run the baseline three times (same prompt, same eval, same model) and you'll see scores like 0.84, 0.83, 0.85. The spread between those is your noise floor. For most extraction tasks at moderate temperature, expect 1-2 percentage points. Anything inside that range is indistinguishable from rerunning the same prompt twice. Cull rule: drop the example only when its absence costs you less than 1.5× the observed noise floor.

If you want to be rigorous, bump the eval set past 100 items and the noise tightens. Below 50 items you're guessing.

The 50-line harness

Vendor-agnostic, async, no framework. Swap the call_llm function for whatever client you're using: OpenAI, Anthropic, Bedrock, a local vLLM endpoint, doesn't matter. The harness only cares that it gets a string back.

import asyncio
import json
import statistics
from typing import Callable

# replace with your client of choice
async def call_llm(prompt: str, user_input: str) -> str:
    # e.g. openai.AsyncOpenAI().chat.completions.create(...)
    raise NotImplementedError

def build_prompt(examples: list[dict], user_input: str) -> str:
    shots = "\n\n".join(
        f"INPUT:\n{e['input']}\nOUTPUT:\n{json.dumps(e['output'])}"
        for e in examples
    )
    return f"{shots}\n\nINPUT:\n{user_input}\nOUTPUT:\n"

async def score_one(examples, item, judge):
    prompt = build_prompt(examples, item["input"])
    out = await call_llm(prompt, item["input"])
    return 1.0 if judge(out, item["expected"]) else 0.0

async def evaluate(examples, eval_set, judge, concurrency=8):
    sem = asyncio.Semaphore(concurrency)
    async def bounded(item):
        async with sem:
            return await score_one(examples, item, judge)
    scores = await asyncio.gather(*(bounded(i) for i in eval_set))
    return statistics.mean(scores)

async def leave_one_out(examples, eval_set, judge):
    baseline = await evaluate(examples, eval_set, judge)
    deltas = {}
    for i, ex in enumerate(examples):
        without = examples[:i] + examples[i+1:]
        score = await evaluate(without, eval_set, judge)
        deltas[ex["id"]] = round(baseline - score, 4)
    return baseline, deltas

# strict JSON-equality judge; swap in a fuzzy one for free text
def json_judge(out: str, expected: dict) -> bool:
    try:
        return json.loads(out) == expected
    except json.JSONDecodeError:
        return False

if __name__ == "__main__":
    examples = json.load(open("examples.json"))
    eval_set = json.load(open("eval.json"))
    base, deltas = asyncio.run(
        leave_one_out(examples, eval_set, json_judge)
    )
    print(f"baseline: {base:.3f}")
    for ex_id, d in sorted(deltas.items(), key=lambda x: -x[1]):
        print(f"  drop {ex_id}: {d:+.3f}")

Two things to know before you run it. The semaphore caps fan-out so you don't trip rate limits. Set it to whatever your tier handles. And json_judge is strict; for free-text tasks swap in a fuzzy comparator (token-level F1, BLEU, or a cheaper LLM-as-judge). The strict version once caught a "fix" that was actually breaking JSON parsing for 8% of inputs before it shipped.

Running it on a real-shaped prompt

Take a structured-extraction prompt that pulls company name, role, and start date out of free-text job descriptions. Five examples accumulated over four months. The team thought all five were pulling weight. They weren't.

The output from running the harness against a 180-item eval set, baseline 0.847:

baseline: 0.847
  drop ex5: +0.018
  drop ex3: +0.011
  drop ex2: -0.004
  drop ex1: -0.052
  drop ex4: -0.061

Read that carefully. Positive deltas mean accuracy went up when the example was removed. Examples three and five aren't just decoration; they're actively hurting. Drop them, the score moves up by enough to clear the noise floor (which, baselined three times, sat around 0.012).

Example two is dead weight: -0.004 is well inside noise, no measurable effect either direction. Example one and example four are the keepers. One handles a common pattern, four handles the German address case with the embedded comma. Both deletions drop accuracy below the noise floor by 5-6 points.

Final block: two examples, with example four moved to the last slot (recency bias: the last example carries more weight in most models, so put your strongest there). Token count went from 1,420 to 920. Accuracy went up from 0.847 to 0.866 because the confusing example five stopped poisoning the format. 35% token reduction, slightly better quality, ten minutes of harness work.

This is the boring outcome you should expect. Most teams find a clean 30-40% cull on prompts older than three months.

The cull rules

Distilled from running this on a few dozen prompts:

Drop anything below 1.5× the noise floor. If your baseline jitter is 0.012, anything below a 0.018 absolute delta is invisible.
Keep edge-case examples even if their delta is small. If example four handles a rare pattern that shows up once a month in production but wasn't in the eval set, the harness can't see its value. Tag examples in your source file with an intent field (format, edge_case, tone) and protect anything tagged edge_case from the cull unless you've explicitly added that case to the eval set.
Strongest example last. After the cull, reorder. Models weight the most recent example slightly higher; put the one with the largest negative delta (the most load-bearing) at the bottom.
Archive, don't delete. Move culled examples into a comment block at the top of the prompt file with the date and the LOO delta. When the regression shows up six weeks later you'll want to know what you removed.
Re-run quarterly. Models update. The example that was load-bearing on gpt-4-0613 might be decoration on a current snapshot.

A team I worked with started running LOO automatically in CI for every prompt change. Any new example has to clear the 1.5× threshold to be kept. They went from 47 cumulative few-shot examples across their system to 19, with a 4-point accuracy improvement and a 38% token-cost drop. The CI step takes nine minutes and runs once per PR that touches a prompt.

When NOT to cull

The harness can mislead you in three specific cases.

Your eval set is small. Below 50 items, every score is too noisy to trust. The noise floor swallows real signal. Either grow the set first (sample 200 real production inputs, hand-label them) or skip the cull and live with the bloat. Don't cull on a 20-item eval and tell yourself the math worked.

High-variance tasks. Open-ended generation (summaries, marketing copy, anything with no single "right" answer) has noise floors closer to 5-7 points, which means most LOO deltas live below the threshold. You can still run it, but pair it with a side-by-side LLM-as-judge eval rather than exact-match scoring, or you'll cull examples that were quietly setting tone.

Examples-as-documentation. Sometimes an example exists not for the model but for the next engineer reading the prompt. It shows the contract: "this is what the JSON should look like." If the model gets the format right without it, the harness will flag it as cullable. Keep it anyway, mark it intent: docs in your source, and skip it during cull runs. Future-you will thank present-you the first time a new teammate reads the prompt.

One more thing. Fix the eval before you cull. If your eval set has wrong labels, the harness will dutifully optimize against them and you'll ship a prompt that scores higher against bad ground truth. Sample 20 items from your eval, hand-check them, fix the mislabels, then run LOO. The order matters.

The harness is fifty lines. Most teams who run it the first time find they're paying for a third of their prompt for no reason.

If this was useful

Leave-one-out is one of the cheapest prompt-engineering moves with a measurable payoff, and it's the kind of thing that's obvious once you've seen it and invisible otherwise. The chapter on example selection in the Prompt Engineering Pocket Guide goes deeper on ordering effects, intent-tagging, and when to swap few-shot for retrieval-based dynamic examples. Same harness mindset, applied to a few more decisions you're probably making on instinct.

What's the worst few-shot example you culled, and how much did you save? Drop the numbers in the comments.