Few-Shot Selection at Runtime: Why Static Examples Hurt Edge Cases

#ai #rag #llm #promptengineering

Book: Prompt Engineering Pocket Guide: Techniques for Getting the Most from LLMs
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You picked five good few-shot examples for your classifier, pasted them into the system prompt, and shipped. Eval set passed. Production accuracy on the dashboard sat at the same number as the eval, until a misclassified ticket lands in your inbox: a refund question the model called shipping, with vocabulary nothing like your five examples.

That is the failure mode every static few-shot prompt eventually hits. Five examples cover the head of your input distribution. The tail is everywhere else, and the tail is where most of the interesting bugs live.

The fix is to stop treating examples as a constant. Treat them as a retrieval problem. For each incoming query, pull the k most-similar past examples from a bank, inject those into the prompt, then call the model. The prompt now changes shape per query. The 70 lines below are the whole pattern.

Why static few-shot caps out

Picking few-shot examples by hand is a sampling problem in disguise. You read 50 tickets, pick 5 that feel representative, and ship. The 5 you picked anchor the model's behavior on every input that follows. Cases that look like your 5 picks generalize; the rest don't.

Three forces make this worse over time:

Vocabulary drift. A team launches a new product line. The customer's words change. The examples in your prompt still describe last quarter's product taxonomy.
Customer-segment skew. Your hand-picked examples come from English-speaking users on the web app. The mobile app on Android in Brazil sends queries that don't look like any of them.
Long-tail intents. The 5 examples cover the top 5 intents. The 50th intent (the one that comes up once a week and is always urgent) is invisible to the prompt.

You can paper over this for a while by adding more static examples. At 20 you start hitting context-budget pressure. At 40 you've noticeably slowed every call and raised the per-request cost. At 60 the model starts ignoring later examples in the list. The relationship between number-of-examples and accuracy is not linear. Anthropic's prompt-engineering guide recommends three to five for most tasks, which lines up with that ceiling.

The ceiling isn't a count of examples. It's whether they fit the query in front of you.

The dynamic few-shot pattern

Build a bank of past inputs paired with their correct outputs. Embed every input in the bank into a vector store. At query time, embed the new input, retrieve the k most-similar examples, format them into the prompt, call the model. The bank can be your eval set, your historical labeled data, or a feedback log of past corrections.

Three things change versus static:

The prompt is now query-conditional. The 5 examples a refund question sees are different from the 5 a shipping question sees.
The bank can grow without changing the prompt. Add a new labeled case → it's available next request. No deploy.
The bank can be filtered at retrieval time. Recent cases only. High-success-rate cases only. Cases for this customer's tier only.

The cost is one embedding lookup per query (rough order of magnitude on a hot in-memory index: tens of ms) and a context window that holds the same five-example budget, just chosen per query. You can tune k from 3 to 10 without much code change.

Where dynamic beats static, in order of impact:

Long-tail queries. If the new query is rare, the static prompt has nothing for it. The dynamic prompt finds the closest historical match and gives the model a foothold.
Customer-specific vocabulary. A B2B tenant uses internal jargon for their product names. The dynamic bank, filtered by tenant_id, returns examples that share that jargon.
Evolving taxonomies. New intents added to the bank yesterday show up in retrieval today. No prompt rewrite, no redeploy.

Where static still wins:

Tasks with five canonical examples. "Translate informal English to formal English" doesn't need retrieval. The model has the concept; you just need to anchor the output style.
Contracts with the model. When the few-shot is teaching format, not content (always answer in JSON, always cite the source), static is enough and the prompt cache loves it.
Cold-start, no labeled data. You need a bank to retrieve from. If you have 10 examples total, dynamic adds latency without payoff.

A working rule of thumb on data volume: under ~50 labels, stay static. Past ~200, dynamic almost always wins. Between those, run the eval both ways and let the numbers pick. These are heuristics rather than empirical findings, so anchor the call to your own eval set when you can.

A 70-line harness

Below is the whole thing in Python. It uses Anthropic for the chat call and a local sentence-transformers model for embeddings (so you can run it without standing up a vector DB). For a real deployment, replace the in-memory numpy index with pgvector, Qdrant, or whatever your stack already uses.

import json
from dataclasses import dataclass

import numpy as np
from anthropic import Anthropic
from sentence_transformers import SentenceTransformer

client = Anthropic()
embedder = SentenceTransformer("all-MiniLM-L6-v2")
MODEL = "claude-sonnet-4-5"


@dataclass
class Example:
    text: str
    label: str
    embedding: np.ndarray


def build_bank(items: list[dict]) -> list[Example]:
    texts = [it["text"] for it in items]
    vecs = embedder.encode(texts, normalize_embeddings=True)
    return [
        Example(it["text"], it["label"], vec)
        for it, vec in zip(items, vecs)
    ]


def topk(query: str, bank: list[Example], k: int) -> list[Example]:
    q = embedder.encode([query], normalize_embeddings=True)[0]
    scores = np.array([float(q @ ex.embedding) for ex in bank])
    idx = np.argsort(-scores)[:k]
    return [bank[i] for i in idx]


def format_shots(shots: list[Example]) -> str:
    return "\n\n".join(
        f"Input: {s.text}\nLabel: {s.label}" for s in shots
    )


def classify(query: str, bank: list[Example], k: int = 5) -> str:
    shots = topk(query, bank, k)
    system = (
        "You classify support tickets into one of: "
        "refund, shipping, account, technical, other.\n\n"
        "Examples:\n\n" + format_shots(shots)
    )
    resp = client.messages.create(
        model=MODEL,
        max_tokens=64,
        system=system,
        messages=[{"role": "user", "content": query}],
    )
    return resp.content[0].text.strip()


if __name__ == "__main__":
    with open("bank.jsonl") as f:
        items = [json.loads(line) for line in f]
    bank = build_bank(items)
    while True:
        q = input("ticket> ")
        if not q:
            break
        print(classify(q, bank, k=5))

Two functions carry it: topk retrieves, classify runs the call. The rest is plumbing.

A few notes on the shape:

Cosine similarity via dot product. normalize_embeddings=True turns dot product into cosine similarity. That's the single line that decides retrieval quality more than anything else.
k = 5 is a good starting default. Below 3, the model lacks context. Above 8, you start paying for examples that don't move accuracy.
The system prompt holds the examples. Anthropic's prompt cache works on the stable prefix of the system prompt, not on the per-query example block, which by definition varies per request. So the cached portion is whatever sits above the examples (instructions, label list, format spec); the example block is fresh every call. In high-traffic systems with stable input distributions, hot queries still hit the cached prefix; long-tail queries pay full freight, which is the right cost shape.
No re-ranker. A classifier with a tight label set runs fine on top-k by cosine. Open-ended generation is where rerankers earn their keep: drop a cross-encoder on the top-50 and pull the top-5 from that.

Filters that earn their keep

The topk function in the harness retrieves on similarity alone. In production, similarity is the start, not the end. Three filters tend to pay off:

Filter by recency. If the bank goes back two years, examples from 2024 may describe a product taxonomy that no longer exists. Add a created_at column to the bank and either hard-cut at 90 days or weight similarity by recency: score = cosine * exp(-age_days / 60). The exponential decay gives recent examples a soft boost without erasing useful older ones.

Filter by success rate. If you're capturing model corrections (the human relabeled this case after a misclassification), tag those rows. Now you have two banks: a "model got it right" bank and a "model got it wrong, here's the correction" bank. Pull from the corrections bank first when the input looks like a previously-broken case. This is the cheapest active-learning loop you can ship.

Filter by tenant. Multi-tenant systems leak vocabulary across customers if you don't scope the bank. Add a tenant_id column. At retrieval, scope to tenant_id IN (this_tenant, GLOBAL). The tenant-specific examples come back first; the global examples fill in when the tenant bank is sparse.

Each filter is a WHERE clause on the underlying store. None of them require changing the prompt template.

Two things to instrument

If you ship this without metrics, you've replaced one black box with another. Two gauges tell you whether dynamic few-shot is paying off:

Retrieval similarity score for the top match. Log the cosine similarity of the best-matching example for every query. Plot the distribution. The left tail is the queries that didn't match anything in the bank — those are your gap. Build a feedback loop where queries below a threshold get flagged for human labeling, and the labels feed back into the bank.
Per-bucket accuracy on the eval set. Split the eval set by retrieval-similarity quartile. Quartile 1 (worst-matched) and quartile 4 (best-matched) should both pass. If quartile 1 accuracy drops, your bank doesn't cover that part of the distribution and you need more labels there. The quartile 4 number tells you the ceiling — what the model can do when it has good examples.

These two gauges turn the bank into a system you can debug. Without them, you're back to "five examples in a prompt."

What to do with this on Monday

If you have a static few-shot prompt in production with at least 200 labeled examples in some form (eval set, historical tickets, anything with input + correct output), the migration is small. Index your existing labels. Replace the static example block in the system prompt with a format_shots(topk(query, bank, k=5)) call. Run your eval set through both versions. The accuracy delta is your answer.

If you have fewer than 50 labels, build the bank as a side-effect of running production traffic instead. Log every (input, model output, human correction) tuple. Inside two weeks of moderate traffic you'll have a bank worth retrieving from.

If you have hundreds of labels but the eval set already passes, the lift will land in the long tail rather than the headline number. Watch the quartile-1 accuracy gauge described above. That's where dynamic few-shot earns its place.

The static prompt will keep working for the head of your distribution. The tail is where the bug reports come from. Letting the prompt change shape per query is what makes that tail tractable.

If this was useful

Prompt Engineering Pocket Guide: Techniques for Getting the Most from LLMs covers dynamic few-shot selection alongside the rest of the patterns that survive in production prompt engineering — example banks, retrieval-conditional prompts, prompt caching, and the eval discipline that tells you which choices paid off. The book is structured the way you'd actually pick up these patterns at work: one concrete failure, one pattern, one decision rule.