Few-Shot or Fine-Tune in 2026: The Decision Most Teams Get Backwards

#ai #llm #promptengineering #machinelearning

Book: Prompt Engineering Pocket Guide: Techniques for Getting the Most from LLMs
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You're three weeks into a classification feature and accuracy is stuck at 0.78. Someone in the standup says the word: fine-tune. The room nods. It sounds like the grown-up move. Examples in the prompt feel like a hack you're embarrassed to still be running.

So you spend a sprint building a training set, paying for a tuning job, and wiring a new endpoint into the deploy. Accuracy lands at 0.82. Four points, two weeks, and now you own a model artifact that's pinned to a snapshot you'll have to re-tune the next time the base model updates.

The thing is, the four points were probably sitting in your prompt the whole time. Most teams reach for fine-tuning when better examples would have closed the gap, and they keep stacking examples when fine-tuning was the right call months earlier. The decision runs backwards more often than it runs forward.

What each one actually changes

Few-shot puts examples in the context window at request time. The model's weights never move. You're showing it the pattern on every call: here are five inputs and the outputs I want, now do the sixth. It learns the shape for the length of that one request and forgets it the moment the call returns.

Fine-tuning changes the weights. You take a base model, feed it hundreds or thousands of input-output pairs, and the training run nudges the parameters so the behavior is baked in. The pattern lives in the model, not the prompt. Your per-call prompt gets shorter because you stop shipping the examples.

That's the whole difference, and it drives everything else. Few-shot trades tokens-per-call for zero training cost and instant iteration. Fine-tuning trades upfront cost and slower iteration for cheaper, shorter calls later. Which trade wins depends on numbers you can mostly estimate before you build anything.

When few-shot wins

Reach for examples-in-context first. It's the cheaper experiment and it tells you whether the task is even learnable from demonstration.

Few-shot is the right tool when:

You're still iterating on the spec. If the task definition changes weekly, fine-tuning is pouring concrete on a moving target. Editing a prompt is free; re-running a tuning job is not.
Your volume is low or spiky. A few thousand calls a day doesn't generate enough per-call token spend to justify a training run plus a hosted custom model.
The pattern fits in a handful of examples. Format-following, simple extraction, tone matching, routing into 3-5 categories. If five good examples get you most of the way, you don't have a fine-tuning problem.
You need to change behavior today. A customer hits an edge case at 4pm. You add an example, run the eval, ship before dinner. No training loop in the way.

Modern context windows also moved this line. When few-shot meant three examples because that's all that fit, fine-tuning won earlier. With long context, you can fit dozens of examples or pull them dynamically per request, and that covers a lot of ground that used to need tuning.

When fine-tuning wins

Fine-tuning earns its keep in a narrower band than most teams assume. The signals:

High, steady volume. Millions of calls where a 1,500-token few-shot preamble is being paid for on every single one. That's where moving the examples into the weights and shipping a 100-token prompt actually pays back the training cost.
A stable task. The spec hasn't moved in months and you don't expect it to. You're optimizing a known quantity, not exploring.
A pattern too big for examples. Hundreds of categories, a house style with a thousand subtle rules, a domain dialect the base model wasn't trained on. If you can't demonstrate it in a dozen examples, demonstration isn't enough.
Latency budgets that can't absorb a long prompt. Shorter prompts mean fewer input tokens to process. When you're fighting for every millisecond at the p99, a fat few-shot block is dead weight on each call.

Notice that "accuracy is stuck" is not on this list. A stuck score is usually a prompt problem, an example-selection problem, or an eval-label problem before it's a weights problem.

The cost math, roughly

Run the back-of-envelope before you commit a sprint. The shape:

few-shot cost  = calls/month
               × few_shot_tokens
               × input_price_per_token

fine-tune cost = training_run_cost          (one-time)
               + calls/month
                 × short_prompt_tokens
                 × input_price_per_token
               + hosting_premium             (if any)

The crossover is where the per-call token savings from a shorter prompt pay back the training run. At a thousand calls a month, a fat preamble costs almost nothing and you'll never recoup a tuning job. At tens of millions of calls, a 1,400-token few-shot block is a line item someone will eventually circle in red.

Plug in your real numbers. Most teams who do this discover they're nowhere near the crossover, and the fine-tuning conversation quietly ends.

A quick simulation

You don't have to guess the crossover. Estimate it.

def monthly_cost_few_shot(calls, shot_tokens, price):
    return calls * shot_tokens * price

def monthly_cost_fine_tune(
    calls, short_tokens, price,
    train_cost, months, hosting=0.0,
):
    per_call = calls * short_tokens * price
    amortized = train_cost / months
    return per_call + amortized + hosting

# example inputs: price per input token, token counts
price = 3e-6          # $3 / 1M input tokens
few_shot_tokens = 1400
short_tokens = 150
train_cost = 400.0    # one-time tuning run
months = 6            # amortization window

for calls in (50_000, 500_000, 5_000_000):
    fs = monthly_cost_few_shot(
        calls, few_shot_tokens, price
    )
    ft = monthly_cost_fine_tune(
        calls, short_tokens, price,
        train_cost, months,
    )
    winner = "few-shot" if fs < ft else "fine-tune"
    print(
        f"{calls:>9,} calls | "
        f"few-shot ${fs:8.2f} | "
        f"fine-tune ${ft:8.2f} | {winner}"
    )

Swap in your provider's prices and your real token counts. The output tells you where the line sits for your workload, not for a blog post's assumptions. Run it before the sprint, not after.

The decision checklist

Walk these in order. Stop at the first honest "yes" that points you somewhere.

Have you culled and reordered your few-shot block yet? If not, do that first. Half the "stuck accuracy" cases are a bloated, badly-ordered example block, and the fix is an afternoon.
Is the task spec stable? If it's still moving, stay on few-shot. Don't tune a target you're still defining.
Have you run the cost math? If you're under the crossover, fine-tuning loses on economics regardless of accuracy. Stay on few-shot.
Can you demonstrate the pattern in a dozen examples? If yes, you probably don't need new weights. If no, fine-tuning starts to make sense.
Do you have clean labeled data? Fine-tuning needs hundreds to thousands of good pairs. If your labels are noisy, you'll bake the noise into the model. Fix the data first.
Can you own the maintenance? A tuned model is pinned to a base snapshot. When the base updates, you re-tune, re-eval, re-deploy. If nobody owns that loop, the model rots.

If you cleared 2 through 6 and you're past the crossover with clean data and a stable spec, fine-tune. Otherwise the answer was examples all along.

The maintenance tax nobody prices in

The part that gets skipped: a fine-tuned model is a frozen asset in a world that doesn't freeze. Base models improve every few months, and the new base often beats your tuned old one out of the box. Now you're choosing between a stale custom model and re-running the whole pipeline.

Few-shot has no such tax. The examples ride on whatever the current base model is. You swap the model name, re-run your eval, and you're done. That flexibility is worth real money during a stretch where the frontier moves this fast, and it rarely shows up in the spreadsheet that justified the tuning job.

None of this means fine-tuning is wrong. It means it's a commitment, and most teams commit before they've earned the right to. Get the prompt clean, run the math, check the data, then decide. The order is the whole point.

If this was useful

The choice between examples-in-context and weight updates is one of those decisions that feels technical but is mostly economic, and getting it backwards costs a sprint you didn't have to spend. The Prompt Engineering Pocket Guide goes deeper on example selection, dynamic few-shot retrieval, and the eval discipline that tells you which lever to pull before you pull it. Same instinct as the checklist above, applied across more of the decisions you're making by feel.