Ken Imoto

Posted on Apr 24 • Edited on May 7 • Originally published at zenn.dev

A $0.25 model beat a $3 model -- with better context

#ai #llm #rag #contextengineering

The numbers

I ran the same benchmark on two Claude models. The $0.25 model scored 11.8. The $3 model scored 5.3.

The cheap model won by 223%. And it cost one-twelfth as much per token.

I didn't expect this. Nobody expects this. The AI industry runs on a simple assumption: bigger model, better results. Pay more, get more. But the data told a different story.

Claude Haiku 3, Anthropic's smallest model, paired with a RAG pipeline, outperformed Claude Sonnet 4 running with zero context. Not by a small margin. By more than double.

Here's what makes this even stranger: Haiku + RAG (11.8) also beat Haiku + full Context Engineering (10.1). RAG alone -- just retrieving the right documents and stuffing them into the prompt -- unlocked more of Haiku's potential than a full stack of context techniques.

Think of it this way. A local hire with a perfect briefing doc outperforms a big-company transfer who wings it. Raw talent matters, but knowing what you're walking into matters more.

This wasn't a fluke in one test. The pattern held across question types. And it forced me to rethink everything I assumed about model selection.

The cost math

The performance gap is interesting. The cost gap is where it gets practical.

Here are the API prices (at time of writing):

Model	Input (per 1M tokens)	Output (per 1M tokens)
Claude Haiku 3	$0.25	$1.25
Claude Sonnet 4	$3.00	$15.00

Sonnet costs 12x what Haiku costs. That's the sticker price. But RAG adds overhead -- you're retrieving documents, embedding queries, and injecting extra tokens into every prompt. Let's be generous and say RAG adds 50% to the base cost.

Haiku + RAG: $0.75 × 1.5 = $1.125 per 1M tokens
Sonnet (no context): $9.00 per 1M tokens

Even after RAG overhead, Haiku + RAG is 1/8th the cost of Sonnet. And it scores 11.8 vs 5.3.

ROI: 17.8x difference

When you divide performance by cost, the gap explodes:

Haiku + RAG ROI: 11.8 / $1.125 = 10.49
Sonnet zero-context ROI: 5.3 / $9.00 = 0.59

That's a 17.8x ROI difference. You're paying less and getting more.

This is why so many production SaaS products run on small models behind the scenes. ChatGPT, Cursor, Perplexity -- they're not routing every query to the biggest model they have. They triage. Simple queries go to fast, cheap models. Only the hard stuff gets escalated. And the "simple" bucket is usually 70-80% of traffic.

The lesson: your default model should be the smallest one that works. Context design is where you spend your engineering effort, not model upgrades.

Real monthly numbers

Abstract per-token costs don't hit the same way as a monthly bill. Let's make it concrete.

Assume a mid-sized application: 1,000 queries/day, averaging 2,000 input tokens and 500 output tokens each.

def monthly_cost(queries_per_day, avg_input, avg_output, input_price, output_price):
    monthly_queries = queries_per_day * 30
    cost_per_query = (avg_input / 1_000_000) * input_price + \
                     (avg_output / 1_000_000) * output_price
    return monthly_queries * cost_per_query

# Sonnet 4
sonnet = monthly_cost(1000, 2000, 500, 3.00, 15.00)
# => $405.00/month

# Haiku 3 + RAG (1,000 extra input tokens from retrieval)
haiku_rag = monthly_cost(1000, 3000, 500, 0.25, 1.25)
# + RAG infra ~$0.001/query = $30/month
# => $41.25 + $30 = $71.25/month

	Sonnet 4	Haiku + RAG
Monthly cost	$405.00	$71.25
Monthly savings	--	$333.75 (82.4%)
Benchmark score	5.3	11.8

You save $333.75/month. Your benchmark score more than doubles. And this is a modest workload -- at 10,000 queries/day, you're saving $3,337/month.

That $333.75 isn't theoretical. It's real budget you can redirect toward building the RAG pipeline, hiring, or just not bleeding cash on API bills while you validate product-market fit. For startups especially, this is the difference between "we can afford to experiment" and "we're burning runway on inference costs."

When should you actually do this?

Not always. That's the honest answer.

Some tasks genuinely need a large model's reasoning depth. Complex multi-step logic, subtle creative writing, tasks where the model needs to "think through" problems rather than look up answers -- these are where bigger models earn their cost.

But many production workloads are lookup-heavy, pattern-matching-heavy, or template-heavy. Those are exactly where small model + good context shines.

I built a four-phase framework for making this decision systematically.

The decision framework

Phase 1: Define what "good enough" means

Before touching any model, pin down three things:

Performance threshold -- What accuracy do you actually need? 90%? 99%?
Cost ceiling -- What's your monthly budget? What's your max per-query cost?
Latency requirements -- Real-time response? Or can you batch?

Most teams skip this and jump straight to "let's use the best model." That's like renting a moving truck to carry groceries. You'll get the groceries home, sure. But you'll also spend $200 on something a backpack could handle.

Phase 2: Establish baselines

Test in this exact order:

Smallest model, zero context -- How bad is it? This is your floor.
Smallest model + context layers -- Add RAG, few-shot examples, chain-of-thought prompting. Measure each one individually.
Larger model, zero context -- How much does raw model size buy you?

This order matters. If step 2 already meets your threshold from Phase 1, you're done. You don't need the bigger model.

Phase 3: The decision algorithm

flowchart TD
    A[Start: Define performance threshold] --> B[Test smallest model + context]
    B --> C{Meets threshold?}
    C -->|Yes| D[Use small model + context]
    C -->|No| E[Test larger model + same context]
    E --> F{Meets threshold?}
    F -->|Yes| G{Cost within budget?}
    G -->|Yes| H[Use larger model + context]
    G -->|No| I[Optimize context further]
    I --> B
    F -->|No| J[Reconsider requirements]
    D --> K[Deploy + monitor]
    H --> K

The key insight: always try context improvements before scaling up the model. Context is cheaper than compute. Every time.

Phase 4: Don't set it and forget it

Run a monthly check:

Actual performance vs. expected -- Has quality drifted?
Cost trend -- API prices change. New models launch.
A/B testing -- Periodically test newer small models. They keep getting better.

For migration, go gradual. Route 10% of traffic to the new setup in week one. 30% in week two. 70% in week three. 100% only after you've confirmed quality holds at scale. At any stage, roll back if metrics drop.

This is the boring part. But boring is what keeps production systems alive. The teams that skip Phase 4 are the ones who wake up three months later wondering why their AI feature's quality tanked and nobody noticed.

The 1M token complication

Everything I've said so far assumes traditional context windows of 4K-32K tokens. But we're now in the era of 1M+ token context windows (Gemini, Claude). This changes the calculus.

With a million tokens, you can dump an entire codebase, three books, or months of conversation history into a single prompt. The "selection" problem that RAG solves -- which documents to retrieve -- becomes less important when you can just include everything.

But "less important" doesn't mean "gone." Three problems show up at scale:

Middle Lost Problem -- Models pay less attention to information in the middle of very long contexts. Critical information should go at the beginning or end of your prompt.

Attention Dilution -- More information isn't always better. When everything is included, the model struggles to figure out what's actually relevant. You still need to signal importance explicitly.

Processing Overhead -- Long context costs scale faster than linearly. Stuffing in everything because you can is wasteful if 90% of it is irrelevant.

The takeaway: even in the 1M token era, structured context beats dumped context. The tools change, but the principle holds -- context quality matters more than context quantity.

If anything, long context makes context design more important, not less. When your window was 4K tokens, you had to be selective but the model could attend to everything. With 1M tokens, you have room for everything, but the model can't attend to it all equally. Structure becomes the bottleneck.

The real shift

Here's what this experiment changed in how I think about AI systems.

The old mental model:

Performance = Model Size

Want better results? Use a bigger model. Pay more.

The new mental model:

Performance = Model × Context Design

Want better results? Design better context. Then pick the smallest model that meets your threshold.

This isn't just a cost optimization trick. It's a democratization argument.

Under the "bigger is better" model, only companies with massive API budgets could build high-quality AI products. Under "context-first design," a two-person team with strong retrieval infrastructure and well-structured prompts can match -- or beat -- what a well-funded competitor gets from a flagship model.

The playing field isn't level. But it's a lot more level than it was when the only variable was model size.

I've seen this firsthand. A team of two, with a well-tuned RAG pipeline and carefully structured prompts, consistently outperformed a team of twenty that threw everything at GPT-4 and called it a day. The two-person team wasn't smarter. They just spent their time on context design instead of model shopping.

Try it yourself

If you're running Sonnet (or GPT-4, or any large model) in production, here's a weekend experiment:

Take your 20 most common query types
Run them through the smallest model from the same provider, zero context
Add a basic RAG pipeline (even a simple vector store works)
Compare scores

You might be surprised. I was.

I won't claim this works for every use case. If you're doing open-ended reasoning, long-horizon planning, or creative work that requires genuine "thinking," bigger models still earn their premium. But for the majority of production AI workloads -- classification, extraction, Q&A, summarization, code generation from specs -- the small model + good context approach is worth testing before you commit to a $405/month API bill.

The book this article is based on goes deeper into the full Context Engineering stack -- RAG, few-shot learning, memory systems, MCP, and more: Turning LLMs from Liars into Experts: Context Engineering in Practice.

What model are you using -- and have you tested a smaller one with better context?

DEV Community