I tested cheap vs expensive LLMs across 3 real agent tasks. The cheap model won every time.

#ai #opensource #llm #python

I tested cheap vs expensive LLMs across 3 real agent tasks. The cheap model won every time.
Everyone tells you to use the best model you can afford. I built a toolkit to test that assumption with data, and the results surprised me.
Over the past few weeks, I built three different LLM-based agents and ran structured evaluations against each one — comparing models on accuracy, cost, and latency using the same golden datasets.
THE SETUP
I built a CLI tool that wraps any agent function, runs it against a labeled dataset, and produces a comparison table. The agent doesn't need to change — you write a thin wrapper that adapts its interface, point the tool at your test cases, and get results.
I tested three agent types:

GitHub issue classifier — Takes issue text from Streamlit's open-source repo and classifies it as bug, feature_request, question, or incomplete. 30 hand-labeled issues.
Sentiment analyzer — Binary positive/negative classification on movie review sentences from SST-2. 40 cases including 10 deliberately ambiguous ones (sarcasm, double negatives, mixed sentiment).
RAG question-answering — Answers questions about the US Constitution using retrieval-augmented generation. 30 questions across easy, medium, and hard difficulty. Tested both with and without RAG retrieval.

Every test used the same workflow: build a golden dataset with manually verified labels, run each model against it, compare.

RESULT 1: HAIKU OUTPERFORMED SONNET ON CLASSIFICATION
GitHub issue triage, 30 Streamlit issues:
Haiku — 30/30 (100%), avg latency 1.7s, 0 errors
Sonnet — 29/30 (96.7%), avg latency 3.3s, 1 error
Haiku — the model that costs roughly one-third as much — was both more accurate and faster. Sonnet's one miss was an issue it classified incorrectly, while Haiku nailed every single one. This held across a separate earlier test on 30 LangChain issues too, where Haiku scored 87% vs Sonnet's 80% on the hard ambiguous subset.

RESULT 2: A FREE LOCAL MODEL MATCHED PAID APIs ON SENTIMENT
Sentiment analysis, 40 SST-2 sentences:
DistilBERT (local) — 40/40 (100%), cost: free, avg latency 0.1s
Haiku — 40/40 (100%), cost: $0.006, avg latency 0.9s
Sonnet — 40/40 (100%), cost: $0.019, avg latency 1.5s
A 250MB model running locally on my laptop matched both Claude models on every case — including the ambiguous ones with sarcasm and double negatives. It was 9x faster than Haiku and 15x faster than Sonnet. For this task, paying for an API gives you nothing.
The two Claude models did disagree with DistilBERT on 2 edge cases (one each), but all three scored 100% because the golden dataset correctly marked those cases as ambiguous with multiple acceptable answers.

RESULT 3: RAG HURT ACCURACY ON WELL-KNOWN CONTENT
US Constitution Q&A, 30 questions, 4 model variants:
RAG + Haiku — 80.0%, avg latency 1.7s
Direct Haiku (no RAG) — 96.7%, avg latency 1.3s
RAG + Sonnet — 80.0%, avg latency 2.2s
Direct Sonnet (no RAG) — 93.3%, avg latency 1.9s
This was the most surprising finding. Adding retrieval made both models worse. RAG+Haiku scored 80% while Direct Haiku scored 96.7%.
The reason: when the retriever failed to pull the right document chunks, the model correctly said "I can't answer from this context" instead of using what it already knows. All 5 RAG failures were retrieval failures — the model refused to hallucinate, which is good behavior, but it means RAG constrained the model's knowledge rather than augmenting it.
On well-known content that's already in the model's training data, RAG is a net negative. It adds latency, adds cost (more input tokens from retrieved context), and reduces accuracy by hiding information the model already has. RAG helps when the model doesn't know the information — proprietary documents, recent data, internal knowledge bases. For public, well-established content, it hurts.
Also: Haiku beat Sonnet again. Direct Haiku (96.7%) outperformed Direct Sonnet (93.3%) at one-third the cost.
WHAT I LEARNED
"More expensive = more accurate" is wrong for specific tasks. General benchmarks show Sonnet ahead of Haiku, but on every specific task I tested, Haiku matched or beat Sonnet. The model's general capability doesn't predict its performance on your particular agent. You have to measure.
Model selection should be data-driven, not assumption-driven. Most teams default to the most expensive model they can afford, assuming it'll be the most accurate. Across four datasets and three task types, the cheapest option was consistently the best. That's not a universal truth — there will be tasks where Sonnet or Opus genuinely outperform Haiku. The point is you can't know without testing.
RAG isn't always the answer. The default assumption in AI engineering is "add RAG for better accuracy." On well-known domains, it backfires. The architecture decision of whether to use RAG should be tested, not assumed.
The hardest part of evaluation is defining "correct." Three of four "failures" in my early testing turned out to be labeling problems — ambiguous cases, categories that don't cover reality, or penalizing the model for not accessing information behind a link. The eval tooling worked fine mechanically. Garbage labels in, garbage results out.

THE TOOL
I open-sourced the evaluation toolkit: github.com/aimvik07/agent-eval
Install: pip install agt-eval
Three commands:
agent-eval probe config.py — Find where your agent fails. Shows accuracy, failure list, category distribution.
agent-eval compare config.py — Compare models side by side. Accuracy, cost, latency, head-to-head disagreements.
agent-eval gate config.py — Catch regressions. Compares against a stored baseline, exits 1 if accuracy dropped.
You write a Python config file that wraps your agent function and point it at a JSON golden dataset. The toolkit handles the rest. It supports exact match, substring match, and custom comparison functions for different task types.
It's a personal tool I built for my own workflow, not a startup or a product. If it's useful to you, use it. If you find something broken or missing, open an issue.

DEV Community

I tested cheap vs expensive LLMs across 3 real agent tasks. The cheap model won every time.

Top comments (0)