DEV Community

Cover image for Why You Underestimate Haiku
Suleyman
Suleyman

Posted on • Originally published at Medium

Why You Underestimate Haiku

Most people pick a model the wrong way around. They look at the leaderboard, see Opus on top, and reach for it by default. Sonnet if they want to save money. Haiku almost never, because the name says "small."

That habit costs you. For a lot of what you actually build, Haiku is the right call, and you're paying three to five times more for capability the task never uses. This post is about how to choose, and why Haiku should be your default more often than it is.

The short version: don't start from "what's the best model." Start from "what does this task need." Most tasks don't need much.


Comparison

Here is the current lineup, with the numbers that matter when you're choosing.

Haiku 4.5 Sonnet 4.6 Opus 4.8
Model ID claude-haiku-4-5 claude-sonnet-4-6 claude-opus-4-8
Input price (per 1M tokens) $1 $3 $5
Output price (per 1M tokens) $5 $15 $25
Context window 200K 1M 1M
Max output 64K 64K 128K
Best at speed, volume balance hardest reasoning

Two things jump out.

First, price. Haiku input is a fifth of Opus and a third of Sonnet. Output is the same ratio. If you send a million tokens through Opus for $25 and the same work would have been fine on Haiku, you spent $20 for nothing. And that gap is per request, so it compounds. A feature that runs ten thousand times a day on Opus instead of Haiku is not a rounding error. It is the difference between a feature that ships and one that gets cut for cost.

Second, the context window. This is where Haiku gives something up: 200K tokens instead of 1M. That is the real tradeoff, and it points straight at when to use it. We'll come back to that.


The mental model

Stop ranking models. Rank tasks. Ask three questions about the task in front of you:

  1. Does it need real reasoning, or is it bounded? A task is bounded when a competent junior could do it from a clear spec without much judgment: pull these fields out, sort this into one of five buckets, rewrite this in a different tone, answer this from the text I gave you. A task needs reasoning when the path isn't obvious: debug this across files, plan this migration, weigh these tradeoffs.

  2. What does a wrong answer cost? If a bad output is caught by a test, a schema check, or a human two seconds later, errors are cheap and you should go for speed and price. If a bad output ships money or breaks production, errors are expensive and you pay up for the better model.

  3. How often does it run, and does latency show? A nightly job that runs once doesn't care about speed or per-call cost. A loop that fires on every keystroke, or a batch of a hundred thousand items, cares about both, a lot.

Now map the answers:

  • Bounded, cheap to get wrong, high volume or latency-sensitive → Haiku. This is most of what you build.
  • Some judgment, longer output, moderate stakes → Sonnet.
  • Hard reasoning, long multi-step work, expensive to get wrong → Opus.

The reason you underestimate Haiku is that you picked the model top-down, from the leaderboard, where the test is always something hard. But almost nothing you ship in production is leaderboard-hard. It's extraction, routing, classification, summaries, and small edits, run over and over. That's exactly the work Haiku is built for.


What Haiku is actually good at

These are the jobs where Haiku is not a compromise. It's the correct tool.

  • Classification and routing. "Is this ticket a bug, a feature request, or spam?" "Which of these eight queues does this go to?" Bounded, checkable, often high volume.
  • Extraction. Pull the name, email, and plan out of this message. Pair it with structured outputs (Haiku supports them) so the result is a validated object, not a string you have to parse and pray over.
  • Summarizing and rewriting. Tighten this paragraph. Turn these notes into a changelog line. Translate this. The input is right there; there's nothing to reason about.
  • First-pass filtering. Run Haiku over a thousand records to find the fifty worth a closer look, then send only those fifty to a bigger model. You just cut your Opus bill by 95% and barely touched quality.
  • The inner steps of an agent. More on this next, because it's the pattern that changes the most.

What ties these together: the answer is in the input or in a short list of options, the output is short, and you can check it cheaply. That's the Haiku zone.


The pattern that matters most: mixed models

The biggest mistake is treating model choice as one decision for the whole app. It's a decision per step.

A real agent doesn't do one thing. It reads files, searches, plans, edits, checks. Those steps are not equally hard. The planning step might need Opus. The "go read these twelve files and tell me which ones mention auth" step does not. That's a Haiku job, and there are usually a lot of them.

So run the main loop on a strong model and hand the cheap, parallel sub-tasks to Haiku. This is exactly how Claude Code works: its Explore subagents run on Haiku while the main agent stays on a bigger model. The expensive model does the thinking. The cheap fast model does the legwork, often several at once.

There's a second reason to do it with subagents rather than swapping the model mid-conversation: switching models invalidates your prompt cache. Caches are tied to one model. If you flip the main loop from Opus to Haiku and back, you throw away the cached prefix every time and pay full price to rebuild it. Spawning a Haiku subagent for the sub-task keeps the main loop's cache intact. You get the cheap model and the warm cache.

In rough terms, the shape is:

# Main loop: the model that does the hard part
plan = client.messages.create(model="claude-opus-4-8", ...)

# Fan-out: the bounded sub-tasks, cheap and parallel, on Haiku
results = [
    client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": f"Does this file touch auth? {f}"}],
    )
    for f in files
]
Enter fullscreen mode Exit fullscreen mode

Most of your token volume lives in those sub-tasks. Move them to Haiku and your bill changes more than any single model upgrade ever will.


Haiku plus Batch, for the bulk stuff

If the work isn't time-sensitive — overnight classification, backfilling labels, processing a big export — send it through the Batch API. That's another 50% off on top of Haiku's already-low price. Haiku output drops from $5 to $2.50 per million tokens. For bulk, nothing else comes close, and the quality is fine because bulk work is almost always bounded work.


When Haiku is the wrong choice

The mental model cuts both ways. Reaching for Haiku on the wrong task is its own mistake. Send it up the ladder when:

  • The task needs deep, multi-step reasoning. Haiku answers fast and direct. It doesn't even take the effort parameter, the setting that tells a model how hard to think, which only Sonnet 4.6 and the Opus tier support. That's the point: Haiku is built for fast answers, not slow thinking. Send hard debugging, planning, and deep research to Opus.
  • The context is huge. 200K is a lot, but Sonnet and Opus give you 1M. If you're feeding in a whole codebase or a pile of long documents at once, you need the bigger window.
  • A wrong answer is expensive. Anything that moves money, ships to users without review, or is hard to undo. Pay for the better model; the error you avoid is worth more than the tokens you save.
  • The output is long and structured. Long coding runs and big generated documents. That's where Opus's 128K output, and its knack for staying on track over long tasks, earn their price.

If you're unsure which tier a task needs, the cheap experiment is to run it on Haiku first and look at the failures. If it's already good enough, you're done. If it fails in a clear, consistent way, you've learned exactly what capability the task needs before you pay for it.


How to try it

It's a one-line change. The API surface is the same across all three models, so swapping the model string is usually all it takes:

response = client.messages.create(
    model="claude-haiku-4-5",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Classify this ticket: ..."}],
)
Enter fullscreen mode Exit fullscreen mode

Two things to know going in. Haiku has its own rate-limit pool, separate from the bigger models, so test your throughput at the volume you actually expect. And it doesn't take the effort parameter, so strip it from the request if it's there, or the call will error.

Pick your highest-volume, most boring API call — the classifier, the extractor, the summarizer you run thousands of times a day. Move it to Haiku, watch the failures for a day, and check your bill at the end of the week. That one change usually pays for the experiment many times over.


Is Haiku 4.5 actually good, or just cheap?
Both. It's a current-generation model, not a stripped-down one. On bounded, well-specified tasks the gap to the bigger models is small and often invisible once you add a schema check or a test. The gap shows up on hard reasoning, which is the work you shouldn't be sending to Haiku anyway.

What's the model ID?
claude-haiku-4-5, or the pinned snapshot claude-haiku-4-5-20251001.

How much cheaper is it, really?
$1 in / $5 out per million tokens, versus $3 / $15 for Sonnet and $5 / $25 for Opus. A fifth of Opus, a third of Sonnet. Halve it again with the Batch API.

What does Haiku give up?
A smaller context window (200K vs 1M), no effort parameter, and less depth on hard multi-step reasoning. Those three lines tell you when to reach past it.

Does it support structured outputs?
Yes. Haiku 4.5, Sonnet 4.6, and Opus 4.8 all do. Use them for extraction and classification so you get a validated object back instead of a string to parse.

So when do I still use Opus?
The hardest reasoning, the longest multi-step jobs, and anything where a wrong answer is expensive. Use it for the step that needs it, not for the whole app.

Top comments (0)