Rohini Gaonkar for AWS

Posted on May 15

Bigger AI models aren't always better. Here's how to actually choose.

#ai #aws #beginners #tutorial

Recipe demo and cost-to-scale trade-offs

In the previous post, I showed you two models answering the same question. One hallucinated confidently. The other knew when to stop.

And a bunch of you asked: okay, but which one should I actually use?

Haiku, Sonnet, Opus. Micro, Lite, Pro. Mini, Small, Large. There are dozens of models and they all sound like perfume brands. How are you supposed to pick one?

That's this post. I'm going to take one prompt, run it through two models (one small, one large), and show you what's different. Then I'll give you a simple framework for choosing the right one.

The demo: same prompt, two models

I went back to the recipe from the first post. Same recipe. Same question. Two different model sizes.

The prompt:

"I'm cooking this for six people on Saturday. One is vegan, one is gluten-free. Adapt the recipe for me, give me a shopping list, and a timeline starting from 4pm."

The small model

Quickly, it gave me a shopping list, a timeline, and basic adaptations. Nothing fancy, but everything I asked for. If I just need a quick answer and I'm going to double-check it anyway, this works.

For a lot of everyday tasks, this is genuinely all you need.

The large model

Same prompt but a very different response.

It added a whole "Strategy for the vegan guest" section explaining why you should make a parallel pot instead of adapting the main dish. It gave me a timeline starting from the night before. Separated prep into phases. Told me to keep the rice pots separate so nothing touches the vegan side. Scaling math for going from 4 servings to 6. It even gave me an oven method AND a stovetop method as alternatives.

More thorough and more considerate. But did I need all of that for a Saturday dinner? Maybe or maybe not.

There are medium-sized models in between these two and they exist in every family. I'll tell you when to reach for them later.

But the contrast between small and large is where our today's lesson lives.

Why models come in sizes

Let's take a simple example.

My son is two. If I ask him what he wants for dinner, he says "pasta." Done in two seconds without any deliberation.

If you ask me what to make for dinner, I'm thinking: what's in the fridge, what did we have yesterday, does he need more protein today, is it too late to start something that takes 40 minutes, should I batch-cook for tomorrow. Ten variables that will take me five minutes. I will give you a better answer, but slower.

Models work the same way.

A model's "size" is roughly how many parameters it has.

Think of parameters as the variables it can hold in its head when making a decision. More variables, more nuance, more ability to handle complex tasks. Fewer variables, faster and cheaper, but less sophisticated.

My son doesn't need ten variables to pick dinner. He just needs to decide. And for a lot of tasks, that's all you need from a model too. A fast answer. Not a perfect one.

Training a big model costs more. Running a big model costs more per question. And it's slower, because there are more variables to weigh for every single response.

So why not just always use the biggest one? Two reasons.

First, cost.

If you're building something that handles thousands of requests, the difference between a small model and a large model is the difference between a reasonable bill and a terrifying one.

Second, and this is the one people miss: bigger isn't always better.

For simple tasks, a big model can actually overthink it. Give you more than you asked for. Take longer to say something the small model said in two seconds.

The model families you see (Haiku/Sonnet/Opus, Micro/Lite/Pro) are just size tiers from the same provider. Same architecture, different capacity. Like buying a car in compact, sedan, or SUV. Same manufacturer. Different trade-offs. You don't take the SUV to grab milk. You don't take the compact on a cross-country road trip with three kids.

Tokens and pricing: how you actually pay

Models don't charge by the question. They charge by the token.

What's a token? It's a chunk of text. Not quite a word, not quite a letter, but roughly three-quarters of a word.

Take the sentence: "Adapt this recipe for a gluten-free vegan." Seven words but nine tokens. Some words get split, some punctuation becomes its own token.

You don't need to memorise this. Just know: token count and word count aren't the same thing. A full page of text is around 400 tokens. A million tokens is roughly a 750,000-word book.

There's a free tool called Tiktokenizer where you can paste text and see exactly how a model breaks it into tokens. It's weirdly satisfying. Try it.

One thing that surprised me: different models tokenize the same text differently. I sent the exact same prompt and recipe to both models. The small one counted 6,548 input tokens. The large one counted 16,685. Same words, different tokenizers under the hood.

And here's the thing: you get charged twice. Once for the tokens you send in (your question). And once for the tokens the model sends back (its answer). Input tokens and output tokens. They're priced separately, and output is always more expensive, because that's where the model is doing the work.

Real numbers

On Amazon Bedrock, for the Claude family (pricing as of May 2025, check current prices here):

Model	Size	Input (per 1M tokens)	Output (per 1M tokens)
Haiku	Small	~$1	~$5
Sonnet	Medium	~$3	~$15
Opus	Large	~$5	~$25

That's 5x more expensive from small to large. Same question, same answer, but 5x the price on both sides.

If you're asking one question yourself, who cares. The difference is fractions of a cent. But if you're building an app that handles ten thousand requests a day, each one generating a few hundred output tokens, that 5x multiplier turns into real money fast.

The best model is the model you can afford to run at the scale you need.

Where it breaks: when bigger is worse

Let's go back to the large model's response and look at the over-engineered parts. The timeline starting from the night before. "Marinate chicken in yogurt and spices, overnight is best." Fry the vegan portion first in clean oil, then fry the chicken onions in separate oil. Keep the rice pots separate. An oven method AND a stovetop method as alternatives.

The small model? A simple table. 4:00pm, start marinating. 4:05, fry onions. 5:15, into the oven. 7:00, serve. Done.

Opus is doing project management for my Saturday dinner. And here's the real cost of that overthinking.

The small model: 18 seconds, about 1,900 output tokens.
The large model: 44 seconds, 2,700 output tokens.

40% more output. 2.4x slower. And about 10x more expensive for that single request.

For a Saturday dinner, this is overkill. And if I were building an app that answers recipe questions for thousands of users, I'd be paying for all that extra thinking on every single request.

This is the trade-off. Bigger models are smarter, but "smarter" isn't always what you need. Sometimes you need fast, cheap, and good enough.

How to actually choose

Here's how I think about it.

First, the biggest factor: cost. We just saw a 5x difference between small and large. And that's per token. When the big model also generates 40% more tokens per response, it compounds fast. That alone narrows the field for most people. If you're building something, cost is the thing that decides what's even on the table.

If you can't afford to run it at the scale you need, it doesn't matter how good it is.

Start there. What can you actually sustain?

Then, once cost has set your boundaries, three questions help you pick within them.

1. How complex is the task?
Summarising an email? Small model. Writing a legal brief? Big model. Adapting a recipe? Probably medium.

2. How many times will you run it?
If it's one question from you personally, use whatever you want. If it's an app serving thousands of users, speed matters just as much as quality. Start small, upgrade only when the quality isn't good enough.

3. What are the stakes?
If a wrong answer ruins dinner, that's low stakes. If a wrong answer means bad financial processing logic that costs you millions, that's high stakes. Higher stakes, bigger model, plus verification on top.

That's it. Cost sets the ceiling. Complexity, volume, and stakes help you pick the floor. You don't need to memorise model names. You need to know what you're optimising for.

What about picking a provider?

I've been showing models from different providers. Claude, Nova, Llama. How do you pick a family?

Honestly? Pick the one that's available where you already work. If you're on AWS, you have access to all of them through Bedrock. If you're somewhere else, use what's there. The concepts are the same. Don't overthink the brand. Overthink the task.

One thing that confuses a lot of us early on: models and products are not the same thing. Claude is a model. But Claude inside Kiro (a coding IDE) behaves differently from Claude in the Bedrock Playground, which behaves differently from Claude on claude.ai. Same model underneath.

But each product wraps it with different instructions, tools, and context that shape how it responds. Kiro's Claude is tuned for writing code. The Playground's Claude is general-purpose. Same brain, different job description.

So when you see dozens of AI "products" out there, many of them are the same few models dressed up for different use cases. The model decides how smart it is. The product decides what it's pointed at, and priced accordingly.

Try it yourself

If you're just getting started: models come in sizes. Bigger is smarter but slower and more expensive. For most everyday tasks, a medium model is the sweet spot. Try a few and see which one feels right for what you're doing.

If you're more on the builder side: start with the smallest model that gives acceptable quality. Only upgrade when you can point to a specific failure the bigger model fixes. Don't start big and optimise down. Start small and justify up. And remember, you can use different models for different parts of the same system. The router doesn't need to be the same size as the reasoner. The model that decides which tool to call doesn't need to be the same one that processes the result.

Start small. Justify up.

What's next

We are going to talk why model forgets what you told it. Ride Along.

This post is part of the "Learning AI Out Loud" series, a cloud architect learning AI from first principles.

Follow along with the series

Top comments (6)

Max Quimby • May 18

The "start small, justify up" framing is the right default, but I'd add one more axis to your decision sequence: failure modes. A small model that's wrong 5% of the time on a low-stakes classification call is fine. The same 5% wrong rate on a tool-using agent that writes to a database compounds into expensive incidents.

What I've found running a stack of agents in production: the biggest cost savings don't come from "use small everywhere," they come from routing — Haiku/Nova Lite for the high-volume narrow tasks (summarization, intent classification, tool-call selection), Sonnet-class for orchestration and planning. The mistake I see most often is treating "which model" as one global decision instead of one per call site. Once you instrument per-call cost and latency, the right size for each step becomes obvious within a week.

Curious whether you've found a clean pattern for the "start small, then escalate on uncertainty" path — confidence scoring on small-model outputs is still pretty brittle in my experience.

Rohini Gaonkar AWS • May 19

The failure-modes point is a good one. That's actually what question #3 in the framework is getting at (stakes). But you're right that once a wrong classification cascades into a write operation, the compounding changes the math. That's a later conversation in this series though, as my series audience here hasn't seen tool use or agents yet.

On routing: totally agree. One model per call site, not one model per project. I hinted at it at the end ("the router doesn't need to be the same size as the reasoner") but kept it light because the series hasn't built anything multi-step yet. In practice I would use small models for the high-volume narrow stuff (intent classification, tool selection) and a medium model for orchestration and planning. The big savings come from the volume layer, not from squeezing the planning layer.

On the "start small, escalate on uncertainty" path, honestly, I haven't found a clean one. Confidence scoring on small-model outputs feels like building a second model to babysit the first one, and at that point you're paying for both anyway. What I've found works better is: let the small model handle everything, spot-check a sample of outputs, and if a whole category of task is failing, move that category up to a bigger model. Then we are not escalating individual answers, we are escalating the type of work. That's more ops than architecture, but I'm still figuring out where the line is.

The agent episodes later in this series will go deeper on routing and per-call decisions. Appreciate you pushing on it.

Esin Saribudak • May 15

Thanks for this breakdown! Do you think tools with "model auto selection" features do a good enough job of routing different tasks to the right-sized model?

Rohini Gaonkar AWS • May 16

Thank you!

My personal opinion is that it’s getting better. But auto-selection is usually optimised for the platform’s cost efficiency, and not ours. It’s still a black box, and a black box we can’t audit is a risk when stakes or volume are high. Understanding the framework ourselves means we can sanity-check whatever “auto” decides. That’s exactly the gap this post tries to fill!

Rohini Gaonkar AWS • May 15

The best model is the model you can afford to run at the scale you need.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.