We open our IDE and let a model running somewhere in the cloud read our entire codebase to add a null check - and track our behaviour along the way. We open Google Docs and ask Gemini to fix a typo. We fire up GPT-class models to refine a Slack message, restructure a comment, generate a thumbnail. We're going to shove AI into every single hole that has data for it to be trained on.
I'm not saying we shouldn't - that's the nature of expected technological progress, and there isn't much choice in the matter. But somewhere along the way we stopped asking whether the scale of the model matches the scale of the task. And the answer, more often than we'd like to admit, is no.
This isn't a doom take. We're not being replaced. We're just still in the early adoption phase, when most people don't fully grasp what AI is not and where its limits are, wishful-thinking about it a bit too much. Which means we can still shape it - like we shaped radio, then the internet, then open source. We just need to find a more natural path for this technology, before the current default ossifies into the only option.
The numbers don't support the defaults
Take Qwen3-Coder-Next: 80B total parameters but only 3B active - performing on par with models that have 10-20x more active compute, runnable on high-end consumer hardware (think a 64GB+ Apple Silicon Mac, or a beefy workstation card) instead of a datacenter rack. Go smaller still and it gets more interesting. A Qwen3-4B fine-tuned for a specific task matches a 120B+ model on that task, deployable on consumer hardware. Or take Chandra - a 5B OCR model purpose-built for PDF and image conversion that outperforms both Gemini 2.5 Flash and GPT-5 Mini on multilingual document benchmarks. Not because it's smarter. Because it's focused.
And every major model release is announced like an earth-shattering event, destined to shadow everything before it and boost everything tenfold. Then we actually start using the thing, and we find a modest improvement - mostly specific, mostly a derivative of what the model was trained on. Take the mysterious announcement of Anthropic's Mythos, supposedly "too dangerous to release" - we don't even know yet if it justifies the hype. Meanwhile this experimental article from Aisle already suggests small models can match or outperform it in vulnerability scans - one early experiment, but telling.
This isn't new, either. Chinchilla challenged the "bigger is always better" orthodoxy back in 2022, and since then the evidence has only stacked up - small models trained on high-quality data for a dedicated task can match or beat their much larger cousins. We just kept defaulting to the biggest available thing anyway, partly out of habit, partly because the cloud paradigm is being pushed hard by everyone with a stake in keeping us there. The headline outpaces the reality, and the reality is that for most tasks, we're already past the point of useful returns from going bigger.
A different path
There's another path, and it doesn't look like Cyberpunk 2037. It doesn't require massive H200 clusters just to prettify your CV. It leads to more equal AI distribution, and it doesn't try to substitute anybody.
That path consists of small, dedicated models trained to do one or a few specific things at most. Models that are just smart enough to fulfill their purpose, and small enough to avoid creating the false impression that they're replacing anyone. This is the mass AI of the future - a true symbiosis. Or to be more precise, it's proper tool use.
Because AI is not a being. It's a simulation of one: a very cleverly engineered statistical model that's good at approximation in a way that looks like adaptability. Treating it as a being is what gets us reaching for the largest possible model every time, as if we were asking a person for help. Treating it as a tool is what lets us match the model to the task - the way you don't use a chainsaw to slice bread.
What this looks like in practice is software built AI-native from the ground up, not bolted onto with MCPs and API calls to remote giants. A document editor with small models embedded or pluggable for grammar checks, restructuring, summarization, all running locally. An OCR pipeline that just does OCR, well - paired with a small RAG model that lets you actually search and query a shelf of scanned papers or PDFs locally. A video editor with a small model that clips and tags footage on your machine. An in-game AI that runs on the player's hardware. None of these require breakthroughs - the models already exist, or could be trained without a billion-dollar cluster if there's enough data available.
What's missing is the software paradigm to host them properly - and the orchestration layer to chain them together. If general AI adoption is in its early phase, small-model orchestration is in its infancy: tooling, conventions, ecosystems, all still forming. ComfyUI already lets people chain specialized image and video models into local pipelines - the closest thing we have to a working blueprint, though it's fragile and leans heavily on Python venvs. LM Studio and Ollama make running local models trivial and stable, but they're runtimes more than orchestrators. These are embryos - but they prove the paradigm works. And it's the part worth building out further.
Where the big ones still belong
Large models aren't a dead end. They're the right tool for genuinely hard, open-ended problems - complex coding across unfamiliar codebases, in-depth analysis, anything that genuinely requires reasoning across a wide context. The argument isn't "small models for everything." It's "stop using a trillion-parameter model to fix a typo."
The honest version of the AI future is mixed: large models where their capabilities are actually needed, and small specialized models for the long tail of focused tasks - which is most of them. Treating those two cases the same way is what's wasteful. Not the technology itself.
Why this matters
Using large models for everything is the dead end. Not because it doesn't work, but because of what it costs and where it leads. Every "fix this typo" routed through a frontier model is a small vote for the centralization of compute, the centralization of data, and the centralization of who gets to decide what AI does next. Multiply that by a billion daily prompts and you get the bubble we're currently inflating - one where the only viable AI is the kind that requires a hyperscaler to run.
The small-model path isn't just more efficient. It's more honest about what most AI tasks actually need, and it leaves room for AI to be something other than a service we rent from a handful of hyperscalers.
We can still take that path. Many of the models are already there, others are still to be explored and trained. The hardware is there. What's missing is the will to stop assuming bigger is always better - and the software to make small the new default.
Top comments (2)
The cheapest options to not use AI for a task is always there, but we are so lazy. Lazy for select a perfect MI for a specify task. So your idea can be delegated after AI API level. So on the first layer of that AI decide which model used for answer.
That's one way to address this, having a router model to decide what dedicated model to use. This router model can conveniently be a small one too :) The other part is there's no mature infrastructure this router can delegate execution to yet. Maybe it can be done like OpenRouter does across hosted providers. I hope we'll accelerate in this direction.