AI agents have a cost problem.
A single "task" often means many model calls: reading context, calling tools, summarizing results, deciding the next step, retrying, validating output. If every step hits a frontier LLM, the unit economics get ugly fast.
One big model for everything is probably the wrong shape
The better question isn't "which model is smartest?" — it's "which part of the task actually needs the smartest model?"
LLMs should handle the hard parts: planning, backtracking, judgment, ambiguous decisions.
Small language models can handle the boring but frequent parts: extraction, routing, JSON formatting, tool parameters, log summaries, simple validation.
Most agent workflows contain a lot of that second category.
Why desktop agents are interesting
Cloud agents pay for tokens at almost every step — every retry, every summary, every tool-call decision, every formatting pass usually goes through a remote model.
Desktop agents have another option: local compute. They can run small local models or deterministic code for cheap, repetitive work, and only call cloud LLMs when the task actually needs deeper reasoning.
That changes the cost structure. Instead of:
every step → cloud LLM token cost
you get something closer to:
routine work → local compute · hard decisions → cloud LLMs
The long-term loop
start with LLMs → log agent traces → find repeated task patterns → distill them into SLMs / LoRAs → run them locally or cheaply → keep LLMs as fallback
In other words, agents should get cheaper as they're used more. The more traces you collect, the clearer it gets which tasks are repeated, narrow, and safe to move off frontier models.
My takeaway
The next wave of agents won't just be about stronger models — it'll be about better compute allocation: LLMs for judgment, SLMs for narrow repeated work, code for deterministic checks, local compute wherever possible.
That may be what makes agent economics work.
Paper: Small Language Models are the Future of Agentic AI — https://arxiv.org/abs/2506.02153
Top comments (1)
I think this is where a lot of AI cost discussions miss the point. It's not about replacing a GPT-5 class model with a smaller one it's about orchestrating the right model for the right job.
Deterministic code, SLMs, and frontier models each have different strengths. If every retry, formatter, and validation step goes through the biggest model, costs will scale linearly with usage. The teams that win will be the ones treating AI pipelines like distributed systems, where compute is allocated intentionally instead of uniformly.