If your LLM costs are climbing, the instinct is almost always the same: swap to a cheaper model. GPT-4 to GPT-4-mini. Claude Opus to Claude Haiku. Sometimes that helps a little. It rarely fixes the actual problem.
The actual problem, in most workflows I've looked at, is that every step gets routed through the LLM, even the steps that don't need language reasoning at all.
This post breaks down a simple mental model for deciding what should and shouldn't touch an LLM, with a working example you can adapt.
The four components of any AI workflow
Every automated workflow — whether it's a support ticket router, a fraud check, or a content pipeline — is built from some combination of four building blocks. They get treated the same once a workflow diagram is drawn flat, but they have wildly different cost and latency profiles.
| Component | What it does | Think of it as | Typical cost |
|---|---|---|---|
| Trigger | Starts the workflow | The doorbell | ~$0 |
| Deterministic ML | Structured predictions — classify, score, rank | The calculator | Cents per 1,000 calls |
| LLM / Generative | Reads, writes, reasons in language | The writer | Dollars per 1,000 calls |
| Tool / API | Fetches or writes real data | The hands | Cents per 1,000 calls |
The gap between row 2 and row 3 is the whole article. A classifier and an LLM call can solve the exact same problem, but one costs roughly 100-1000x more than the other, depending on model and provider. If you're not deliberately deciding which one handles which step, you're probably defaulting to the expensive one — because in frameworks like LangChain or a quick custom agent loop, it's just easier to shove everything into a prompt.
Where this actually shows up
Here's a workflow I see constantly: an automated support ticket triage system.
flowchart LR
A[New support ticket] --> B{Classify intent}
B --> C[Route to team]
B --> D[Auto-draft response]
D --> E[Update CRM]
A naive build sends the entire ticket text to an LLM and asks it to do everything at once: classify the intent, decide routing, draft a response, and format a CRM update — all in a single prompt, often with the LLM also asked to output structured JSON for the routing decision.
This works. It's also wildly overpriced for what it's doing, because step B — classification — doesn't need an LLM's reasoning ability. It needs a model that's good at one narrow task: mapping ticket text to one of N categories.
The breakdown
Trigger — ticket arrives via webhook. Free.
Deterministic ML — a lightweight classifier (a fine-tuned BERT-style model, or just a gradient-boosted classifier on embeddings) decides intent: billing, technical, account, spam. This is a calculator problem. Fast, cheap, and consistent — the same input gives the same output every time, which matters when you're debugging routing logic later.
LLM / Generative — only invoked for the response draft, and only for tickets that actually need a written reply (not, say, an auto-tagged spam ticket that gets silently archived).
Tool / API — the CRM update. A database write. No reasoning required.
In the naive version, every single ticket — including spam that gets immediately discarded — pays the LLM tax for classification it didn't need.
A simplified routing layer
Here's roughly what separating these concerns looks like in code. This is illustrative, not production-hardened — the point is the shape of the decision, not the specific classifier implementation.
from dataclasses import dataclass
from enum import Enum
class Intent(Enum):
BILLING = "billing"
TECHNICAL = "technical"
ACCOUNT = "account"
SPAM = "spam"
@dataclass
class Ticket:
text: str
customer_id: str
def classify_intent(ticket: Ticket) -> Intent:
"""
Deterministic ML step. In practice this might be a small
fine-tuned classifier, a logistic regression over embeddings,
or even keyword/regex rules for simple cases.
No LLM call here — this should run in single-digit milliseconds.
"""
# placeholder logic
if "unsubscribe" in ticket.text.lower():
return Intent.SPAM
if "invoice" in ticket.text.lower() or "charge" in ticket.text.lower():
return Intent.BILLING
return Intent.TECHNICAL
def needs_generated_response(intent: Intent) -> bool:
"""Only some intents need a written reply at all."""
return intent != Intent.SPAM
def draft_response(ticket: Ticket, intent: Intent) -> str:
"""
This is the only place an LLM call belongs in this pipeline.
Everything upstream has already done the cheap filtering.
"""
prompt = f"Write a helpful support reply for this {intent.value} ticket:\n{ticket.text}"
return call_llm(prompt) # your actual LLM client call
def update_crm(ticket: Ticket, intent: Intent, response: str | None) -> None:
"""Tool/API step. A database write, no reasoning involved."""
crm_client.update_ticket(
customer_id=ticket.customer_id,
intent=intent.value,
response=response,
)
def handle_ticket(ticket: Ticket) -> None:
intent = classify_intent(ticket) # deterministic ML
response = None
if needs_generated_response(intent): # cheap gate
response = draft_response(ticket, intent) # LLM only when needed
update_crm(ticket, intent, response) # tool/API
The structure matters more than the specific classifier you plug in. I've seen teams spend a week picking the "best" classifier model when the real win was just moving classification out of the prompt in the first place. The LLM call sits behind two cheap gates: classification, and a boolean check on whether a response is even warranted. Spam tickets never reach the LLM. Routine billing tickets that match a known pattern could, in a more developed version, skip the LLM entirely and use a templated response instead.
Illustrative cost comparison
To be clear: these are example numbers to illustrate the order of magnitude, not measured results from a specific deployment. Your actual costs depend on your provider, model choice, and ticket volume.
| Approach | Classification | Response generation | Total for 10,000 tickets/month (~30% spam, ~70% need replies) |
|---|---|---|---|
| Everything through LLM | LLM call per ticket | LLM call per ticket | LLM called 10,000 times |
| Routed architecture | Cheap classifier per ticket | LLM call only for non-spam | LLM called ~7,000 times |
Even in this simple example, routing alone removes 30% of the most expensive calls before any model swap. Add templated responses for common patterns and caching for repeated questions, and the LLM call count drops further still — usually by more than switching models would save on its own.
When to actually reach for a smaller model
This isn't an argument against using cheaper LLMs. It's an argument for using them in the right place. Once you've separated deterministic work from generative work, "should I use a smaller/cheaper model" becomes a much narrower question: applied only to the generation step, where it belongs, instead of bolted onto everything.
A reasonable order of operations:
- Map your workflow against the four components above. Be honest about which steps are actually classification/extraction/ranking versus genuine language generation.
- Move deterministic steps out of the prompt. Classification, routing, scoring, structured extraction — these usually have a non-LLM solution that's faster and cheaper, even if it takes more upfront engineering than "just ask the LLM to do it."
- Gate the LLM call. Don't generate a response for tickets that don't need one. Don't summarize content nobody asked to see.
- Only then, evaluate model size for what's left. If you're still calling an LLM 10,000 times a month for response generation, that's the point where comparing model tiers actually matters.
The takeaway
A scoped-tools agent and a scoped-architecture pipeline are solving the same problem: give an expensive, general-purpose reasoning engine less to think about, so it spends its compute on the one thing it's actually needed for.
I'll admit the smaller-model conversation is more fun to have. It feels like progress swap a config value, watch the bill drop a little, move on. Rearchitecting which steps even touch the LLM is slower and less satisfying in the short term. But it's usually where the real savings are sitting, untouched, while everyone argues about which model is cheapest per token.
Top comments (0)