Your AI Bill Isn't a Model Problem. It's an Architecture Problem.

#ai #systemdesign #llm #architecture

If your LLM costs are climbing, the instinct is almost always the same: swap to a cheaper model. GPT-4 to GPT-4-mini. Claude Opus to Claude Haiku. Sometimes that helps a little. It rarely fixes the actual problem.

The actual problem, in most workflows I've looked at, is that every step gets routed through the LLM, even the steps that don't need language reasoning at all.

This post breaks down a simple mental model for deciding what should and shouldn't touch an LLM, with a working example you can adapt.

The four components of any AI workflow

Every automated workflow — whether it's a support ticket router, a fraud check, or a content pipeline — is built from some combination of four building blocks. They get treated the same once a workflow diagram is drawn flat, but they have wildly different cost and latency profiles.

Component	What it does	Think of it as	Typical cost
Trigger	Starts the workflow	The doorbell	~$0
Deterministic ML	Structured predictions — classify, score, rank	The calculator	Cents per 1,000 calls
LLM / Generative	Reads, writes, reasons in language	The writer	Dollars per 1,000 calls
Tool / API	Fetches or writes real data	The hands	Cents per 1,000 calls

The gap between row 2 and row 3 is the whole article. A classifier and an LLM call can solve the exact same problem, but one costs roughly 100-1000x more than the other, depending on model and provider. If you're not deliberately deciding which one handles which step, you're probably defaulting to the expensive one — because in frameworks like LangChain or a quick custom agent loop, it's just easier to shove everything into a prompt.

Where this actually shows up

Here's a workflow I see constantly: an automated support ticket triage system.

flowchart LR
    A[New support ticket] --> B{Classify intent}
    B --> C[Route to team]
    B --> D[Auto-draft response]
    D --> E[Update CRM]

A naive build sends the entire ticket text to an LLM and asks it to do everything at once: classify the intent, decide routing, draft a response, and format a CRM update — all in a single prompt, often with the LLM also asked to output structured JSON for the routing decision.

This works. It's also wildly overpriced for what it's doing, because step B — classification — doesn't need an LLM's reasoning ability. It needs a model that's good at one narrow task: mapping ticket text to one of N categories.

The breakdown

Trigger — ticket arrives via webhook. Free.

Deterministic ML — a lightweight classifier (a fine-tuned BERT-style model, or just a gradient-boosted classifier on embeddings) decides intent: billing, technical, account, spam. This is a calculator problem. Fast, cheap, and consistent — the same input gives the same output every time, which matters when you're debugging routing logic later.

LLM / Generative — only invoked for the response draft, and only for tickets that actually need a written reply (not, say, an auto-tagged spam ticket that gets silently archived).

Tool / API — the CRM update. A database write. No reasoning required.

In the naive version, every single ticket — including spam that gets immediately discarded — pays the LLM tax for classification it didn't need.

A simplified routing layer

Here's roughly what separating these concerns looks like in code. This is illustrative, not production-hardened — the point is the shape of the decision, not the specific classifier implementation.

from dataclasses import dataclass
from enum import Enum

class Intent(Enum):
    BILLING = "billing"
    TECHNICAL = "technical"
    ACCOUNT = "account"
    SPAM = "spam"

@dataclass
class Ticket:
    text: str
    customer_id: str

def classify_intent(ticket: Ticket) -> Intent:
    """
    Deterministic ML step. In practice this might be a small
    fine-tuned classifier, a logistic regression over embeddings,
    or even keyword/regex rules for simple cases.
    No LLM call here — this should run in single-digit milliseconds.
    """
    # placeholder logic
    if "unsubscribe" in ticket.text.lower():
        return Intent.SPAM
    if "invoice" in ticket.text.lower() or "charge" in ticket.text.lower():
        return Intent.BILLING
    return Intent.TECHNICAL


def needs_generated_response(intent: Intent) -> bool:
    """Only some intents need a written reply at all."""
    return intent != Intent.SPAM


def draft_response(ticket: Ticket, intent: Intent) -> str:
    """
    This is the only place an LLM call belongs in this pipeline.
    Everything upstream has already done the cheap filtering.
    """
    prompt = f"Write a helpful support reply for this {intent.value} ticket:\n{ticket.text}"
    return call_llm(prompt)  # your actual LLM client call


def update_crm(ticket: Ticket, intent: Intent, response: str | None) -> None:
    """Tool/API step. A database write, no reasoning involved."""
    crm_client.update_ticket(
        customer_id=ticket.customer_id,
        intent=intent.value,
        response=response,
    )


def handle_ticket(ticket: Ticket) -> None:
    intent = classify_intent(ticket)          # deterministic ML

    response = None
    if needs_generated_response(intent):       # cheap gate
        response = draft_response(ticket, intent)  # LLM only when needed

    update_crm(ticket, intent, response)        # tool/API

The structure matters more than the specific classifier you plug in. I've seen teams spend a week picking the "best" classifier model when the real win was just moving classification out of the prompt in the first place. The LLM call sits behind two cheap gates: classification, and a boolean check on whether a response is even warranted. Spam tickets never reach the LLM. Routine billing tickets that match a known pattern could, in a more developed version, skip the LLM entirely and use a templated response instead.

Illustrative cost comparison

To be clear: these are example numbers to illustrate the order of magnitude, not measured results from a specific deployment. Your actual costs depend on your provider, model choice, and ticket volume.

Approach	Classification	Response generation	Total for 10,000 tickets/month (~30% spam, ~70% need replies)
Everything through LLM	LLM call per ticket	LLM call per ticket	LLM called 10,000 times
Routed architecture	Cheap classifier per ticket	LLM call only for non-spam	LLM called ~7,000 times

Even in this simple example, routing alone removes 30% of the most expensive calls before any model swap. Add templated responses for common patterns and caching for repeated questions, and the LLM call count drops further still — usually by more than switching models would save on its own.

When to actually reach for a smaller model

This isn't an argument against using cheaper LLMs. It's an argument for using them in the right place. Once you've separated deterministic work from generative work, "should I use a smaller/cheaper model" becomes a much narrower question: applied only to the generation step, where it belongs, instead of bolted onto everything.

A reasonable order of operations:

Map your workflow against the four components above. Be honest about which steps are actually classification/extraction/ranking versus genuine language generation.
Move deterministic steps out of the prompt. Classification, routing, scoring, structured extraction — these usually have a non-LLM solution that's faster and cheaper, even if it takes more upfront engineering than "just ask the LLM to do it."
Gate the LLM call. Don't generate a response for tickets that don't need one. Don't summarize content nobody asked to see.
Only then, evaluate model size for what's left. If you're still calling an LLM 10,000 times a month for response generation, that's the point where comparing model tiers actually matters.

The takeaway

A scoped-tools agent and a scoped-architecture pipeline are solving the same problem: give an expensive, general-purpose reasoning engine less to think about, so it spends its compute on the one thing it's actually needed for.

I'll admit the smaller-model conversation is more fun to have. It feels like progress swap a config value, watch the bill drop a little, move on. Rearchitecting which steps even touch the LLM is slower and less satisfying in the short term. But it's usually where the real savings are sitting, untouched, while everyone argues about which model is cheapest per token.