<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Yogesh Bakshi</title>
    <description>The latest articles on DEV Community by Yogesh Bakshi (@bakshiyogesh).</description>
    <link>https://dev.to/bakshiyogesh</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1183537%2F0ab92292-425c-46a9-af29-59b3ff84b71d.jpeg</url>
      <title>DEV Community: Yogesh Bakshi</title>
      <link>https://dev.to/bakshiyogesh</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/bakshiyogesh"/>
    <language>en</language>
    <item>
      <title>Your AI Bill Isn't a Model Problem. It's an Architecture Problem.</title>
      <dc:creator>Yogesh Bakshi</dc:creator>
      <pubDate>Tue, 23 Jun 2026 06:43:17 +0000</pubDate>
      <link>https://dev.to/bakshiyogesh/your-ai-bill-isnt-a-model-problem-its-an-architecture-problem-1ole</link>
      <guid>https://dev.to/bakshiyogesh/your-ai-bill-isnt-a-model-problem-its-an-architecture-problem-1ole</guid>
      <description>&lt;p&gt;If your LLM costs are climbing, the instinct is almost always the same: swap to a cheaper model. GPT-4 to GPT-4-mini. Claude Opus to Claude Haiku. Sometimes that helps a little. It rarely fixes the actual problem.&lt;/p&gt;

&lt;p&gt;The actual problem, in most workflows I've looked at, is that every step gets routed through the LLM, even the steps that don't need language reasoning at all.&lt;/p&gt;

&lt;p&gt;This post breaks down a simple mental model for deciding what should and shouldn't touch an LLM, with a working example you can adapt.&lt;/p&gt;

&lt;p&gt;The four components of any AI workflow&lt;/p&gt;

&lt;p&gt;Every automated workflow — whether it's a support ticket router, a fraud check, or a content pipeline — is built from some combination of four building blocks. They get treated the same once a workflow diagram is drawn flat, but they have wildly different cost and latency profiles.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;th&gt;Think of it as&lt;/th&gt;
&lt;th&gt;Typical cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Trigger&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Starts the workflow&lt;/td&gt;
&lt;td&gt;The doorbell&lt;/td&gt;
&lt;td&gt;~$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deterministic ML&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Structured predictions — classify, score, rank&lt;/td&gt;
&lt;td&gt;The calculator&lt;/td&gt;
&lt;td&gt;Cents per 1,000 calls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM / Generative&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reads, writes, reasons in language&lt;/td&gt;
&lt;td&gt;The writer&lt;/td&gt;
&lt;td&gt;Dollars per 1,000 calls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tool / API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fetches or writes real data&lt;/td&gt;
&lt;td&gt;The hands&lt;/td&gt;
&lt;td&gt;Cents per 1,000 calls&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The gap between row 2 and row 3 is the whole article. A classifier and an LLM call can solve the exact same problem, but one costs roughly 100-1000x more than the other, depending on model and provider. If you're not deliberately deciding which one handles which step, you're probably defaulting to the expensive one — because in frameworks like LangChain or a quick custom agent loop, it's just easier to shove everything into a prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this actually shows up
&lt;/h2&gt;

&lt;p&gt;Here's a workflow I see constantly: an automated support ticket triage system.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart LR
    A[New support ticket] --&amp;gt; B{Classify intent}
    B --&amp;gt; C[Route to team]
    B --&amp;gt; D[Auto-draft response]
    D --&amp;gt; E[Update CRM]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A naive build sends the entire ticket text to an LLM and asks it to do everything at once: classify the intent, decide routing, draft a response, and format a CRM update — all in a single prompt, often with the LLM also asked to output structured JSON for the routing decision.&lt;/p&gt;

&lt;p&gt;This works. It's also wildly overpriced for what it's doing, because step B — classification — doesn't need an LLM's reasoning ability. It needs a model that's good at one narrow task: mapping ticket text to one of N categories.&lt;/p&gt;

&lt;h3&gt;
  
  
  The breakdown
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Trigger&lt;/strong&gt; — ticket arrives via webhook. Free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deterministic ML&lt;/strong&gt; — a lightweight classifier (a fine-tuned BERT-style model, or just a gradient-boosted classifier on embeddings) decides intent: billing, technical, account, spam. This is a calculator problem. Fast, cheap, and consistent — the same input gives the same output every time, which matters when you're debugging routing logic later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM / Generative&lt;/strong&gt; — only invoked for the response draft, and only for tickets that actually need a written reply (not, say, an auto-tagged spam ticket that gets silently archived).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool / API&lt;/strong&gt; — the CRM update. A database write. No reasoning required.&lt;/p&gt;

&lt;p&gt;In the naive version, every single ticket — including spam that gets immediately discarded — pays the LLM tax for classification it didn't need.&lt;/p&gt;

&lt;h2&gt;
  
  
  A simplified routing layer
&lt;/h2&gt;

&lt;p&gt;Here's roughly what separating these concerns looks like in code. This is illustrative, not production-hardened — the point is the shape of the decision, not the specific classifier implementation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dataclasses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;enum&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Enum&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Enum&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;BILLING&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;billing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;TECHNICAL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;technical&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;ACCOUNT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;account&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;SPAM&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spam&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Ticket&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;classify_intent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Ticket&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Deterministic ML step. In practice this might be a small
    fine-tuned classifier, a logistic regression over embeddings,
    or even keyword/regex rules for simple cases.
    No LLM call here — this should run in single-digit milliseconds.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# placeholder logic
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unsubscribe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SPAM&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invoice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;charge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BILLING&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TECHNICAL&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;needs_generated_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Only some intents need a written reply at all.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;intent&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SPAM&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;draft_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Ticket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    This is the only place an LLM call belongs in this pipeline.
    Everything upstream has already done the cheap filtering.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a helpful support reply for this &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; ticket:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# your actual LLM client call
&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;update_crm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Ticket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Tool/API step. A database write, no reasoning involved.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;crm_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update_ticket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_ticket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Ticket&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;intent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;classify_intent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;          &lt;span class="c1"&gt;# deterministic ML
&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;needs_generated_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;       &lt;span class="c1"&gt;# cheap gate
&lt;/span&gt;        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;draft_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# LLM only when needed
&lt;/span&gt;
    &lt;span class="nf"&gt;update_crm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;        &lt;span class="c1"&gt;# tool/API
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The structure matters more than the specific classifier you plug in. I've seen teams spend a week picking the "best" classifier model when the real win was just moving classification out of the prompt in the first place. The LLM call sits behind two cheap gates: classification, and a boolean check on whether a response is even warranted. Spam tickets never reach the LLM. Routine billing tickets that match a known pattern could, in a more developed version, skip the LLM entirely and use a templated response instead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Illustrative cost comparison
&lt;/h2&gt;

&lt;p&gt;To be clear: these are example numbers to illustrate the order of magnitude, not measured results from a specific deployment. Your actual costs depend on your provider, model choice, and ticket volume.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Classification&lt;/th&gt;
&lt;th&gt;Response generation&lt;/th&gt;
&lt;th&gt;Total for 10,000 tickets/month (~30% spam, ~70% need replies)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Everything through LLM&lt;/td&gt;
&lt;td&gt;LLM call per ticket&lt;/td&gt;
&lt;td&gt;LLM call per ticket&lt;/td&gt;
&lt;td&gt;LLM called 10,000 times&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Routed architecture&lt;/td&gt;
&lt;td&gt;Cheap classifier per ticket&lt;/td&gt;
&lt;td&gt;LLM call only for non-spam&lt;/td&gt;
&lt;td&gt;LLM called ~7,000 times&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Even in this simple example, routing alone removes 30% of the most expensive calls before any model swap. Add templated responses for common patterns and caching for repeated questions, and the LLM call count drops further still — usually by more than switching models would save on its own.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to actually reach for a smaller model
&lt;/h2&gt;

&lt;p&gt;This isn't an argument against using cheaper LLMs. It's an argument for using them in the right place. Once you've separated deterministic work from generative work, "should I use a smaller/cheaper model" becomes a much narrower question: applied only to the generation step, where it belongs, instead of bolted onto everything.&lt;/p&gt;

&lt;p&gt;A reasonable order of operations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Map your workflow&lt;/strong&gt; against the four components above. Be honest about which steps are actually classification/extraction/ranking versus genuine language generation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Move deterministic steps out of the prompt.&lt;/strong&gt; Classification, routing, scoring, structured extraction — these usually have a non-LLM solution that's faster and cheaper, even if it takes more upfront engineering than "just ask the LLM to do it."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gate the LLM call.&lt;/strong&gt; Don't generate a response for tickets that don't need one. Don't summarize content nobody asked to see.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Only then, evaluate model size&lt;/strong&gt; for what's left. If you're still calling an LLM 10,000 times a month for response generation, that's the point where comparing model tiers actually matters.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;A scoped-tools agent and a scoped-architecture pipeline are solving the same problem: give an expensive, general-purpose reasoning engine less to think about, so it spends its compute on the one thing it's actually needed for.&lt;/p&gt;

&lt;p&gt;I'll admit the smaller-model conversation is more fun to have. It feels like progress swap a config value, watch the bill drop a little, move on. Rearchitecting which steps even touch the LLM is slower and less satisfying in the short term. But it's usually where the real savings are sitting, untouched, while everyone argues about which model is cheapest per token.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
      <category>systemdesign</category>
    </item>
  </channel>
</rss>
