<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Dhruv Kapadia</title>
    <description>The latest articles on DEV Community by Dhruv Kapadia (@dhruv_kapadia_703eadaa762).</description>
    <link>https://dev.to/dhruv_kapadia_703eadaa762</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4004472%2F857f8f61-5b8f-4a4d-b8e2-d4bdcb30d09a.png</url>
      <title>DEV Community: Dhruv Kapadia</title>
      <link>https://dev.to/dhruv_kapadia_703eadaa762</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dhruv_kapadia_703eadaa762"/>
    <language>en</language>
    <item>
      <title>Cutting our LLM bill ~80% with model routing: the actual cost math</title>
      <dc:creator>Dhruv Kapadia</dc:creator>
      <pubDate>Fri, 26 Jun 2026 19:23:40 +0000</pubDate>
      <link>https://dev.to/dhruv_kapadia_703eadaa762/cutting-our-llm-bill-80-with-model-routing-the-actual-cost-math-mfk</link>
      <guid>https://dev.to/dhruv_kapadia_703eadaa762/cutting-our-llm-bill-80-with-model-routing-the-actual-cost-math-mfk</guid>
      <description>&lt;p&gt;Most teams I talk to run every LLM call through one frontier model, then act surprised when the invoice shows up. We did the same thing for a while. The fix that actually moved our bill was boring: route each request to the cheapest model that can still do the job. Here is the math and how we set it up.&lt;/p&gt;

&lt;h2&gt;
  
  
  The price spread is bigger than people assume
&lt;/h2&gt;

&lt;p&gt;If you line up current API pricing across providers, the gap between budget and frontier models for comparable output is roughly &lt;strong&gt;50x&lt;/strong&gt; per token. Output tokens also cost more than input, usually in the 4-6x range, which matters a lot if your app generates long responses.&lt;/p&gt;

&lt;p&gt;So the question is not "which model is best." It is "which model is good enough for &lt;em&gt;this&lt;/em&gt; request, at what cost." For a support reply, a classification, or a short summary, a mid-tier model often produces output you cannot distinguish from the frontier one in a blind test. You are paying frontier prices for work a cheaper model finishes fine.&lt;/p&gt;

&lt;h2&gt;
  
  
  What routing looks like in practice
&lt;/h2&gt;

&lt;p&gt;The pattern is simple:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Classify the incoming task (intent, complexity, how much it would cost if it went wrong).&lt;/li&gt;
&lt;li&gt;Pick the cheapest model that clears the quality bar for that class.&lt;/li&gt;
&lt;li&gt;Fall back to a stronger model if a confidence check or validation step fails.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A rough example from our own traffic. Say a workflow does 1M requests a month, averaging 500 input tokens and 800 output tokens:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Everything on a frontier model: you are paying frontier output rates on all 800M output tokens.&lt;/li&gt;
&lt;li&gt;Route ~70% of those (the simple classes) to a mid-tier model at a fraction of the per-token cost, keep the hard 30% on frontier: the blended cost drops sharply, and in our case it landed around 80% lower than the all-frontier baseline.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The savings are not magic. They come from the fact that most production traffic is not hard, and the price curve between "good enough" and "best" is steep.&lt;/p&gt;

&lt;h2&gt;
  
  
  The parts that bite you
&lt;/h2&gt;

&lt;p&gt;Routing is not free to run. A few things I would not skip:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;An eval harness.&lt;/strong&gt; You need to measure quality per task class before and after you move it to a cheaper model, or you are guessing. Without this you will either over-route and ship worse output, or under-route and keep overpaying.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A real fallback.&lt;/strong&gt; Cheap model returns low confidence or fails a schema check, escalate. The escalation rate tells you whether your routing thresholds are set right.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency, not just cost.&lt;/strong&gt; Sometimes the cheaper model is also faster, sometimes not. Track both.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Know when not to route.&lt;/strong&gt; High-stakes output (legal, medical, anything a human acts on directly) is where you keep the strong model and eat the cost. Routing is for the long tail of ordinary requests, not the 1% that has to be right.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Build vs buy
&lt;/h2&gt;

&lt;p&gt;You can build this yourself with a classifier in front of a few provider SDKs, plus the eval and fallback logic above. It is a reasonable weekend prototype and a real project to run in production.&lt;/p&gt;

&lt;p&gt;The other option is a gateway that sits in front of the providers and does the routing for you. That is the part of the problem I work on day to day at Coworker, where the &lt;a href="https://coworker.ai/llm-gateway" rel="noopener noreferrer"&gt;LLM gateway&lt;/a&gt; routes each task across OpenAI, Anthropic, Google, and open models and connects to the tools a request actually needs. Either way, the lever is the same: stop sending easy work to expensive models.&lt;/p&gt;

&lt;p&gt;If you just want to sanity-check your own spend before changing anything, we put the per-model 2026 pricing into a free &lt;a href="https://coworker.ai/llm-cost-calculator" rel="noopener noreferrer"&gt;LLM cost calculator&lt;/a&gt; so you can plug in your token volumes and see the spread for yourself.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;The single biggest AI cost win for most teams is not a smaller context window or a prompt tweak. It is admitting that most requests do not need your best model, then routing accordingly. Measure quality per task class, set a fallback, and let price do the rest.&lt;/p&gt;

&lt;p&gt;What are you routing on in production, task complexity, intent, something else? Curious how other people are drawing the line.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>api</category>
      <category>architecture</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
