<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Juan</title>
    <description>The latest articles on DEV Community by Juan (@juanok).</description>
    <link>https://dev.to/juanok</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3864719%2F703a7b7a-a551-44c6-882f-86824afd6ef1.jpeg</url>
      <title>DEV Community: Juan</title>
      <link>https://dev.to/juanok</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/juanok"/>
    <language>en</language>
    <item>
      <title>I was mass-sending everything to GPT-4. Here's what I changed.</title>
      <dc:creator>Juan</dc:creator>
      <pubDate>Tue, 07 Apr 2026 01:19:39 +0000</pubDate>
      <link>https://dev.to/juanok/i-was-mass-sending-everything-to-gpt-4-heres-what-i-changed-20jh</link>
      <guid>https://dev.to/juanok/i-was-mass-sending-everything-to-gpt-4-heres-what-i-changed-20jh</guid>
      <description>&lt;p&gt;I'm a solo dev from Argentina building AI tools. For months I was doing what most of us do — every API call went straight to GPT-4 (now GPT-4o). Summarization? GPT-4. Formatting a JSON? GPT-4. Answering "what's 2+2"? You guessed it.&lt;/p&gt;

&lt;p&gt;Then I looked at my bill and did some math.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The numbers that made me stop&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's what the main LLM providers charge per 1M tokens right now:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input&lt;/th&gt;
&lt;th&gt;Output&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o&lt;/td&gt;
&lt;td&gt;$2.50&lt;/td&gt;
&lt;td&gt;$10.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o-mini&lt;/td&gt;
&lt;td&gt;$0.15&lt;/td&gt;
&lt;td&gt;$0.60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1 8B (via Groq)&lt;/td&gt;
&lt;td&gt;$0.05&lt;/td&gt;
&lt;td&gt;$0.05&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude 3.5 Sonnet&lt;/td&gt;
&lt;td&gt;$3.00&lt;/td&gt;
&lt;td&gt;$15.00&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Look at that gap between GPT-4o and Llama. That's a 50x price difference on input tokens.&lt;/p&gt;

&lt;p&gt;And here's the thing — for probably 70% of what I was sending to GPT-4o, Llama would've given me the same answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I tried first&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The obvious solution: just add some if/else logic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="err"&gt;`&lt;/span&gt;&lt;span class="n"&gt;pythonif&lt;/span&gt; &lt;span class="nf"&gt;is_simple&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama-3.1-8b-instant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="err"&gt;`&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sounds easy. It's not.&lt;/p&gt;

&lt;p&gt;What's "simple"? How do you define that? Token count? Keywords? And then you need different API clients for OpenAI vs Groq. Different error handling. Fallback logic when one provider goes down. Rate limiting per provider.&lt;br&gt;
It turned into spaghetti real fast.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I actually ended up building&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I spent a few months building a proxy that handles all of this automatically. You point your OpenAI SDK at it, and it figures out which model to use per request.&lt;/p&gt;

&lt;p&gt;The routing logic is basically:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Classify the prompt — is it casual chat, coding, analysis, math, translation? Each has a different complexity baseline.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Check complexity — token count, multi-step signals, risk level&lt;br&gt;
Route — simple stuff goes to Llama (nearly free), complex stuff goes to GPT-4o.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Validate — a background process compares cheap vs premium responses to catch quality drops. &lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The integration looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="err"&gt;`&lt;/span&gt;&lt;span class="n"&gt;pythonfrom&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="c1"&gt;# just change the base_url, that's it
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://your-proxy-url/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-routing-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;)&lt;/p&gt;

&lt;h1&gt;
  
  
  use it exactly like before
&lt;/h1&gt;

&lt;p&gt;response = client.chat.completions.create(&lt;br&gt;
    model="gpt-4o",  # the router overrides this&lt;br&gt;
    messages=[{"role": "user", "content": "hello"}]&lt;br&gt;
)`&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;br&gt;
One line change. Everything else stays the same.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I learned&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A few things that surprised me:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Most classification doesn't need AI. I started with GPT-4o-mini classifying each prompt (ironic, paying for AI to decide if I should pay for AI). Switched to regex + heuristics. Zero cost, runs in &amp;lt;1ms.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Fallbacks matter more than routing. If Groq goes down and you don't have a fallback, your users don't care about your 80% savings. They care that the app is broken.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;-Quality validation is the hard part. Routing is easy. Knowing if the cheap model gave a good enough answer — that's the real problem. I built a shadow engine that samples responses and compares them. Still not perfect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;where I'm at now&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;$0 MRR. Zero paying customers. The product works — I use it myself every day. But I'm at the "talking to people" stage now, which is way harder than building.&lt;br&gt;
If you're spending $100+/mo on LLM APIs and want to try it, the project is called NeuralRouting. Free tier available, no credit card.&lt;/p&gt;

&lt;p&gt;But honestly I'm more interested in hearing: how are you handling multi-model routing? Roll your own? Using a gateway? Just sending everything to one model and accepting the cost?&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
