<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jinav Shah</title>
    <description>The latest articles on DEV Community by Jinav Shah (@jinav_shah_6c63fcc03c0b9e).</description>
    <link>https://dev.to/jinav_shah_6c63fcc03c0b9e</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3986615%2F0afee9e8-a0a8-44e9-a708-64861feead31.jpg</url>
      <title>DEV Community: Jinav Shah</title>
      <link>https://dev.to/jinav_shah_6c63fcc03c0b9e</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jinav_shah_6c63fcc03c0b9e"/>
    <language>en</language>
    <item>
      <title>You're probably using AI wrong. And it's costing you more than you think.</title>
      <dc:creator>Jinav Shah</dc:creator>
      <pubDate>Tue, 16 Jun 2026 04:58:54 +0000</pubDate>
      <link>https://dev.to/jinav_shah_6c63fcc03c0b9e/youre-probably-using-ai-wrong-and-its-costing-you-more-than-you-think-1ilc</link>
      <guid>https://dev.to/jinav_shah_6c63fcc03c0b9e/youre-probably-using-ai-wrong-and-its-costing-you-more-than-you-think-1ilc</guid>
      <description>&lt;p&gt;Most companies today have one AI setup: send everything to the most powerful model available. Pay the bill. Repeat.&lt;/p&gt;

&lt;p&gt;It works. But it's expensive, slower than it needs to be, and honestly — a bit like hiring a surgeon to change a lightbulb.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem nobody talks about
&lt;/h2&gt;

&lt;p&gt;Imagine a hospital where every patient — whether they need open-heart surgery or a bandage on a paper cut — is seen by the senior consultant first.&lt;/p&gt;

&lt;p&gt;The consultant is brilliant. But the waiting room is chaos. The costs are sky-high. And half his time is spent on things a nurse could have handled in two minutes.&lt;/p&gt;

&lt;p&gt;That's what most AI pipelines look like today.&lt;/p&gt;

&lt;p&gt;When your team sends something to an AI model, it might be a Python file, a customer complaint in Hindi, a SQL query, or a casual Hinglish support ticket. These are completely different problems requiring different expertise, different depth, different cost.&lt;/p&gt;

&lt;p&gt;Yet most systems send them all to the same model, at the same price, with the same wait time.&lt;/p&gt;




&lt;h2&gt;
  
  
  The smarter approach: right model for the right job
&lt;/h2&gt;

&lt;p&gt;Some inputs have hard, deterministic boundaries. A &lt;code&gt;.py&lt;/code&gt; file contains Python. A &lt;code&gt;.sql&lt;/code&gt; file contains SQL. You don't need the most powerful AI in the world to figure that out — you need a rule.&lt;/p&gt;

&lt;p&gt;Here's what a smarter pipeline looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input arrives
      ↓
Orchestrator SLM — a small, fast model that reads
the input and decides: what is this, who handles it?
      ↓
├── Python file   → Python specialist model
├── SQL query     → SQL specialist model  
├── Hindi doc     → Hindi specialist model
└── Ambiguous     → Frontier model directly
      ↓
Specialist outputs + original input
→ Frontier model → Final answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There is no separate routing system to build. The orchestrator is itself a small AI model — trained to classify inputs and direct traffic. It costs almost nothing to run.&lt;/p&gt;

&lt;p&gt;The powerful frontier model — your Claude, your GPT-4 — stays in the loop for the final answer. It just isn't doing the sorting anymore.&lt;/p&gt;




&lt;h2&gt;
  
  
  One insight most teams miss
&lt;/h2&gt;

&lt;p&gt;When specialist models pass findings to the frontier model, the instinct is to format outputs for human readability. Paragraphs. Explanations. Full sentences.&lt;/p&gt;

&lt;p&gt;Wrong target.&lt;/p&gt;

&lt;p&gt;The downstream consumer is another model — not a human. Specialist models should produce machine-readable structured output. Dense. Precise. No explanation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"language"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"python"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"issues_detected"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"unbounded loop at line 47"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.94&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This isn't for a person to read. It's for a model to consume efficiently. The constraint is baked into training — not imposed by external truncation at runtime. Prevention over correction.&lt;/p&gt;

&lt;p&gt;Think of it like a doctor handing a consultant a structured chart instead of a five-page narrative. Same information. Faster to read. More room to think.&lt;/p&gt;




&lt;h2&gt;
  
  
  What happens when context gets too large?
&lt;/h2&gt;

&lt;p&gt;The frontier model has finite working memory. Multiple specialists contributing outputs fills it fast. Here's the fallback stack, in order:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First&lt;/strong&gt; — specialists send only essential structured signal. No reasoning traces. This is the default.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If needed&lt;/strong&gt; — summarise the original input first. Compress before routing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If still needed&lt;/strong&gt; — feed specialist outputs one at a time. The frontier model builds context incrementally. Slower, but accurate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Last resort&lt;/strong&gt; — skip specialists entirely. Raw input directly to the frontier model. Full cost, guaranteed quality.&lt;/p&gt;

&lt;p&gt;The pipeline always has a path to the right answer. You're just choosing how much it costs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Does this actually save money?
&lt;/h2&gt;

&lt;p&gt;Honest answer: only at scale.&lt;/p&gt;

&lt;p&gt;Specialist models are open-source — free to use, but you pay for compute. A reasonable GPU setup costs $1,000–1,100 per month. The savings come from routing a large share of queries away from expensive frontier API calls.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Monthly AI API spend&lt;/th&gt;
&lt;th&gt;Does this make sense?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Below $2,000&lt;/td&gt;
&lt;td&gt;Probably not — keep it simple&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;$3,000–$5,000&lt;/td&gt;
&lt;td&gt;Worth evaluating&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Above $5,000&lt;/td&gt;
&lt;td&gt;Very likely yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;One important caveat.&lt;/strong&gt; If your team currently uses Claude.ai, Claude Code, or any managed AI interface — this architecture means moving away from that. You'd be calling APIs directly from your own system, which means building and owning the interaction layer your employees use.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;How you use AI today&lt;/th&gt;
&lt;th&gt;What this means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Managed interface (Claude.ai, etc.)&lt;/td&gt;
&lt;td&gt;Build a custom interface first — factor in engineering cost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Already using APIs with custom tooling&lt;/td&gt;
&lt;td&gt;Plugs in naturally&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  You've already seen this architecture
&lt;/h2&gt;

&lt;p&gt;If you've used agent mode in Cursor — the AI coding tool — you've experienced this exact pattern without realising it.&lt;/p&gt;

&lt;p&gt;Cursor doesn't send your entire codebase to one model and hope for the best. A lightweight orchestrator reads your request, decides what to do — read a file, search the codebase, run a terminal command — routes to the right tool, then a frontier model synthesises the final response.&lt;/p&gt;

&lt;p&gt;Enterprise tools like Atlassian's Rovo are moving in the same direction for workplace workflows.&lt;/p&gt;

&lt;p&gt;The companies that built these tools figured out that one model doing everything is wasteful. The question is whether the AI pipelines &lt;em&gt;inside your organisation&lt;/em&gt; are designed with the same intelligence — or still sending every query to the most expensive model available.&lt;/p&gt;




&lt;h2&gt;
  
  
  The real lesson
&lt;/h2&gt;

&lt;p&gt;Most AI cost and speed problems aren't model problems. They're routing problems.&lt;/p&gt;

&lt;p&gt;The best AI pipelines look less like "one genius doing everything" and more like a well-run team: a smart receptionist, skilled specialists, and senior judgment applied only where it genuinely matters.&lt;/p&gt;

&lt;p&gt;The question isn't which model is best.&lt;/p&gt;

&lt;p&gt;It's: &lt;strong&gt;are you using the right model for the right job?&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What routing decisions is your organisation making — or avoiding? Would love to hear in the comments.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Views expressed are my own and do not represent my employer.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>architecture</category>
    </item>
  </channel>
</rss>
