<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Megha mukherjee</title>
    <description>The latest articles on DEV Community by Megha mukherjee (@megha_mukherjee_5eb776f2b).</description>
    <link>https://dev.to/megha_mukherjee_5eb776f2b</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3934575%2Fe13fdbdb-c511-4781-a470-8c188e85f23c.png</url>
      <title>DEV Community: Megha mukherjee</title>
      <link>https://dev.to/megha_mukherjee_5eb776f2b</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/megha_mukherjee_5eb776f2b"/>
    <language>en</language>
    <item>
      <title>Three LLM Infrastructure Problems That Shouldn't Exist in 2026</title>
      <dc:creator>Megha mukherjee</dc:creator>
      <pubDate>Wed, 27 May 2026 16:33:25 +0000</pubDate>
      <link>https://dev.to/megha_mukherjee_5eb776f2b/three-llm-infrastructure-problems-that-shouldnt-exist-in-2026-3a88</link>
      <guid>https://dev.to/megha_mukherjee_5eb776f2b/three-llm-infrastructure-problems-that-shouldnt-exist-in-2026-3a88</guid>
      <description>&lt;p&gt;LLM infrastructure has three problems that shouldn't exist in 2026. Here's what we built because nobody else fixed them.&lt;/p&gt;




&lt;h2&gt;
  
  
  Problem 1: Your LLM bill is unnecessarily high
&lt;/h2&gt;

&lt;p&gt;Everyone routes everything to GPT-4 because who has time to configure per-query routing. The bill hits 3-5x what it should be for zero extra value.&lt;/p&gt;

&lt;p&gt;People are already switching because of this. A dev on X: &lt;em&gt;"Cancelled both my Claude Code Pro and ChatGPT Pro. Kimi K2.6 is just as good for my side projects as Opus or GPT 5.4 were. The price for this is crazy low."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Another one: &lt;em&gt;"Just used gemini-embedding-2 to vectorize 27,603 notes for semantic search. Total cost: $0.07. That's pretty amazing."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The pattern is obvious — developers are actively looking for cheaper alternatives. The problem is doing it query-by-query without wasting time.&lt;/p&gt;

&lt;p&gt;We built a router that classifies every query by complexity and sends it to the cheapest capable model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Design a clinical trial protocol&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;  &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="nf"&gt;premium  &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;$2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;M&lt;/span&gt; &lt;span class="nx"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Write a Python sort function&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;      &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="nf"&gt;groq     &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;$0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;M&lt;/span&gt; &lt;span class="nx"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;What is 2+2?&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;                      &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="nf"&gt;free     &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;$0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;M&lt;/span&gt; &lt;span class="nx"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result: &lt;strong&gt;62% cost savings&lt;/strong&gt; measured across 200 real API calls. Not theoretical.&lt;/p&gt;




&lt;h2&gt;
  
  
  Problem 2: Sequential fallback gives you one answer, not the best
&lt;/h2&gt;

&lt;p&gt;Every gateway does: try A → fail → try B → fail → try C.&lt;/p&gt;

&lt;p&gt;You always get one provider's answer. Never the best across all. If A is slow, everything waits.&lt;/p&gt;

&lt;p&gt;Someone already built &lt;code&gt;ai-retry&lt;/code&gt; — a library for retry and fallback mechanisms — because this is such a common pain. People are hacking around it manually.&lt;/p&gt;

&lt;p&gt;We went further. Run all providers in parallel. Score every result on specificity, structure, and relevance. Return the best answer with reasons why it won.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;executeEnsemble&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;nvidia&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;callNvidia&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;groq&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;callGroq&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;callOpenAI&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="c1"&gt;// → nvidia (scored 75, higher specificity on code)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Problem 3: Every gateway claims "negligible overhead." None publish numbers.
&lt;/h2&gt;

&lt;p&gt;It's the standard line. "Negligible overhead" followed by zero data.&lt;/p&gt;

&lt;p&gt;We ran ours through a third-party benchmark tool (llm-gateway-bench) and published everything:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;What's included&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Direct to Groq&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;138ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Raw API call&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Through A3M&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;374ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Routing + cache + guardrails + cost tracking&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;236ms overhead. Not zero. But it saves 62% on API costs — that's ~$2,600/year at 100K queries/month.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why it grew
&lt;/h2&gt;

&lt;p&gt;10,024 downloads in 14 days. Zero marketing. Developers found it on npm, tried it, told other developers.&lt;/p&gt;

&lt;p&gt;The feedback loop was: &lt;em&gt;"My bill is too high"&lt;/em&gt; → 62% savings. &lt;em&gt;"I want the best answer, not the first one"&lt;/em&gt; → parallel ensemble. &lt;em&gt;"I don't trust your latency claims"&lt;/em&gt; → here's the third-party benchmark, run it yourself.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;npm: &lt;code&gt;npm install adaptive-memory-multi-model-router&lt;/code&gt;&lt;/em&gt;&lt;br&gt;&lt;br&gt;
&lt;em&gt;GitHub: &lt;a href="https://github.com/Das-rebel/a3m-router" rel="noopener noreferrer"&gt;github.com/Das-rebel/a3m-router&lt;/a&gt;&lt;/em&gt;&lt;br&gt;&lt;br&gt;
&lt;em&gt;Benchmarks: third-party via &lt;a href="https://github.com/taffy-owo/llm-gateway-bench" rel="noopener noreferrer"&gt;llm-gateway-bench&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>typescript</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
