<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sathvik 07</title>
    <description>The latest articles on DEV Community by Sathvik 07 (@sathviksu).</description>
    <link>https://dev.to/sathviksu</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3907404%2F75b1b49e-6dd6-4312-847b-0d9057b1d2e2.png</url>
      <title>DEV Community: Sathvik 07</title>
      <link>https://dev.to/sathviksu</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sathviksu"/>
    <language>en</language>
    <item>
      <title>How I built multi-model LLM routing on Groq's free tier</title>
      <dc:creator>Sathvik 07</dc:creator>
      <pubDate>Fri, 01 May 2026 11:19:48 +0000</pubDate>
      <link>https://dev.to/sathviksu/how-i-built-multi-model-llm-routing-on-groqs-free-tier-5618</link>
      <guid>https://dev.to/sathviksu/how-i-built-multi-model-llm-routing-on-groqs-free-tier-5618</guid>
      <description>&lt;p&gt;I hit Groq's token limits building an AI research paper analyser. Here's the routing system I built to get around it — and why it made the app better.&lt;/p&gt;

&lt;p&gt;I didn't plan to build a multi-model routing system.&lt;/p&gt;

&lt;p&gt;I was just trying to summarise a 40-page research paper without paying for an API.&lt;/p&gt;

&lt;p&gt;That's how Papers.ai started — a side project born out of frustration with how painful academic literature reviews are. You open a paper, it's 30 pages of dense methodology, and you spend 20 minutes just figuring out whether it's even relevant to what you're working on.&lt;/p&gt;

&lt;p&gt;I wanted to fix that. And I wanted to fix it for free.&lt;/p&gt;




&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;The stack was simple at first: React frontend, Node.js backend, Firebase for auth and storage, and Groq as the LLM provider.&lt;/p&gt;

&lt;p&gt;Why Groq? Because it's fast. Genuinely, shockingly fast compared to most LLM APIs. And on the free tier, it's good enough to build real things.&lt;/p&gt;

&lt;p&gt;The plan was: user uploads a PDF → extract text → send to Groq → get a summary back. Done.&lt;/p&gt;

&lt;p&gt;Except it wasn't done. Not even close.&lt;/p&gt;




&lt;h2&gt;
  
  
  The first wall I hit
&lt;/h2&gt;

&lt;p&gt;Groq's free tier has token limits per model per minute. When you're summarising a research paper, you're often pushing 8,000–15,000 tokens in a single request. Hit that limit and you get a 429 error. Hit it repeatedly and your app becomes unusable.&lt;/p&gt;

&lt;p&gt;My first reaction was the obvious one: just truncate the paper. Send the first N tokens, get a summary.&lt;/p&gt;

&lt;p&gt;That worked. It was also terrible. You'd miss the results section entirely, or skip the methodology, or get a summary that was confidently wrong because it only saw the abstract and introduction.&lt;/p&gt;

&lt;p&gt;So truncation was out. I needed something smarter.&lt;/p&gt;




&lt;h2&gt;
  
  
  The routing idea
&lt;/h2&gt;

&lt;p&gt;Here's what I noticed: Groq offers multiple models, and each has its own separate rate limit bucket.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;llama3-8b-8192&lt;/code&gt; — smaller, faster, 8k context&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;llama3-70b-8192&lt;/code&gt; — bigger, smarter, 8k context
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;mixtral-8x7b-32768&lt;/code&gt; — larger context window, 32k tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last one was the key insight. Different tasks need different things. A quick keyword extraction doesn't need a 70B model. A deep synthesis of methodology across three papers probably does.&lt;/p&gt;

&lt;p&gt;So instead of routing every request to one model and hoping for the best, I built a simple router that picks the model based on what the task actually needs.&lt;/p&gt;




&lt;h2&gt;
  
  
  How the routing works
&lt;/h2&gt;

&lt;p&gt;The logic is straightforward — almost embarrassingly so once you see it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;routeToModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;tokenCount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tokenCount&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;20000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Only mixtral can handle this context size&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;mixtral-8x7b-32768&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;task&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;summary&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;task&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;qa&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// These need reasoning ability — use the big model&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;llama3-70b-8192&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;task&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;extraction&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;task&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;keywords&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Structured extraction doesn't need a 70B model&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;llama3-8b-8192&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// Default fallback&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;llama3-70b-8192&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then on every API call, before hitting Groq, I estimate the token count (rough heuristic: 1 token ≈ 4 characters), call the router, and send the request to whichever model it picks.&lt;/p&gt;

&lt;p&gt;If that model is rate-limited, I fall back to the next best option and log it. The user never sees a 429 — they just get a slightly slower response.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Genkit layer
&lt;/h2&gt;

&lt;p&gt;The routing alone solved the rate limit problem. But I still had an architectural issue: my backend was a mess of ad-hoc Groq calls scattered across different route handlers.&lt;/p&gt;

&lt;p&gt;That's where Genkit came in. Genkit (by Firebase/Google) lets you define "flows" — type-safe, structured pipelines for LLM tasks. Think of it like Express routes but for AI operations.&lt;/p&gt;

&lt;p&gt;Each tab in Papers.ai (Summary, Extraction, Visualisation, Q&amp;amp;A) became its own Genkit flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;summaryFlow&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;defineFlow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;summarise&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;inputSchema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;PaperInputSchema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;outputSchema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;SummarySchema&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;routeToModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;summary&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tokenCount&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;buildSummaryPrompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;parseSummaryOutput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output schema is the part I underestimated. When you define what the output &lt;em&gt;should&lt;/em&gt; look like — sections, confidence scores, citation references — the model actually follows it much more consistently. Structured output via Genkit killed most of my prompt reliability problems overnight.&lt;/p&gt;




&lt;h2&gt;
  
  
  What changed after routing
&lt;/h2&gt;

&lt;p&gt;Before routing: analysis took around 20 minutes if you include re-uploads, retries, and manually piecing together partial summaries.&lt;/p&gt;

&lt;p&gt;After routing: under 60 seconds for a full paper. Not because the models got faster — because I stopped wasting tokens on the wrong model for the wrong task, stopped hitting rate limits mid-analysis, and stopped making the user re-upload papers they'd already processed (that's the Share ID system, which is a whole other post).&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use token counting properly.&lt;/strong&gt; My "4 chars = 1 token" heuristic works 90% of the time and breaks badly the other 10% — especially on papers with lots of equations or non-English text. A proper tokenizer would make the routing more reliable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add a queue.&lt;/strong&gt; Right now if two users hit the same model simultaneously and both get rate-limited, they both see a delay. A simple Redis queue would smooth that out entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Expose the model choice to users.&lt;/strong&gt; Power users would genuinely want to know "this summary used llama3-70b" and be able to override it. That transparency also builds trust.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try it yourself
&lt;/h2&gt;

&lt;p&gt;Papers.ai is live at &lt;a href="https://papers-ai-delta.vercel.app" rel="noopener noreferrer"&gt;papers-ai-delta.vercel.app&lt;/a&gt; — free tier lets you upload 5 papers a month. Throw a dense paper at it and see what the router picks.&lt;/p&gt;

&lt;p&gt;The routing logic I described here is simple enough to drop into any Groq-based project. If you're building something on the free tier and hitting limits, the answer usually isn't "pay for a bigger plan" — it's "stop treating all tasks as identical."&lt;/p&gt;

&lt;p&gt;Different tasks, different models. That's it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm a 3rd-year CS student at Reva University. Building Papers.ai as a solo project taught me more about LLM infrastructure than any course has. If you have questions or want to talk about the Genkit architecture, drop a comment — happy to go deeper on any part of this.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>groq</category>
      <category>llm</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
