How I built multi-model LLM routing on Groq's free tier

#ai #groq #llm #webdev

I hit Groq's token limits building an AI research paper analyser. Here's the routing system I built to get around it — and why it made the app better.

I didn't plan to build a multi-model routing system.

I was just trying to summarise a 40-page research paper without paying for an API.

That's how Papers.ai started — a side project born out of frustration with how painful academic literature reviews are. You open a paper, it's 30 pages of dense methodology, and you spend 20 minutes just figuring out whether it's even relevant to what you're working on.

I wanted to fix that. And I wanted to fix it for free.

The setup

The stack was simple at first: React frontend, Node.js backend, Firebase for auth and storage, and Groq as the LLM provider.

Why Groq? Because it's fast. Genuinely, shockingly fast compared to most LLM APIs. And on the free tier, it's good enough to build real things.

The plan was: user uploads a PDF → extract text → send to Groq → get a summary back. Done.

Except it wasn't done. Not even close.

The first wall I hit

Groq's free tier has token limits per model per minute. When you're summarising a research paper, you're often pushing 8,000–15,000 tokens in a single request. Hit that limit and you get a 429 error. Hit it repeatedly and your app becomes unusable.

My first reaction was the obvious one: just truncate the paper. Send the first N tokens, get a summary.

That worked. It was also terrible. You'd miss the results section entirely, or skip the methodology, or get a summary that was confidently wrong because it only saw the abstract and introduction.

So truncation was out. I needed something smarter.

The routing idea

Here's what I noticed: Groq offers multiple models, and each has its own separate rate limit bucket.

llama3-8b-8192 — smaller, faster, 8k context
llama3-70b-8192 — bigger, smarter, 8k context
mixtral-8x7b-32768 — larger context window, 32k tokens

That last one was the key insight. Different tasks need different things. A quick keyword extraction doesn't need a 70B model. A deep synthesis of methodology across three papers probably does.

So instead of routing every request to one model and hoping for the best, I built a simple router that picks the model based on what the task actually needs.

How the routing works

The logic is straightforward — almost embarrassingly so once you see it:

function routeToModel(task, tokenCount) {
  if (tokenCount > 20000) {
    // Only mixtral can handle this context size
    return 'mixtral-8x7b-32768';
  }

  if (task === 'summary' || task === 'qa') {
    // These need reasoning ability — use the big model
    return 'llama3-70b-8192';
  }

  if (task === 'extraction' || task === 'keywords') {
    // Structured extraction doesn't need a 70B model
    return 'llama3-8b-8192';
  }

  // Default fallback
  return 'llama3-70b-8192';
}

Then on every API call, before hitting Groq, I estimate the token count (rough heuristic: 1 token ≈ 4 characters), call the router, and send the request to whichever model it picks.

If that model is rate-limited, I fall back to the next best option and log it. The user never sees a 429 — they just get a slightly slower response.

The Genkit layer

The routing alone solved the rate limit problem. But I still had an architectural issue: my backend was a mess of ad-hoc Groq calls scattered across different route handlers.

That's where Genkit came in. Genkit (by Firebase/Google) lets you define "flows" — type-safe, structured pipelines for LLM tasks. Think of it like Express routes but for AI operations.

Each tab in Papers.ai (Summary, Extraction, Visualisation, Q&A) became its own Genkit flow:

const summaryFlow = defineFlow(
  { name: 'summarise', inputSchema: PaperInputSchema, outputSchema: SummarySchema },
  async (input) => {
    const model = routeToModel('summary', input.tokenCount);
    const result = await generate({ model, prompt: buildSummaryPrompt(input) });
    return parseSummaryOutput(result.text());
  }
);

The output schema is the part I underestimated. When you define what the output should look like — sections, confidence scores, citation references — the model actually follows it much more consistently. Structured output via Genkit killed most of my prompt reliability problems overnight.

What changed after routing

Before routing: analysis took around 20 minutes if you include re-uploads, retries, and manually piecing together partial summaries.

After routing: under 60 seconds for a full paper. Not because the models got faster — because I stopped wasting tokens on the wrong model for the wrong task, stopped hitting rate limits mid-analysis, and stopped making the user re-upload papers they'd already processed (that's the Share ID system, which is a whole other post).

What I'd do differently

Use token counting properly. My "4 chars = 1 token" heuristic works 90% of the time and breaks badly the other 10% — especially on papers with lots of equations or non-English text. A proper tokenizer would make the routing more reliable.

Add a queue. Right now if two users hit the same model simultaneously and both get rate-limited, they both see a delay. A simple Redis queue would smooth that out entirely.

Expose the model choice to users. Power users would genuinely want to know "this summary used llama3-70b" and be able to override it. That transparency also builds trust.

Try it yourself

Papers.ai is live at papers-ai-delta.vercel.app — free tier lets you upload 5 papers a month. Throw a dense paper at it and see what the router picks.

The routing logic I described here is simple enough to drop into any Groq-based project. If you're building something on the free tier and hitting limits, the answer usually isn't "pay for a bigger plan" — it's "stop treating all tasks as identical."

Different tasks, different models. That's it.

I'm a 3rd-year CS student at Reva University. Building Papers.ai as a solo project taught me more about LLM infrastructure than any course has. If you have questions or want to talk about the Genkit architecture, drop a comment — happy to go deeper on any part of this.