<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ayush Not so great</title>
    <description>The latest articles on DEV Community by Ayush Not so great (@ayush_notsogreat_b673d5).</description>
    <link>https://dev.to/ayush_notsogreat_b673d5</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3989794%2Fd25fa66f-8eef-49a4-a020-89616faa3345.jpg</url>
      <title>DEV Community: Ayush Not so great</title>
      <link>https://dev.to/ayush_notsogreat_b673d5</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ayush_notsogreat_b673d5"/>
    <language>en</language>
    <item>
      <title>How I built a 3-provider LLM fallback system in production (and what actually broke)</title>
      <dc:creator>Ayush Not so great</dc:creator>
      <pubDate>Wed, 17 Jun 2026 21:06:14 +0000</pubDate>
      <link>https://dev.to/ayush_notsogreat_b673d5/how-i-built-a-3-provider-llm-fallback-system-in-production-and-what-actually-broke-46jk</link>
      <guid>https://dev.to/ayush_notsogreat_b673d5/how-i-built-a-3-provider-llm-fallback-system-in-production-and-what-actually-broke-46jk</guid>
      <description>&lt;h1&gt;
  
  
  How I built a 3-provider LLM fallback system in production (and what actually broke)
&lt;/h1&gt;

&lt;p&gt;I'm a pre-final year student. I built Socra(&lt;a href="https://socra-production.up.railway.app/" rel="noopener noreferrer"&gt;https://socra-production.up.railway.app/&lt;/a&gt;) — a multi-agent LLM SaaS that interrogates your startup idea using 5 specialist AI personas before generating an architecture masterplan. It has paying users. It runs on Railway. And for the first two weeks of production, it was quietly broken in a way I didn't notice until real users hit it.&lt;/p&gt;

&lt;p&gt;This is the story of how I built the 3-provider fallback chain (Anthropic → Google → Groq), what broke along the way, and the actual code that runs in production today.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why you need a fallback chain at all
&lt;/h2&gt;

&lt;p&gt;When I first deployed Socra, the LLM routing was simple: one provider, one model, one API key. It worked fine in development.&lt;/p&gt;

&lt;p&gt;Then real users started using it.&lt;/p&gt;

&lt;p&gt;Groq's free tier is &lt;strong&gt;6,000 tokens per minute&lt;/strong&gt;. A single Socra masterplan pipeline — 5 specialist agents running in parallel, each with ~1,500 input tokens — consumes roughly 9,500 tokens in one burst. The math: 3 out of 5 agents were returning &lt;code&gt;Error code: 429&lt;/code&gt; on every session with any real traffic.&lt;/p&gt;

&lt;p&gt;The app was showing agent cards to users. Some said "Error" in amber text. I thought it was a race condition. It wasn't. It was me naively assuming one free-tier API could handle a multi-agent pipeline.&lt;/p&gt;

&lt;p&gt;The fix wasn't to optimize — it was to add redundancy.&lt;/p&gt;




&lt;h2&gt;
  
  
  The routing priority chain
&lt;/h2&gt;

&lt;p&gt;The final production routing order:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Anthropic Claude Haiku   — if ANTHROPIC_API_KEY is set
2. Google Gemini 2.0 Flash  — if GOOGLE_API_KEY is set  ← production default
3. Groq LLaMA 3.1 8B        — if GROQ_API_KEY is set    ← fallback
4. Stub mode                — demo scenarios, no API key needed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why this order? Cost and rate limits, not model quality:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input $/MTok&lt;/th&gt;
&lt;th&gt;Output $/MTok&lt;/th&gt;
&lt;th&gt;Free tier TPM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Anthropic&lt;/td&gt;
&lt;td&gt;claude-haiku-4-5&lt;/td&gt;
&lt;td&gt;$0.80&lt;/td&gt;
&lt;td&gt;$4.00&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Google&lt;/td&gt;
&lt;td&gt;gemini-2.0-flash&lt;/td&gt;
&lt;td&gt;$0.075&lt;/td&gt;
&lt;td&gt;$0.30&lt;/td&gt;
&lt;td&gt;1,000,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Groq&lt;/td&gt;
&lt;td&gt;llama-3.1-8b-instant&lt;/td&gt;
&lt;td&gt;$0.06&lt;/td&gt;
&lt;td&gt;$0.06&lt;/td&gt;
&lt;td&gt;6,000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Google's free tier is &lt;strong&gt;150× more headroom than Groq&lt;/strong&gt; for a pipeline that fires 5 LLM calls simultaneously. For a student-built SaaS where LLM cost needs to be near zero while testing, that's not a small difference — it's the difference between the app working and not working.&lt;/p&gt;




&lt;h2&gt;
  
  
  The implementation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The routing check
&lt;/h3&gt;

&lt;p&gt;Every LLM call in the system goes through one of two entrypoints: &lt;code&gt;_call_llm&lt;/code&gt; (non-streaming, for structured JSON) and &lt;code&gt;_stream_llm_tokens&lt;/code&gt; (streaming, for conversation text). Both use the same routing logic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# backend/llm_client.py
&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json_mode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;anthropic_api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;_call_anthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json_mode&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;google_api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;_call_google&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json_mode&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groq_api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;_call_groq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json_mode&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;_stub_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Dead simple. The routing is just: which key is set? The first match wins.&lt;/p&gt;

&lt;h3&gt;
  
  
  Google via the OpenAI SDK (the elegant hack)
&lt;/h3&gt;

&lt;p&gt;Google AI Studio exposes an OpenAI-compatible endpoint. This means you don't need the Google SDK — just point the OpenAI SDK at a different base URL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_call_google&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json_mode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AsyncOpenAI&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AsyncOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;google_api_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://generativelanguage.googleapis.com/v1beta/openai/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;kwargs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.0-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;json_mode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response_format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json_object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same pattern works for streaming — just use &lt;code&gt;stream=True&lt;/code&gt; and iterate &lt;code&gt;async for chunk in stream&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is a pattern worth knowing: Groq, Azure OpenAI, and Google AI Studio all support the OpenAI-compatible endpoint format. If you write against the OpenAI SDK with configurable &lt;code&gt;base_url&lt;/code&gt; and &lt;code&gt;api_key&lt;/code&gt;, you get multi-provider support with almost no extra code.&lt;/p&gt;

&lt;h3&gt;
  
  
  The structured output problem
&lt;/h3&gt;

&lt;p&gt;Here's where it got messy. After the multi-agent pipeline runs and generates a masterplan, Socra needs structured JSON back from the LLM — eval scores, assumption tracking, quick reply choices. The original approach was a separator in the stream:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;Stream:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Here are my questions... ###JSON###{"&lt;/span&gt;&lt;span class="err"&gt;eval_delta&lt;/span&gt;&lt;span class="s2"&gt;": {...}, "&lt;/span&gt;&lt;span class="err"&gt;choices&lt;/span&gt;&lt;span class="s2"&gt;": [...]}"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This worked fine with Anthropic (Claude follows formatting instructions reliably). It broke completely with smaller models.&lt;/p&gt;

&lt;p&gt;The 8B Groq model would occasionally include the separator, occasionally not, occasionally put it in the middle of a sentence. Parsing failed silently and &lt;code&gt;choices&lt;/code&gt; came back empty — users saw no quick reply options after the first message.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix: two separate calls.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Call 1: Stream plain text, no format requirements
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;_stream_llm_tokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;
    &lt;span class="n"&gt;full_message&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;

&lt;span class="c1"&gt;# Call 2: After streaming ends, get structured data separately
&lt;/span&gt;&lt;span class="n"&gt;eval_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;_call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;eval_system_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;full_message&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;json_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Anthropic path still uses the separator (it's reliable there and saves one API call). The Groq and Google paths use two calls. A bit more latency, zero parsing failures.&lt;/p&gt;




&lt;h2&gt;
  
  
  What actually broke in production
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The trailing newline API key
&lt;/h3&gt;

&lt;p&gt;This one cost me 45 minutes.&lt;/p&gt;

&lt;p&gt;After deploying to Railway, every LLM call was failing with &lt;code&gt;Illegal header value&lt;/code&gt;. The API key was correct — I'd copied it straight from the Groq console. Except I hadn't. I'd pasted it into Railway's Variables tab and there was an invisible &lt;code&gt;\n&lt;/code&gt; at the end.&lt;/p&gt;

&lt;p&gt;The fix was two things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Re-enter the key manually (don't paste from clipboard)&lt;/li&gt;
&lt;li&gt;Add &lt;code&gt;.strip()&lt;/code&gt; defensively in &lt;code&gt;config.py&lt;/code&gt;:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Settings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseSettings&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;groq_api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="n"&gt;anthropic_api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="n"&gt;google_api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;

    &lt;span class="nd"&gt;@validator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;groq_api_key&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;anthropic_api_key&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;google_api_key&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pre&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;strip_keys&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the app is defensive against copy-paste mistakes. The &lt;code&gt;.strip()&lt;/code&gt; costs nothing and prevents a class of errors that are genuinely hard to debug.&lt;/p&gt;

&lt;h3&gt;
  
  
  The startup log that lied
&lt;/h3&gt;

&lt;p&gt;After adding Google as the second provider, I pushed to Railway and checked the logs. They said:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Using Groq LLaMA for LLM calls
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But I'd set &lt;code&gt;GOOGLE_API_KEY&lt;/code&gt;. For two days I thought Google wasn't working. It was. The startup log was wrong.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;main.py&lt;/code&gt; lifespan check had a bug:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before — skipped Google entirely
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;anthropic_api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Using Anthropic Claude&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groq_api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;         &lt;span class="c1"&gt;# ← checked Groq before Google
&lt;/span&gt;    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Using Groq LLaMA&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The actual routing in &lt;code&gt;_call_llm&lt;/code&gt; was correct (Google checked second, before Groq). But the log check had a different order — so if Groq was also set (it was), it logged "Using Groq" even though every actual call was going to Google.&lt;/p&gt;

&lt;p&gt;Fix: mirror the routing logic exactly in the startup log.&lt;/p&gt;

&lt;h3&gt;
  
  
  The 429 cascade
&lt;/h3&gt;

&lt;p&gt;Running 5 parallel specialist agents against Groq's 6k TPM free tier: the math never worked and I was pretending it did.&lt;/p&gt;

&lt;p&gt;Each agent gets ~1,500 input tokens + generates ~400 output tokens = ~1,900 tokens per call. 5 parallel calls = 9,500 tokens launched simultaneously. Groq's rate limiter sees all 9,500 in the same minute window and rejects the overflow.&lt;/p&gt;

&lt;p&gt;Three approaches I tried, in order:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Approach 1: Retry with backoff.&lt;/strong&gt; Added 3-attempt retry with 4s/8s exponential backoff on 429 errors. Helped slightly. Didn't fix the underlying math.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Approach 2: Sequential execution with delays.&lt;/strong&gt; Switched from &lt;code&gt;asyncio.gather()&lt;/code&gt; to sequential calls with 1.5s gaps between agents. This spread the token burst across multiple rate-limit windows. Worked on Groq, but added ~7.5s to the masterplan pipeline — noticeable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Approach 3: Switch to Google.&lt;/strong&gt; Google's free tier is 1,000,000 TPM. Problem disappeared entirely. Now Groq is the fallback, not the primary.&lt;/p&gt;

&lt;p&gt;The real lesson: design for the rate limits of your fallback providers, not just your primary. Groq is fast and cheap but not meant for parallel multi-agent workloads on the free tier.&lt;/p&gt;




&lt;h2&gt;
  
  
  The cost analysis
&lt;/h2&gt;

&lt;p&gt;After switching to Google as the production default, I did a full token and cost breakdown per session:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Input tokens&lt;/th&gt;
&lt;th&gt;Output tokens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Conversation (7 turns avg)&lt;/td&gt;
&lt;td&gt;~16,700&lt;/td&gt;
&lt;td&gt;~3,500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5 specialist agents&lt;/td&gt;
&lt;td&gt;~24,000&lt;/td&gt;
&lt;td&gt;~3,500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Synthesis&lt;/td&gt;
&lt;td&gt;~12,700&lt;/td&gt;
&lt;td&gt;~2,500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Devil's advocate&lt;/td&gt;
&lt;td&gt;~2,800&lt;/td&gt;
&lt;td&gt;~600&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total per session&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~56,200&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~10,100&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At Google Gemini Flash pricing ($0.075 input / $0.30 output per million tokens):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input cost:  56,200 / 1,000,000 × $0.075 = $0.0042
Output cost: 10,100 / 1,000,000 × $0.30  = $0.0030
Total:       ~$0.007 per session
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Socra charges ₹499 (~$6) for a full masterplan session. LLM cost per session: &lt;strong&gt;$0.007&lt;/strong&gt;. That's &lt;strong&gt;99.8% gross margin on the LLM cost alone&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Railway hosting is ~$30/month fixed. Break-even is roughly 6 paid sessions per month.&lt;/p&gt;

&lt;p&gt;This math only works because of the provider choice. The same session on Anthropic Haiku costs ~$0.085 — 12× more expensive, which would put margins at ~98.6%. Still fine, but the point is: provider selection is a product decision, not just a technical one.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Design for multi-provider from day one.&lt;/strong&gt; I added the fallback chain in Phase 3 after production broke. It should have been in the architecture from the start. The routing abstraction (&lt;code&gt;_call_llm&lt;/code&gt; with provider detection) is simple enough to add in 30 minutes — there's no reason to start with a single provider.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Test the rate limit math before deploying parallel calls.&lt;/strong&gt; 5 parallel agents × 1,900 tokens = 9,500 tokens in one burst. Groq's free tier is 6,000 TPM. This is elementary arithmetic that I didn't do until users were getting errors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Strip API keys at the config layer.&lt;/strong&gt; &lt;code&gt;.strip()&lt;/code&gt; in your settings class is a 5-minute change that eliminates an entire class of deployment bugs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Make your startup log mirror your routing logic exactly.&lt;/strong&gt; A log that says "Using Groq" when you're actually using Google is worse than no log — it actively misleads debugging.&lt;/p&gt;




&lt;h2&gt;
  
  
  The full stack for context
&lt;/h2&gt;

&lt;p&gt;Socra is built on: FastAPI + React + PostgreSQL + Railway + LangGraph (for the multi-agent pipeline) + Langfuse v4 (for per-call LLM observability) + Clerk (auth) + Razorpay (payments). The LLM fallback chain described here handles all LLM calls across the entire system — conversation, agents, synthesis, pitch deck generation, and the tribunal verdict scoring.&lt;/p&gt;

&lt;p&gt;The live app is at &lt;a href="https://socra-production.up.railway.app" rel="noopener noreferrer"&gt;socra-production.up.railway.app&lt;/a&gt;. The approach described here — OpenAI-compatible endpoints, two-call structured output, provider detection at the config layer — is all running in production today.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm a pre-final year student at HBTU Kanpur building production ML systems. If you're working on something similar or have questions about the multi-agent architecture, I'm on &lt;a href="https://linkedin.com/in/ayush-yadav-7ba731289" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; and &lt;a href="https://github.com/CaringNihilistic" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




</description>
      <category>llm</category>
      <category>python</category>
      <category>webdev</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
