<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Devansh</title>
    <description>The latest articles on DEV Community by Devansh (@devansh365).</description>
    <link>https://dev.to/devansh365</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F679755%2F9dc6ebfe-a1d9-4613-8192-f2854324ea75.png</url>
      <title>DEV Community: Devansh</title>
      <link>https://dev.to/devansh365</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/devansh365"/>
    <language>en</language>
    <item>
      <title>Your Google Sheets backend is silently dropping rows. Here's why.</title>
      <dc:creator>Devansh</dc:creator>
      <pubDate>Sun, 26 Apr 2026 23:30:00 +0000</pubDate>
      <link>https://dev.to/devansh365/your-google-sheets-backend-is-silently-dropping-rows-heres-why-3o74</link>
      <guid>https://dev.to/devansh365/your-google-sheets-backend-is-silently-dropping-rows-heres-why-3o74</guid>
      <description>&lt;p&gt;A signup form POSTing to Google Sheets is the most common "backend" on the indie web.&lt;/p&gt;

&lt;p&gt;It works for your landing page demo. It works when you test it with 5 friends. It works right up until the moment you don't want it to fail.&lt;/p&gt;

&lt;p&gt;Here's the part nobody tells you: &lt;strong&gt;Google's own &lt;code&gt;values.append&lt;/code&gt; endpoint silently drops rows under concurrent writes.&lt;/strong&gt; Two simultaneous POSTs can resolve to the same target row and one of them gets overwritten. No error in your logs. No error in your client. Just rows that silently didn't land.&lt;/p&gt;

&lt;p&gt;Every "Sheets as a backend" wrapper you've heard of — SheetDB, Sheety, SheetBest, NoCodeAPI — forwards your request straight to &lt;code&gt;values.append&lt;/code&gt;. They inherit the bug.&lt;/p&gt;

&lt;p&gt;I spent the last few weeks building a fix. It's called SheetForge, it's MIT-licensed, and this post is about the actual bug and why the fix isn't as simple as "throw a mutex on it."&lt;/p&gt;

&lt;h2&gt;
  
  
  The bug, reproduced in 4 lines
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
  &lt;span class="nx"&gt;sheets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;values&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;rowA&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="nx"&gt;sheets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;values&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;rowB&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="nx"&gt;sheets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;values&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;rowC&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="nx"&gt;sheets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;values&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;rowD&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four POSTs. You expect four rows. Under load you often get three. Sometimes two.&lt;/p&gt;

&lt;p&gt;The reason is inside &lt;code&gt;values.append&lt;/code&gt;. The operation reads the current last row, then writes to the position after it. When two calls race, they can read the same "current last row" and write to the same target cell range. One value wins. The other silently disappears.&lt;/p&gt;

&lt;p&gt;Google has &lt;a href="https://developers.google.com/sheets/api/guides/values#appending_values" rel="noopener noreferrer"&gt;documented this&lt;/a&gt;. The official workaround is: "Don't write concurrently." That's a fine rule when your launch gets 3 signups. It's actively destructive when you hit HN's front page.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2c4ejfe7ra8amw1navaf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2c4ejfe7ra8amw1navaf.png" alt="4 POSTs, 3 rows. rowC was silently overwritten. No error." width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why your form loses 40% of rows during a traffic spike
&lt;/h2&gt;

&lt;p&gt;Your form looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// /api/signup&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;POST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;email&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;sheets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;values&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="nx"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;()]]&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;Response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is deployed on Vercel. Each request gets its own isolated serverless invocation. They run truly in parallel. There is zero coordination between them.&lt;/p&gt;

&lt;p&gt;Now Product Hunt puts you on the daily leaderboard. Forty people land on your page in the same 10 seconds. Twelve of them submit the form before the others.&lt;/p&gt;

&lt;p&gt;Twelve concurrent POSTs to &lt;code&gt;values.append&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;You won't end up with twelve rows. You'll end up with something closer to eight, maybe nine. The exact number depends on how Google's backend serializes the writes internally (it doesn't, deterministically — that's the whole problem). The lost rows show no error. The users who submitted them see a green checkmark. Their email is gone.&lt;/p&gt;

&lt;p&gt;This isn't theoretical. Levels.fyi &lt;a href="https://www.levels.fyi/blog/scaling-to-millions-with-google-sheets.html" rel="noopener noreferrer"&gt;wrote a long engineering post&lt;/a&gt; about running their entire site on Sheets until exactly this class of problem forced them to migrate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a mutex doesn't fix it
&lt;/h2&gt;

&lt;p&gt;The naive fix is obvious: put a lock in front of the Sheets API call. One write at a time per sheet. Problem solved.&lt;/p&gt;

&lt;p&gt;In practice, this is where it breaks:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Your API runs on serverless.&lt;/strong&gt; A lock in Node process memory doesn't work when requests are spread across 12 cold starts. You need a distributed lock.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. A distributed lock has its own bugs.&lt;/strong&gt; Redis &lt;code&gt;SETNX&lt;/code&gt; with a TTL is the standard answer. But the classic failure mode: process A acquires the lock, gets paused by GC, TTL expires, process B grabs the lock, process A wakes up and releases "its" lock — which is now B's lock. Now two processes think they hold the lock. You're back to silent data loss.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Retries break everything.&lt;/strong&gt; Your client retries a failed POST. The lock-protected write succeeds twice. Now you have duplicate rows. Fixing drops created duplicates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Crashes strand the lock.&lt;/strong&gt; If your process dies while holding the lock, the next writer has to wait for the TTL to expire. For TTLs long enough to be safe (30+ seconds), that stalls your whole write throughput.&lt;/p&gt;

&lt;p&gt;Every one of these is a real bug I hit while prototyping. The fix isn't a mutex. The fix is a proper queue.&lt;/p&gt;

&lt;h2&gt;
  
  
  How SheetForge actually fixes it
&lt;/h2&gt;

&lt;p&gt;The architecture, in one sentence: &lt;strong&gt;every write goes into a per-sheet queue, one worker per sheet pulls from that queue inside a Postgres transaction, and an idempotency key dedupes retries.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's the flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;submitWrite(row)
  └─ INSERT INTO write_ledger (sheet_id, idempotency_key, payload)
     (partial unique index dedupes retries before we even touch Sheets)
  └─ XADD to Redis Stream for this sheet
  └─ return { writeId, status: 'pending' }

processNext()  ← runs in a loop on the worker
  └─ XREADGROUP from the sheet's stream
  └─ BEGIN transaction
     └─ SELECT pg_advisory_xact_lock(hashtextextended(streamKey, 0))
     └─ call sheets.values.append(payload)
     └─ UPDATE write_ledger SET status = 'committed'
  └─ COMMIT
  └─ XACK only after commit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four things matter here:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The advisory lock is the fence.&lt;/strong&gt; &lt;code&gt;pg_advisory_xact_lock&lt;/code&gt; acquires a lock inside the transaction. If the transaction commits or aborts, the lock is released automatically by Postgres. There is no TTL. No lease clock. No split-brain. If your process dies mid-handler, the transaction rolls back, the lock releases, and the message redelivers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The idempotency key is the deduper.&lt;/strong&gt; Every write comes in with an &lt;code&gt;Idempotency-Key&lt;/code&gt; header. A partial unique index on &lt;code&gt;(sheet_id, idempotency_key)&lt;/code&gt; WHERE &lt;code&gt;status IN ('pending', 'committed')&lt;/code&gt; means the database rejects duplicates before the worker even sees them. Retry the same request 100 times, you get 1 row.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The ledger is the truth.&lt;/strong&gt; The write is durable in Postgres the moment you get the &lt;code&gt;writeId&lt;/code&gt; back. Sheets is downstream. If Google's API is down, your writes queue up and flush when it recovers. Your users see a green checkmark and it means something.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;XACK happens post-commit.&lt;/strong&gt; Redis Streams' PEL (pending entries list) redelivers messages if they're not acked. If the worker crashes mid-transaction, Postgres rolls back, Redis redelivers, the idempotency key catches the replay. Exactly-once semantics, for real.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh004xnip7xx24dxgk2c0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh004xnip7xx24dxgk2c0.png" alt="Queue Architecture" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The full test for this concurrency
&lt;/h2&gt;

&lt;p&gt;Every change to the write-queue slice requires a concurrency test. This is the one I do not break:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;it&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;50 parallel writes land 50 rows in order&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sheet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;createTestSheet&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;writes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Array&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;from&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;length&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;email&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`user-&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;@test.com`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;idempotencyKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`key-&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;}))&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nx"&gt;writes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;w&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;sheetforge&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;idempotencyKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;idempotencyKey&lt;/span&gt; &lt;span class="p"&gt;}))&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;waitForQueueDrain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;sheet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;readSheet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;sheet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toHaveLength&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;email&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;toEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;writes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;w&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;email&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;50 parallel POSTs. 50 rows. In order. Retry safe.&lt;/p&gt;

&lt;p&gt;This same test against raw &lt;code&gt;values.append&lt;/code&gt; reliably fails. It fails under SheetDB, Sheety, and SheetBest too — I checked.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhard04qrm1c4pkalpz5i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhard04qrm1c4pkalpz5i.png" alt="Concurrency Test Result" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The typed SDK is the bonus
&lt;/h2&gt;

&lt;p&gt;Once you have a proper queue, the rest gets interesting. Since SheetForge knows your sheet's header row, it can generate a typed TypeScript client with literal union types inferred from your sample cells.&lt;/p&gt;

&lt;p&gt;Header row: &lt;code&gt;email | plan | created_at&lt;/code&gt;&lt;br&gt;
Sample cells: &lt;code&gt;hi@example.com | free | 2026-04-15&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Generated SDK:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;createClient&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./sheetforge-client&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sheet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;createClient&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;SHEETFORGE_API_KEY&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;sheetId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;sht_abc123&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;sheet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;email&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;hi@example.com&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;free&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// 'free' | 'pro' — inferred from sample cells&lt;/span&gt;
    &lt;span class="na"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;idempotencyKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;crypto&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randomUUID&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The compiler catches header drift. Rename a column in the sheet and regenerate the client — TypeScript tells you every call site that needs updating.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting started
&lt;/h2&gt;

&lt;p&gt;One-click hosted:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://getsheetforge.vercel.app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sign in with Google. Connect a sheet. Copy your API key. Done.&lt;/p&gt;

&lt;p&gt;Self-host:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Devansh-365/sheetforge.git
&lt;span class="nb"&gt;cd &lt;/span&gt;sheetforge
pnpm &lt;span class="nb"&gt;install
cp&lt;/span&gt; .env.example .env   &lt;span class="c"&gt;# Google OAuth + DATABASE_URL + Redis&lt;/span&gt;
pnpm db:push
pnpm dev               &lt;span class="c"&gt;# web :3000, api :3001&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Prereqs: Node 20+, pnpm 9+, Postgres 14+, Redis 6+ (or Upstash REST).&lt;/p&gt;

&lt;p&gt;The OSS core (&lt;code&gt;packages/queue&lt;/code&gt;, &lt;code&gt;packages/codegen&lt;/code&gt;, &lt;code&gt;packages/sdk-ts&lt;/code&gt;) is MIT and stays free forever. The hosted SaaS runs the same code.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flyi08g3kjqj1953ks7sa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flyi08g3kjqj1953ks7sa.png" alt="If you need webhooks today, use SheetDB. If you need your rows to land, come back." width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What SheetForge is not
&lt;/h2&gt;

&lt;p&gt;It is not a Postgres replacement. If you need complex queries, indices, or relational integrity, use a real database.&lt;/p&gt;

&lt;p&gt;It is not a high-throughput pipe. Google caps you at ~60 writes/minute per sheet regardless of what sits in front. SheetForge makes sure those writes land; it doesn't make them faster.&lt;/p&gt;

&lt;p&gt;It is not a reason to keep Sheets as your backend forever. It's the right tool for landing pages, waitlists, internal forms, ops tools, and MVPs where you need rows to actually land. If your app outgrows Sheets, it outgrows SheetForge too.&lt;/p&gt;

&lt;h2&gt;
  
  
  The one takeaway
&lt;/h2&gt;

&lt;p&gt;Every Sheets-as-backend wrapper you've been using has a silent data-loss bug that only shows up when traffic spikes. That's exactly the moment you most care about losing rows.&lt;/p&gt;

&lt;p&gt;If you've ever shipped a form on Sheets and watched rows vanish mid-launch, give SheetForge a try. If it saves you one bug, star the repo.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/Devansh-365/sheetforge" rel="noopener noreferrer"&gt;github.com/Devansh-365/sheetforge&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Hosted:&lt;/strong&gt; &lt;a href="https://getsheetforge.vercel.app" rel="noopener noreferrer"&gt;getsheetforge.vercel.app&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;License:&lt;/strong&gt; MIT&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>googlesheets</category>
      <category>startup</category>
      <category>saas</category>
    </item>
    <item>
      <title>Gemini 2.5 Flash was returning 37 tokens. Here's why.</title>
      <dc:creator>Devansh</dc:creator>
      <pubDate>Sun, 19 Apr 2026 23:30:00 +0000</pubDate>
      <link>https://dev.to/devansh365/gemini-25-flash-was-returning-37-tokens-heres-why-4ppp</link>
      <guid>https://dev.to/devansh365/gemini-25-flash-was-returning-37-tokens-heres-why-4ppp</guid>
      <description>&lt;p&gt;I set &lt;code&gt;max_tokens: 1000&lt;/code&gt; on a Gemini 2.5 Flash call.&lt;/p&gt;

&lt;p&gt;The response came back with 37 tokens. &lt;code&gt;finish_reason: "MAX_TOKENS"&lt;/code&gt;. No error. No warning. Just a string that stopped mid-sentence.&lt;/p&gt;

&lt;p&gt;I changed it to 2000. Got back 41 tokens. Then 5000. Got back 38.&lt;/p&gt;

&lt;p&gt;That's when I knew something was actually broken, not just a config issue.&lt;/p&gt;

&lt;p&gt;I spent a day tracing this. The root cause is surprising, the official docs don't explain it, and the fix depends on which version of which SDK you're using. Here's what I learned, and a diagnostic script at the end so you can figure out which variant of the bug you hit.&lt;/p&gt;

&lt;h2&gt;
  
  
  The symptom
&lt;/h2&gt;

&lt;p&gt;Your Gemini 2.5 Flash or Pro call returns one of these shapes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"candidates"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"parts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"finishReason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"MAX_TOKENS"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"usageMetadata"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"promptTokenCount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"candidatesTokenCount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"thoughtsTokenCount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;964&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"totalTokenCount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1084&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or a truncated mid-sentence response with &lt;code&gt;candidatesTokenCount&lt;/code&gt; near zero and &lt;code&gt;thoughtsTokenCount&lt;/code&gt; close to whatever you set &lt;code&gt;max_output_tokens&lt;/code&gt; to.&lt;/p&gt;

&lt;p&gt;The word &lt;code&gt;thoughtsTokenCount&lt;/code&gt; is the giveaway.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it happens
&lt;/h2&gt;

&lt;p&gt;Gemini 2.5 Flash and Pro are reasoning models. Like OpenAI's o-series, they burn tokens on internal reasoning before writing the visible response. Unlike OpenAI's models, Google counts those thinking tokens against your &lt;code&gt;max_output_tokens&lt;/code&gt; budget.&lt;/p&gt;

&lt;p&gt;So when you ask for 1,000 tokens:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The model thinks. This uses some number of tokens, tracked as &lt;code&gt;thoughtsTokenCount&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Once &lt;code&gt;thoughtsTokenCount + candidatesTokenCount&lt;/code&gt; hits your budget, generation stops.&lt;/li&gt;
&lt;li&gt;If thinking consumed most of the budget, &lt;code&gt;candidatesTokenCount&lt;/code&gt; ends up near zero.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Gemini 2.5 Flash defaults to a dynamic thinking budget. It decides how much to think based on the task. For anything non-trivial, it will happily burn 90 to 98 percent of your budget on reasoning.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyutlsc5hr5fc8x0t2zmc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyutlsc5hr5fc8x0t2zmc.png" alt="Where your max_tokens actually go"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can see this directly in the API response. If you're using the Google GenAI SDK:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize quantum computing.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_output_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Output tokens:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage_metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;candidates_token_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Thinking tokens:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage_metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;thoughts_token_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Total:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage_metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_token_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Finish reason:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;finish_reason&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;thoughts_token_count&lt;/code&gt; field is where your budget actually went.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three fixes, ranked
&lt;/h2&gt;

&lt;p&gt;There are three ways to handle this, and they have real tradeoffs.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;Quality&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Disable thinking&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;thinking_budget: 0&lt;/code&gt; (Flash) or &lt;code&gt;reasoning_effort: "none"&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Fast&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Lower on complex reasoning&lt;/td&gt;
&lt;td&gt;Chat UIs, structured extraction, high-volume endpoints&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cap thinking&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;thinking_budget: 1024&lt;/code&gt; + &lt;code&gt;max_output_tokens: 8192&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Most production workloads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dynamic thinking&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Let Flash decide, set &lt;code&gt;max_output_tokens&lt;/code&gt; to 8K+&lt;/td&gt;
&lt;td&gt;Slowest&lt;/td&gt;
&lt;td&gt;Highest&lt;/td&gt;
&lt;td&gt;Best&lt;/td&gt;
&lt;td&gt;Research queries, complex analysis, one-shot deep tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The third option is the default, and it's the source of the bug. It's only the right choice if you're actually okay with burning most of your tokens on reasoning and waiting 5 to 30 seconds per response.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix 1: Disable thinking for Flash
&lt;/h2&gt;

&lt;p&gt;For Gemini 2.5 Flash, you can turn thinking off entirely:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.genai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;types&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain circuit breakers in 2 sentences.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GenerateContentConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;max_output_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;thinking_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ThinkingConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;thinking_budget&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;thinking_budget=0&lt;/code&gt; is only valid for 2.5 Flash. Pro refuses to run without at least some thinking, and throws:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Thinking can't be disabled for this model.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Pro, the minimum accepted value is 128. Using &lt;code&gt;thinking_budget=128&lt;/code&gt; gets you the closest thing to "off" that Pro allows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix 2: The OpenAI-compat escape hatch (underdocumented)
&lt;/h2&gt;

&lt;p&gt;If you're hitting Gemini through the OpenAI-compatible endpoint (either Google's own &lt;code&gt;generativelanguage.googleapis.com/v1beta/openai&lt;/code&gt; or through a proxy like LiteLLM), you can use &lt;code&gt;reasoning_effort&lt;/code&gt; instead of &lt;code&gt;thinking_budget&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://generativelanguage.googleapis.com/v1beta/openai&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;GEMINI_API_KEY&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain circuit breakers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;reasoning_effort&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;none&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# or "low", "medium", "high"
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is barely documented. Google's official OpenAI-compatibility page mentions it in passing, and almost no tutorials cover it. But it works, and it's the cleanest way to control reasoning from code that uses the OpenAI SDK.&lt;/p&gt;

&lt;p&gt;Mapping:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;reasoning_effort: "none"&lt;/code&gt; → &lt;code&gt;thinking_budget: 0&lt;/code&gt; (Flash only)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;reasoning_effort: "low"&lt;/code&gt; → &lt;code&gt;thinking_budget: 1024&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;reasoning_effort: "medium"&lt;/code&gt; → &lt;code&gt;thinking_budget: 8192&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;reasoning_effort: "high"&lt;/code&gt; → &lt;code&gt;thinking_budget: 24576&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm75kc7oclgtgkgc5r2x6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm75kc7oclgtgkgc5r2x6.png" alt="Fix decision tree"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix 3: The integration-specific gotchas
&lt;/h2&gt;

&lt;p&gt;The bug manifests differently depending on your stack. Some quick notes from actual GitHub issues (&lt;a href="https://github.com/googleapis/python-genai/issues/782" rel="noopener noreferrer"&gt;python-genai #782&lt;/a&gt;, &lt;a href="https://github.com/google-gemini/gemini-cli/issues/23081" rel="noopener noreferrer"&gt;gemini-cli #23081&lt;/a&gt;, &lt;a href="https://github.com/langchain-ai/langchain-google/issues/1490" rel="noopener noreferrer"&gt;langchain-google #1490&lt;/a&gt;):&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LangChain&lt;/strong&gt; silently truncates output. Developers report setting &lt;code&gt;max_tokens=16000&lt;/code&gt; and still getting cut-off responses. Fix: pass &lt;code&gt;thinking_budget&lt;/code&gt; via &lt;code&gt;model_kwargs&lt;/code&gt;, or switch to the OpenAI-compat endpoint through &lt;code&gt;ChatOpenAI&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LiteLLM&lt;/strong&gt; accepts &lt;code&gt;reasoning_effort&lt;/code&gt; and maps it to the Gemini parameter, but as of late 2025 it rejected the parameter for Pro with "Thinking can't be disabled." Fix: use &lt;code&gt;reasoning_effort="low"&lt;/code&gt; instead of &lt;code&gt;"none"&lt;/code&gt; for Pro.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ha-llmvision&lt;/strong&gt; defaulted &lt;code&gt;thinkingBudget&lt;/code&gt; to 35 to 50 tokens. That value gets fully consumed by thinking, leaving nothing for output. Fix: set to 1024 or higher.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cline&lt;/strong&gt; set &lt;code&gt;thinkingBudget: 0&lt;/code&gt; which works for Flash Lite but throws on Pro. Fix depends on which model you're targeting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vertex AI&lt;/strong&gt; uses &lt;code&gt;thinkingConfig.thinkingBudget&lt;/code&gt; nested inside the config object. Raw API requests that put it at the top level silently ignore it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Diagnostic script
&lt;/h2&gt;

&lt;p&gt;If you're not sure which variant of the bug you hit, run this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;diagnose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_output_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;usage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage_metadata&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="n"&gt;finish&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;finish_reason&lt;/span&gt;

    &lt;span class="n"&gt;thinking_pct&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;thoughts_token_count&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;thoughts_token_count&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;output_pct&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;candidates_token_count&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;candidates_token_count&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Model:          &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Budget:         &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Thinking used:  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;thoughts_token_count&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;thinking_pct&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;%)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Output tokens:  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;candidates_token_count&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;output_pct&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;%)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Finish reason:  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;finish&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Response len:   &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; chars&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;finish&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MAX_TOKENS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;candidates_token_count&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;DIAGNOSIS: Thinking tokens ate your budget.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FIX: Set thinking_budget=0 (Flash) or reasoning_effort=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;none&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;finish&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MAX_TOKENS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;DIAGNOSIS: Output actually hit the cap. Raise max_output_tokens.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;diagnose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a short poem about debugging.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The script prints a percentage breakdown showing exactly where your budget went. If thinking is over 50 percent of your budget, you need to cap it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The gateway-level fix
&lt;/h2&gt;

&lt;p&gt;All of this is fixable at the application layer, but it requires every caller to know about reasoning budgets. That doesn't scale if you have multiple services calling Gemini.&lt;/p&gt;

&lt;p&gt;I run &lt;a href="https://github.com/devansh-365/freellm" rel="noopener noreferrer"&gt;FreeLLM&lt;/a&gt; in front of my LLM calls. It's an OpenAI-compatible gateway that routes across six providers, and it sets the right reasoning budget per Gemini model automatically. Flash gets &lt;code&gt;reasoning_effort: "none"&lt;/code&gt;. Pro gets &lt;code&gt;"low"&lt;/code&gt;. Your full &lt;code&gt;max_tokens&lt;/code&gt; budget goes to the actual answer. You can override per-request if you need reasoning back.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:3000/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "gemini/gemini-2.5-flash",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 1000
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before the gateway: 37 output tokens. After: 670+ tokens, &lt;code&gt;finish_reason: stop&lt;/code&gt;. Same prompt, same budget.&lt;/p&gt;

&lt;p&gt;The point is not "use my tool." The point is that gateway-level defaults let you fix provider quirks once instead of in every service.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to take away
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Gemini 2.5 is a reasoning model. Its thinking tokens count against your &lt;code&gt;max_output_tokens&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The dynamic default will eat 90 to 98 percent of your budget on anything non-trivial.&lt;/li&gt;
&lt;li&gt;For Flash, disable thinking with &lt;code&gt;thinking_budget: 0&lt;/code&gt; or &lt;code&gt;reasoning_effort: "none"&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;For Pro, cap thinking with &lt;code&gt;thinking_budget: 128&lt;/code&gt; (minimum) or &lt;code&gt;reasoning_effort: "low"&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;If you're using an OpenAI-compat endpoint, &lt;code&gt;reasoning_effort&lt;/code&gt; is cleaner and underdocumented.&lt;/li&gt;
&lt;li&gt;Run the diagnostic script above when in doubt.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The official docs don't make any of this obvious. Hopefully this post saves you the day I spent figuring it out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/devansh-365/freellm" rel="noopener noreferrer"&gt;github.com/devansh-365/freellm&lt;/a&gt; (the gateway that handles this for you)&lt;/p&gt;

</description>
      <category>ai</category>
      <category>gemini</category>
      <category>openai</category>
      <category>development</category>
    </item>
    <item>
      <title>LiteLLM got hacked. I built a simpler LLM gateway you can actually audit.</title>
      <dc:creator>Devansh</dc:creator>
      <pubDate>Tue, 14 Apr 2026 00:10:45 +0000</pubDate>
      <link>https://dev.to/devansh365/litellm-got-hacked-i-built-a-simpler-llm-gateway-you-can-actually-audit-3hia</link>
      <guid>https://dev.to/devansh365/litellm-got-hacked-i-built-a-simpler-llm-gateway-you-can-actually-audit-3hia</guid>
      <description>&lt;p&gt;On March 24, 2026, LiteLLM versions 1.82.7 and 1.82.8 were uploaded to PyPI with a credential harvester, a Kubernetes lateral-movement toolkit, and a persistent remote code execution backdoor baked in.&lt;/p&gt;

&lt;p&gt;The malicious package was live for about 40 minutes before PyPI quarantined it.&lt;/p&gt;

&lt;p&gt;40 minutes doesn't sound like much. But LiteLLM gets 95 million downloads a month. It's the default multi-provider routing library for anyone building on LLMs. Teams running &lt;code&gt;pip install litellm&lt;/code&gt; during that window got compromised automatically. No explicit import needed. The payload triggered on Python interpreter startup via a &lt;code&gt;.pth&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;Google brought in Mandiant for the investigation. Snyk, Kaspersky, and Trend Micro all published breakdowns. The attack vector: a compromised Trivy security scanner leaked CircleCI credentials, including the PyPI publishing token and a GitHub PAT.&lt;/p&gt;

&lt;p&gt;This is not a theoretical risk. This happened.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn61cm58mhjeckf0ttnf8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn61cm58mhjeckf0ttnf8.png" alt="LiteLLM Attack Timeline" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The real problem is not one attack
&lt;/h2&gt;

&lt;p&gt;LiteLLM does a lot. 2,000+ models across 100+ providers. Proxy server, load balancing, spend tracking, A/B testing, caching, logging, guardrails, prompt management.&lt;/p&gt;

&lt;p&gt;That scope is the problem.&lt;/p&gt;

&lt;p&gt;A developer on HN described the codebase as having a 7,000+ line &lt;code&gt;utils.py&lt;/code&gt;. A 30-year engineer called it "the worst code I have ever read in my life." Before the supply chain attack, a DEV Community post titled "5 Real Issues With LiteLLM That Are Pushing Teams Away in 2026" was already documenting the trust erosion.&lt;/p&gt;

&lt;p&gt;The supply chain attack was the tipping point, not the root cause. The root cause is depending on a massive, opaque library for critical routing infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a simpler design looks like
&lt;/h2&gt;

&lt;p&gt;I ran into the same multi-provider routing problem last year while building Metis, an AI stock analysis tool. Kept burning through Groq's free tier in 20 minutes, switching to Gemini manually, hitting their cap, switching again.&lt;/p&gt;

&lt;p&gt;Built FreeLLM to stop doing that manually. It solves a narrower problem than LiteLLM, and that's the point.&lt;/p&gt;

&lt;p&gt;FreeLLM is an OpenAI-compatible gateway that routes across Groq, Gemini, Mistral, Cerebras, NVIDIA NIM, and Ollama. When one provider rate-limits, the next one answers. That's the core of it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqv5zmkmnwy550olr0tz4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqv5zmkmnwy550olr0tz4.png" alt="LiteLLM vs FreeLLM" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What it does
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:3000/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"model": "free-fast", "messages": [{"role": "user", "content": "Hello!"}]}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your existing OpenAI SDK code works. Swap the base URL. Keep your code.&lt;/p&gt;

&lt;p&gt;Three meta-models handle routing: &lt;code&gt;free-fast&lt;/code&gt; (lowest latency, usually Groq/Cerebras), &lt;code&gt;free-smart&lt;/code&gt; (best reasoning, usually Gemini 2.5 Pro), and &lt;code&gt;free&lt;/code&gt; (max availability).&lt;/p&gt;

&lt;h3&gt;
  
  
  What it fixes that LiteLLM doesn't
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Gemini 2.5 reasoning tokens eating your output.&lt;/strong&gt; This is one of the most reported Gemini bugs right now. Gemini 2.5 Flash and Pro are reasoning models. They burn 90-98% of your &lt;code&gt;max_tokens&lt;/code&gt; on internal thinking before producing visible text. Ask for 1,000 tokens and you get back 37. There are 15+ open GitHub issues about this across multiple SDKs.&lt;/p&gt;

&lt;p&gt;FreeLLM fixes it at the gateway. Flash gets &lt;code&gt;reasoning_effort: "none"&lt;/code&gt; by default. Pro gets &lt;code&gt;"low"&lt;/code&gt;. Your full token budget goes to the actual answer. Override per-request if you want the reasoning back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Provider outages don't break your app.&lt;/strong&gt; Claude went down for three consecutive days in early April. 8,000+ Downdetector reports. If your app depends on one provider, that's three days of broken service. FreeLLM's circuit breakers pull failing providers from rotation and test for recovery automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Response caching without a separate layer.&lt;/strong&gt; Identical prompts return in ~23ms with zero quota burn. The cache refuses to store truncated responses (another Gemini bug: reasoning models returning cut-off output that then poisons your cache for an hour).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Browser-safe tokens for static sites.&lt;/strong&gt; Mint a short-lived HMAC-signed token from a serverless function, pass it to the browser, call the gateway directly from client-side JavaScript. No auth backend. No session store.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key stacking: 360 free requests per minute
&lt;/h3&gt;

&lt;p&gt;Every provider env var accepts a comma-separated list. FreeLLM rotates round-robin per key.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GROQ_API_KEY=gsk_key1,gsk_key2,gsk_key3
GEMINI_API_KEY=AI_key1,AI_key2,AI_key3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Stack 3 keys across 5 cloud providers: ~360 req/min. All free. Enough to prototype an entire product without spending anything.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwrn8oxy5n9bvgvnpbjze.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwrn8oxy5n9bvgvnpbjze.png" alt="Key Stacking Math" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Get it running
&lt;/h2&gt;

&lt;p&gt;Docker:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; 3000:3000 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;GROQ_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gsk_... &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;GEMINI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;AI... &lt;span class="se"&gt;\&lt;/span&gt;
  ghcr.io/devansh-365/freellm:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or one-click deploy on Railway or Render (buttons in the README).&lt;/p&gt;

&lt;p&gt;Use it from Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:3000/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unused&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;free-smart&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain circuit breakers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;TypeScript, Go, Ruby, anything that speaks OpenAI. Same pattern.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters beyond FreeLLM
&lt;/h2&gt;

&lt;p&gt;The LiteLLM attack exposed something the community already suspected: critical AI infrastructure is running on libraries nobody audits.&lt;/p&gt;

&lt;p&gt;The fix is not "use my tool instead." The fix is smaller dependencies, pinned versions, codebases you can read in an afternoon. FreeLLM is 262 tests across 22 files. TypeScript, not Python. Docker images with pinned deps. MIT licensed.&lt;/p&gt;

&lt;p&gt;If you don't use FreeLLM, build something similarly scoped. The era of "install this 100-provider mega-library and trust it with your API keys" should be over.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsl47ise5n7ao6pfjztu3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsl47ise5n7ao6pfjztu3.png" alt="Request Flow" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;262 tests. 6 providers. One endpoint. Zero cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/Devansh-365/freellm" rel="noopener noreferrer"&gt;github.com/devansh-365/freellm&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>security</category>
      <category>openai</category>
    </item>
    <item>
      <title>I built an OpenAI-compatible gateway that routes across 5 free LLM providers</title>
      <dc:creator>Devansh</dc:creator>
      <pubDate>Mon, 06 Apr 2026 20:22:07 +0000</pubDate>
      <link>https://dev.to/devansh365/i-built-an-openai-compatible-gateway-that-routes-across-5-free-llm-providers-6jo</link>
      <guid>https://dev.to/devansh365/i-built-an-openai-compatible-gateway-that-routes-across-5-free-llm-providers-6jo</guid>
      <description>&lt;p&gt;Every LLM provider has a free tier.&lt;/p&gt;

&lt;p&gt;Groq gives you 30 requests per minute. Gemini gives you 15. Cerebras gives you 30. Mistral gives you 5.&lt;/p&gt;

&lt;p&gt;Combined, that's about 80 requests per minute. Enough for prototyping, internal tools, and side projects where you don't want to pay for API access yet.&lt;/p&gt;

&lt;p&gt;The problem: each provider has its own SDK, its own rate limits, its own auth, and its own downtime. You end up writing provider-switching logic, catching 429 errors, and managing API keys across five different dashboards.&lt;/p&gt;

&lt;p&gt;I got tired of this while building &lt;a href="https://trymetis.app" rel="noopener noreferrer"&gt;Metis&lt;/a&gt;, an AI stock analysis tool. Kept hitting Groq's limits while Gemini had capacity sitting idle. So I built FreeLLM.&lt;/p&gt;

&lt;h2&gt;
  
  
  What FreeLLM does
&lt;/h2&gt;

&lt;p&gt;One endpoint. Five providers. Twenty models. All free.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:3000/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"model": "free-fast", "messages": [{"role": "user", "content": "Hello!"}]}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your existing OpenAI SDK code works. Just change the base URL. That's the whole migration.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the routing works
&lt;/h2&gt;

&lt;p&gt;When a request comes in, FreeLLM:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Checks which providers are healthy (circuit breakers track this automatically)&lt;/li&gt;
&lt;li&gt;Picks the best available provider based on your model choice&lt;/li&gt;
&lt;li&gt;If that provider returns a 429 or fails, it tries the next one&lt;/li&gt;
&lt;li&gt;You get a response&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Three meta-models handle routing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;free-fast   → lowest latency (usually Groq or Cerebras)
free-smart  → most capable model (usually Gemini 2.5)
free        → maximum availability across all providers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Providers and their free tiers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Models&lt;/th&gt;
&lt;th&gt;Free Tier&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Groq&lt;/td&gt;
&lt;td&gt;Llama 3.3 70B, Llama 4 Scout, Qwen3 32B&lt;/td&gt;
&lt;td&gt;~30 req/min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini&lt;/td&gt;
&lt;td&gt;2.5 Flash, 2.5 Pro, 2.0 Flash&lt;/td&gt;
&lt;td&gt;~15 req/min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cerebras&lt;/td&gt;
&lt;td&gt;Llama 3.1 8B, Qwen3 235B, GPT-OSS 120B&lt;/td&gt;
&lt;td&gt;~30 req/min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral&lt;/td&gt;
&lt;td&gt;Small, Medium, Nemo&lt;/td&gt;
&lt;td&gt;~5 req/min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ollama&lt;/td&gt;
&lt;td&gt;Any local model&lt;/td&gt;
&lt;td&gt;Unlimited&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdwx65ruk33vv4zqkl0q5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdwx65ruk33vv4zqkl0q5.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What's under the hood
&lt;/h2&gt;

&lt;p&gt;This isn't a simple round-robin proxy. The routing layer handles real production concerns:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sliding-window rate limiter.&lt;/strong&gt; Each provider's limits are tracked independently. FreeLLM knows how many requests you've sent to Groq in the last 60 seconds and won't send another if you're near the cap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Circuit breakers.&lt;/strong&gt; If Gemini starts returning 500s, FreeLLM pulls it from rotation. Every 30 seconds, it sends a test request. When the provider recovers, it goes back in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-client rate limiting.&lt;/strong&gt; If you expose this to a team, each client gets their own limit. Admin auth protects the config endpoints.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Zod validation.&lt;/strong&gt; Every request is validated before it hits any provider. Bad payloads fail fast with clear error messages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-time dashboard.&lt;/strong&gt; React frontend showing provider health, request logs, and latency. You can see which providers are healthy at a glance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Get it running in 30 seconds
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/devansh-365/freellm.git
&lt;span class="nb"&gt;cd &lt;/span&gt;freellm
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env   &lt;span class="c"&gt;# add your free API keys&lt;/span&gt;
docker compose up
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;API on &lt;code&gt;localhost:3000&lt;/code&gt;. Dashboard on &lt;code&gt;localhost:3000/dashboard&lt;/code&gt;. Done.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using it with the OpenAI SDK
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;OpenAI&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;openai&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;baseURL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;http://localhost:3000/v1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;not-needed&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;free-fast&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Explain circuit breakers in 2 sentences&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No new SDK to learn. No migration effort.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I built this
&lt;/h2&gt;

&lt;p&gt;I was building Metis and kept running into the same pattern: burn through Groq's free tier in 20 minutes of testing, switch to Gemini manually, hit their limit, switch to Mistral. Repeat.&lt;/p&gt;

&lt;p&gt;Wrote a quick proxy to automate the switching. Added failover because providers go down randomly. Added circuit breakers because I didn't want to wait for timeouts. Added a dashboard because I wanted to see what was happening.&lt;/p&gt;

&lt;p&gt;It grew into a proper tool. Open-sourced it because every developer prototyping with LLMs has this exact problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stack
&lt;/h2&gt;

&lt;p&gt;TypeScript, Express 5, React 19, Zod, Docker. MIT licensed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/devansh-365/freellm" rel="noopener noreferrer"&gt;github.com/devansh-365/freellm&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>typescript</category>
      <category>opensource</category>
    </item>
    <item>
      <title>react native animation</title>
      <dc:creator>Devansh</dc:creator>
      <pubDate>Wed, 10 Sep 2025 16:11:18 +0000</pubDate>
      <link>https://dev.to/devansh365/react-native-animation-46f2</link>
      <guid>https://dev.to/devansh365/react-native-animation-46f2</guid>
      <description></description>
      <category>reactnative</category>
      <category>react</category>
      <category>animation</category>
      <category>javascript</category>
    </item>
  </channel>
</rss>
