<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Akash Melavanki</title>
    <description>The latest articles on DEV Community by Akash Melavanki (@thsky21).</description>
    <link>https://dev.to/thsky21</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3866017%2Fbe85f26b-1253-41a7-a3f8-a912d5d47e89.png</url>
      <title>DEV Community: Akash Melavanki</title>
      <link>https://dev.to/thsky21</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/thsky21"/>
    <language>en</language>
    <item>
      <title>Vercel got hacked. Your API keys rotated. You're still not safe.</title>
      <dc:creator>Akash Melavanki</dc:creator>
      <pubDate>Tue, 21 Apr 2026 16:44:56 +0000</pubDate>
      <link>https://dev.to/thsky21/vercel-got-hacked-your-api-keys-rotated-youre-still-not-safe-361c</link>
      <guid>https://dev.to/thsky21/vercel-got-hacked-your-api-keys-rotated-youre-still-not-safe-361c</guid>
      <description>&lt;p&gt;I host Thskyshield on Vercel.&lt;/p&gt;

&lt;p&gt;So when I woke up to the news that Vercel had been breached — internal systems compromised, customer environment variables exposed, data allegedly being sold on BreachForums for $2 million — I didn't panic. I rotated my keys immediately, like everyone else.&lt;/p&gt;

&lt;p&gt;And then I sat with a question that I don't think enough developers are asking:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happens in the window between a key being stolen and you rotating it?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That question is literally why I built what I built.&lt;/p&gt;




&lt;h2&gt;
  
  
  What actually happened at Vercel
&lt;/h2&gt;

&lt;p&gt;Let me explain the attack chain, because it's more interesting than "Vercel got hacked."&lt;/p&gt;

&lt;p&gt;A Context.ai employee got hit with Lumma Stealer malware in February 2026. Lumma is an infostealer — it quietly harvests credentials, OAuth tokens, session cookies. The attacker sat on those credentials for weeks.&lt;/p&gt;

&lt;p&gt;Then they used the stolen OAuth token to access Vercel's Google Workspace. That gave them access to Vercel's internal environments. And in those environments sat thousands of customer environment variables — API keys, database credentials, signing tokens — that weren't marked as "sensitive."&lt;/p&gt;

&lt;p&gt;One employee. One third-party tool. One overly permissive OAuth grant. And suddenly an attacker has the keys to a significant chunk of the web's developer infrastructure.&lt;/p&gt;

&lt;p&gt;This is what a supply chain attack looks like in 2026. It's not brute force. It's patient, precise, and automated.&lt;/p&gt;




&lt;h2&gt;
  
  
  The standard advice is incomplete
&lt;/h2&gt;

&lt;p&gt;"Rotate your keys immediately."&lt;/p&gt;

&lt;p&gt;Yes. Obviously. Do that.&lt;/p&gt;

&lt;p&gt;But here's what nobody's talking about: key rotation is reactive. You rotate after you know you're compromised. The attacker who stole your key at 2 AM on a Sunday doesn't wait for Monday morning. They act the moment they have it.&lt;/p&gt;

&lt;p&gt;With AI-powered automation, that window between theft and damage is now measured in seconds, not hours. Google's Threat Intelligence team found that the time between initial access and full breach has collapsed from 8 hours in 2022 to 22 seconds in 2025. The attacker doesn't need you to be asleep. They're done before you finish reading the breach notification email.&lt;/p&gt;

&lt;p&gt;So yes, rotate your keys. But also ask: if this key gets stolen tonight, what is the maximum damage the attacker can do with it?&lt;/p&gt;




&lt;h2&gt;
  
  
  What a stolen LLM API key actually enables
&lt;/h2&gt;

&lt;p&gt;Here's where I want to be honest about scope, because I think most people only think about one dimension of this.&lt;/p&gt;

&lt;p&gt;A stolen OpenAI, Anthropic, or Gemini API key gives an attacker several options:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Denial of Wallet&lt;/strong&gt; — loop your chatbot endpoint with high-token payloads. Max out your billing. Leave you with a $5,000 invoice by sunrise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Data exfiltration&lt;/strong&gt; — if your LLM calls include user data, system prompts, or sensitive context, the attacker can extract that by replaying your own endpoints.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Content generation at your cost&lt;/strong&gt; — use your key as a free compute resource. Generate content, run agents, build products — all billed to you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Prompt injection into your users&lt;/strong&gt; — if the attacker can make calls to your endpoint, they may be able to manipulate the responses your actual users see.&lt;/p&gt;

&lt;p&gt;I want to be clear: I'm only solving one of these. Thskyshield is a financial kill-switch. It stops the billing damage. It does not stop data exfiltration. It doesn't block malicious prompt injection. There are other tools for those layers.&lt;/p&gt;

&lt;p&gt;What I believe is: every attack that costs money gets stopped at a known ceiling. If your OpenAI key is stolen and the attacker tries a Denial of Wallet attack, they can only drain up to whatever daily limit you set per user. Not $5,000. Not $500. Whatever you decided.&lt;/p&gt;

&lt;p&gt;That ceiling exists even if everything else failed. Even if Vercel leaked your key. Even if the attacker has full access. The financial blast radius is bounded.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why AI makes this more urgent, not less
&lt;/h2&gt;

&lt;p&gt;I've had this instinct for a while, and the data is starting to confirm it.&lt;/p&gt;

&lt;p&gt;Attacks are getting faster because the automation is getting smarter. The Lumma Stealer that hit the Context.ai employee — that's not a human manually harvesting credentials. That's a piece of software running autonomously, finding credentials, exfiltrating them, and handing them off to an operator in real time.&lt;/p&gt;

&lt;p&gt;The same automation that makes AI products useful makes AI-powered attacks cheap to run at scale. A Denial of Wallet attack used to require someone sitting at a keyboard, writing a loop script, running it manually. Today it's a five-line agentic task: "Loop this endpoint until the budget hits $X or you get a 429."&lt;/p&gt;

&lt;p&gt;The attack surface for every AI product is a financial endpoint. Every chatbot, every AI feature, every LLM-powered tool has a cost function. And attackers are starting to understand that better than most developers do.&lt;/p&gt;




&lt;h2&gt;
  
  
  The thing I keep coming back to
&lt;/h2&gt;

&lt;p&gt;The Vercel breach is not an outlier. LiteLLM in March. Axios in March. Context.ai in February. Vercel in April.&lt;/p&gt;

&lt;p&gt;Every one of these is a supply chain attack. Every one of them exposed developer credentials. Every one of them happened through trusted infrastructure — things developers rely on every day without thinking twice.&lt;/p&gt;

&lt;p&gt;You can't stop the breaches. You can't guarantee your keys won't be stolen. What you can control is what happens after.&lt;/p&gt;

&lt;p&gt;Rotate your keys — yes. But also put a ceiling on what a stolen key can do to your business.&lt;/p&gt;

&lt;p&gt;That's the layer most developers don't have yet.&lt;/p&gt;




&lt;p&gt;If you're building with LLM APIs and you want a hard limit on what a stolen key can drain: thskyshield.com&lt;/p&gt;

&lt;p&gt;Or if you want to watch a simulated Denial of Wallet attack fire in real time: thskyshield.com/simulator&lt;/p&gt;

&lt;p&gt;Curious what others are doing about this. Are you relying on provider-side limits? Rolling your own governance? Or just hoping it doesn't happen to you?&lt;/p&gt;

</description>
      <category>security</category>
      <category>saas</category>
      <category>devops</category>
      <category>ai</category>
    </item>
    <item>
      <title>Rate limiting your LLM API is useless. Here's what actually protects you.</title>
      <dc:creator>Akash Melavanki</dc:creator>
      <pubDate>Tue, 14 Apr 2026 13:32:34 +0000</pubDate>
      <link>https://dev.to/thsky21/rate-limiting-your-llm-api-is-useless-heres-what-actually-protects-you-mmc</link>
      <guid>https://dev.to/thsky21/rate-limiting-your-llm-api-is-useless-heres-what-actually-protects-you-mmc</guid>
      <description>&lt;p&gt;Last month, the LiteLLM supply chain attack exposed API keys across thousands of developer projects.&lt;/p&gt;

&lt;p&gt;The standard advice: rotate your keys immediately.&lt;/p&gt;

&lt;p&gt;Here's what nobody tells you after that: a rotated key doesn't protect you from the next attack. Rate limiting doesn't either. I'll show you why — and what actually works.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem with rate limiting LLMs
&lt;/h2&gt;

&lt;p&gt;Rate limiting assumes all requests cost roughly the same. For traditional APIs, that's true.&lt;/p&gt;

&lt;p&gt;For LLMs, it's completely wrong.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Request 1: "Hi"
→ ~10 tokens → cost: $0.0001

Request 2: "Summarize this 50-page PDF"
→ ~30,000 tokens → cost: $0.45
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An attacker doesn't need high volume. They just need expensive requests. 10 requests per minute means nothing when each request costs $0.45.&lt;/p&gt;

&lt;p&gt;What you actually need is &lt;strong&gt;budget limiting&lt;/strong&gt; — enforcing a maximum dollar spend per user, per day, in real time.&lt;/p&gt;




&lt;h2&gt;
  
  
  The race condition nobody talks about
&lt;/h2&gt;

&lt;p&gt;OK, so you decide to track spend in Redis. Simple, right?&lt;/p&gt;

&lt;p&gt;Wrong. Here's what happens at scale.&lt;/p&gt;

&lt;p&gt;Your app receives 10 concurrent requests from the same user.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Instance A reads budget: $0.05 remaining. Proceeds.
Instance B reads budget: $0.05 remaining. Proceeds.
Instance C reads budget: $0.05 remaining. Proceeds.
...all 10 instances read $0.05 and proceed.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All 10 fire $1.00 LLM requests. Your user's budget was $1.00. You just spent $10.00.&lt;/p&gt;

&lt;p&gt;This is the race condition. Standard Redis GET + SET cannot solve it — there's always a gap between reading and writing where another instance sneaks through.&lt;/p&gt;




&lt;h2&gt;
  
  
  The fix: atomic Lua scripts
&lt;/h2&gt;

&lt;p&gt;The solution is to move the entire check-and-update logic into a single atomic operation inside Redis. Lua scripts on Redis run as one uninterruptible step — no interleaving, no race condition possible.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight lua"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- runs atomically inside Redis — no race condition possible&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;tonumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;tonumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;tonumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'GET'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="s2"&gt;"0"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="c1"&gt;-- BLOCK&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'INCRBYFLOAT'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="c1"&gt;-- ALLOW&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This runs in ~10ms on edge infrastructure. Instance A and Instance B hitting this at the exact same millisecond? Redis queues them. One passes, one fails. Budget enforced. Mathematically consistent.&lt;/p&gt;




&lt;h2&gt;
  
  
  The two-phase protocol
&lt;/h2&gt;

&lt;p&gt;There's one more problem: you don't know the exact cost of an LLM call until it finishes.&lt;/p&gt;

&lt;p&gt;The solution is a two-phase commit:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 1 — pre-flight (before the LLM call)&lt;/strong&gt;&lt;br&gt;
Estimate the cost based on max possible tokens. Reserve that amount atomically. If budget exceeded, return 429 immediately — the LLM never even gets called.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 2 — reconciliation (after the LLM call)&lt;/strong&gt;&lt;br&gt;
OpenAI returns actual token usage. Reconcile: release the estimate, apply the real cost. If the estimate was too high, refund the difference back to the user's budget.&lt;/p&gt;

&lt;p&gt;This means your budget enforcement is tight even under worst-case conditions.&lt;/p&gt;


&lt;h2&gt;
  
  
  Benchmark: controlled Denial of Wallet attack
&lt;/h2&gt;

&lt;p&gt;I ran a simulated DoW attack against a standard GPT-4o endpoint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup:&lt;/strong&gt; Recursive script, concurrent requests, 800+ token payloads per request.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Unprotected&lt;/th&gt;
&lt;th&gt;With atomic governance&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Duration&lt;/td&gt;
&lt;td&gt;47 seconds&lt;/td&gt;
&lt;td&gt;Stopped at request 3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total spend&lt;/td&gt;
&lt;td&gt;$847.00&lt;/td&gt;
&lt;td&gt;$0.08&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Intervention&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;td&gt;Automatic&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The governance layer fired a 429 at the third request. The attacker's loop never got traction.&lt;/p&gt;


&lt;h2&gt;
  
  
  Why this matters after supply chain attacks
&lt;/h2&gt;

&lt;p&gt;Here's the thing about the LiteLLM and Axios breaches: rotating your API key is the right move, but it's reactive. The damage happens in the window between the breach and you waking up.&lt;/p&gt;

&lt;p&gt;Budget governance is your last line of defense. Even with a stolen key, the attacker can only drain up to the limit you set. No $1,000 surprise at sunrise.&lt;/p&gt;


&lt;h2&gt;
  
  
  The implementation
&lt;/h2&gt;

&lt;p&gt;I built this into an open-source SDK called Thskyshield. Two lines to wrap any LLM call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;allowed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;requestId&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;shield&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;check&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;externalUserId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gpt-4o&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;estimatedTokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;allowed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;Response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Budget exceeded&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;429&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;// ...your LLM call here...&lt;/span&gt;

&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;shield&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;requestId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;externalUserId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gpt-4o&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;prompt_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completion_tokens&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The SDK handles the atomic reservation, the two-phase reconciliation, and the 429 response automatically. Works on Vercel Edge, Cloudflare Workers, or any Node.js environment.&lt;/p&gt;

&lt;p&gt;→ SDK: &lt;code&gt;npm install @thsky-21/thskyshield&lt;/code&gt;&lt;br&gt;
→ Live attack simulation: thskyshield.com/simulator&lt;/p&gt;




&lt;p&gt;What are you actually using to cap LLM spend right now? Are you relying on OpenAI's hard limits, or have you built something custom? Would genuinely like to know what's working.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>javascript</category>
      <category>security</category>
    </item>
    <item>
      <title>How I built a real-time LLM "Kill-Switch" for Vercel Edge using Atomic Redis</title>
      <dc:creator>Akash Melavanki</dc:creator>
      <pubDate>Tue, 07 Apr 2026 14:39:17 +0000</pubDate>
      <link>https://dev.to/thsky21/how-i-built-a-real-time-llm-kill-switch-for-vercel-edge-using-atomic-redis-3njm</link>
      <guid>https://dev.to/thsky21/how-i-built-a-real-time-llm-kill-switch-for-vercel-edge-using-atomic-redis-3njm</guid>
      <description>&lt;p&gt;Last week, the Axios supply chain attack compromised over 100 million weekly downloads. A week before that, it was LiteLLM.&lt;/p&gt;

&lt;p&gt;In both cases, the goal was simple: Exfiltrate API keys. As developers, we are taught to rotate our keys immediately. But there’s a massive gap in that advice. If an attacker gets your OpenAI key at 2 AM, they don't wait for you to wake up. They loop your endpoints, drain your credits, and leave you with a $1,000+ bill by sunrise.&lt;/p&gt;

&lt;p&gt;This is what OWASP calls LLM10:2025 – Unbounded Consumption (or "Denial of Wallet"). I spent the last two weeks building a way to stop it at the Edge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Problem: Why Rate Limiting Fails LLMs&lt;/strong&gt;&lt;br&gt;
Standard rate-limiting (e.g., 10 requests per minute) is useless for LLMs.&lt;/p&gt;

&lt;p&gt;Request 1: "Hi" (10 tokens) — Cost: $0.0001&lt;/p&gt;

&lt;p&gt;Request 2: "Summarize this 50-page PDF" (30,000 tokens) — Cost: $0.45&lt;/p&gt;

&lt;p&gt;An attacker doesn't need a high volume of requests to ruin you; they just need expensive requests. We need Budget Limiting, not Rate Limiting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Technical Challenge: The Stateless Race Condition&lt;/strong&gt;&lt;br&gt;
I’m building this for Next.js on Vercel Edge.&lt;/p&gt;

&lt;p&gt;Vercel Edge functions are stateless. If you try to track a user's spend in a local variable, it vanishes. If you use a standard database, the latency kills your UX.&lt;/p&gt;

&lt;p&gt;But the real "final boss" is the Race Condition.&lt;/p&gt;

&lt;p&gt;Imagine a user fires 10 concurrent requests.&lt;/p&gt;

&lt;p&gt;Instance A checks the budget: "Remaining: $0.05. Proceed."&lt;/p&gt;

&lt;p&gt;Instance B checks the budget: "Remaining: $0.05. Proceed."&lt;/p&gt;

&lt;p&gt;Both fire $1.00 requests.&lt;/p&gt;

&lt;p&gt;Result: You are now -$1.95 in the hole.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Solution: Atomic Lua Scripts on Redis&lt;/strong&gt;&lt;br&gt;
To solve this, I moved the logic into an Atomic Lua Script on Upstash Redis. Instead of "Check then Update" (two steps), the logic happens in one single, uninterruptible step inside the database memory.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight lua"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- The "Kill-Switch" Logic&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;-- user_budget_key&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;tonumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="c1"&gt;-- e.g., 1.00 USD&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;tonumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="c1"&gt;-- estimated cost&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;tonumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'GET'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="s2"&gt;"0"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="c1"&gt;-- BLOCK&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'INCRBYFLOAT'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="c1"&gt;-- ALLOW&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This runs in ~10ms. If Instance A and B hit the script at the exact same millisecond, Redis queues them. One passes, the second fails. No race condition. No $1,000 surprises.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Benchmark: A Controlled Stress Test&lt;/strong&gt;&lt;br&gt;
To quantify the risk, I ran a simulated Denial of Wallet (DWL) attack against a standard Next.js API route.&lt;/p&gt;

&lt;p&gt;The Setup:&lt;/p&gt;

&lt;p&gt;Attacker: A simple recursive script firing concurrent requests with high-token payloads (800+ tokens/request).&lt;/p&gt;

&lt;p&gt;Target: A GPT-4o endpoint.&lt;/p&gt;

&lt;p&gt;The Result (Unprotected): The script ran for 47 seconds. Total simulated cost reached $847.00 before manual intervention.&lt;/p&gt;

&lt;p&gt;The Result (Thskyshield): Using the same script, the governance layer triggered a 429 (Too Many Requests) at the 3rd call. Total spend: $0.08.&lt;/p&gt;

&lt;p&gt;Watch the Live Simulation →&lt;a href="https://www.thskyshield.com/simulator" rel="noopener noreferrer"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The "Two-Phase" Protocol&lt;/strong&gt;&lt;br&gt;
The hardest part was handling the fact that you don't know the exact cost of an LLM call until it's finished. I settled on a two-phase approach:&lt;/p&gt;

&lt;p&gt;Phase 1 (Pre-flight): Check the budget based on the max possible tokens. "Lock" that amount.&lt;/p&gt;

&lt;p&gt;Phase 2 (Post-flight): Once the LLM returns, reconcile the actual usage and "Refund" the difference to the user's budget.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;br&gt;
Supply chain attacks like the Axios one are the "new normal." We can't stop every key from being stolen, but we can stop a stolen key from being a business-ending event.&lt;/p&gt;

&lt;p&gt;I’ve open-sourced the SDK for this under Thskyshield. If you're building with Next.js and want to stop worrying about your OpenAI bill, it's free for founders.&lt;/p&gt;

&lt;p&gt;SDK: @thsky-21/thskyshield &lt;a href="https://www.npmjs.com/package/@thsky-21/thskyshield" rel="noopener noreferrer"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Website: thskyshield.com &lt;a href="https://www.thskyshield.com/" rel="noopener noreferrer"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Would love to hear how others are handling "Denial of Wallet" risks. Are you just relying on OpenAI's hard limits, or are you building your own governance layer?&lt;/p&gt;

</description>
      <category>nextjs</category>
      <category>ai</category>
      <category>security</category>
      <category>api</category>
    </item>
  </channel>
</rss>
