<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Faruk Celikkanat</title>
    <description>The latest articles on DEV Community by Faruk Celikkanat (@celikkanat).</description>
    <link>https://dev.to/celikkanat</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3949551%2Ff2e495ee-d3b3-4884-9469-3a27b276e14a.jpeg</url>
      <title>DEV Community: Faruk Celikkanat</title>
      <link>https://dev.to/celikkanat</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/celikkanat"/>
    <language>en</language>
    <item>
      <title>AWS Budgets Has a 24-Hour Delay. Your Bedrock Bill Doesn't.</title>
      <dc:creator>Faruk Celikkanat</dc:creator>
      <pubDate>Sat, 30 May 2026 17:07:23 +0000</pubDate>
      <link>https://dev.to/celikkanat/aws-budgets-has-a-24-hour-delay-your-bedrock-bill-doesnt-2d9g</link>
      <guid>https://dev.to/celikkanat/aws-budgets-has-a-24-hour-delay-your-bedrock-bill-doesnt-2d9g</guid>
      <description>&lt;h1&gt;
  
  
  Why AWS Budgets Can't Protect You From Bedrock Overspend
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;Published: May 30, 2026 · 6 min read&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;You set a $100 AWS Budget alert for your Bedrock usage. You feel safe. Then you wake up to a $2,300 bill.&lt;/p&gt;

&lt;p&gt;This is not a hypothetical. It's the failure mode that AWS Budgets was designed to handle — and fundamentally cannot.&lt;/p&gt;




&lt;h2&gt;
  
  
  How AWS Budgets Actually Works
&lt;/h2&gt;

&lt;p&gt;AWS Budgets monitors your account spend and sends alerts when you cross a threshold. Sounds exactly right for controlling Bedrock costs. Here's the catch buried in the AWS documentation:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Budget alerts are typically delivered within 24 hours of your spending crossing the threshold."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In practice, the lag is 8–24 hours. AWS aggregates billing data in batch jobs. The notification pipeline has no concept of "real time."&lt;/p&gt;

&lt;p&gt;This means the sequence of events is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Your code enters a runaway loop at 11:47 PM&lt;/li&gt;
&lt;li&gt;Claude Sonnet runs 900 iterations at ~$0.003/request&lt;/li&gt;
&lt;li&gt;You accumulate $2,700 in spend over 3 hours&lt;/li&gt;
&lt;li&gt;You go to sleep&lt;/li&gt;
&lt;li&gt;AWS sends you an email at 9 AM&lt;/li&gt;
&lt;li&gt;The damage is done&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;AWS Budgets is a &lt;em&gt;reporting&lt;/em&gt; tool wearing the costume of an &lt;em&gt;enforcement&lt;/em&gt; tool. It doesn't stop anything. It tells you what already happened.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Bedrock Is Especially Dangerous
&lt;/h2&gt;

&lt;p&gt;Most compute services have natural rate limits — you can only spin up so many EC2 instances before you hit account limits. Bedrock has no such floor.&lt;/p&gt;

&lt;p&gt;Claude Sonnet 3.7 costs $3.00 per million input tokens and $15.00 per million output tokens. A single request with 10,000 tokens in and 4,000 out costs about $0.09. That seems small.&lt;/p&gt;

&lt;p&gt;Now consider:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A retry loop with no backoff: 1 request every 200ms&lt;/li&gt;
&lt;li&gt;300 requests/minute × $0.09 = &lt;strong&gt;$27/minute&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;One hour of this: &lt;strong&gt;$1,620&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This happens. A missing &lt;code&gt;await&lt;/code&gt;, an exception handler that retries in a loop, a batch job that gets misconfigured — these are real bugs that hit developers every week.&lt;/p&gt;

&lt;p&gt;And if you're using &lt;code&gt;claude-opus-4-7&lt;/code&gt; at $75/M output tokens, multiply those numbers by 5.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three Failures of the Standard Advice
&lt;/h2&gt;

&lt;p&gt;When developers ask "how do I control Bedrock costs," the standard advice is:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Set AWS Budgets alerts&lt;/strong&gt;&lt;br&gt;
Already covered — these are retroactive, not preventive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Set service quotas&lt;/strong&gt;&lt;br&gt;
AWS Bedrock service quotas are measured in requests-per-minute, not dollars. You can cap throughput, but not spend. A slow leak at 5 RPM with a large context can still cost hundreds per day.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Monitor CloudWatch metrics and set alarms&lt;/strong&gt;&lt;br&gt;
CloudWatch's &lt;code&gt;InvokeModel&lt;/code&gt; metrics have a delay too. You can alarm on invocation count, but mapping invocations to dollars requires knowing the exact token counts per request — which CloudWatch doesn't report in real time.&lt;/p&gt;

&lt;p&gt;None of these intercept a request &lt;em&gt;before&lt;/em&gt; it hits Bedrock. They all tell you something happened. None of them stop it from happening.&lt;/p&gt;


&lt;h2&gt;
  
  
  What Actual Enforcement Looks Like
&lt;/h2&gt;

&lt;p&gt;The only way to guarantee a dollar cap is to intercept at the request level.&lt;/p&gt;

&lt;p&gt;When your code calls Bedrock, the request has to pass through something that knows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;How much you've spent so far (in real time)&lt;/li&gt;
&lt;li&gt;How much this request will approximately cost (token estimate)&lt;/li&gt;
&lt;li&gt;Whether allowing this request would exceed your cap&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the answer to (3) is yes, the request never reaches Bedrock. You get a 429. The token is never consumed. The money is never spent.&lt;/p&gt;

&lt;p&gt;This is the difference between a smoke detector and a sprinkler system. AWS Budgets is a smoke detector. You need a sprinkler.&lt;/p&gt;


&lt;h2&gt;
  
  
  How We Built This for Bedrock
&lt;/h2&gt;

&lt;p&gt;At LLMCap, we built a transparent proxy that sits between your code and AWS Bedrock Runtime. Your code changes exactly one line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="n"&gt;bedrock&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bedrock-runtime&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;us-east-1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# After — via LLMCap proxy
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;

&lt;span class="c1"&gt;# Pass your AWS credentials in headers, LLMCap signs the request
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://proxy.llmcap.io/bedrock/us-east-1/model/anthropic.claude-sonnet-4-6/invoke&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;X-LLMCap-Key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tg_live_...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;X-AWS-Access-Key-ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AKIA...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;X-AWS-Secret-Access-Key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every request goes through three checks before reaching AWS:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Token estimation&lt;/strong&gt; (&amp;lt;10ms)&lt;br&gt;
We estimate input tokens from the request body. For Anthropic Claude on Bedrock, we use the Anthropic token counting endpoint. For other models, we use character-count heuristics. This gives us a pre-request cost estimate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Budget check&lt;/strong&gt; (&amp;lt;5ms)&lt;br&gt;
We query Redis for your current spend in the active window (daily/weekly/monthly). If `current_spend + estimated_cost &amp;gt; limit`, we return 429 immediately. The request never leaves our servers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. SigV4 signing&lt;/strong&gt;&lt;br&gt;
Your AWS credentials pass through per-request in headers and are discarded after signing. We never store them. LLMCap holds only your token counts and costs.&lt;/p&gt;

&lt;p&gt;Total added latency: &lt;strong&gt;&amp;lt;35ms&lt;/strong&gt; in the median case.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Happens Mid-Stream
&lt;/h2&gt;

&lt;p&gt;Bedrock supports streaming responses via the binary event stream format. A streaming request can run for seconds and generate thousands of output tokens — more cost exposure than a single-turn request.&lt;/p&gt;

&lt;p&gt;LLMCap handles streaming by checking budget periodically as chunks arrive. If spending crosses the cap mid-stream, we close the connection and send a final error event. Your code sees a disconnection (or a &lt;code&gt;budget_exceeded&lt;/code&gt; error, depending on how you handle it). Anthropic's servers close the generation.&lt;/p&gt;

&lt;p&gt;The tokens already generated are charged. Everything after the connection close is not.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Developer Experience
&lt;/h2&gt;

&lt;p&gt;Setting a $50/day cap on Bedrock looks like this in the LLMCap dashboard:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Provider: Bedrock&lt;/li&gt;
&lt;li&gt;Window: Daily&lt;/li&gt;
&lt;li&gt;Limit: $50.00&lt;/li&gt;
&lt;li&gt;Action: Block&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once that rule is set, no Bedrock request through your proxy key can push you past $50 for the day. The cap is hard. Not a notification — a wall.&lt;/p&gt;

&lt;p&gt;You can set separate caps per API key, per provider, per model, per time window. A staging key can have a $5/day limit while your production key has a $200/day limit. They're isolated.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Honest Tradeoffs
&lt;/h2&gt;

&lt;p&gt;LLMCap adds a network hop. That's ~35ms of latency in the happy path. For interactive applications, this is usually invisible. For batch workloads that run millions of requests, it's worth measuring.&lt;/p&gt;

&lt;p&gt;LLMCap also requires your AWS credentials to pass through the proxy on each request. We sign them and discard them — we never store credentials, and this is auditable in our code. But you're trusting a third party with temporary credential access on each call. Some security policies won't allow this.&lt;/p&gt;

&lt;p&gt;For those cases, self-hosted deployment is on our roadmap.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;AWS Budgets is not a spending cap. It's a bill notification with a 24-hour delay. For Bedrock workloads where a runaway loop can cost thousands per hour, that's not protection — it's a post-mortem.&lt;/p&gt;

&lt;p&gt;Real enforcement requires interception at the request level, before the token is consumed.&lt;/p&gt;

&lt;p&gt;That's what we built. If you're running Bedrock in production and you don't have a hard cap in place, &lt;a href="https://llmcap.io" rel="noopener noreferrer"&gt;try LLMCap free for 3 days&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;LLMCap supports Anthropic, OpenAI, Google Gemini, Mistral, Cohere, and AWS Bedrock. Setup takes under 15 minutes.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aws</category>
      <category>infrastructure</category>
      <category>monitoring</category>
    </item>
  </channel>
</rss>
