<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shek</title>
    <description>The latest articles on DEV Community by Shek (@midrelay).</description>
    <link>https://dev.to/midrelay</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3946647%2F2579c26e-cc96-44dc-9fea-2b095aa31b3d.png</url>
      <title>DEV Community: Shek</title>
      <link>https://dev.to/midrelay</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/midrelay"/>
    <language>en</language>
    <item>
      <title>Why your Claude API bill is 3x what it should be (and how to fix it)</title>
      <dc:creator>Shek</dc:creator>
      <pubDate>Fri, 22 May 2026 19:24:53 +0000</pubDate>
      <link>https://dev.to/midrelay/why-your-claude-api-bill-is-3x-what-it-should-be-and-how-to-fix-it-4lfo</link>
      <guid>https://dev.to/midrelay/why-your-claude-api-bill-is-3x-what-it-should-be-and-how-to-fix-it-4lfo</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; I audited a friend's startup that was spending $4,200/month on Claude API. Only $1,300 produced business value. The other $2,900 was waste — split across three patterns that hit most teams using LLM APIs in production. Here's how to find them in your own bill, and the code to fix each one.&lt;/p&gt;




&lt;h2&gt;
  
  
  The audit
&lt;/h2&gt;

&lt;p&gt;A friend running a B2B doc-summarization product asked me to look at their Claude bill. Q1 was $4,200/month and climbing. We pulled their request logs into a spreadsheet, classified each call by purpose, then estimated what each &lt;em&gt;should&lt;/em&gt; have cost. The answer was uncomfortable:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bucket&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;Producing business value?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;First-time doc analysis&lt;/td&gt;
&lt;td&gt;$890&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;User chat turns&lt;/td&gt;
&lt;td&gt;$410&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Repeated system prompts (no cache)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$1,810&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Opus calls that should be Sonnet&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$680&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Serial bulk runs (should be batched)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$410&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three problems, $2,900/month of waste. Each one is unsexy and easy to miss, but together they were 70% of the bill.&lt;/p&gt;




&lt;h2&gt;
  
  
  Culprit #1: prompt caching is off
&lt;/h2&gt;

&lt;p&gt;This is the silent killer. Claude 4.x supports prompt caching: send a 5-minute or 1-hour TTL &lt;code&gt;cache_control&lt;/code&gt; block, and Anthropic charges you ~10x less for cached tokens on subsequent requests. Pricing today (per million tokens for Sonnet 4.6):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fresh input: $3.00&lt;/li&gt;
&lt;li&gt;Cache write: $3.75 (one-time, slightly more than fresh)&lt;/li&gt;
&lt;li&gt;Cache read: &lt;strong&gt;$0.30 — 10x cheaper&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The catch: &lt;strong&gt;you have to opt in per-request&lt;/strong&gt;, and most code doesn't. Before/after:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before — every call pays for the full system prompt
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are an expert at...[2000 words of rules + examples]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# After — system prompt cached for 5 minutes
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are an expert at...[2000 words of rules + examples]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One-line change. 90% discount on every subsequent call within the cache TTL.&lt;/p&gt;

&lt;p&gt;For my friend: 20K tokens of system prompt × 8 requests/min × 50% cache hit ratio = ~$80/day saved. &lt;strong&gt;That alone was $2,400/month&lt;/strong&gt; — most of the $1,810 leak.&lt;/p&gt;

&lt;p&gt;OpenAI SDK calling Claude (via compatible proxies) has equivalent semantics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[...],&lt;/span&gt;
    &lt;span class="n"&gt;prompt_cache_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user-session-12345&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Stable across calls = cache hit
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Action&lt;/strong&gt;: open your last week of API logs. If you have any repeated &lt;code&gt;system&lt;/code&gt; content across requests, you're leaking.&lt;/p&gt;




&lt;h2&gt;
  
  
  Culprit #2: model overkill
&lt;/h2&gt;

&lt;p&gt;The mental shortcut "Claude = quality, just always use Opus" is expensive. Opus is 4x the cost of Sonnet for inputs, 5x for outputs. For a lot of work, Sonnet or even Haiku is indistinguishable.&lt;/p&gt;

&lt;p&gt;I ran 5 tasks across the lineup (1000 samples, scored by judge model + human spot-check):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Opus 4.7&lt;/th&gt;
&lt;th&gt;Sonnet 4.6&lt;/th&gt;
&lt;th&gt;Haiku 4.5&lt;/th&gt;
&lt;th&gt;Best price/quality&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;JSON extraction from PDFs&lt;/td&gt;
&lt;td&gt;99.2%&lt;/td&gt;
&lt;td&gt;98.7%&lt;/td&gt;
&lt;td&gt;96.4%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Haiku&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code review (real bugs)&lt;/td&gt;
&lt;td&gt;74%&lt;/td&gt;
&lt;td&gt;74%&lt;/td&gt;
&lt;td&gt;61%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Sonnet&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Creative copy (blind judged)&lt;/td&gt;
&lt;td&gt;51% pref&lt;/td&gt;
&lt;td&gt;48% pref&lt;/td&gt;
&lt;td&gt;32% pref&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Sonnet&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-step reasoning chain&lt;/td&gt;
&lt;td&gt;89%&lt;/td&gt;
&lt;td&gt;76%&lt;/td&gt;
&lt;td&gt;54%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Opus&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Customer chat&lt;/td&gt;
&lt;td&gt;92% sat&lt;/td&gt;
&lt;td&gt;89% sat&lt;/td&gt;
&lt;td&gt;81% sat&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Sonnet&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pattern: &lt;strong&gt;Opus wins clearly only on complex multi-step reasoning&lt;/strong&gt;. For most tasks Sonnet is within margin of error at 1/4 the cost. Haiku trades 2-5% accuracy for 1/13 the cost — fine when you have downstream validation.&lt;/p&gt;

&lt;p&gt;My friend was running every doc through Opus by default. Switching to Sonnet for analysis + Haiku for tagging dropped that bucket from $680 to $140. No quality complaints.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Action&lt;/strong&gt;: pick the 3 most expensive endpoints in your bill, A/B-test them on the next cheapest model for a week, score outputs blind.&lt;/p&gt;




&lt;h2&gt;
  
  
  Culprit #3: serial calls when you could batch
&lt;/h2&gt;

&lt;p&gt;If your work doesn't need a response in the next 30 seconds, the Anthropic Message Batches API charges &lt;strong&gt;half price&lt;/strong&gt; with a 24-hour SLA. Same models, same quality, half the bill.&lt;/p&gt;

&lt;p&gt;Good fits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Nightly summarization runs&lt;/li&gt;
&lt;li&gt;Classifying or tagging large datasets&lt;/li&gt;
&lt;li&gt;Embedding generation for indexing&lt;/li&gt;
&lt;li&gt;Internal report generation&lt;/li&gt;
&lt;li&gt;Training data prep&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bad fits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Anything user-facing (you'll wait hours)&lt;/li&gt;
&lt;li&gt;Anything where input depends on previous output
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;batches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;custom_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doc-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;params&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Poll until done (or just check tomorrow)
&lt;/span&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;processing_status&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ended&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;batches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;My friend had a nightly job re-summarizing all docs from the previous 24h. Moving it from &lt;code&gt;asyncio.gather&lt;/code&gt; to batches cut that bucket from $410 to $205, no user-visible impact.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Action&lt;/strong&gt;: any cron job, weekly report, or async task hitting your LLM API — most can be batched.&lt;/p&gt;




&lt;h2&gt;
  
  
  The result
&lt;/h2&gt;

&lt;p&gt;After three changes (cache hint, model rebalance, batch the async work), my friend's monthly bill went &lt;strong&gt;$4,200 → $1,540&lt;/strong&gt;. Same product, same quality, no rewrites — just turning on features the API already supports.&lt;/p&gt;

&lt;p&gt;If your bill feels high, do the same audit:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pull last 30 days of API calls&lt;/li&gt;
&lt;li&gt;Count distinct &lt;code&gt;system&lt;/code&gt; prompts. &amp;lt;10 unique but &amp;gt;10,000 calls = no caching&lt;/li&gt;
&lt;li&gt;Look at top 5 model+endpoint combos by spend. Anything simple enough to downshift?&lt;/li&gt;
&lt;li&gt;Find your largest single-day spike. Batch job? Use the batches API.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  A shortcut, if you don't want to instrument all this
&lt;/h2&gt;

&lt;p&gt;I built a little proxy called &lt;a href="https://midrelay.com" rel="noopener noreferrer"&gt;MidRelay&lt;/a&gt; that handles the first two automatically: it injects a per-key cache hint into every request (even SDK code that doesn't know about &lt;code&gt;cache_control&lt;/code&gt; gets the discount), and it exposes both OpenAI and Anthropic surfaces from the same key so you can route model-by-model without rewriting.&lt;/p&gt;

&lt;p&gt;It also happens to be 60-80% cheaper than calling Anthropic / OpenAI directly. (Same models, same wire protocol — your existing SDK just changes the &lt;code&gt;base_url&lt;/code&gt;.)&lt;/p&gt;

&lt;p&gt;$5 of free credit to test it: drop a comment, I'll DM a code. First 100 readers, no signup gate.&lt;/p&gt;

&lt;p&gt;But honestly — &lt;strong&gt;the techniques above work on any provider&lt;/strong&gt;. Even if you never touch MidRelay, just turning on &lt;code&gt;cache_control&lt;/code&gt; and downshifting one over-spec'd Opus call will cut your bill more than any "AI cost optimization" SaaS will.&lt;/p&gt;

&lt;p&gt;Check your logs tonight.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudeai</category>
      <category>openai</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
