<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shek</title>
    <description>The latest articles on DEV Community by Shek (@midrelay).</description>
    <link>https://dev.to/midrelay</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3946647%2F2579c26e-cc96-44dc-9fea-2b095aa31b3d.png</url>
      <title>DEV Community: Shek</title>
      <link>https://dev.to/midrelay</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/midrelay"/>
    <language>en</language>
    <item>
      <title>Migrating Claude Code to a custom backend in 2 lines (and what to actually watch for)</title>
      <dc:creator>Shek</dc:creator>
      <pubDate>Sat, 06 Jun 2026 04:52:03 +0000</pubDate>
      <link>https://dev.to/midrelay/migrating-claude-code-to-a-custom-backend-in-2-lines-and-what-to-actually-watch-for-1e0g</link>
      <guid>https://dev.to/midrelay/migrating-claude-code-to-a-custom-backend-in-2-lines-and-what-to-actually-watch-for-1e0g</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Claude Code's &lt;code&gt;ANTHROPIC_BASE_URL&lt;/code&gt; env var lets you redirect every request to any compatible backend in one shell line. Almost no one does it, but it unlocks caching, rate-limit pooling, multi-vendor routing, audit logging, and billing routing without rewriting any code. Here's the 2-line setup, the things that quietly break, and the production checklist for actually running it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 2-line setup
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;https://your-proxy.example.com
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-key
claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. The Anthropic SDK (which Claude Code is built on) doesn't care where it's sending requests, as long as the response comes back in Anthropic Messages format. It'll append &lt;code&gt;/v1/messages&lt;/code&gt; automatically, forward all your &lt;code&gt;anthropic-version&lt;/code&gt; and &lt;code&gt;anthropic-beta&lt;/code&gt; headers, and stream SSE the same way.&lt;/p&gt;

&lt;p&gt;Same trick works with Cursor, Cline, Continue, and anything else that builds on the Anthropic SDK.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why would you do this?
&lt;/h2&gt;

&lt;p&gt;I've found five patterns that justify a custom backend:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Caching middleware
&lt;/h3&gt;

&lt;p&gt;Anthropic's prompt cache feature requires manual &lt;code&gt;cache_control&lt;/code&gt; markers that most teams skip. A proxy can inject them server-side on every system message. For a Claude Code session with a 20-30K-token system prompt and 50+ turns, this cuts your bill by 60-80%.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Rate-limit pooling
&lt;/h3&gt;

&lt;p&gt;Most teams have 3-5 API keys spread across projects. Without a proxy, each is rate-limited independently — you can hit 429 on one while another sits idle. A proxy can round-robin across keys, exposing one virtual key with the combined rate budget.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Multi-vendor abstraction
&lt;/h3&gt;

&lt;p&gt;Want to route &lt;code&gt;claude-haiku-4-5&lt;/code&gt; calls to one provider and &lt;code&gt;claude-opus-4-7&lt;/code&gt; calls to another (maybe one offers Haiku throughput discounts)? A proxy inspects the &lt;code&gt;model&lt;/code&gt; field and routes accordingly. Your Claude Code config doesn't change.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Audit logging
&lt;/h3&gt;

&lt;p&gt;Claude Code makes a &lt;em&gt;lot&lt;/em&gt; of requests. Without instrumentation, you have no idea what each session cost. A proxy can log per-request metadata (model, tokens, latency, session id) to a warehouse.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Billing routing
&lt;/h3&gt;

&lt;p&gt;Different team members hitting different cost centers? A proxy can map API keys to billing tags and forward usage to finance.&lt;/p&gt;




&lt;h2&gt;
  
  
  What breaks (and what to watch for)
&lt;/h2&gt;

&lt;p&gt;The 2-line setup works for the happy path. In practice, here's what bites you.&lt;/p&gt;

&lt;h3&gt;
  
  
  Streaming behavior
&lt;/h3&gt;

&lt;p&gt;Anthropic returns chunked SSE for &lt;code&gt;stream: true&lt;/code&gt;. If your proxy buffers responses or doesn't flush each chunk immediately, you'll see massive perceived latency — Claude Code will spin while waiting for the full response.&lt;/p&gt;

&lt;p&gt;Fix: pass through SSE chunks without buffering. In Caddy, that means &lt;code&gt;flush_interval -1&lt;/code&gt;. In Cloudflare Workers, use &lt;code&gt;TransformStream&lt;/code&gt; with explicit &lt;code&gt;.write()&lt;/code&gt; per chunk. In Node, set &lt;code&gt;noDelay: true&lt;/code&gt; on the response socket.&lt;/p&gt;

&lt;h3&gt;
  
  
  Header forwarding
&lt;/h3&gt;

&lt;p&gt;Anthropic's SDK sends headers like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;anthropic-version&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;anthropic-beta&lt;/code&gt; (for prompt cache, batches, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;x-api-key&lt;/code&gt; (auth)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;x-stainless-*&lt;/code&gt; (SDK telemetry)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your proxy strips any of these before forwarding to the real upstream, certain features break silently:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Strip &lt;code&gt;anthropic-version&lt;/code&gt; → SDK gets default-version response which may not match its expectations&lt;/li&gt;
&lt;li&gt;Strip &lt;code&gt;anthropic-beta&lt;/code&gt; → prompt cache headers ignored, you wonder why caching isn't working&lt;/li&gt;
&lt;li&gt;Strip &lt;code&gt;x-stainless-*&lt;/code&gt; → harmless&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Rule of thumb: forward everything except &lt;code&gt;host&lt;/code&gt; and &lt;code&gt;authorization&lt;/code&gt; (since you're rewriting those).&lt;/p&gt;

&lt;h3&gt;
  
  
  Tool use iteration loops
&lt;/h3&gt;

&lt;p&gt;Claude Code uses tool calling heavily. A typical session has 30-50 tool call rounds, each one a separate API call with the &lt;strong&gt;full&lt;/strong&gt; conversation history.&lt;/p&gt;

&lt;p&gt;If your proxy is rate-limited at 60 RPM per source IP, a single Claude Code session can blow through it in 30 seconds.&lt;/p&gt;

&lt;p&gt;Fix: rate-limit per API key, not per source IP. And budget for high burst rates (200+ RPM during heavy tool-call sessions).&lt;/p&gt;

&lt;h3&gt;
  
  
  Request body inspection
&lt;/h3&gt;

&lt;p&gt;If you want to inject &lt;code&gt;cache_control&lt;/code&gt; or rewrite request bodies, you have to parse the JSON, modify, re-serialize. Be careful: large requests (50K+ tokens of context as JSON-encoded strings) take noticeable CPU time to parse.&lt;/p&gt;

&lt;p&gt;Minimal injection in Node:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;rawBody&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Inject cache_control on the first system message&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;system&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="k"&gt;typeof&lt;/span&gt; &lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;system&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;system&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;system&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;cache_control&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ephemeral&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;}];&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;newRawBody&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Claude Code's typical traffic this adds &amp;lt;5ms per request. For very large workloads, consider stream-parsing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Streaming error handling
&lt;/h3&gt;

&lt;p&gt;Anthropic sends &lt;code&gt;error&lt;/code&gt; SSE events mid-stream when something fails (rate limit, content filter, etc.). Most naive proxies forward the body as opaque bytes, which works. If your proxy tries to be "smart" and parse the SSE, you have to handle error events without breaking the connection.&lt;/p&gt;

&lt;p&gt;The safe path: don't parse SSE in your proxy unless you absolutely have to.&lt;/p&gt;




&lt;h2&gt;
  
  
  A minimal production checklist
&lt;/h2&gt;

&lt;p&gt;If you're building this proxy yourself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] TLS termination (Let's Encrypt or Cloudflare front)&lt;/li&gt;
&lt;li&gt;[ ] Pass-through SSE streaming (no buffering)&lt;/li&gt;
&lt;li&gt;[ ] Per-key rate limiting (not per-IP)&lt;/li&gt;
&lt;li&gt;[ ] Forward all headers except &lt;code&gt;host&lt;/code&gt;/&lt;code&gt;authorization&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;[ ] Handle JSON request body modification gracefully&lt;/li&gt;
&lt;li&gt;[ ] Log per-request metadata without storing message bodies (privacy)&lt;/li&gt;
&lt;li&gt;[ ] Retry on upstream 5xx with exponential backoff&lt;/li&gt;
&lt;li&gt;[ ] Health check endpoint that doesn't proxy&lt;/li&gt;
&lt;li&gt;[ ] Graceful degradation when upstream is down (return 502 fast, don't hang)&lt;/li&gt;
&lt;li&gt;[ ] CORS headers if you want browser clients to hit it directly (TypingMind etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's not "2 lines" anymore. The 2-line part is the SDK config; the checklist is the actual cost of running it in production.&lt;/p&gt;




&lt;h2&gt;
  
  
  The side benefit no one talks about
&lt;/h2&gt;

&lt;p&gt;When you put a proxy in front of Claude Code, you get &lt;strong&gt;observability for free&lt;/strong&gt;. You can see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every request and its token count&lt;/li&gt;
&lt;li&gt;Which sessions are expensive&lt;/li&gt;
&lt;li&gt;Which prompts repeat (cache candidates)&lt;/li&gt;
&lt;li&gt;Tool-call patterns&lt;/li&gt;
&lt;li&gt;Latency distribution per model&lt;/li&gt;
&lt;li&gt;Time-of-day spend curves&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Anthropic's dashboard tells you the bill. Your proxy tells you &lt;em&gt;why&lt;/em&gt;. For any team running Claude Code at scale, that's often more valuable than the cost optimization itself.&lt;/p&gt;




&lt;h2&gt;
  
  
  When NOT to do this
&lt;/h2&gt;

&lt;p&gt;A few cases where a custom backend is more trouble than it's worth:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Single developer running &lt;code&gt;claude&lt;/code&gt; once a day — direct API is fine&lt;/li&gt;
&lt;li&gt;Team &amp;lt; 5 people with simple usage patterns&lt;/li&gt;
&lt;li&gt;No engineering bandwidth to maintain a proxy&lt;/li&gt;
&lt;li&gt;Using Anthropic-specific beta features that might break in a proxy chain&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The break-even point is roughly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have &amp;gt; 3 API keys to manage, OR&lt;/li&gt;
&lt;li&gt;Monthly Claude bill &amp;gt; $500, OR&lt;/li&gt;
&lt;li&gt;You need cross-team usage analytics, OR&lt;/li&gt;
&lt;li&gt;You're integrating with multiple LLM vendors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Below that, just use the real Anthropic API and pocket the engineering time.&lt;/p&gt;




&lt;h2&gt;
  
  
  If you don't want to build it
&lt;/h2&gt;

&lt;p&gt;I built &lt;a href="https://midrelay.com" rel="noopener noreferrer"&gt;MidRelay&lt;/a&gt; precisely because this checklist took 6 weeks to get right and I figured other teams shouldn't repeat it. Hosted proxy with both Anthropic and OpenAI surfaces, prompt cache injection on by default, per-key usage logs, CORS enabled for browser clients.&lt;/p&gt;

&lt;p&gt;Pointing Claude Code at it is the 2-line setup at the top of this post.&lt;/p&gt;

&lt;p&gt;But honestly: the techniques here work whether you use MidRelay, build your own, or pick any of the other gateways. The interesting thing is almost nobody knows you &lt;em&gt;can&lt;/em&gt; point Claude Code at a custom backend, even though Anthropic explicitly documented it. If this post does nothing else but make you realize that capability exists, that's worth your reading time.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you've built something similar, drop a comment with what bit you in production — I'm collecting patterns for a follow-up post on multi-vendor LLM routing.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudeai</category>
      <category>anthropic</category>
      <category>devops</category>
    </item>
    <item>
      <title>Why your Claude API bill is 3x what it should be (and how to fix it)</title>
      <dc:creator>Shek</dc:creator>
      <pubDate>Fri, 22 May 2026 19:24:53 +0000</pubDate>
      <link>https://dev.to/midrelay/why-your-claude-api-bill-is-3x-what-it-should-be-and-how-to-fix-it-4lfo</link>
      <guid>https://dev.to/midrelay/why-your-claude-api-bill-is-3x-what-it-should-be-and-how-to-fix-it-4lfo</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; I audited a friend's startup that was spending $4,200/month on Claude API. Only $1,300 produced business value. The other $2,900 was waste — split across three patterns that hit most teams using LLM APIs in production. Here's how to find them in your own bill, and the code to fix each one.&lt;/p&gt;




&lt;h2&gt;
  
  
  The audit
&lt;/h2&gt;

&lt;p&gt;A friend running a B2B doc-summarization product asked me to look at their Claude bill. Q1 was $4,200/month and climbing. We pulled their request logs into a spreadsheet, classified each call by purpose, then estimated what each &lt;em&gt;should&lt;/em&gt; have cost. The answer was uncomfortable:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bucket&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;Producing business value?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;First-time doc analysis&lt;/td&gt;
&lt;td&gt;$890&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;User chat turns&lt;/td&gt;
&lt;td&gt;$410&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Repeated system prompts (no cache)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$1,810&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Opus calls that should be Sonnet&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$680&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Serial bulk runs (should be batched)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$410&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three problems, $2,900/month of waste. Each one is unsexy and easy to miss, but together they were 70% of the bill.&lt;/p&gt;




&lt;h2&gt;
  
  
  Culprit #1: prompt caching is off
&lt;/h2&gt;

&lt;p&gt;This is the silent killer. Claude 4.x supports prompt caching: send a 5-minute or 1-hour TTL &lt;code&gt;cache_control&lt;/code&gt; block, and Anthropic charges you ~10x less for cached tokens on subsequent requests. Pricing today (per million tokens for Sonnet 4.6):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fresh input: $3.00&lt;/li&gt;
&lt;li&gt;Cache write: $3.75 (one-time, slightly more than fresh)&lt;/li&gt;
&lt;li&gt;Cache read: &lt;strong&gt;$0.30 — 10x cheaper&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The catch: &lt;strong&gt;you have to opt in per-request&lt;/strong&gt;, and most code doesn't. Before/after:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before — every call pays for the full system prompt
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are an expert at...[2000 words of rules + examples]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# After — system prompt cached for 5 minutes
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are an expert at...[2000 words of rules + examples]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One-line change. 90% discount on every subsequent call within the cache TTL.&lt;/p&gt;

&lt;p&gt;For my friend: 20K tokens of system prompt × 8 requests/min × 50% cache hit ratio = ~$80/day saved. &lt;strong&gt;That alone was $2,400/month&lt;/strong&gt; — most of the $1,810 leak.&lt;/p&gt;

&lt;p&gt;OpenAI SDK calling Claude (via compatible proxies) has equivalent semantics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[...],&lt;/span&gt;
    &lt;span class="n"&gt;prompt_cache_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user-session-12345&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Stable across calls = cache hit
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Action&lt;/strong&gt;: open your last week of API logs. If you have any repeated &lt;code&gt;system&lt;/code&gt; content across requests, you're leaking.&lt;/p&gt;




&lt;h2&gt;
  
  
  Culprit #2: model overkill
&lt;/h2&gt;

&lt;p&gt;The mental shortcut "Claude = quality, just always use Opus" is expensive. Opus is 4x the cost of Sonnet for inputs, 5x for outputs. For a lot of work, Sonnet or even Haiku is indistinguishable.&lt;/p&gt;

&lt;p&gt;I ran 5 tasks across the lineup (1000 samples, scored by judge model + human spot-check):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Opus 4.7&lt;/th&gt;
&lt;th&gt;Sonnet 4.6&lt;/th&gt;
&lt;th&gt;Haiku 4.5&lt;/th&gt;
&lt;th&gt;Best price/quality&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;JSON extraction from PDFs&lt;/td&gt;
&lt;td&gt;99.2%&lt;/td&gt;
&lt;td&gt;98.7%&lt;/td&gt;
&lt;td&gt;96.4%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Haiku&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code review (real bugs)&lt;/td&gt;
&lt;td&gt;74%&lt;/td&gt;
&lt;td&gt;74%&lt;/td&gt;
&lt;td&gt;61%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Sonnet&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Creative copy (blind judged)&lt;/td&gt;
&lt;td&gt;51% pref&lt;/td&gt;
&lt;td&gt;48% pref&lt;/td&gt;
&lt;td&gt;32% pref&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Sonnet&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-step reasoning chain&lt;/td&gt;
&lt;td&gt;89%&lt;/td&gt;
&lt;td&gt;76%&lt;/td&gt;
&lt;td&gt;54%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Opus&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Customer chat&lt;/td&gt;
&lt;td&gt;92% sat&lt;/td&gt;
&lt;td&gt;89% sat&lt;/td&gt;
&lt;td&gt;81% sat&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Sonnet&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pattern: &lt;strong&gt;Opus wins clearly only on complex multi-step reasoning&lt;/strong&gt;. For most tasks Sonnet is within margin of error at 1/4 the cost. Haiku trades 2-5% accuracy for 1/13 the cost — fine when you have downstream validation.&lt;/p&gt;

&lt;p&gt;My friend was running every doc through Opus by default. Switching to Sonnet for analysis + Haiku for tagging dropped that bucket from $680 to $140. No quality complaints.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Action&lt;/strong&gt;: pick the 3 most expensive endpoints in your bill, A/B-test them on the next cheapest model for a week, score outputs blind.&lt;/p&gt;




&lt;h2&gt;
  
  
  Culprit #3: serial calls when you could batch
&lt;/h2&gt;

&lt;p&gt;If your work doesn't need a response in the next 30 seconds, the Anthropic Message Batches API charges &lt;strong&gt;half price&lt;/strong&gt; with a 24-hour SLA. Same models, same quality, half the bill.&lt;/p&gt;

&lt;p&gt;Good fits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Nightly summarization runs&lt;/li&gt;
&lt;li&gt;Classifying or tagging large datasets&lt;/li&gt;
&lt;li&gt;Embedding generation for indexing&lt;/li&gt;
&lt;li&gt;Internal report generation&lt;/li&gt;
&lt;li&gt;Training data prep&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bad fits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Anything user-facing (you'll wait hours)&lt;/li&gt;
&lt;li&gt;Anything where input depends on previous output
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;batches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;custom_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doc-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;params&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Poll until done (or just check tomorrow)
&lt;/span&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;processing_status&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ended&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;batches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;My friend had a nightly job re-summarizing all docs from the previous 24h. Moving it from &lt;code&gt;asyncio.gather&lt;/code&gt; to batches cut that bucket from $410 to $205, no user-visible impact.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Action&lt;/strong&gt;: any cron job, weekly report, or async task hitting your LLM API — most can be batched.&lt;/p&gt;




&lt;h2&gt;
  
  
  The result
&lt;/h2&gt;

&lt;p&gt;After three changes (cache hint, model rebalance, batch the async work), my friend's monthly bill went &lt;strong&gt;$4,200 → $1,540&lt;/strong&gt;. Same product, same quality, no rewrites — just turning on features the API already supports.&lt;/p&gt;

&lt;p&gt;If your bill feels high, do the same audit:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pull last 30 days of API calls&lt;/li&gt;
&lt;li&gt;Count distinct &lt;code&gt;system&lt;/code&gt; prompts. &amp;lt;10 unique but &amp;gt;10,000 calls = no caching&lt;/li&gt;
&lt;li&gt;Look at top 5 model+endpoint combos by spend. Anything simple enough to downshift?&lt;/li&gt;
&lt;li&gt;Find your largest single-day spike. Batch job? Use the batches API.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  A shortcut, if you don't want to instrument all this
&lt;/h2&gt;

&lt;p&gt;I built a little proxy called &lt;a href="https://midrelay.com" rel="noopener noreferrer"&gt;MidRelay&lt;/a&gt; that handles the first two automatically: it injects a per-key cache hint into every request (even SDK code that doesn't know about &lt;code&gt;cache_control&lt;/code&gt; gets the discount), and it exposes both OpenAI and Anthropic surfaces from the same key so you can route model-by-model without rewriting.&lt;/p&gt;

&lt;p&gt;It also happens to be 60-80% cheaper than calling Anthropic / OpenAI directly. (Same models, same wire protocol — your existing SDK just changes the &lt;code&gt;base_url&lt;/code&gt;.)&lt;/p&gt;

&lt;p&gt;$5 of free credit to test it: drop a comment, I'll DM a code. First 100 readers, no signup gate.&lt;/p&gt;

&lt;p&gt;But honestly — &lt;strong&gt;the techniques above work on any provider&lt;/strong&gt;. Even if you never touch MidRelay, just turning on &lt;code&gt;cache_control&lt;/code&gt; and downshifting one over-spec'd Opus call will cut your bill more than any "AI cost optimization" SaaS will.&lt;/p&gt;

&lt;p&gt;Check your logs tonight.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudeai</category>
      <category>openai</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
