<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tufail Khan</title>
    <description>The latest articles on DEV Community by Tufail Khan (@tufailkhan457).</description>
    <link>https://dev.to/tufailkhan457</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3890666%2F512a744d-eab5-48fd-a402-4adccef0aef2.jpg</url>
      <title>DEV Community: Tufail Khan</title>
      <link>https://dev.to/tufailkhan457</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tufailkhan457"/>
    <language>en</language>
    <item>
      <title>FastAPI at 1M+ users: the patterns that actually matter</title>
      <dc:creator>Tufail Khan</dc:creator>
      <pubDate>Tue, 21 Apr 2026 11:53:15 +0000</pubDate>
      <link>https://dev.to/tufailkhan457/fastapi-at-1m-users-the-patterns-that-actually-matter-1o44</link>
      <guid>https://dev.to/tufailkhan457/fastapi-at-1m-users-the-patterns-that-actually-matter-1o44</guid>
      <description>&lt;p&gt;FastAPI is the default Python web framework in 2026 — 38% of Python teams ship on it, up from 29% a year ago. That means a lot of greenfield projects are making the same early mistakes.&lt;/p&gt;

&lt;p&gt;This post is what I wish I'd known before scaling &lt;strong&gt;Savyour&lt;/strong&gt; (Pakistan's first cashback platform, 1M+ users, 300+ merchant integrations) from 50 RPS to 3,000+ RPS on FastAPI.&lt;/p&gt;

&lt;p&gt;Everything below is drawn from production. No "hello world" demos.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Know your async boundaries
&lt;/h2&gt;

&lt;p&gt;FastAPI supports both &lt;code&gt;def&lt;/code&gt; and &lt;code&gt;async def&lt;/code&gt; endpoints. The framework is smart enough to run sync routes in a threadpool — but &lt;em&gt;your&lt;/em&gt; code may not be.&lt;/p&gt;

&lt;p&gt;The failure mode: an &lt;code&gt;async def&lt;/code&gt; endpoint that calls a blocking library (say, &lt;code&gt;requests&lt;/code&gt; instead of &lt;code&gt;httpx&lt;/code&gt;). The sync call holds the event loop, everything queues behind it, and your p99 latency goes vertical.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule:&lt;/strong&gt; if the function is &lt;code&gt;async def&lt;/code&gt;, every IO operation inside it must be awaitable. Use &lt;code&gt;httpx.AsyncClient&lt;/code&gt;, &lt;code&gt;asyncpg&lt;/code&gt;, &lt;code&gt;aioboto3&lt;/code&gt;, &lt;code&gt;redis.asyncio&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;When you must call a sync library, wrap it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi.concurrency&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;run_in_threadpool&lt;/span&gt;

&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/report&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_report&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# sync pandas code — don't block the loop
&lt;/span&gt;    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;run_in_threadpool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expensive_sync_function&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  2. Connection pools are not optional
&lt;/h2&gt;

&lt;p&gt;Naive async code opens a new database connection per request. At 500 RPS with a 50ms query, that's &lt;strong&gt;25,000 connections&lt;/strong&gt; fighting your Postgres instance. Postgres caps out around 200-500.&lt;/p&gt;

&lt;p&gt;Fix: use a single pool per worker, with tuned sizing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# database.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sqlalchemy.ext.asyncio&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_async_engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AsyncSession&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sqlalchemy.orm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sessionmaker&lt;/span&gt;

&lt;span class="n"&gt;engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_async_engine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;DATABASE_URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pool_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# steady-state per worker
&lt;/span&gt;    &lt;span class="n"&gt;max_overflow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;# burst tolerance
&lt;/span&gt;    &lt;span class="n"&gt;pool_pre_ping&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;# detect dead connections
&lt;/span&gt;    &lt;span class="n"&gt;pool_recycle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1800&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# rotate every 30min
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;AsyncSessionLocal&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sessionmaker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;AsyncSession&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expire_on_commit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_db&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;AsyncSessionLocal&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For multi-worker deployments (Uvicorn &lt;code&gt;--workers 4&lt;/code&gt;), multiply by worker count. If your Postgres caps at 200 connections, 4 workers × 30 max = 120 is safe. Monitor &lt;code&gt;pg_stat_activity&lt;/code&gt; in prod.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Push heavy work to background queues
&lt;/h2&gt;

&lt;p&gt;The endpoint that made Savyour go down in month two: a synchronous product-sync that iterated through 50K affiliate offers per merchant. Five merchants syncing at once = 250K records in-request = timeouts cascading.&lt;/p&gt;

&lt;p&gt;The fix was simple but non-obvious to a team new to async: &lt;strong&gt;never do heavy work in the request cycle.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;arq&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_pool&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;arq.connections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RedisSettings&lt;/span&gt;

&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/sync/{merchant_id}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;trigger_sync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;merchant_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Depends&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_arq_pool&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
    &lt;span class="n"&gt;job&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enqueue_job&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sync_merchant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;merchant_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;job_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;job_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queued&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ARQ, Celery, or Dramatiq — pick one. The worker fleet scales independently of the API fleet. Requests return in milliseconds. Monitoring stays sane.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Pydantic v2 is 5-50× faster — use it
&lt;/h2&gt;

&lt;p&gt;If you're still on Pydantic v1, migrate. The v2 rewrite in Rust dropped our request validation overhead from ~8ms to ~0.5ms per request. At 3,000 RPS that's a full CPU core back.&lt;/p&gt;

&lt;p&gt;Gotchas we hit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Config&lt;/code&gt; → &lt;code&gt;model_config&lt;/code&gt; (nested dict)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;.dict()&lt;/code&gt; → &lt;code&gt;.model_dump()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;validator&lt;/code&gt; → &lt;code&gt;field_validator&lt;/code&gt;, &lt;code&gt;root_validator&lt;/code&gt; → &lt;code&gt;model_validator&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use &lt;code&gt;bump-pydantic&lt;/code&gt; for the mechanical parts. The semantic changes (validator signatures) need human review.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Middleware for observability, not magic
&lt;/h2&gt;

&lt;p&gt;We run three middleware layers in production. In order:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# main.py
&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# 1. Request ID — every log line traces back
&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RequestIDMiddleware&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Timing — p50/p95/p99 per route
&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TimingMiddleware&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 3. Structured logging — JSON out to CloudWatch
&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LoggingMiddleware&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# CORS goes OUTERMOST so OPTIONS requests skip everything
&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CORSMiddleware&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;allow_origins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;FRONTEND_ORIGINS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Avoid:&lt;/strong&gt; auto-magic middleware that wraps your handlers with decorators you can't inspect. When things break at 3 AM, you need to grep the source and understand what's happening. Explicit &amp;gt; clever.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Health checks, liveness, readiness
&lt;/h2&gt;

&lt;p&gt;Three distinct endpoints. Don't collapse them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/healthz&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# is the process up?
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;health&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ok&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/readyz&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# can we serve traffic?
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ready&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Depends&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_db&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Depends&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_redis&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT 1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ping&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ready&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/livez&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# should kubelet restart us?
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;live&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;alive&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Kubernetes (or ECS, or Fargate) uses these to make restart decisions. A failing dependency should make &lt;code&gt;readyz&lt;/code&gt; fail so the LB stops sending traffic — but shouldn't make &lt;code&gt;livez&lt;/code&gt; fail and trigger a restart loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. One project structure to rule them all
&lt;/h2&gt;

&lt;p&gt;After shipping a dozen FastAPI services, this is the structure I reach for:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;app/
├── main.py            # FastAPI app, middleware, lifespan
├── config.py          # pydantic-settings, env-driven
├── db.py              # engine + session factory
├── dependencies.py    # shared Depends() providers
├── routers/
│   ├── customers.py
│   ├── orders.py
│   └── webhooks.py
├── schemas/           # pydantic request/response models
├── models/            # SQLAlchemy ORM
├── services/          # business logic, pure-ish
├── workers/           # ARQ/Celery task definitions
└── tests/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key discipline: &lt;strong&gt;routers call services, services call models, models don't reach back up.&lt;/strong&gt; Break that rule and tests get painful fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd skip
&lt;/h2&gt;

&lt;p&gt;Things I used to reach for that I don't anymore:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Starlette middleware for auth.&lt;/strong&gt; Use FastAPI &lt;code&gt;Depends()&lt;/code&gt; for auth — it composes cleanly with route permissions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom exception handlers for every error.&lt;/strong&gt; One global handler that maps exceptions → HTTP codes is enough for 95% of services.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Over-engineered response models for internal APIs.&lt;/strong&gt; &lt;code&gt;dict&lt;/code&gt; returns are fine for handlers only your own code calls.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The meta-point
&lt;/h2&gt;

&lt;p&gt;FastAPI's documentation is aggressively good — better than most frameworks' books. Read it twice before inventing patterns. Most of the hard-won lessons above are implicit in the docs; I just didn't slow down enough to absorb them the first time.&lt;/p&gt;

</description>
      <category>fastapi</category>
      <category>python</category>
      <category>scaling</category>
      <category>async</category>
    </item>
    <item>
      <title>Cutting our Claude API bill by 78% with prompt caching</title>
      <dc:creator>Tufail Khan</dc:creator>
      <pubDate>Tue, 21 Apr 2026 11:41:20 +0000</pubDate>
      <link>https://dev.to/tufailkhan457/cutting-our-claude-api-bill-by-78-with-prompt-caching-1fon</link>
      <guid>https://dev.to/tufailkhan457/cutting-our-claude-api-bill-by-78-with-prompt-caching-1fon</guid>
      <description>&lt;p&gt;In January 2026 our monthly Claude bill crossed &lt;strong&gt;$4,200&lt;/strong&gt;, up from $600 six months earlier. We were serving a RAG-backed customer-support assistant that retrieved ~12K tokens of context per query, ran through an 800-token system prompt, and called Claude an average of 4.2 times per user session.&lt;/p&gt;

&lt;p&gt;Rolling out Anthropic's &lt;strong&gt;prompt caching&lt;/strong&gt; dropped that to &lt;strong&gt;$920/month&lt;/strong&gt; — a 78% reduction — without touching any user-facing behavior.&lt;/p&gt;

&lt;p&gt;This post is the exact playbook.&lt;/p&gt;

&lt;h2&gt;
  
  
  What prompt caching does
&lt;/h2&gt;

&lt;p&gt;Claude's prompt caching stores &lt;em&gt;prefix portions&lt;/em&gt; of your prompt in Anthropic's infrastructure. When a subsequent request reuses that same prefix, the cached portion costs &lt;strong&gt;10% of the normal input-token price&lt;/strong&gt; and is processed much faster.&lt;/p&gt;

&lt;p&gt;The pricing in 2026:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cache write:&lt;/strong&gt; 1.25× input cost (on first use)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache read (hit):&lt;/strong&gt; 0.1× input cost&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TTL:&lt;/strong&gt; 5 minutes by default, 1 hour available&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Break-even is ~2 hits per cache write. In practice, a well-placed cache break point hits &lt;strong&gt;dozens to hundreds of times&lt;/strong&gt; before it expires.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to cache — high, medium, low ROI
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;High ROI&lt;/strong&gt; (always cache):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;System prompts (usually stable across all requests)&lt;/li&gt;
&lt;li&gt;Long tool-schema definitions&lt;/li&gt;
&lt;li&gt;Retrieved context chunks reused within a session (RAG)&lt;/li&gt;
&lt;li&gt;Few-shot example banks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Medium ROI&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User conversation history early in a session (caches grow as the conversation progresses)&lt;/li&gt;
&lt;li&gt;Document chunks that appear frequently across queries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Low / anti-ROI&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Per-request user input&lt;/li&gt;
&lt;li&gt;Anything that changes every call&lt;/li&gt;
&lt;li&gt;Caches smaller than 1024 tokens (minimum cache block size for Claude Opus/Sonnet)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The anatomy of a cached prompt
&lt;/h2&gt;

&lt;p&gt;In the Python SDK, you add &lt;code&gt;cache_control&lt;/code&gt; markers to the content blocks you want cached. Everything &lt;em&gt;before&lt;/em&gt; the marker gets cached as a prefix.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;LONG_SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# stable, reusable
&lt;/span&gt;            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_schema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{...},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;retrieved_context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# session-scoped RAG chunks
&lt;/span&gt;                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="c1"&gt;# no cache marker — this changes every request
&lt;/span&gt;                &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# inspect cache metrics
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache_creation_input_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache_read_input_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: &lt;strong&gt;up to 4 cache break points&lt;/strong&gt; per request. We use all 4:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;System prompt (changes ~monthly)&lt;/li&gt;
&lt;li&gt;Tool schemas (changes ~monthly)&lt;/li&gt;
&lt;li&gt;Retrieved RAG context (changes per session)&lt;/li&gt;
&lt;li&gt;Conversation history (grows within session)&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Metrics from real traffic
&lt;/h2&gt;

&lt;p&gt;Before caching, on a representative 1,000-request sample:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Input tokens billed: 14.2M (≈ $42.60 at Opus 4.7 pricing)&lt;/li&gt;
&lt;li&gt;Output tokens billed: 380K (≈ $28.50)&lt;/li&gt;
&lt;li&gt;Total: &lt;strong&gt;$71.10&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After caching, same workload:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cache write input tokens: 1.8M ($6.75)&lt;/li&gt;
&lt;li&gt;Cache read input tokens: 12.1M ($3.63)&lt;/li&gt;
&lt;li&gt;Uncached input tokens: 300K ($0.90)&lt;/li&gt;
&lt;li&gt;Output tokens: 380K ($28.50)&lt;/li&gt;
&lt;li&gt;Total: &lt;strong&gt;$39.78&lt;/strong&gt; (−44%)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Output tokens dominate what's left. Short of switching models, the input side is essentially solved.&lt;/p&gt;

&lt;h2&gt;
  
  
  Watch out for: cache invalidation footguns
&lt;/h2&gt;

&lt;p&gt;Cache hits match on &lt;strong&gt;exact byte-level prefix equality&lt;/strong&gt;. Any variance busts the cache. Things that silently broke ours early on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Whitespace drift&lt;/strong&gt; in system-prompt templating (a stray &lt;code&gt;\n&lt;/code&gt; from a template engine)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dict-ordering&lt;/strong&gt; when serializing tool schemas from a Python dict — always use &lt;code&gt;json.dumps(..., sort_keys=True)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp injection&lt;/strong&gt; into system prompts (&lt;code&gt;"Today is {date}..."&lt;/code&gt; rebuilds the cache every day — move it to user content)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User-scoped data in system prompt&lt;/strong&gt; — blows cache per user; move it down the prompt&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instrument &lt;code&gt;cache_creation_input_tokens&lt;/code&gt; vs &lt;code&gt;cache_read_input_tokens&lt;/code&gt; on every response and alert if the ratio drifts. A week of silent cache misses can cost you thousands.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 1-hour cache tier
&lt;/h2&gt;

&lt;p&gt;Anthropic added a &lt;strong&gt;1-hour TTL&lt;/strong&gt; option in mid-2025. It costs 2× the write price but lives 12× longer. For workloads with predictable hot paths — e.g. a support assistant where 80% of sessions hit the same product docs — the 1-hour tier amortizes beautifully.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ttl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1h&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use it where cache hit rate is high. Don't use it for small cache blocks or unpredictable traffic — you'll pay the write premium without the hit volume.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;Prompt caching is the highest-ROI single change I've made to a production Claude app in the last year. If you're running a RAG, agent, or long-context workload on Claude and &lt;em&gt;not&lt;/em&gt; using prompt caching, the savings are almost certainly 40-80% sitting on the table.&lt;/p&gt;

&lt;p&gt;The cost to implement: two afternoons, including the instrumentation. The cost to ignore: compounding every month you don't do it.&lt;/p&gt;

</description>
      <category>claude</category>
      <category>anthropic</category>
      <category>costoptimization</category>
      <category>ai</category>
    </item>
    <item>
      <title>Why we replaced LangChain with the raw Anthropic SDK in production</title>
      <dc:creator>Tufail Khan</dc:creator>
      <pubDate>Tue, 21 Apr 2026 11:40:59 +0000</pubDate>
      <link>https://dev.to/tufailkhan457/why-we-replaced-langchain-with-the-raw-anthropic-sdk-in-production-3611</link>
      <guid>https://dev.to/tufailkhan457/why-we-replaced-langchain-with-the-raw-anthropic-sdk-in-production-3611</guid>
      <description>&lt;p&gt;LangChain was the right answer in 2023. It abstracted away a messy ecosystem of half-baked provider APIs, gave you a unified &lt;code&gt;LLM&lt;/code&gt; interface, and let you stitch agents together with a few dozen lines of Python. We used it everywhere — including in production on Vettio, our AI recruitment platform.&lt;/p&gt;

&lt;p&gt;In April 2026, we ripped it out.&lt;/p&gt;

&lt;p&gt;This post is about &lt;strong&gt;why&lt;/strong&gt; we made that call, &lt;strong&gt;what replaced it&lt;/strong&gt;, and &lt;strong&gt;the metrics&lt;/strong&gt; that justified the migration.&lt;/p&gt;

&lt;h2&gt;
  
  
  The symptoms
&lt;/h2&gt;

&lt;p&gt;LangChain's abstractions started leaking the moment we went beyond happy-path demos. Three things kept biting us:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Stack traces from hell.&lt;/strong&gt; A single &lt;code&gt;AgentExecutor.invoke()&lt;/code&gt; call crossed 14 frames of LangChain internals before reaching &lt;em&gt;our&lt;/em&gt; code. Debugging a malformed tool call felt like archaeology.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version churn.&lt;/strong&gt; Every minor bump renamed, relocated, or deprecated something we depended on. Our CI was pinned to a specific LangChain SHA for six months just to stay green.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Abstracted-away observability.&lt;/strong&gt; We couldn't cleanly trace token usage, cache hits, or per-tool latencies without monkey-patching internal classes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Meanwhile, Anthropic's native SDK was getting &lt;em&gt;better&lt;/em&gt;. Native tool calling, prompt caching, extended thinking, streaming — all first-class and documented.&lt;/p&gt;

&lt;h2&gt;
  
  
  The refactor
&lt;/h2&gt;

&lt;p&gt;The logic we were using LangChain for wasn't complicated:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Build a system prompt from templates&lt;/li&gt;
&lt;li&gt;Call Claude with a list of tools&lt;/li&gt;
&lt;li&gt;Route tool calls to our internal handlers&lt;/li&gt;
&lt;li&gt;Return the result&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We replaced ~800 lines of LangChain glue with this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Anthropic&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;tool_handlers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;

    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stop_reason&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end_turn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;

        &lt;span class="c1"&gt;# Handle tool use
&lt;/span&gt;        &lt;span class="n"&gt;tool_calls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_use&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

        &lt;span class="n"&gt;tool_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;call&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tool_handlers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;tool_results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_use_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tool_results&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. No &lt;code&gt;AgentExecutor&lt;/code&gt;, no &lt;code&gt;Callback&lt;/code&gt;, no &lt;code&gt;ConversationBufferMemory&lt;/code&gt;. Just the model and our code.&lt;/p&gt;

&lt;h2&gt;
  
  
  The metrics
&lt;/h2&gt;

&lt;p&gt;We ran the old and new paths side-by-side for two weeks on Vettio's interview-bot service. Results:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;p50 latency:&lt;/strong&gt; 2.1s → 1.4s (−33%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;p95 latency:&lt;/strong&gt; 4.8s → 3.2s (−33%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error rate:&lt;/strong&gt; 0.9% → 0.2%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stack trace depth on errors:&lt;/strong&gt; 14 → 4 frames&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lines of integration code:&lt;/strong&gt; 812 → 187&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The latency win came mostly from eliminating LangChain's implicit retry-and-retry-again behavior on tool-use mismatches. With direct SDK calls, a malformed tool schema fails loudly instead of silently retrying three times.&lt;/p&gt;

&lt;h2&gt;
  
  
  When LangChain still makes sense
&lt;/h2&gt;

&lt;p&gt;This isn't a blanket "don't use LangChain" post. It still wins if you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-provider abstraction.&lt;/strong&gt; Swapping between Claude, GPT-4, and Gemini behind a stable interface.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangGraph workflows&lt;/strong&gt; for graph-based agent topologies you'd otherwise build from scratch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangSmith observability&lt;/strong&gt; you don't want to rebuild.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a team that's already committed to one provider (we're all-in on Claude) and wants full control over prompts, tool schemas, and observability — the native SDK is the right tool in 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  The lesson
&lt;/h2&gt;

&lt;p&gt;Abstractions pay for themselves when the underlying APIs are bad. Anthropic's API isn't bad. It's clean, well-documented, and stable. The abstraction tax was real; the abstraction benefit had quietly evaporated.&lt;/p&gt;

&lt;p&gt;If you're still on LangChain in a production Claude app, benchmark a direct-SDK rewrite of your hot path. You might be surprised.&lt;/p&gt;

</description>
      <category>langchain</category>
      <category>claude</category>
      <category>anthropic</category>
      <category>python</category>
    </item>
  </channel>
</rss>
