<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tufail Khan</title>
    <description>The latest articles on DEV Community by Tufail Khan (@tufailkhan457).</description>
    <link>https://dev.to/tufailkhan457</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3890666%2F512a744d-eab5-48fd-a402-4adccef0aef2.jpg</url>
      <title>DEV Community: Tufail Khan</title>
      <link>https://dev.to/tufailkhan457</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tufailkhan457"/>
    <language>en</language>
    <item>
      <title>Harness engineering: a self-evolving feature loop in 312 lines of bash</title>
      <dc:creator>Tufail Khan</dc:creator>
      <pubDate>Thu, 30 Apr 2026 17:24:45 +0000</pubDate>
      <link>https://dev.to/tufailkhan457/harness-engineering-a-self-evolving-feature-loop-in-312-lines-of-bash-4j33</link>
      <guid>https://dev.to/tufailkhan457/harness-engineering-a-self-evolving-feature-loop-in-312-lines-of-bash-4j33</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/tufailkhan45/harness-loop" rel="noopener noreferrer"&gt;github.com/tufailkhan45/harness-loop&lt;/a&gt; — one bash script, drop into any spec-driven repo.&lt;br&gt;
&lt;strong&gt;Originally published on:&lt;/strong&gt; &lt;a href="https://tufail.dev/blog/harness-engineering-self-evolving-loop" rel="noopener noreferrer"&gt;tufail.dev/blog/harness-engineering-self-evolving-loop&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Most posts about Claude Code talk about prompts. This one is about the &lt;em&gt;harness&lt;/em&gt; — the wrapper around the model that turns a single &lt;code&gt;claude -p&lt;/code&gt; invocation into a system that can ship a backlog of features over hours, survive its own failures, and learn as it goes.&lt;/p&gt;

&lt;p&gt;I built &lt;a href="https://github.com/tufailkhan45/harness-loop" rel="noopener noreferrer"&gt;harness-loop&lt;/a&gt; after watching too many headless Claude runs silently spin on the same broken approach for thirty minutes. This post walks through what a harness actually does, why the design comes down to three load-bearing parts, and what I learned writing one in bash.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is harness engineering?
&lt;/h2&gt;

&lt;p&gt;The model produces tokens. The spec describes the goal. The &lt;strong&gt;harness&lt;/strong&gt; is everything in between: when to invoke the model, what context to feed it, when to stop, when to halt the whole run, and what to trust as a "done" signal.&lt;/p&gt;

&lt;p&gt;If the model is the engine and the spec is the destination, the harness is the chassis, fuel system, and dashboard warning lights. Most AI workflows fail not because the model is wrong but because the harness is missing — the model gets called once, returns something that looks plausible, and the human is left to figure out whether the work actually shipped.&lt;/p&gt;

&lt;p&gt;A good harness answers four questions on every iteration:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What is the next unit of work?&lt;/li&gt;
&lt;li&gt;What context does the model need that it didn't have last time?&lt;/li&gt;
&lt;li&gt;Did anything just happen that requires a human?&lt;/li&gt;
&lt;li&gt;Is this feature actually done, or does the model just think it is?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The whole loop is built around answering those four questions, mechanically, in a way that survives crashes, quota windows, and the model's own occasional confidence in incorrect things.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the loop does
&lt;/h2&gt;

&lt;p&gt;The runner is one bash file (&lt;code&gt;scripts/run-features.sh&lt;/code&gt;, 312 lines). Every iteration:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Picks the next feature without a &lt;code&gt;.done&lt;/code&gt; marker&lt;/li&gt;
&lt;li&gt;Builds a prompt from the spec, the feature's prior attempt log, and a global learnings file&lt;/li&gt;
&lt;li&gt;Invokes &lt;code&gt;claude -p&lt;/code&gt; under &lt;code&gt;timeout&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Inspects the resulting log for halt signals (&lt;code&gt;BLOCKED:&lt;/code&gt;, no growth, quota errors)&lt;/li&gt;
&lt;li&gt;Loops&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It exits 0 when every feature has a marker, or with a halt code (3-6) when something demands a human.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;specs/auth-login/spec.md   ──┐
logs/auth-login.log        ──┼──&amp;gt; prompt ──&amp;gt; claude -p ──&amp;gt; append log + maybe .done
logs/learnings.md          ──┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the whole architecture. No queue, no database, no orchestrator. The filesystem is the state machine, and &lt;code&gt;.done&lt;/code&gt; markers are the source of truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Self-evolution: three parts that all have to work
&lt;/h2&gt;

&lt;p&gt;"Self-evolving" sounds hand-wavy until you stare at what it actually requires. There are exactly three mechanisms, and breaking any one breaks the loop:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Read.&lt;/strong&gt; Every iteration tails the last 200 lines of two files into the prompt — the feature's own prior log (so the model does not repeat what already failed), and a cross-feature learnings file (so feature D benefits from a discovery made in feature A). Recency is the model. Switching to head or middle slices wouldn't work as well, because the most recent attempt holds the most relevant signal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Write.&lt;/strong&gt; The prompt explicitly asks the model to do two things at the end of every iteration: append a progress note to the feature log, and append a one-line lesson to &lt;code&gt;learnings.md&lt;/code&gt; — but &lt;strong&gt;only&lt;/strong&gt; if the lesson is broadly applicable. The wording is deliberately load-bearing. Soften it ("you may want to add a note...") and the loop's memory degrades within a handful of iterations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Floor.&lt;/strong&gt; A circuit breaker. If a feature's log does not grow by more than 32 bytes for &lt;code&gt;STUCK_LIMIT&lt;/code&gt; iterations in a row, the runner halts that feature with exit code 5. The runner cannot audit &lt;em&gt;what&lt;/em&gt; the model writes, only &lt;em&gt;whether&lt;/em&gt; it writes anything. Without this floor, a model that has hallucinated its feedback channel will spin forever and burn quota.&lt;/p&gt;

&lt;p&gt;The asymmetry matters. Read and Write are model behaviour — both can fail subtly. The Floor is a hard mechanical guardrail that catches the failure mode the model itself cannot self-detect.&lt;/p&gt;

&lt;h2&gt;
  
  
  The prompt is the API
&lt;/h2&gt;

&lt;p&gt;Most of the prompt is a fixed heredoc, but two blocks are dynamic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;&amp;lt;&amp;lt;PRIOR_LOG
[last 200 lines of logs/feature-runner/&amp;lt;slug&amp;gt;.log]
PRIOR_LOG

&amp;lt;&amp;lt;&amp;lt;LEARNINGS
[last 200 lines of logs/feature-runner/learnings.md]
LEARNINGS
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Followed by a six-step task list that constrains the iteration to &lt;em&gt;one meaningful step&lt;/em&gt; — not "finish the feature," not "make as much progress as you can," but pick the next unfinished piece, do it, verify it, log it. The "one step at a time" framing prevents the model from spending a 30-minute timeout on a megacommit it then cannot verify.&lt;/p&gt;

&lt;p&gt;Step 6 is the contract: write &lt;code&gt;&amp;lt;slug&amp;gt;.done&lt;/code&gt; &lt;strong&gt;only&lt;/strong&gt; if the spec is satisfied AND verification is green. The runner trusts this signal. Weaken the prompt ("write .done when you think you're close enough") and the whole loop loses its meaning — features get marked done that aren't done.&lt;/p&gt;

&lt;h2&gt;
  
  
  Four halt codes for four failure modes
&lt;/h2&gt;

&lt;p&gt;Halt categories matter because each one needs a different human response:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Code&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;th&gt;What you do&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;HALT&lt;/code&gt; file present&lt;/td&gt;
&lt;td&gt;Someone paused it; resume with &lt;code&gt;rm HALT&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;BLOCKED:&lt;/code&gt; in feature log&lt;/td&gt;
&lt;td&gt;Model hit something it can't fix; read the log&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Circuit breaker tripped&lt;/td&gt;
&lt;td&gt;Silent spin; feature spec probably ambiguous&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Quota / auth / rate limit&lt;/td&gt;
&lt;td&gt;External issue; wait or rotate keys&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Code 5 is the most interesting. It catches the failure where the model is technically running but producing nothing. Without it, you can lose hours of quota on a feature that has gone silent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why bash
&lt;/h2&gt;

&lt;p&gt;I considered Python. Bash won for three reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Zero install friction.&lt;/strong&gt; Copy one script and a settings file into any repo. No &lt;code&gt;venv&lt;/code&gt;, no &lt;code&gt;pip install&lt;/code&gt;, no version juggling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resumability is trivial.&lt;/strong&gt; State is files on disk. Kill the process, restart it, it picks up exactly where it left off. &lt;code&gt;.done&lt;/code&gt; markers are the source of truth.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coreutils already does the work.&lt;/strong&gt; &lt;code&gt;timeout&lt;/code&gt; for per-call kills, &lt;code&gt;tail -n 200&lt;/code&gt; for windowed context, &lt;code&gt;stat -c %s&lt;/code&gt; for the size-delta circuit breaker, &lt;code&gt;df -Pm&lt;/code&gt; for the disk warning. None of this needs a programming language.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;code&gt;set -uo pipefail&lt;/code&gt; is on; &lt;code&gt;set -e&lt;/code&gt; is intentionally &lt;strong&gt;off&lt;/strong&gt;. The runner &lt;em&gt;must&lt;/em&gt; survive a non-zero exit from &lt;code&gt;claude&lt;/code&gt; — a failed iteration is data, not a fatal error. With &lt;code&gt;-e&lt;/code&gt;, the loop dies on the first model error and you lose the entire run.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it isn't
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Not a planner.&lt;/strong&gt; Specs are the input, not the output. Decomposition happens inside each iteration, by the model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not a verifier.&lt;/strong&gt; Verification is delegated to the model — pytest, npm test, curl, claude-in-chrome MCP for UI smoke tests, whatever fits the feature.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not language-specific.&lt;/strong&gt; It runs against any repo with a &lt;code&gt;specs/&amp;lt;slug&amp;gt;/spec.md&lt;/code&gt; layout. Python, TypeScript, Rust, Go — the runner doesn't care. The model reads the spec and any project-level &lt;code&gt;CLAUDE.md&lt;/code&gt; and picks the right tools.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;p&gt;Three things I would change if I rebuilt it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Make the size-delta threshold configurable per feature.&lt;/strong&gt; 32 bytes works on average but some features have legitimately quiet iterations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add a &lt;code&gt;PARALLEL=N&lt;/code&gt; flag.&lt;/strong&gt; Right now it is strictly serial. For independent features, parallelism would 3-4x throughput.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stream the run log to stderr unconditionally.&lt;/strong&gt; I added &lt;code&gt;tee&lt;/code&gt; later when I realised I couldn't see what was happening without tailing two files at once.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The deeper lesson from this build: &lt;strong&gt;self-evolving systems don't need to be smart, they need to be honest about their own failure modes.&lt;/strong&gt; The harness loop has no learning algorithm, no graph, no agent framework. It has three text files and a circuit breaker. That turns out to be enough to ship features overnight without a human in the chair — provided the spec is clear and the model is given a way to remember.&lt;/p&gt;

&lt;p&gt;Try it: &lt;a href="https://github.com/tufailkhan45/harness-loop" rel="noopener noreferrer"&gt;github.com/tufailkhan45/harness-loop&lt;/a&gt;. The README has install steps and a dry-run mode that prints the resolved queue and sample prompt without spending tokens.&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>bash</category>
      <category>aicoding</category>
      <category>agents</category>
    </item>
    <item>
      <title>Spec-driven development with Claude Code: shipping features in an hour</title>
      <dc:creator>Tufail Khan</dc:creator>
      <pubDate>Fri, 24 Apr 2026 07:58:02 +0000</pubDate>
      <link>https://dev.to/tufailkhan457/spec-driven-development-with-claude-code-shipping-features-in-an-hour-1e40</link>
      <guid>https://dev.to/tufailkhan457/spec-driven-development-with-claude-code-shipping-features-in-an-hour-1e40</guid>
      <description>&lt;p&gt;The developers I know who are shipping the most in 2026 aren't the ones with the fastest typing speed. They're the ones who've rewired their workflow around &lt;strong&gt;spec-driven development&lt;/strong&gt; with tools like Claude Code.&lt;/p&gt;

&lt;p&gt;I've been using this pattern for nine months on everything from Savyour to Vettio. My output has roughly doubled. My bug count is down. My reviews are shorter.&lt;/p&gt;

&lt;p&gt;Here's what the workflow actually looks like — not the marketing version, the messy version.&lt;/p&gt;

&lt;h2&gt;
  
  
  The shift: from chat to spec
&lt;/h2&gt;

&lt;p&gt;The first generation of AI coding assistants (2023-early 2024) were chat-based: you'd have a long conversation with the model, paste code back and forth, and iterate. It was faster than solo, but the context was ephemeral, the quality was uneven, and it didn't play nicely with git.&lt;/p&gt;

&lt;p&gt;Claude Code, Cursor's agent mode, and similar tools inverted this. The new loop:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Write a spec&lt;/strong&gt; — a markdown document describing what to build.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hand the spec to the agent&lt;/strong&gt; — it reads it, explores the repo, writes the code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Review the diff&lt;/strong&gt; — like reviewing a junior engineer's PR.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iterate via spec amendments&lt;/strong&gt;, not chat.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The spec becomes the source of truth. The agent is the implementer. You stay in the architect / reviewer role.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a good spec looks like
&lt;/h2&gt;

&lt;p&gt;Specs that produce clean PRs share a few traits:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Intent and constraint, not instructions
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Bad spec:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Open &lt;code&gt;app/routes/users.ts&lt;/code&gt;, add a new function called &lt;code&gt;getUserByEmail&lt;/code&gt;, call the prisma client...&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Good spec:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Add an endpoint &lt;code&gt;GET /users/by-email?email=...&lt;/code&gt; that returns the user profile. Must hit the existing Prisma-backed &lt;code&gt;users&lt;/code&gt; table. Must respect the existing auth middleware on the &lt;code&gt;/users&lt;/code&gt; router. 404 when not found. Covered by a unit test in the same style as the existing &lt;code&gt;/users/:id&lt;/code&gt; test.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The good version tells the agent &lt;em&gt;what&lt;/em&gt; to build and &lt;em&gt;what rules apply&lt;/em&gt;, not &lt;em&gt;how&lt;/em&gt; to build it. The agent figures out the how from reading the codebase.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Acceptance criteria
&lt;/h3&gt;

&lt;p&gt;End every spec with a bulleted list of what "done" means:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Acceptance criteria&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; [ ] New route passes all existing auth middleware
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Returns 200 + user JSON when the email matches
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Returns 404 with a &lt;span class="sb"&gt;`{"error": "not_found"}`&lt;/span&gt; body otherwise
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Email lookup is case-insensitive
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Test added alongside &lt;span class="sb"&gt;`users.spec.ts`&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; [ ] No changes to the DB schema
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent uses these to self-check. You use them to review.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Out-of-scope callouts
&lt;/h3&gt;

&lt;p&gt;This is the one most devs skip, and it's the difference between a focused PR and a sprawl:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Out of scope&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; Do NOT refactor the existing &lt;span class="sb"&gt;`/users/:id`&lt;/span&gt; route
&lt;span class="p"&gt;-&lt;/span&gt; Do NOT add rate limiting (we'll do that in a follow-up)
&lt;span class="p"&gt;-&lt;/span&gt; Do NOT touch the signup flow
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Agents, like junior engineers, will happily "improve" adjacent code unless told not to. Make the boundary explicit.&lt;/p&gt;

&lt;h2&gt;
  
  
  The iteration loop
&lt;/h2&gt;

&lt;p&gt;Real workflow, from spec to merged PR, on a typical 200-line feature:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;10 min:&lt;/strong&gt; Write the spec (&lt;code&gt;specs/2026-03-02-user-by-email.md&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;30 sec:&lt;/strong&gt; &lt;code&gt;claude "implement the spec at specs/2026-03-02-user-by-email.md"&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3-8 min:&lt;/strong&gt; Claude reads the codebase, writes the code, runs the tests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;5-10 min:&lt;/strong&gt; I review the diff. I ask for a change. The agent makes it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2 min:&lt;/strong&gt; CI runs. Green.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Merge.&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Total: ~30 minutes of my time for work that used to take 2 hours. Most of the savings aren't typing — they're &lt;em&gt;not-context-switching&lt;/em&gt; because the agent does the file-hunting.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the agent is bad at — and how to compensate
&lt;/h2&gt;

&lt;p&gt;Three failure modes I've seen repeatedly:&lt;/p&gt;

&lt;h3&gt;
  
  
  Over-abstracting
&lt;/h3&gt;

&lt;p&gt;Agents love to introduce helper classes, utility modules, and "future-proofing" abstractions you didn't ask for. Explicit "keep it simple, match the surrounding code style" in the spec mitigates this 80% of the way.&lt;/p&gt;

&lt;h3&gt;
  
  
  Silent test deletion
&lt;/h3&gt;

&lt;p&gt;Sometimes an agent will disable a failing test rather than fix the underlying bug. I've caught this half a dozen times. Mitigation: &lt;strong&gt;always grep the diff for &lt;code&gt;.skip&lt;/code&gt;, &lt;code&gt;xit(&lt;/code&gt;, &lt;code&gt;@pytest.mark.skip&lt;/code&gt;&lt;/strong&gt; before approving.&lt;/p&gt;

&lt;h3&gt;
  
  
  Confident wrong answers on versioning
&lt;/h3&gt;

&lt;p&gt;If your codebase uses an unusual library version, agents will default to the current version's API. Mitigation: pin the spec to "read &lt;code&gt;package.json&lt;/code&gt; first and match versions" or include a short "stack notes" section.&lt;/p&gt;

&lt;h2&gt;
  
  
  The CI piece: trust but verify
&lt;/h2&gt;

&lt;p&gt;I treat AI-written code with &lt;em&gt;slightly&lt;/em&gt; more suspicion than my own. My CI for agent-produced PRs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Standard test suite&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;grep -n 'skip\|FIXME\|TODO'&lt;/code&gt; diff check&lt;/li&gt;
&lt;li&gt;Secret scanner (agents occasionally echo-back test credentials)&lt;/li&gt;
&lt;li&gt;Bundle-size budget check&lt;/li&gt;
&lt;li&gt;Type-coverage threshold&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If any of those fail, the PR goes back for revision via a spec amendment, not a code fix on my side.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where spec-driven development fails
&lt;/h2&gt;

&lt;p&gt;Not every task is a fit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Highly exploratory work&lt;/strong&gt; ("figure out why this is slow") is still better with an interactive shell session, not a spec&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Very small changes&lt;/strong&gt; (a one-line fix) have too much spec overhead&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deep refactors spanning &amp;gt;10 files&lt;/strong&gt; often do better broken into multiple specs handed off sequentially&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the 200-line-feature sweet spot — the majority of backend and glue work — spec-driven is my default.&lt;/p&gt;

&lt;h2&gt;
  
  
  The meta-skill
&lt;/h2&gt;

&lt;p&gt;The thing that's changed most about my job in 2026 isn't the model. It's that &lt;strong&gt;writing precise English&lt;/strong&gt; has become my single most leveraged engineering skill. A good spec is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Unambiguous about intent&lt;/li&gt;
&lt;li&gt;Explicit about constraints&lt;/li&gt;
&lt;li&gt;Clear about what "done" looks like&lt;/li&gt;
&lt;li&gt;Honest about what's out of scope&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Which, now that I think about it, is also what a good pre-2023 design doc looked like. Maybe we've come full circle.&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>developerproductivity</category>
      <category>aicoding</category>
      <category>workflow</category>
    </item>
    <item>
      <title>Building MCP Servers in Python: a production primer for 2026</title>
      <dc:creator>Tufail Khan</dc:creator>
      <pubDate>Thu, 23 Apr 2026 05:49:09 +0000</pubDate>
      <link>https://dev.to/tufailkhan457/building-mcp-servers-in-python-a-production-primer-for-2026-4kh2</link>
      <guid>https://dev.to/tufailkhan457/building-mcp-servers-in-python-a-production-primer-for-2026-4kh2</guid>
      <description>&lt;p&gt;The Model Context Protocol (MCP) went from "Anthropic side project" to &lt;strong&gt;industry standard&lt;/strong&gt; in eighteen months. As of March 2026, MCP SDKs are pulling &lt;strong&gt;97 million monthly downloads&lt;/strong&gt;. Every serious agent framework — Claude, Cursor, OpenAI Agents SDK, Microsoft Agent Framework — speaks MCP natively.&lt;/p&gt;

&lt;p&gt;If you're a Python backend engineer, MCP is the most leveraged thing you can learn right now. This post is a practical walkthrough of shipping a production-grade MCP server using &lt;strong&gt;FastMCP&lt;/strong&gt;, the Python framework that makes it boring.&lt;/p&gt;

&lt;h2&gt;
  
  
  What MCP actually is
&lt;/h2&gt;

&lt;p&gt;MCP is a protocol for exposing &lt;strong&gt;tools&lt;/strong&gt;, &lt;strong&gt;resources&lt;/strong&gt;, and &lt;strong&gt;prompts&lt;/strong&gt; to an AI agent in a standardized way. Instead of each agent framework inventing its own adapter format, you write your server once and it plugs into any MCP-compatible client.&lt;/p&gt;

&lt;p&gt;Think of it as &lt;strong&gt;"USB-C for agents."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A minimal server exposes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tools&lt;/strong&gt; — functions the agent can call (e.g. &lt;code&gt;search_customers&lt;/code&gt;, &lt;code&gt;get_order_status&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resources&lt;/strong&gt; — URIs the agent can read (e.g. &lt;code&gt;crm://contacts/123&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompts&lt;/strong&gt; — parameterized prompt templates&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Starter: a FastMCP server in 40 lines
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# server.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastmcp&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastMCP&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;

&lt;span class="n"&gt;mcp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastMCP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;internal-crm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Customer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;tier&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;mrr&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;

&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search_customers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tier&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Customer&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Search the CRM for customers by name or email. Optionally filter by tier.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://crm.internal/api/search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tier&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;Customer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;

&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_customer_notes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Fetch the latest account-manager notes for a customer.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://crm.internal/api/notes/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;

&lt;span class="nd"&gt;@mcp.resource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;crm://customer/{customer_id}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;customer_resource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Read-only customer profile.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://crm.internal/api/customer/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;mcp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;streamable-http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.0.0.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's a complete, production-adjacent MCP server. Type-safe inputs and outputs via Pydantic. Docstrings become tool descriptions the agent reads. Resources get URIs the agent can embed in its context.&lt;/p&gt;

&lt;h2&gt;
  
  
  The transport shift: stdio → Streamable HTTP
&lt;/h2&gt;

&lt;p&gt;Every MCP tutorial from 2024 used &lt;code&gt;stdio&lt;/code&gt; transport — the server runs as a subprocess, the agent pipes JSON-RPC over stdin/stdout. That's fine for desktop tools like Claude Desktop. It's the wrong answer for production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Streamable HTTP&lt;/strong&gt; (finalized in the 2025 spec) fixes this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Servers run as long-lived HTTP services, not per-invocation subprocesses&lt;/li&gt;
&lt;li&gt;Scale horizontally behind a load balancer&lt;/li&gt;
&lt;li&gt;Share across teams and apps&lt;/li&gt;
&lt;li&gt;Deploy once, discover via URL&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In FastMCP, the switch is one line: &lt;code&gt;transport="streamable-http"&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Auth: OAuth 2.1 the boring way
&lt;/h2&gt;

&lt;p&gt;MCP's 2025 spec added OAuth 2.1 as the standard auth mechanism. You don't roll your own. FastMCP ships with OAuth middleware that plugs into your existing IdP (Auth0, Okta, Cognito, Clerk, etc.):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastmcp.auth&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OAuth2Middleware&lt;/span&gt;

&lt;span class="n"&gt;mcp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;OAuth2Middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;issuer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://tufail.auth0.com/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;audience&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mcp-internal-crm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;required_scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;crm:read&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent handles the authorization dance. Your server just enforces scopes on each tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deploying to AWS without overspending
&lt;/h2&gt;

&lt;p&gt;Two patterns we've landed on for production MCP:&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern A — Low-traffic internal tools: &lt;strong&gt;Lambda + API Gateway&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;code&gt;mangum&lt;/code&gt; or FastMCP's ASGI adapter to run inside Lambda&lt;/li&gt;
&lt;li&gt;Cold starts ~300-500ms (acceptable for human-speed agent interactions)&lt;/li&gt;
&lt;li&gt;Cost: near-zero when idle&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Pattern B — High-traffic shared servers: &lt;strong&gt;ECS Fargate behind ALB&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;One service per logical server&lt;/li&gt;
&lt;li&gt;Auto-scale on CPU/memory&lt;/li&gt;
&lt;li&gt;Pair with ElastiCache for stateful session continuity&lt;/li&gt;
&lt;li&gt;Cost: predictable, ~\$30/mo for a small always-on service&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The mistake we made early on: treating every MCP server like it needed an always-on Fargate task. For servers that handle &amp;lt;10 agent calls/hour, Lambda is dramatically cheaper.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to expose — and what not to
&lt;/h2&gt;

&lt;p&gt;The #1 mistake I see is devs exposing their entire internal API as MCP tools. &lt;strong&gt;Don't.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Good MCP servers are &lt;em&gt;curated&lt;/em&gt; for an agent's use case. Ask: what would a smart human operator need to do their job? Expose &lt;em&gt;those&lt;/em&gt; 5-15 tools. Not your 300-endpoint API.&lt;/p&gt;

&lt;p&gt;Good tool design:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;One clear job per tool.&lt;/strong&gt; &lt;code&gt;search_customers&lt;/code&gt; not &lt;code&gt;crm_unified_query&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Typed inputs and outputs.&lt;/strong&gt; Pydantic makes this cheap.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Honest docstrings.&lt;/strong&gt; The agent reads them. Lie in the docstring and the agent will confidently call your tool wrong.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Idempotent where possible.&lt;/strong&gt; Agents retry. Accept that.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;Remote MCP servers + fine-grained OAuth scopes are unlocking internal-AI-assistant work that was impossible a year ago. If you're a Python backend engineer and you haven't shipped an MCP server yet, pick your highest-leverage internal system and wrap it. You'll be surprised how quickly it changes how your team works.&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>python</category>
      <category>claude</category>
      <category>agentic</category>
    </item>
    <item>
      <title>FastAPI at 1M+ users: the patterns that actually matter</title>
      <dc:creator>Tufail Khan</dc:creator>
      <pubDate>Tue, 21 Apr 2026 11:53:15 +0000</pubDate>
      <link>https://dev.to/tufailkhan457/fastapi-at-1m-users-the-patterns-that-actually-matter-1o44</link>
      <guid>https://dev.to/tufailkhan457/fastapi-at-1m-users-the-patterns-that-actually-matter-1o44</guid>
      <description>&lt;p&gt;FastAPI is the default Python web framework in 2026 — 38% of Python teams ship on it, up from 29% a year ago. That means a lot of greenfield projects are making the same early mistakes.&lt;/p&gt;

&lt;p&gt;This post is what I wish I'd known before scaling &lt;strong&gt;Savyour&lt;/strong&gt; (Pakistan's first cashback platform, 1M+ users, 300+ merchant integrations) from 50 RPS to 3,000+ RPS on FastAPI.&lt;/p&gt;

&lt;p&gt;Everything below is drawn from production. No "hello world" demos.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Know your async boundaries
&lt;/h2&gt;

&lt;p&gt;FastAPI supports both &lt;code&gt;def&lt;/code&gt; and &lt;code&gt;async def&lt;/code&gt; endpoints. The framework is smart enough to run sync routes in a threadpool — but &lt;em&gt;your&lt;/em&gt; code may not be.&lt;/p&gt;

&lt;p&gt;The failure mode: an &lt;code&gt;async def&lt;/code&gt; endpoint that calls a blocking library (say, &lt;code&gt;requests&lt;/code&gt; instead of &lt;code&gt;httpx&lt;/code&gt;). The sync call holds the event loop, everything queues behind it, and your p99 latency goes vertical.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule:&lt;/strong&gt; if the function is &lt;code&gt;async def&lt;/code&gt;, every IO operation inside it must be awaitable. Use &lt;code&gt;httpx.AsyncClient&lt;/code&gt;, &lt;code&gt;asyncpg&lt;/code&gt;, &lt;code&gt;aioboto3&lt;/code&gt;, &lt;code&gt;redis.asyncio&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;When you must call a sync library, wrap it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi.concurrency&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;run_in_threadpool&lt;/span&gt;

&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/report&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_report&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# sync pandas code — don't block the loop
&lt;/span&gt;    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;run_in_threadpool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expensive_sync_function&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  2. Connection pools are not optional
&lt;/h2&gt;

&lt;p&gt;Naive async code opens a new database connection per request. At 500 RPS with a 50ms query, that's &lt;strong&gt;25,000 connections&lt;/strong&gt; fighting your Postgres instance. Postgres caps out around 200-500.&lt;/p&gt;

&lt;p&gt;Fix: use a single pool per worker, with tuned sizing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# database.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sqlalchemy.ext.asyncio&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_async_engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AsyncSession&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sqlalchemy.orm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sessionmaker&lt;/span&gt;

&lt;span class="n"&gt;engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_async_engine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;DATABASE_URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pool_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# steady-state per worker
&lt;/span&gt;    &lt;span class="n"&gt;max_overflow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;# burst tolerance
&lt;/span&gt;    &lt;span class="n"&gt;pool_pre_ping&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;# detect dead connections
&lt;/span&gt;    &lt;span class="n"&gt;pool_recycle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1800&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# rotate every 30min
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;AsyncSessionLocal&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sessionmaker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;AsyncSession&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expire_on_commit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_db&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;AsyncSessionLocal&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For multi-worker deployments (Uvicorn &lt;code&gt;--workers 4&lt;/code&gt;), multiply by worker count. If your Postgres caps at 200 connections, 4 workers × 30 max = 120 is safe. Monitor &lt;code&gt;pg_stat_activity&lt;/code&gt; in prod.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Push heavy work to background queues
&lt;/h2&gt;

&lt;p&gt;The endpoint that made Savyour go down in month two: a synchronous product-sync that iterated through 50K affiliate offers per merchant. Five merchants syncing at once = 250K records in-request = timeouts cascading.&lt;/p&gt;

&lt;p&gt;The fix was simple but non-obvious to a team new to async: &lt;strong&gt;never do heavy work in the request cycle.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;arq&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_pool&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;arq.connections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RedisSettings&lt;/span&gt;

&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/sync/{merchant_id}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;trigger_sync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;merchant_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Depends&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_arq_pool&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
    &lt;span class="n"&gt;job&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enqueue_job&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sync_merchant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;merchant_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;job_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;job_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queued&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ARQ, Celery, or Dramatiq — pick one. The worker fleet scales independently of the API fleet. Requests return in milliseconds. Monitoring stays sane.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Pydantic v2 is 5-50× faster — use it
&lt;/h2&gt;

&lt;p&gt;If you're still on Pydantic v1, migrate. The v2 rewrite in Rust dropped our request validation overhead from ~8ms to ~0.5ms per request. At 3,000 RPS that's a full CPU core back.&lt;/p&gt;

&lt;p&gt;Gotchas we hit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Config&lt;/code&gt; → &lt;code&gt;model_config&lt;/code&gt; (nested dict)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;.dict()&lt;/code&gt; → &lt;code&gt;.model_dump()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;validator&lt;/code&gt; → &lt;code&gt;field_validator&lt;/code&gt;, &lt;code&gt;root_validator&lt;/code&gt; → &lt;code&gt;model_validator&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use &lt;code&gt;bump-pydantic&lt;/code&gt; for the mechanical parts. The semantic changes (validator signatures) need human review.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Middleware for observability, not magic
&lt;/h2&gt;

&lt;p&gt;We run three middleware layers in production. In order:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# main.py
&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# 1. Request ID — every log line traces back
&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RequestIDMiddleware&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Timing — p50/p95/p99 per route
&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TimingMiddleware&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 3. Structured logging — JSON out to CloudWatch
&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LoggingMiddleware&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# CORS goes OUTERMOST so OPTIONS requests skip everything
&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CORSMiddleware&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;allow_origins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;FRONTEND_ORIGINS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Avoid:&lt;/strong&gt; auto-magic middleware that wraps your handlers with decorators you can't inspect. When things break at 3 AM, you need to grep the source and understand what's happening. Explicit &amp;gt; clever.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Health checks, liveness, readiness
&lt;/h2&gt;

&lt;p&gt;Three distinct endpoints. Don't collapse them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/healthz&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# is the process up?
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;health&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ok&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/readyz&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# can we serve traffic?
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ready&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Depends&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_db&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Depends&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_redis&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT 1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ping&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ready&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/livez&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# should kubelet restart us?
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;live&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;alive&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Kubernetes (or ECS, or Fargate) uses these to make restart decisions. A failing dependency should make &lt;code&gt;readyz&lt;/code&gt; fail so the LB stops sending traffic — but shouldn't make &lt;code&gt;livez&lt;/code&gt; fail and trigger a restart loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. One project structure to rule them all
&lt;/h2&gt;

&lt;p&gt;After shipping a dozen FastAPI services, this is the structure I reach for:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;app/
├── main.py            # FastAPI app, middleware, lifespan
├── config.py          # pydantic-settings, env-driven
├── db.py              # engine + session factory
├── dependencies.py    # shared Depends() providers
├── routers/
│   ├── customers.py
│   ├── orders.py
│   └── webhooks.py
├── schemas/           # pydantic request/response models
├── models/            # SQLAlchemy ORM
├── services/          # business logic, pure-ish
├── workers/           # ARQ/Celery task definitions
└── tests/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key discipline: &lt;strong&gt;routers call services, services call models, models don't reach back up.&lt;/strong&gt; Break that rule and tests get painful fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd skip
&lt;/h2&gt;

&lt;p&gt;Things I used to reach for that I don't anymore:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Starlette middleware for auth.&lt;/strong&gt; Use FastAPI &lt;code&gt;Depends()&lt;/code&gt; for auth — it composes cleanly with route permissions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom exception handlers for every error.&lt;/strong&gt; One global handler that maps exceptions → HTTP codes is enough for 95% of services.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Over-engineered response models for internal APIs.&lt;/strong&gt; &lt;code&gt;dict&lt;/code&gt; returns are fine for handlers only your own code calls.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The meta-point
&lt;/h2&gt;

&lt;p&gt;FastAPI's documentation is aggressively good — better than most frameworks' books. Read it twice before inventing patterns. Most of the hard-won lessons above are implicit in the docs; I just didn't slow down enough to absorb them the first time.&lt;/p&gt;

</description>
      <category>fastapi</category>
      <category>python</category>
      <category>scaling</category>
      <category>async</category>
    </item>
    <item>
      <title>Cutting our Claude API bill by 78% with prompt caching</title>
      <dc:creator>Tufail Khan</dc:creator>
      <pubDate>Tue, 21 Apr 2026 11:41:20 +0000</pubDate>
      <link>https://dev.to/tufailkhan457/cutting-our-claude-api-bill-by-78-with-prompt-caching-1fon</link>
      <guid>https://dev.to/tufailkhan457/cutting-our-claude-api-bill-by-78-with-prompt-caching-1fon</guid>
      <description>&lt;p&gt;In January 2026 our monthly Claude bill crossed &lt;strong&gt;$4,200&lt;/strong&gt;, up from $600 six months earlier. We were serving a RAG-backed customer-support assistant that retrieved ~12K tokens of context per query, ran through an 800-token system prompt, and called Claude an average of 4.2 times per user session.&lt;/p&gt;

&lt;p&gt;Rolling out Anthropic's &lt;strong&gt;prompt caching&lt;/strong&gt; dropped that to &lt;strong&gt;$920/month&lt;/strong&gt; — a 78% reduction — without touching any user-facing behavior.&lt;/p&gt;

&lt;p&gt;This post is the exact playbook.&lt;/p&gt;

&lt;h2&gt;
  
  
  What prompt caching does
&lt;/h2&gt;

&lt;p&gt;Claude's prompt caching stores &lt;em&gt;prefix portions&lt;/em&gt; of your prompt in Anthropic's infrastructure. When a subsequent request reuses that same prefix, the cached portion costs &lt;strong&gt;10% of the normal input-token price&lt;/strong&gt; and is processed much faster.&lt;/p&gt;

&lt;p&gt;The pricing in 2026:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cache write:&lt;/strong&gt; 1.25× input cost (on first use)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache read (hit):&lt;/strong&gt; 0.1× input cost&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TTL:&lt;/strong&gt; 5 minutes by default, 1 hour available&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Break-even is ~2 hits per cache write. In practice, a well-placed cache break point hits &lt;strong&gt;dozens to hundreds of times&lt;/strong&gt; before it expires.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to cache — high, medium, low ROI
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;High ROI&lt;/strong&gt; (always cache):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;System prompts (usually stable across all requests)&lt;/li&gt;
&lt;li&gt;Long tool-schema definitions&lt;/li&gt;
&lt;li&gt;Retrieved context chunks reused within a session (RAG)&lt;/li&gt;
&lt;li&gt;Few-shot example banks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Medium ROI&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User conversation history early in a session (caches grow as the conversation progresses)&lt;/li&gt;
&lt;li&gt;Document chunks that appear frequently across queries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Low / anti-ROI&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Per-request user input&lt;/li&gt;
&lt;li&gt;Anything that changes every call&lt;/li&gt;
&lt;li&gt;Caches smaller than 1024 tokens (minimum cache block size for Claude Opus/Sonnet)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The anatomy of a cached prompt
&lt;/h2&gt;

&lt;p&gt;In the Python SDK, you add &lt;code&gt;cache_control&lt;/code&gt; markers to the content blocks you want cached. Everything &lt;em&gt;before&lt;/em&gt; the marker gets cached as a prefix.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;LONG_SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# stable, reusable
&lt;/span&gt;            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_schema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{...},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;retrieved_context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# session-scoped RAG chunks
&lt;/span&gt;                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="c1"&gt;# no cache marker — this changes every request
&lt;/span&gt;                &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# inspect cache metrics
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache_creation_input_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache_read_input_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: &lt;strong&gt;up to 4 cache break points&lt;/strong&gt; per request. We use all 4:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;System prompt (changes ~monthly)&lt;/li&gt;
&lt;li&gt;Tool schemas (changes ~monthly)&lt;/li&gt;
&lt;li&gt;Retrieved RAG context (changes per session)&lt;/li&gt;
&lt;li&gt;Conversation history (grows within session)&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Metrics from real traffic
&lt;/h2&gt;

&lt;p&gt;Before caching, on a representative 1,000-request sample:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Input tokens billed: 14.2M (≈ $42.60 at Opus 4.7 pricing)&lt;/li&gt;
&lt;li&gt;Output tokens billed: 380K (≈ $28.50)&lt;/li&gt;
&lt;li&gt;Total: &lt;strong&gt;$71.10&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After caching, same workload:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cache write input tokens: 1.8M ($6.75)&lt;/li&gt;
&lt;li&gt;Cache read input tokens: 12.1M ($3.63)&lt;/li&gt;
&lt;li&gt;Uncached input tokens: 300K ($0.90)&lt;/li&gt;
&lt;li&gt;Output tokens: 380K ($28.50)&lt;/li&gt;
&lt;li&gt;Total: &lt;strong&gt;$39.78&lt;/strong&gt; (−44%)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Output tokens dominate what's left. Short of switching models, the input side is essentially solved.&lt;/p&gt;

&lt;h2&gt;
  
  
  Watch out for: cache invalidation footguns
&lt;/h2&gt;

&lt;p&gt;Cache hits match on &lt;strong&gt;exact byte-level prefix equality&lt;/strong&gt;. Any variance busts the cache. Things that silently broke ours early on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Whitespace drift&lt;/strong&gt; in system-prompt templating (a stray &lt;code&gt;\n&lt;/code&gt; from a template engine)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dict-ordering&lt;/strong&gt; when serializing tool schemas from a Python dict — always use &lt;code&gt;json.dumps(..., sort_keys=True)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp injection&lt;/strong&gt; into system prompts (&lt;code&gt;"Today is {date}..."&lt;/code&gt; rebuilds the cache every day — move it to user content)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User-scoped data in system prompt&lt;/strong&gt; — blows cache per user; move it down the prompt&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instrument &lt;code&gt;cache_creation_input_tokens&lt;/code&gt; vs &lt;code&gt;cache_read_input_tokens&lt;/code&gt; on every response and alert if the ratio drifts. A week of silent cache misses can cost you thousands.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 1-hour cache tier
&lt;/h2&gt;

&lt;p&gt;Anthropic added a &lt;strong&gt;1-hour TTL&lt;/strong&gt; option in mid-2025. It costs 2× the write price but lives 12× longer. For workloads with predictable hot paths — e.g. a support assistant where 80% of sessions hit the same product docs — the 1-hour tier amortizes beautifully.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ttl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1h&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use it where cache hit rate is high. Don't use it for small cache blocks or unpredictable traffic — you'll pay the write premium without the hit volume.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;Prompt caching is the highest-ROI single change I've made to a production Claude app in the last year. If you're running a RAG, agent, or long-context workload on Claude and &lt;em&gt;not&lt;/em&gt; using prompt caching, the savings are almost certainly 40-80% sitting on the table.&lt;/p&gt;

&lt;p&gt;The cost to implement: two afternoons, including the instrumentation. The cost to ignore: compounding every month you don't do it.&lt;/p&gt;

</description>
      <category>claude</category>
      <category>anthropic</category>
      <category>costoptimization</category>
      <category>ai</category>
    </item>
    <item>
      <title>Why we replaced LangChain with the raw Anthropic SDK in production</title>
      <dc:creator>Tufail Khan</dc:creator>
      <pubDate>Tue, 21 Apr 2026 11:40:59 +0000</pubDate>
      <link>https://dev.to/tufailkhan457/why-we-replaced-langchain-with-the-raw-anthropic-sdk-in-production-3611</link>
      <guid>https://dev.to/tufailkhan457/why-we-replaced-langchain-with-the-raw-anthropic-sdk-in-production-3611</guid>
      <description>&lt;p&gt;LangChain was the right answer in 2023. It abstracted away a messy ecosystem of half-baked provider APIs, gave you a unified &lt;code&gt;LLM&lt;/code&gt; interface, and let you stitch agents together with a few dozen lines of Python. We used it everywhere — including in production on Vettio, our AI recruitment platform.&lt;/p&gt;

&lt;p&gt;In April 2026, we ripped it out.&lt;/p&gt;

&lt;p&gt;This post is about &lt;strong&gt;why&lt;/strong&gt; we made that call, &lt;strong&gt;what replaced it&lt;/strong&gt;, and &lt;strong&gt;the metrics&lt;/strong&gt; that justified the migration.&lt;/p&gt;

&lt;h2&gt;
  
  
  The symptoms
&lt;/h2&gt;

&lt;p&gt;LangChain's abstractions started leaking the moment we went beyond happy-path demos. Three things kept biting us:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Stack traces from hell.&lt;/strong&gt; A single &lt;code&gt;AgentExecutor.invoke()&lt;/code&gt; call crossed 14 frames of LangChain internals before reaching &lt;em&gt;our&lt;/em&gt; code. Debugging a malformed tool call felt like archaeology.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version churn.&lt;/strong&gt; Every minor bump renamed, relocated, or deprecated something we depended on. Our CI was pinned to a specific LangChain SHA for six months just to stay green.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Abstracted-away observability.&lt;/strong&gt; We couldn't cleanly trace token usage, cache hits, or per-tool latencies without monkey-patching internal classes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Meanwhile, Anthropic's native SDK was getting &lt;em&gt;better&lt;/em&gt;. Native tool calling, prompt caching, extended thinking, streaming — all first-class and documented.&lt;/p&gt;

&lt;h2&gt;
  
  
  The refactor
&lt;/h2&gt;

&lt;p&gt;The logic we were using LangChain for wasn't complicated:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Build a system prompt from templates&lt;/li&gt;
&lt;li&gt;Call Claude with a list of tools&lt;/li&gt;
&lt;li&gt;Route tool calls to our internal handlers&lt;/li&gt;
&lt;li&gt;Return the result&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We replaced ~800 lines of LangChain glue with this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Anthropic&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;tool_handlers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;

    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stop_reason&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end_turn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;

        &lt;span class="c1"&gt;# Handle tool use
&lt;/span&gt;        &lt;span class="n"&gt;tool_calls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_use&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

        &lt;span class="n"&gt;tool_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;call&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tool_handlers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;tool_results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_use_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tool_results&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. No &lt;code&gt;AgentExecutor&lt;/code&gt;, no &lt;code&gt;Callback&lt;/code&gt;, no &lt;code&gt;ConversationBufferMemory&lt;/code&gt;. Just the model and our code.&lt;/p&gt;

&lt;h2&gt;
  
  
  The metrics
&lt;/h2&gt;

&lt;p&gt;We ran the old and new paths side-by-side for two weeks on Vettio's interview-bot service. Results:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;p50 latency:&lt;/strong&gt; 2.1s → 1.4s (−33%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;p95 latency:&lt;/strong&gt; 4.8s → 3.2s (−33%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error rate:&lt;/strong&gt; 0.9% → 0.2%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stack trace depth on errors:&lt;/strong&gt; 14 → 4 frames&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lines of integration code:&lt;/strong&gt; 812 → 187&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The latency win came mostly from eliminating LangChain's implicit retry-and-retry-again behavior on tool-use mismatches. With direct SDK calls, a malformed tool schema fails loudly instead of silently retrying three times.&lt;/p&gt;

&lt;h2&gt;
  
  
  When LangChain still makes sense
&lt;/h2&gt;

&lt;p&gt;This isn't a blanket "don't use LangChain" post. It still wins if you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-provider abstraction.&lt;/strong&gt; Swapping between Claude, GPT-4, and Gemini behind a stable interface.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangGraph workflows&lt;/strong&gt; for graph-based agent topologies you'd otherwise build from scratch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangSmith observability&lt;/strong&gt; you don't want to rebuild.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a team that's already committed to one provider (we're all-in on Claude) and wants full control over prompts, tool schemas, and observability — the native SDK is the right tool in 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  The lesson
&lt;/h2&gt;

&lt;p&gt;Abstractions pay for themselves when the underlying APIs are bad. Anthropic's API isn't bad. It's clean, well-documented, and stable. The abstraction tax was real; the abstraction benefit had quietly evaporated.&lt;/p&gt;

&lt;p&gt;If you're still on LangChain in a production Claude app, benchmark a direct-SDK rewrite of your hot path. You might be surprised.&lt;/p&gt;

</description>
      <category>langchain</category>
      <category>claude</category>
      <category>anthropic</category>
      <category>python</category>
    </item>
  </channel>
</rss>
