DEV Community: Tufail Khan

Harness engineering: a self-evolving feature loop in 312 lines of bash

Tufail Khan — Thu, 30 Apr 2026 17:24:45 +0000

Repo: github.com/tufailkhan45/harness-loop — one bash script, drop into any spec-driven repo.
Originally published on: tufail.dev/blog/harness-engineering-self-evolving-loop

Most posts about Claude Code talk about prompts. This one is about the harness — the wrapper around the model that turns a single claude -p invocation into a system that can ship a backlog of features over hours, survive its own failures, and learn as it goes.

I built harness-loop after watching too many headless Claude runs silently spin on the same broken approach for thirty minutes. This post walks through what a harness actually does, why the design comes down to three load-bearing parts, and what I learned writing one in bash.

What is harness engineering?

The model produces tokens. The spec describes the goal. The harness is everything in between: when to invoke the model, what context to feed it, when to stop, when to halt the whole run, and what to trust as a "done" signal.

If the model is the engine and the spec is the destination, the harness is the chassis, fuel system, and dashboard warning lights. Most AI workflows fail not because the model is wrong but because the harness is missing — the model gets called once, returns something that looks plausible, and the human is left to figure out whether the work actually shipped.

A good harness answers four questions on every iteration:

What is the next unit of work?
What context does the model need that it didn't have last time?
Did anything just happen that requires a human?
Is this feature actually done, or does the model just think it is?

The whole loop is built around answering those four questions, mechanically, in a way that survives crashes, quota windows, and the model's own occasional confidence in incorrect things.

What the loop does

The runner is one bash file (scripts/run-features.sh, 312 lines). Every iteration:

Picks the next feature without a .done marker
Builds a prompt from the spec, the feature's prior attempt log, and a global learnings file
Invokes claude -p under timeout
Inspects the resulting log for halt signals (BLOCKED:, no growth, quota errors)
Loops

It exits 0 when every feature has a marker, or with a halt code (3-6) when something demands a human.

specs/auth-login/spec.md   ──┐
logs/auth-login.log        ──┼──> prompt ──> claude -p ──> append log + maybe .done
logs/learnings.md          ──┘

That is the whole architecture. No queue, no database, no orchestrator. The filesystem is the state machine, and .done markers are the source of truth.

Self-evolution: three parts that all have to work

"Self-evolving" sounds hand-wavy until you stare at what it actually requires. There are exactly three mechanisms, and breaking any one breaks the loop:

1. Read. Every iteration tails the last 200 lines of two files into the prompt — the feature's own prior log (so the model does not repeat what already failed), and a cross-feature learnings file (so feature D benefits from a discovery made in feature A). Recency is the model. Switching to head or middle slices wouldn't work as well, because the most recent attempt holds the most relevant signal.

2. Write. The prompt explicitly asks the model to do two things at the end of every iteration: append a progress note to the feature log, and append a one-line lesson to learnings.md — but only if the lesson is broadly applicable. The wording is deliberately load-bearing. Soften it ("you may want to add a note...") and the loop's memory degrades within a handful of iterations.

3. Floor. A circuit breaker. If a feature's log does not grow by more than 32 bytes for STUCK_LIMIT iterations in a row, the runner halts that feature with exit code 5. The runner cannot audit what the model writes, only whether it writes anything. Without this floor, a model that has hallucinated its feedback channel will spin forever and burn quota.

The asymmetry matters. Read and Write are model behaviour — both can fail subtly. The Floor is a hard mechanical guardrail that catches the failure mode the model itself cannot self-detect.

The prompt is the API

Most of the prompt is a fixed heredoc, but two blocks are dynamic:

<<<PRIOR_LOG
[last 200 lines of logs/feature-runner/<slug>.log]
PRIOR_LOG

<<<LEARNINGS
[last 200 lines of logs/feature-runner/learnings.md]
LEARNINGS

Followed by a six-step task list that constrains the iteration to one meaningful step — not "finish the feature," not "make as much progress as you can," but pick the next unfinished piece, do it, verify it, log it. The "one step at a time" framing prevents the model from spending a 30-minute timeout on a megacommit it then cannot verify.

Step 6 is the contract: write <slug>.done only if the spec is satisfied AND verification is green. The runner trusts this signal. Weaken the prompt ("write .done when you think you're close enough") and the whole loop loses its meaning — features get marked done that aren't done.

Four halt codes for four failure modes

Halt categories matter because each one needs a different human response:

Code	Meaning	What you do
3	`HALT` file present	Someone paused it; resume with `rm HALT`
4	`BLOCKED:` in feature log	Model hit something it can't fix; read the log
5	Circuit breaker tripped	Silent spin; feature spec probably ambiguous
6	Quota / auth / rate limit	External issue; wait or rotate keys

Code 5 is the most interesting. It catches the failure where the model is technically running but producing nothing. Without it, you can lose hours of quota on a feature that has gone silent.

Why bash

I considered Python. Bash won for three reasons:

Zero install friction. Copy one script and a settings file into any repo. No venv, no pip install, no version juggling.
Resumability is trivial. State is files on disk. Kill the process, restart it, it picks up exactly where it left off. .done markers are the source of truth.
Coreutils already does the work. timeout for per-call kills, tail -n 200 for windowed context, stat -c %s for the size-delta circuit breaker, df -Pm for the disk warning. None of this needs a programming language.

set -uo pipefail is on; set -e is intentionally off. The runner must survive a non-zero exit from claude — a failed iteration is data, not a fatal error. With -e, the loop dies on the first model error and you lose the entire run.

What it isn't

Not a planner. Specs are the input, not the output. Decomposition happens inside each iteration, by the model.
Not a verifier. Verification is delegated to the model — pytest, npm test, curl, claude-in-chrome MCP for UI smoke tests, whatever fits the feature.
Not language-specific. It runs against any repo with a specs/<slug>/spec.md layout. Python, TypeScript, Rust, Go — the runner doesn't care. The model reads the spec and any project-level CLAUDE.md and picks the right tools.

What I'd do differently

Three things I would change if I rebuilt it:

Make the size-delta threshold configurable per feature. 32 bytes works on average but some features have legitimately quiet iterations.
Add a PARALLEL=N flag. Right now it is strictly serial. For independent features, parallelism would 3-4x throughput.
Stream the run log to stderr unconditionally. I added tee later when I realised I couldn't see what was happening without tailing two files at once.

The deeper lesson from this build: self-evolving systems don't need to be smart, they need to be honest about their own failure modes. The harness loop has no learning algorithm, no graph, no agent framework. It has three text files and a circuit breaker. That turns out to be enough to ship features overnight without a human in the chair — provided the spec is clear and the model is given a way to remember.

Try it: github.com/tufailkhan45/harness-loop. The README has install steps and a dry-run mode that prints the resolved queue and sample prompt without spending tokens.

Spec-driven development with Claude Code: shipping features in an hour

Tufail Khan — Fri, 24 Apr 2026 07:58:02 +0000

The developers I know who are shipping the most in 2026 aren't the ones with the fastest typing speed. They're the ones who've rewired their workflow around spec-driven development with tools like Claude Code.

I've been using this pattern for nine months on everything from Savyour to Vettio. My output has roughly doubled. My bug count is down. My reviews are shorter.

Here's what the workflow actually looks like — not the marketing version, the messy version.

The shift: from chat to spec

The first generation of AI coding assistants (2023-early 2024) were chat-based: you'd have a long conversation with the model, paste code back and forth, and iterate. It was faster than solo, but the context was ephemeral, the quality was uneven, and it didn't play nicely with git.

Claude Code, Cursor's agent mode, and similar tools inverted this. The new loop:

Write a spec — a markdown document describing what to build.
Hand the spec to the agent — it reads it, explores the repo, writes the code.
Review the diff — like reviewing a junior engineer's PR.
Iterate via spec amendments, not chat.

The spec becomes the source of truth. The agent is the implementer. You stay in the architect / reviewer role.

What a good spec looks like

Specs that produce clean PRs share a few traits:

1. Intent and constraint, not instructions

Bad spec:

Open app/routes/users.ts, add a new function called getUserByEmail, call the prisma client...

Good spec:

Add an endpoint GET /users/by-email?email=... that returns the user profile. Must hit the existing Prisma-backed users table. Must respect the existing auth middleware on the /users router. 404 when not found. Covered by a unit test in the same style as the existing /users/:id test.

The good version tells the agent what to build and what rules apply, not how to build it. The agent figures out the how from reading the codebase.

2. Acceptance criteria

End every spec with a bulleted list of what "done" means:

## Acceptance criteria

- [ ] New route passes all existing auth middleware
- [ ] Returns 200 + user JSON when the email matches
- [ ] Returns 404 with a `{"error": "not_found"}` body otherwise
- [ ] Email lookup is case-insensitive
- [ ] Test added alongside `users.spec.ts`
- [ ] No changes to the DB schema

The agent uses these to self-check. You use them to review.

3. Out-of-scope callouts

This is the one most devs skip, and it's the difference between a focused PR and a sprawl:

## Out of scope

- Do NOT refactor the existing `/users/:id` route
- Do NOT add rate limiting (we'll do that in a follow-up)
- Do NOT touch the signup flow

Agents, like junior engineers, will happily "improve" adjacent code unless told not to. Make the boundary explicit.

The iteration loop

Real workflow, from spec to merged PR, on a typical 200-line feature:

10 min: Write the spec (specs/2026-03-02-user-by-email.md)
30 sec: claude "implement the spec at specs/2026-03-02-user-by-email.md"
3-8 min: Claude reads the codebase, writes the code, runs the tests
5-10 min: I review the diff. I ask for a change. The agent makes it.
2 min: CI runs. Green.
Merge.

Total: ~30 minutes of my time for work that used to take 2 hours. Most of the savings aren't typing — they're not-context-switching because the agent does the file-hunting.

What the agent is bad at — and how to compensate

Three failure modes I've seen repeatedly:

Over-abstracting

Agents love to introduce helper classes, utility modules, and "future-proofing" abstractions you didn't ask for. Explicit "keep it simple, match the surrounding code style" in the spec mitigates this 80% of the way.

Silent test deletion

Sometimes an agent will disable a failing test rather than fix the underlying bug. I've caught this half a dozen times. Mitigation: always grep the diff for .skip, xit(, @pytest.mark.skip before approving.

Confident wrong answers on versioning

If your codebase uses an unusual library version, agents will default to the current version's API. Mitigation: pin the spec to "read package.json first and match versions" or include a short "stack notes" section.

The CI piece: trust but verify

I treat AI-written code with slightly more suspicion than my own. My CI for agent-produced PRs:

Standard test suite
grep -n 'skip\|FIXME\|TODO' diff check
Secret scanner (agents occasionally echo-back test credentials)
Bundle-size budget check
Type-coverage threshold

If any of those fail, the PR goes back for revision via a spec amendment, not a code fix on my side.

Where spec-driven development fails

Not every task is a fit:

Highly exploratory work ("figure out why this is slow") is still better with an interactive shell session, not a spec
Very small changes (a one-line fix) have too much spec overhead
Deep refactors spanning >10 files often do better broken into multiple specs handed off sequentially

For the 200-line-feature sweet spot — the majority of backend and glue work — spec-driven is my default.

The meta-skill

The thing that's changed most about my job in 2026 isn't the model. It's that writing precise English has become my single most leveraged engineering skill. A good spec is:

Unambiguous about intent
Explicit about constraints
Clear about what "done" looks like
Honest about what's out of scope

Which, now that I think about it, is also what a good pre-2023 design doc looked like. Maybe we've come full circle.

Building MCP Servers in Python: a production primer for 2026

Tufail Khan — Thu, 23 Apr 2026 05:49:09 +0000

The Model Context Protocol (MCP) went from "Anthropic side project" to industry standard in eighteen months. As of March 2026, MCP SDKs are pulling 97 million monthly downloads. Every serious agent framework — Claude, Cursor, OpenAI Agents SDK, Microsoft Agent Framework — speaks MCP natively.

If you're a Python backend engineer, MCP is the most leveraged thing you can learn right now. This post is a practical walkthrough of shipping a production-grade MCP server using FastMCP, the Python framework that makes it boring.

What MCP actually is

MCP is a protocol for exposing tools, resources, and prompts to an AI agent in a standardized way. Instead of each agent framework inventing its own adapter format, you write your server once and it plugs into any MCP-compatible client.

Think of it as "USB-C for agents."

A minimal server exposes:

Tools — functions the agent can call (e.g. search_customers, get_order_status)
Resources — URIs the agent can read (e.g. crm://contacts/123)
Prompts — parameterized prompt templates

Starter: a FastMCP server in 40 lines

# server.py
from fastmcp import FastMCP
from pydantic import BaseModel
import httpx

mcp = FastMCP("internal-crm")

class Customer(BaseModel):
    id: str
    name: str
    tier: str
    mrr: float

@mcp.tool()
async def search_customers(query: str, tier: str | None = None) -> list[Customer]:
    """Search the CRM for customers by name or email. Optionally filter by tier."""
    async with httpx.AsyncClient() as client:
        r = await client.get(
            "https://crm.internal/api/search",
            params={"q": query, "tier": tier},
        )
        return [Customer(**row) for row in r.json()]

@mcp.tool()
async def get_customer_notes(customer_id: str) -> str:
    """Fetch the latest account-manager notes for a customer."""
    async with httpx.AsyncClient() as client:
        r = await client.get(f"https://crm.internal/api/notes/{customer_id}")
        return r.text

@mcp.resource("crm://customer/{customer_id}")
async def customer_resource(customer_id: str) -> str:
    """Read-only customer profile."""
    async with httpx.AsyncClient() as client:
        r = await client.get(f"https://crm.internal/api/customer/{customer_id}")
        return r.text

if __name__ == "__main__":
    mcp.run(transport="streamable-http", host="0.0.0.0", port=8000)

That's a complete, production-adjacent MCP server. Type-safe inputs and outputs via Pydantic. Docstrings become tool descriptions the agent reads. Resources get URIs the agent can embed in its context.

The transport shift: stdio → Streamable HTTP

Every MCP tutorial from 2024 used stdio transport — the server runs as a subprocess, the agent pipes JSON-RPC over stdin/stdout. That's fine for desktop tools like Claude Desktop. It's the wrong answer for production.

Streamable HTTP (finalized in the 2025 spec) fixes this:

Servers run as long-lived HTTP services, not per-invocation subprocesses
Scale horizontally behind a load balancer
Share across teams and apps
Deploy once, discover via URL

In FastMCP, the switch is one line: transport="streamable-http".

Auth: OAuth 2.1 the boring way

MCP's 2025 spec added OAuth 2.1 as the standard auth mechanism. You don't roll your own. FastMCP ships with OAuth middleware that plugs into your existing IdP (Auth0, Okta, Cognito, Clerk, etc.):

from fastmcp.auth import OAuth2Middleware

mcp.add_middleware(OAuth2Middleware(
    issuer="https://tufail.auth0.com/",
    audience="mcp-internal-crm",
    required_scope="crm:read",
))

The agent handles the authorization dance. Your server just enforces scopes on each tool.

Deploying to AWS without overspending

Two patterns we've landed on for production MCP:

Pattern A — Low-traffic internal tools: Lambda + API Gateway

Use mangum or FastMCP's ASGI adapter to run inside Lambda
Cold starts ~300-500ms (acceptable for human-speed agent interactions)
Cost: near-zero when idle

Pattern B — High-traffic shared servers: ECS Fargate behind ALB

One service per logical server
Auto-scale on CPU/memory
Pair with ElastiCache for stateful session continuity
Cost: predictable, ~\$30/mo for a small always-on service

The mistake we made early on: treating every MCP server like it needed an always-on Fargate task. For servers that handle <10 agent calls/hour, Lambda is dramatically cheaper.

What to expose — and what not to

The #1 mistake I see is devs exposing their entire internal API as MCP tools. Don't.

Good MCP servers are curated for an agent's use case. Ask: what would a smart human operator need to do their job? Expose those 5-15 tools. Not your 300-endpoint API.

Good tool design:

One clear job per tool. search_customers not crm_unified_query.
Typed inputs and outputs. Pydantic makes this cheap.
Honest docstrings. The agent reads them. Lie in the docstring and the agent will confidently call your tool wrong.
Idempotent where possible. Agents retry. Accept that.

What's next

Remote MCP servers + fine-grained OAuth scopes are unlocking internal-AI-assistant work that was impossible a year ago. If you're a Python backend engineer and you haven't shipped an MCP server yet, pick your highest-leverage internal system and wrap it. You'll be surprised how quickly it changes how your team works.

FastAPI at 1M+ users: the patterns that actually matter

Tufail Khan — Tue, 21 Apr 2026 11:53:15 +0000

FastAPI is the default Python web framework in 2026 — 38% of Python teams ship on it, up from 29% a year ago. That means a lot of greenfield projects are making the same early mistakes.

This post is what I wish I'd known before scaling Savyour (Pakistan's first cashback platform, 1M+ users, 300+ merchant integrations) from 50 RPS to 3,000+ RPS on FastAPI.

Everything below is drawn from production. No "hello world" demos.

1. Know your async boundaries

FastAPI supports both def and async def endpoints. The framework is smart enough to run sync routes in a threadpool — but your code may not be.

The failure mode: an async def endpoint that calls a blocking library (say, requests instead of httpx). The sync call holds the event loop, everything queues behind it, and your p99 latency goes vertical.

Rule: if the function is async def, every IO operation inside it must be awaitable. Use httpx.AsyncClient, asyncpg, aioboto3, redis.asyncio.

When you must call a sync library, wrap it:

from fastapi.concurrency import run_in_threadpool

@app.get("/report")
async def generate_report():
    # sync pandas code — don't block the loop
    result = await run_in_threadpool(expensive_sync_function)
    return result

2. Connection pools are not optional

Naive async code opens a new database connection per request. At 500 RPS with a 50ms query, that's 25,000 connections fighting your Postgres instance. Postgres caps out around 200-500.

Fix: use a single pool per worker, with tuned sizing:

# database.py
from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession
from sqlalchemy.orm import sessionmaker

engine = create_async_engine(
    DATABASE_URL,
    pool_size=20,           # steady-state per worker
    max_overflow=10,        # burst tolerance
    pool_pre_ping=True,     # detect dead connections
    pool_recycle=1800,      # rotate every 30min
)

AsyncSessionLocal = sessionmaker(engine, class_=AsyncSession, expire_on_commit=False)

async def get_db():
    async with AsyncSessionLocal() as session:
        yield session

For multi-worker deployments (Uvicorn --workers 4), multiply by worker count. If your Postgres caps at 200 connections, 4 workers × 30 max = 120 is safe. Monitor pg_stat_activity in prod.

3. Push heavy work to background queues

The endpoint that made Savyour go down in month two: a synchronous product-sync that iterated through 50K affiliate offers per merchant. Five merchants syncing at once = 250K records in-request = timeouts cascading.

The fix was simple but non-obvious to a team new to async: never do heavy work in the request cycle.

from arq import create_pool
from arq.connections import RedisSettings

@app.post("/sync/{merchant_id}")
async def trigger_sync(merchant_id: int, pool=Depends(get_arq_pool)):
    job = await pool.enqueue_job("sync_merchant", merchant_id)
    return {"job_id": job.job_id, "status": "queued"}

ARQ, Celery, or Dramatiq — pick one. The worker fleet scales independently of the API fleet. Requests return in milliseconds. Monitoring stays sane.

4. Pydantic v2 is 5-50× faster — use it

If you're still on Pydantic v1, migrate. The v2 rewrite in Rust dropped our request validation overhead from ~8ms to ~0.5ms per request. At 3,000 RPS that's a full CPU core back.

Gotchas we hit:

Config → model_config (nested dict)
.dict() → .model_dump()
validator → field_validator, root_validator → model_validator

Use bump-pydantic for the mechanical parts. The semantic changes (validator signatures) need human review.

5. Middleware for observability, not magic

We run three middleware layers in production. In order:

# main.py
app = FastAPI()

# 1. Request ID — every log line traces back
app.add_middleware(RequestIDMiddleware)

# 2. Timing — p50/p95/p99 per route
app.add_middleware(TimingMiddleware)

# 3. Structured logging — JSON out to CloudWatch
app.add_middleware(LoggingMiddleware)

# CORS goes OUTERMOST so OPTIONS requests skip everything
app.add_middleware(CORSMiddleware, allow_origins=FRONTEND_ORIGINS)

Avoid: auto-magic middleware that wraps your handlers with decorators you can't inspect. When things break at 3 AM, you need to grep the source and understand what's happening. Explicit > clever.

6. Health checks, liveness, readiness

Three distinct endpoints. Don't collapse them.

@app.get("/healthz")  # is the process up?
async def health():
    return {"status": "ok"}

@app.get("/readyz")  # can we serve traffic?
async def ready(db=Depends(get_db), redis=Depends(get_redis)):
    await db.execute("SELECT 1")
    await redis.ping()
    return {"status": "ready"}

@app.get("/livez")  # should kubelet restart us?
async def live():
    return {"status": "alive"}

Kubernetes (or ECS, or Fargate) uses these to make restart decisions. A failing dependency should make readyz fail so the LB stops sending traffic — but shouldn't make livez fail and trigger a restart loop.

7. One project structure to rule them all

After shipping a dozen FastAPI services, this is the structure I reach for:

app/
├── main.py            # FastAPI app, middleware, lifespan
├── config.py          # pydantic-settings, env-driven
├── db.py              # engine + session factory
├── dependencies.py    # shared Depends() providers
├── routers/
│   ├── customers.py
│   ├── orders.py
│   └── webhooks.py
├── schemas/           # pydantic request/response models
├── models/            # SQLAlchemy ORM
├── services/          # business logic, pure-ish
├── workers/           # ARQ/Celery task definitions
└── tests/

The key discipline: routers call services, services call models, models don't reach back up. Break that rule and tests get painful fast.

What I'd skip

Things I used to reach for that I don't anymore:

Starlette middleware for auth. Use FastAPI Depends() for auth — it composes cleanly with route permissions.
Custom exception handlers for every error. One global handler that maps exceptions → HTTP codes is enough for 95% of services.
Over-engineered response models for internal APIs. dict returns are fine for handlers only your own code calls.

The meta-point

FastAPI's documentation is aggressively good — better than most frameworks' books. Read it twice before inventing patterns. Most of the hard-won lessons above are implicit in the docs; I just didn't slow down enough to absorb them the first time.

Cutting our Claude API bill by 78% with prompt caching

Tufail Khan — Tue, 21 Apr 2026 11:41:20 +0000

In January 2026 our monthly Claude bill crossed $4,200, up from $600 six months earlier. We were serving a RAG-backed customer-support assistant that retrieved ~12K tokens of context per query, ran through an 800-token system prompt, and called Claude an average of 4.2 times per user session.

Rolling out Anthropic's prompt caching dropped that to $920/month — a 78% reduction — without touching any user-facing behavior.

This post is the exact playbook.

What prompt caching does

Claude's prompt caching stores prefix portions of your prompt in Anthropic's infrastructure. When a subsequent request reuses that same prefix, the cached portion costs 10% of the normal input-token price and is processed much faster.

The pricing in 2026:

Cache write: 1.25× input cost (on first use)
Cache read (hit): 0.1× input cost
TTL: 5 minutes by default, 1 hour available

Break-even is ~2 hits per cache write. In practice, a well-placed cache break point hits dozens to hundreds of times before it expires.

Where to cache — high, medium, low ROI

High ROI (always cache):

System prompts (usually stable across all requests)
Long tool-schema definitions
Retrieved context chunks reused within a session (RAG)
Few-shot example banks

Medium ROI:

User conversation history early in a session (caches grow as the conversation progresses)
Document chunks that appear frequently across queries

Low / anti-ROI:

Per-request user input
Anything that changes every call
Caches smaller than 1024 tokens (minimum cache block size for Claude Opus/Sonnet)

The anatomy of a cached prompt

In the Python SDK, you add cache_control markers to the content blocks you want cached. Everything before the marker gets cached as a prefix.

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,  # stable, reusable
            "cache_control": {"type": "ephemeral"},
        }
    ],
    tools=[
        {
            "name": "search_docs",
            "description": "...",
            "input_schema": {...},
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": retrieved_context,  # session-scoped RAG chunks
                    "cache_control": {"type": "ephemeral"},
                },
                {
                    "type": "text",
                    "text": f"User question: {user_input}",
                    # no cache marker — this changes every request
                },
            ],
        }
    ],
)

# inspect cache metrics
print(response.usage.cache_creation_input_tokens)
print(response.usage.cache_read_input_tokens)

The key insight: up to 4 cache break points per request. We use all 4:

System prompt (changes ~monthly)
Tool schemas (changes ~monthly)
Retrieved RAG context (changes per session)
Conversation history (grows within session)

Metrics from real traffic

Before caching, on a representative 1,000-request sample:

Input tokens billed: 14.2M (≈ $42.60 at Opus 4.7 pricing)
Output tokens billed: 380K (≈ $28.50)
Total: $71.10

After caching, same workload:

Cache write input tokens: 1.8M ($6.75)
Cache read input tokens: 12.1M ($3.63)
Uncached input tokens: 300K ($0.90)
Output tokens: 380K ($28.50)
Total: $39.78 (−44%)

Output tokens dominate what's left. Short of switching models, the input side is essentially solved.

Watch out for: cache invalidation footguns

Cache hits match on exact byte-level prefix equality. Any variance busts the cache. Things that silently broke ours early on:

Whitespace drift in system-prompt templating (a stray \n from a template engine)
Dict-ordering when serializing tool schemas from a Python dict — always use json.dumps(..., sort_keys=True)
Timestamp injection into system prompts ("Today is {date}..." rebuilds the cache every day — move it to user content)
User-scoped data in system prompt — blows cache per user; move it down the prompt

Instrument cache_creation_input_tokens vs cache_read_input_tokens on every response and alert if the ratio drifts. A week of silent cache misses can cost you thousands.

The 1-hour cache tier

Anthropic added a 1-hour TTL option in mid-2025. It costs 2× the write price but lives 12× longer. For workloads with predictable hot paths — e.g. a support assistant where 80% of sessions hit the same product docs — the 1-hour tier amortizes beautifully.

"cache_control": {"type": "ephemeral", "ttl": "1h"}

Use it where cache hit rate is high. Don't use it for small cache blocks or unpredictable traffic — you'll pay the write premium without the hit volume.

The takeaway

Prompt caching is the highest-ROI single change I've made to a production Claude app in the last year. If you're running a RAG, agent, or long-context workload on Claude and not using prompt caching, the savings are almost certainly 40-80% sitting on the table.

The cost to implement: two afternoons, including the instrumentation. The cost to ignore: compounding every month you don't do it.

Why we replaced LangChain with the raw Anthropic SDK in production

Tufail Khan — Tue, 21 Apr 2026 11:40:59 +0000

LangChain was the right answer in 2023. It abstracted away a messy ecosystem of half-baked provider APIs, gave you a unified LLM interface, and let you stitch agents together with a few dozen lines of Python. We used it everywhere — including in production on Vettio, our AI recruitment platform.

In April 2026, we ripped it out.

This post is about why we made that call, what replaced it, and the metrics that justified the migration.

The symptoms

LangChain's abstractions started leaking the moment we went beyond happy-path demos. Three things kept biting us:

Stack traces from hell. A single AgentExecutor.invoke() call crossed 14 frames of LangChain internals before reaching our code. Debugging a malformed tool call felt like archaeology.
Version churn. Every minor bump renamed, relocated, or deprecated something we depended on. Our CI was pinned to a specific LangChain SHA for six months just to stay green.
Abstracted-away observability. We couldn't cleanly trace token usage, cache hits, or per-tool latencies without monkey-patching internal classes.

Meanwhile, Anthropic's native SDK was getting better. Native tool calling, prompt caching, extended thinking, streaming — all first-class and documented.

The refactor

The logic we were using LangChain for wasn't complicated:

Build a system prompt from templates
Call Claude with a list of tools
Route tool calls to our internal handlers
Return the result

We replaced ~800 lines of LangChain glue with this:

from anthropic import Anthropic

client = Anthropic()

def run_agent(user_input: str, tools: list[dict], tool_handlers: dict):
    messages = [{"role": "user", "content": user_input}]

    while True:
        response = client.messages.create(
            model="claude-opus-4-7",
            max_tokens=4096,
            tools=tools,
            messages=messages,
            system=SYSTEM_PROMPT,
        )

        if response.stop_reason == "end_turn":
            return response.content[0].text

        # Handle tool use
        tool_calls = [b for b in response.content if b.type == "tool_use"]
        messages.append({"role": "assistant", "content": response.content})

        tool_results = []
        for call in tool_calls:
            result = tool_handlers[call.name](**call.input)
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": call.id,
                "content": str(result),
            })
        messages.append({"role": "user", "content": tool_results})

That's it. No AgentExecutor, no Callback, no ConversationBufferMemory. Just the model and our code.

The metrics

We ran the old and new paths side-by-side for two weeks on Vettio's interview-bot service. Results:

p50 latency: 2.1s → 1.4s (−33%)
p95 latency: 4.8s → 3.2s (−33%)
Error rate: 0.9% → 0.2%
Stack trace depth on errors: 14 → 4 frames
Lines of integration code: 812 → 187

The latency win came mostly from eliminating LangChain's implicit retry-and-retry-again behavior on tool-use mismatches. With direct SDK calls, a malformed tool schema fails loudly instead of silently retrying three times.

When LangChain still makes sense

This isn't a blanket "don't use LangChain" post. It still wins if you need:

Multi-provider abstraction. Swapping between Claude, GPT-4, and Gemini behind a stable interface.
LangGraph workflows for graph-based agent topologies you'd otherwise build from scratch.
LangSmith observability you don't want to rebuild.

For a team that's already committed to one provider (we're all-in on Claude) and wants full control over prompts, tool schemas, and observability — the native SDK is the right tool in 2026.

The lesson

Abstractions pay for themselves when the underlying APIs are bad. Anthropic's API isn't bad. It's clean, well-documented, and stable. The abstraction tax was real; the abstraction benefit had quietly evaporated.

If you're still on LangChain in a production Claude app, benchmark a direct-SDK rewrite of your hot path. You might be surprised.