DEV Community: Umesh Malik

How to Build Enterprise-Grade AI Agents for Free (MaxKB, 2026)

Umesh Malik — Tue, 07 Jul 2026 19:57:26 +0000

TL;DR

How to build enterprise-grade AI agents for free: self-host MaxKB — an open-source (GPLv3, ~22k GitHub stars) agent platform — and point it at a local model like DeepSeek or Llama via Ollama, so you pay zero API tokens and no data ever leaves your server.
"Enterprise-grade" isn't a checkbox — it's five things: answer precision, cost control, data-sovereign security, access control, and observability. Free tools can nail four of them out of the box.
The one honest catch: MaxKB's Community edition is free forever but capped (2 users, 5 apps, 50 knowledge bases). SSO, LDAP, and RBAC live in the paid Pro tier ($1,920/yr) — you can replace them yourself for $0 with more effort.
Precision comes from your RAG pipeline, not the model. Chunking, hybrid search, a reranker, and a "cite or refuse" prompt matter more than which LLM you pick.
MaxKB vs Dify vs n8n: pick MaxKB for a knowledge-grounded Q&A agent, Dify for a broad LLM app builder, n8n when the agent is one step in a bigger automation.

Most "Free AI Agent" Guides Are Lying to You

Here's how to build enterprise-grade AI agents for free in 2026: self-host the open-source platform MaxKB, point it at a local model, and you get a document-grounded, tool-using agent with $0 API cost and zero data leaving your network.

Most other "free AI agent" tutorials are either toys or bait. The toy version wires ChatGPT to a prompt and calls it an "agent." The bait version is free until step 7, when you hit a paywall, a per-token API meter, or a "contact sales" wall right as it gets useful. Neither gives you something you'd actually put in front of customers or run your internal knowledge base on.

This guide is the version I wish existed. We're going to stand up a real, enterprise-grade AI agent — one that answers from your documents with citations, takes actions through tools, runs entirely on infrastructure you control, and costs $0 in API fees — and I'll be honest about exactly where "free" stops and money starts. No hand-waving.

The vehicle is MaxKB, and by the end you'll have a working agent plus a clear-eyed view of the five things that separate a demo from something you can trust in production.

What Is an Enterprise-Grade AI Agent?

An enterprise-grade AI agent is an AI system that answers or acts on your organization's own data with measurable accuracy, keeps that data under your control, enforces who can do what, stays observable, and does all of it at a cost you can predict. "Enterprise-grade" is about trust and control — not about how big the model is.

The five-question test
Point any agent at five questions: is it accurate on our data, is the cost predictable, does our data stay ours, can we control who does what, and can we see why it answered the way it did? Five yeses is enterprise-grade. Anything less is a demo.

People throw the phrase around like it means "expensive." It doesn't. A $30/month SaaS chatbot can be less enterprise-grade than a well-configured open-source stack running on a $40 VPS. What makes an agent enterprise-grade is whether it holds up on five pillars:

Keep these five in mind. Everything below maps back to them. The good news: a free, self-hosted stack wins pillars 1, 2, 3, and 5 outright. Pillar 4 is the one place "free" gets an asterisk, and we'll deal with it head-on.

The $0 Stack: MaxKB + a Local Model

The cheapest enterprise-grade agent in 2026 is MaxKB running against a local LLM, both self-hosted on one machine. That's the whole trick. MaxKB gives you the agent platform; a local model kills the API bill; self-hosting solves data sovereignty.

MaxKB (short for Max Knowledge Brain, built by the 1Panel team) is an open-source platform for building enterprise-grade agents. Its stack is boringly solid — Vue frontend, Django backend, LangChain under the hood, and PostgreSQL + pgvector as the vector store — which means no exotic dependencies and one-command deployment.

Out of the box it gives you a full RAG pipeline (upload docs or crawl a site → automatic chunking → vectorization), a visual workflow engine, and MCP tool-use so the agent can call external tools. Crucially, it's model-agnostic — it'll talk to OpenAI, Claude, and Gemini or to local models like DeepSeek, Qwen, and Llama.

That last part is the money-saver. Point MaxKB at a local model served by Ollama and your per-token cost drops to exactly zero.

How to Build Enterprise-Grade AI Agents for Free, Step by Step

Here's the honest, end-to-end walkthrough. You need a machine with Docker installed — a laptop works for testing; for a local model you'll want at least 16GB of RAM (more if you run larger models). Every step below is free.

Step 1 — Run MaxKB

One command. This maps a data volume so your knowledge bases survive restarts:

docker run -d --name=maxkb --restart=always \
  -p 8080:8080 \
  -v ~/.maxkb:/var/lib/postgresql/data \
  -v ~/.python-packages:/opt/maxkb/app/sandbox/python-packages \
  1panel/maxkb

Open http://localhost:8080 and log in with the default credentials shown in the MaxKB docs. Change the password immediately — that's the first line of your security checklist.

Step 2 — Serve a local model for $0

Install Ollama, then pull a model. DeepSeek and Qwen punch far above their weight for RAG in 2026:

# install ollama, then:
ollama pull deepseek-r1:7b        # reasoning model, runs on modest hardware
ollama pull nomic-embed-text      # embeddings for the RAG pipeline

Ollama now serves an OpenAI-compatible endpoint at http://localhost:11434. This is the whole reason your token bill is zero — inference happens on your hardware, not someone's metered API.

Step 3 — Connect the model in MaxKB

In the MaxKB UI go to Model Settings → Add Model, choose the Ollama provider, and point it at your Ollama host. If MaxKB runs in Docker and Ollama runs on the host machine, use http://host.docker.internal:11434 as the base URL. Add both the chat model (deepseek-r1:7b) and the embedding model (nomic-embed-text). No API key required.

Precision tip
Use a dedicated embedding model (like nomic-embed-text) for the knowledge base, not your chat model. Retrieval quality — and therefore answer precision — depends far more on the embedding model than on the chat model. This is the single most common mistake people make and never diagnose.

Step 4 — Build the knowledge base

Create a knowledge base, then either upload documents or paste a URL to crawl. MaxKB handles splitting, vectorizing, and indexing into pgvector automatically. Two settings decide your precision:

Chunk size — too big and retrieval pulls in noise; too small and it loses context. Start around 500–800 tokens with overlap, then tune against real questions.
Segment cleanup — strip navigation boilerplate and repeated headers before indexing. Garbage in the index is the number-one cause of confidently wrong answers.

Step 5 — Create the agent and give it a tool

Create an application, attach your knowledge base, and set a system prompt that enforces grounding — the "cite or refuse" rule:

Answer only from the provided knowledge base.
Cite the source document for every claim.
If the answer isn't in the knowledge base, say
"I don't have that information" — never guess.

Then wire an MCP tool so the agent can act, not just answer — look up an order, create a ticket, query a database. MaxKB's function library and MCP support let you register tools the workflow can call. If you're new to MCP, start with how to build an MCP server and deploying one on Cloudflare Workers.

Step 6 — Embed it

MaxKB generates an embeddable chat widget and a REST API. Paste the widget script into any page, or call the API from your backend. Zero front-end code required — this is the "zero-coding integration" MaxKB is built around.

The Five Pillars, Judged Honestly

A running agent isn't the same as an enterprise-grade one. Let's grade the free stack against the five pillars — including where it falls short.

Pillar 1 — Effectiveness & precision

Precision comes from the retrieval layer, not the model. This is the most important sentence in this article. Teams burn weeks swapping models when their real problem is a bad chunking strategy or no reranking. RAG works by retrieving relevant chunks and forcing the model to answer from those chunks, which is what crushes hallucinations.

The precision levers, in order of impact:

Chunking — right size + overlap, boilerplate stripped.
Hybrid search — combine keyword and vector search so exact terms (part numbers, names) aren't lost to fuzzy semantics.
Reranking — reorder the top-k so the best chunk lands in the model's context, not just a relevant one.
A grounding prompt — "cite or refuse," as above.

Nail those four and a 7B local model will out-answer a frontier model with a sloppy pipeline. If you want the full theory, read building a RAG pipeline from scratch.

Pillar 2 — Cost control

This is where self-hosting quietly wins. A metered API bills you more as you succeed; a per-seat SaaS bills you more as your team grows. A self-hosted local model turns both into one flat server bill that barely moves.

The tradeoff is real and worth stating: self-hosting trades a variable money cost for a fixed operational cost — you run the server, you patch it, you own uptime. For a small internal agent that's a rounding error. At scale it's a massive saving.

Pillar 3 — Security & data sovereignty

The strongest argument for this stack. When the model runs locally and MaxKB runs on your server, no document ever touches a third-party cloud. For anyone handling PII, health, financial, or regulated data, that alone can be the difference between "allowed" and "not allowed." You're not sending your knowledge base to an API you don't control.

That's the architecture being secure. You still have to harden the deployment:

For the broader threat model of agents that take actions, see the agentic AI enterprise security model.

Pillar 4 — Access control (the honest asterisk)

Here's where free ends. MaxKB's Community edition caps you at 2 users, and SSO, LDAP, and RBAC are Pro-tier features. If you need "marketing can only see the marketing knowledge base, support leads can edit agents, everyone logs in with Okta" — that's the paid tier, or DIY work you take on yourself (an auth proxy in front, separate instances per team). I'd rather tell you that now than let you discover it at rollout.

Pillar 5 — Observability

MaxKB logs conversations and lets you inspect what was retrieved for a given answer, which is enough to debug precision and iterate on chunking. It's not a full LLM-observability suite — if you need deep tracing and eval dashboards you'll add tooling — but for "why did the agent say that?" you have what you need for free.

Where Does "Free" Actually End?

No dodging it. Here's the exact line between $0 and paid, so you can plan.

My take: start on Community. It is genuinely free and genuinely capable. Only pay when access control across multiple teams becomes a real, present need — not a hypothetical one. Most people building their first agent are nowhere near the 2-user wall.

MaxKB vs Dify vs n8n: Which Free AI Agent Platform Should You Use?

"Free AI agent platform" returns a dozen tools that do overlapping-but-different things. Here's how the honest contenders compare — and none of these are the same product.

The key insight: Ollama isn't a competitor to MaxKB — it's a component of the stack. MaxKB and Dify compete; n8n plays a different game (orchestration). Here's how I'd choose:

Common Mistakes That Kill Free Agents

FAQ

Bottom Line

Free and enterprise-grade are not opposites in 2026 — that's the myth this guide exists to kill. Self-host MaxKB, point it at a local model, invest in your retrieval pipeline, harden the box, and you have an agent that answers from your data with citations, keeps that data on your own hardware, and costs nothing per token. The only honest asterisk is multi-team access control, and you now know exactly when that bill arrives.

Start on the Community edition today. Build the narrow, useful agent — the one that answers your support questions or your internal docs — get the retrieval right, and let the results decide whether you ever need to pay.

If this was useful, read building a RAG pipeline from scratch next to push your agent's precision further, or why 77% of AI agents never reach production to make sure yours is in the 23% that do.

Sources

MaxKB — 1Panel-dev/MaxKB on GitHub (open-source, GPLv3)
MaxKB official documentation
MaxKB project site
Ollama — run local models

Written for umesh-malik.com — no-fluff technical writing on AI, Web Dev, and Engineering.

Originally published at umesh-malik.com

Keep reading on umesh-malik.com:

HTTP QUERY Method Explained: The REST Fix for GET vs POST (RFC 10008)

Umesh Malik — Fri, 03 Jul 2026 14:07:20 +0000

The HTTP QUERY method (RFC 10008, June 2026) is a new REST request method that is safe and idempotent like GET but carries the query in a request body like POST — so large, structured queries finally get HTTP-layer caching and automatic retries. It's the first general-purpose method HTTP has gained since PATCH in 2010, and it closes a design gap REST API developers have been working around for 25 years. If you've ever tunneled a read-only search through POST and silently given up caching and retry safety, this method exists because of you.

TL;DR

QUERY is a new HTTP request method that sends a query in the request body while guaranteeing the operation is safe and idempotent — no state change, freely retryable, and cacheable by proxies and CDNs.
It exists because GET can't reliably carry a body (the spec gives GET bodies "no defined semantics", and URLs cap out around 2,000–8,000 characters) while POST forfeits caching and automatic retries even for purely read-only queries.
The cache key for a QUERY request is the URL plus the request body, so two identical searches can be served from the edge without touching your origin — something POST-based search endpoints will never get.
It's already real: Node.js parses QUERY natively (21.7.2+/22+), Fastify supports it via addHttpMethod(), nginx landed basic support, ASP.NET Core recognizes it in .NET 11 previews, and OpenAPI 3.2 models it — but browsers' fetch() can't send it yet.
My take: design new server-to-server search/filter/report endpoints as QUERY-shaped now (single query document in the body, mandatory Content-Type), even if you expose them via POST today. The migration later becomes a method-name change.

What Is the HTTP QUERY Method?

The HTTP QUERY method is a safe, idempotent request method, defined in RFC 10008, that asks a resource to run a query described by the request body and return the result. The body — with a mandatory Content-Type — carries the query parameters, so the URL no longer has to. Unlike POST, the server promises the operation changes nothing; unlike GET, the query can be as large and structured as you want: JSON documents, GraphQL queries, SQL-ish filters, JSONPath expressions.

The RFC's own framing is precise: a QUERY request asks "the target resource to perform a query operation within the scope of that target resource." Two guarantees make everything else possible. First, safety: "the client does not request or expect any change to the state of the target resource." Second, idempotency: QUERY requests "can be retried or repeated when needed, for instance, after a connection failure."

That second sentence is the quiet superpower. A dropped connection mid-POST leaves your client guessing — did the server process it? A dropped connection mid-QUERY is a non-event: any client, library, or proxy may resend it without asking.

Why HTTP Needed a New Method

Here's the design flaw we all normalized: HTTP made you choose between honest semantics and a request body.

GET is the semantically correct method for reads. Every cache, proxy, prefetcher, and crawler on Earth understands it. But RFC 9110 gives content in a GET request "no generally defined semantics" — intermediaries may drop it, some servers reject it, and browsers won't send it. So real query data had to squeeze into the URL, which breaks somewhere between 2,000 and 8,000 characters depending on the browser, proxy, and server in the chain.

POST takes any body you like. But POST is defined as potentially unsafe and non-idempotent, so every intermediary treats your read-only search like a payment submission: no shared caching, no automatic retry, no prefetching. Every GraphQL query — explicitly read-only by design — loses HTTP-level caching and retry safety the moment it ships as POST.

The industry's workarounds tell the story better than any spec rationale:

This isn't a new idea, which makes the 18-year journey instructive. WebDAV had a body-carrying SEARCH method back in RFC 5323 (2008), but it never escaped WebDAV's orbit. The IETF HTTP working group adopted draft-ietf-httpbis-safe-method-w-body in 2021, initially to generalize SEARCH, renamed it QUERY by draft-02 in 2022 to shed the WebDAV baggage, and shipped RFC 10008 after fourteen draft revisions. The author list explains the industry buy-in: Julian Reschke (greenbytes, co-editor of the core HTTP specs), James Snell (Cloudflare), and Mike Bishop (Akamai). When the two largest CDNs co-author a caching-focused method, they intend to implement it.

How QUERY Works on the Wire

A QUERY request looks like a POST that tells the truth. Here's the RFC's canonical example — a form-encoded query against a contacts collection:

Beyond the basic exchange, RFC 10008 nails down four behaviors that make QUERY more than "POST, but pinky-promise it's safe":

1. Content-Type is not optional. Servers "MUST fail the request if the Content-Type request field is missing or is inconsistent with the request content." No content sniffing, no guessing — a sharp break from POST's anything-goes reality.

2. Discovery via Accept-Query. A server advertises QUERY support and its accepted formats with a response header using Structured Field syntax:

OPTIONS /contacts HTTP/1.1
Host: api.example.com

HTTP/1.1 204 No Content
Allow: GET, HEAD, OPTIONS, QUERY
Accept-Query: application/x-www-form-urlencoded, application/jsonpath

One subtlety worth knowing: the Accept-Query value "applies to every URI on the server that shares the same path" — the URL's query component is ignored for this purpose.

3. Results can get their own URIs. A 2xx response may include a Location header naming a URI you can later GET to re-run the equivalent query, or a Content-Location naming a resource that holds this specific result snapshot. That turns an expensive ad-hoc query into a shareable, bookmarkable, cacheable resource — a genuinely RESTful touch POST never standardized for reads.

4. Redirects behave like you'd hope. A 301 or 308 means repeat the QUERY at the new target; 303 means fetch the result with a plain GET. Crucially, the legacy exception that lets clients degrade a redirected POST into GET "does not apply to QUERY requests" — your query body survives the redirect.

Caching: The Actual Killer Feature

The RFC is blunt about the mechanics: "The cache key for a QUERY request MUST incorporate the request content and related metadata." URL plus body, hashed, stored. Identical query, identical key, edge-served response.

Caches are even allowed to normalize semantically insignificant differences — stripping content encoding, or applying format-aware knowledge (JSON key order, whitespace) — so trivially different bodies can still share a cache entry. If that makes you nervous, no-transform opts out.

Think about what this does for a search-heavy API. Today, your "cache" for POST /search is an application-level Redis layer you built, keyed by a hash you invented, invalidated by rules you maintain. QUERY moves that entire concern into infrastructure that already exists at every hop between the user and your origin. This is the same architectural instinct as moving analytics to the edge with Cloudflare Workers — push work to the layer that's already positioned to do it.

💡 Key insight: QUERY doesn't make anything possible that was impossible before. It makes the default correct — caching, retries, and semantics you previously had to hand-build now fall out of the method name.

QUERY vs GET vs POST

The one-table version, straight from the RFC's own comparison:

That security row deserves a sentence: because the query moves out of the URL, it stops leaking into access logs, browser history, referrer chains, and analytics pipelines. RFC 10008's Security Considerations calls this out as a real benefit for sensitive queries — with the matching warning that servers minting result URIs must not embed sensitive query parts back into them.

Where QUERY Works Today (July 2026)

The honest adoption picture, one month after publication:

On the working side: Node.js parses QUERY natively since 21.7.2 and 22+, and Fastify exposes it with one line. nginx landed basic upstream support, Spring Framework has an active PR (#34993), Rails has a live core proposal, and ASP.NET Core recognizes it in .NET 11 previews.

OpenAPI 3.2 also models QUERY operations — which means generated clients will offer it before most hand-written servers accept it.

Here's a working Fastify endpoint today:

import Fastify from 'fastify';

const app = Fastify();
app.addHttpMethod('QUERY', { hasBody: true });

app.route({
  method: 'QUERY',
  url: '/products',
  handler: async (req, reply) => {
    const results = await search(req.body); // body IS the query
    reply.header('cache-control', 'public, max-age=60');
    return results;
  }
});

And from any client that lets you set a custom method:

curl -X QUERY https://api.example.com/products \
  -H 'content-type: application/json' \
  -d '{ "category": "keyboards", "maxPrice": 200, "sort": "rating" }'

The missing piece is the browser. fetch() can't send QUERY until the WHATWG Fetch spec adds it, and even then QUERY is not a CORS-safelisted method, so cross-origin requests will always preflight. In 2026, QUERY is a server-to-server and API-gateway story — which is exactly where search, reporting, and internal query traffic lives anyway. If you build backend-for-frontend layers in Node.js, that BFF-to-service hop is the perfect first deployment.

Common Mistakes to Avoid

The #1 mistake: treating QUERY as a POST alias
The method is a contract, not a vibe. If your QUERY handler writes an audit row, increments a rate counter that changes responses, or mutates anything a client could observe, you've broken the contract — and a cache or retrying proxy will eventually expose that bug in production, far from the code that caused it. Side effects that aren't client-observable (logging, metrics) are fine, exactly as with GET.

Four more, seen in the wild already:

Omitting Content-Type. Legal-ish on POST, a mandatory failure on QUERY. Fail these requests loudly, as the RFC requires — don't sniff.
Forgetting the CORS preflight. QUERY from browser-adjacent contexts will send OPTIONS first. Your gateway needs Access-Control-Allow-Methods: QUERY or every request dies preflight.
Leaking query content into result URIs. If QUERY returns a Location for later GETs, that URL will be logged everywhere. Use opaque result IDs, not serialized query parameters.
Assuming your WAF and proxies pass it through. Older intermediaries reject unknown methods. Test the full path — one method-allowlist in a corporate proxy can 405 your rollout.

Should You Adopt QUERY Now?

FAQ

The Bottom Line

REST API design has spent 25 years contorting reads into methods that either couldn't carry the query or couldn't admit it was a read. The HTTP QUERY method ends that trade-off with the least glamorous, most durable kind of fix: a verb that tells the infrastructure the truth, and infrastructure that rewards honesty with caching and retries you no longer have to build.

You don't need to rewrite anything this quarter. But every new search, filter, or reporting endpoint you design from today should be QUERY-shaped: one query document in the body, a mandatory content type, and a handler with zero side effects. Do that, and adopting the verb itself — when your framework, gateway, and CDN all catch up — becomes a one-line change instead of a migration project.

If you found this useful, read Node.js Backend for Frontend Developers next — the BFF layer is exactly where QUERY will land first — or see how FastAPI's app.frontend() rethinks another old default.

Sources

RFC 10008 — The HTTP QUERY Method (IETF, June 2026)
draft-ietf-httpbis-safe-method-w-body — draft history (IETF Datatracker)
RFC 9110 — HTTP Semantics (GET body semantics, method properties)
RFC 5323 — WebDAV SEARCH (the 2008 precursor)
Spring Framework PR #34993 — Add RFC 10008 (QUERY) support
Ruby on Rails core proposal — Support for the HTTP QUERY method
RFC 10008: HTTP QUERY Method Ends the POST Workaround (byteiota, adoption roundup)

Written for umesh-malik.com — no-fluff technical writing on AI, Web Dev, and Engineering.

Originally published at umesh-malik.com

Keep reading on umesh-malik.com:

Claude Fable 5: The Deep-Dive Guide — What It Does, What It Costs, and When It's Worth the Premium (2026)

Umesh Malik — Thu, 02 Jul 2026 13:49:44 +0000

Claude Fable 5 is the model Anthropic points to when the answer to "can an AI actually do this?" needs to be yes. It's the most capable model they've released widely — built for the work that used to be a research demo: multi-hour autonomous runs, first-shot builds of well-specified systems, and end-to-end deliverables a person would bill days for. In the API it's the model id claude-fable-5.

It's also the most expensive Claude you can call, it behaves differently from every Opus-tier model before it, and for most of what you do day to day it is the wrong choice. This is the honest deep-dive: what Fable 5 genuinely does better, what it costs once you do the arithmetic, the API quirks that will trip you up, and — the part the launch hype skips — the large set of jobs where you should reach for something cheaper.

TL;DR

Claude Fable 5 is Anthropic's most capable widely released model — tuned for the hardest reasoning and long-horizon agentic work, not for chat or high-volume throughput.
It's premium-priced at $10 / $50 per million tokens — roughly 2× Opus 4.8 and over 3× Sonnet 5. The capability ceiling is real; so is the bill.
It behaves differently from Opus-tier models. Thinking is always on (you can't disable it), the raw chain of thought is never returned, and safety classifiers can decline a request with a refusal stop reason you have to handle in code.
It got broadly available the slow way — through an invitation-only preview and a restricted access program before general release, not a big-bang launch.
The play: keep Opus 4.8 or Sonnet 5 as your default, and escalate a specific workload to Fable 5 only when its ceiling — long-horizon autonomy, first-shot correctness on a genuinely hard task — is worth 2×+ the price.

What Is Claude Fable 5?

Claude Fable 5 is Anthropic's flagship model for the most demanding reasoning and long-horizon agentic work — the most capable model they've released to the general public. It has a 1-million-token context window (which is both the maximum and the default), up to 128K output tokens per request, always-on extended thinking, and high-resolution vision. You call it in the API as claude-fable-5.

The one-sentence version: it's the model you give your hardest, longest, least-supervised problem to. Where a mid-tier model shines on the tasks you'd assign a competent engineer, Fable 5 is built for the ones you'd assign a senior engineer a week to figure out — and it's meant to run largely unattended while it does.

Three things define it in practice:

It's built for autonomy, not turns. The headline is long-horizon execution: a single request on a hard task can run for many minutes while the model gathers context, builds, and verifies its own work. This is not a chat model with a bigger brain — it's a model tuned to be pointed at an outcome and left alone.
It trades control for capability. Several knobs you're used to are gone. Thinking is always on. Sampling parameters are rejected. There's no assistant prefill. You steer it with prompting and an effort dial, not low-level parameters.
It's the premium tier, and priced like it. At $10 / $50 per million tokens it sits above every Opus model. The interesting question was never "is it the most capable?" — it is. The question is "is this workload hard enough to justify it?"

💡 Key insight: Fable 5's value isn't spread evenly. On routine work it's overkill you're overpaying for. On the hardest long-horizon runs it does things no cheaper model reliably does. Knowing which bucket a task falls into is the whole skill.

How Claude Fable 5 Became Widely Available

Fable 5 didn't arrive as a single splashy launch. The capability reached the public through a staged rollout that started tightly restricted and opened up over time — worth understanding, because the more exclusive tiers still exist alongside it.

The practical takeaway: Fable 5 and Mythos 5 are the same model with different front doors. Mythos 5 is what Project Glasswing participants use; Fable 5 is the generally available equivalent everyone else calls. If you've read about "Mythos" and wondered whether you're missing a more powerful model — you're not. You just reach it under a different id.

A note on the timeline
This is an access-tier progression — invitation-only preview, to a restricted program, to general availability — not a case of the model being pulled and re-launched. If you see claims of a "recall," treat them skeptically; the documented path is restricted-to-broad access, not launch-then-relaunch.

What Claude Fable 5 Can Actually Do

Fable 5's gains show up most on work above what previous models could reliably finish. Evaluate it on the tasks a mid-tier model already handles and you'll wonder why you paid the premium. Point it at the hard stuff and the difference is obvious.

If your problem lives on that grid — an overnight build, a hard multi-file migration, an end-to-end analysis deliverable, a long autonomous agent loop where correctness matters more than the token bill — Fable 5 is the model that clears the bar cheaper models keep tripping on. If it doesn't, keep reading, because you're probably about to overpay.

The API Mechanics Nobody Warns You About

This is where Fable 5 stops being "a better model" and starts being a different one. If you migrate an Opus-tier integration by swapping the model string, several of these will bite you. Every one of them is a documented behavior, not a bug.

Thinking is always on — steer it with `effort`, not `budget_tokens`

Extended thinking runs on every request and cannot be disabled. Sending thinking: {type: "disabled"} returns a 400. Sending the old thinking: {type: "enabled", budget_tokens: N} also returns a 400 — the token-budget knob is gone. You control depth with the effort parameter instead.

from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-fable-5",
    max_tokens=16000,
    # No `thinking` field — it's always on. Control depth with effort:
    output_config={"effort": "high"},  # low | medium | high | xhigh | max
    messages=[{"role": "user", "content": "..."}],
)

effort runs from low through high, plus xhigh and max. Higher effort buys deeper reasoning and better self-verification; it also spends more tokens and takes longer. A useful counterintuitive fact: Fable 5 at low often beats a previous-generation model at max. Sweep the levels on your own workload — don't reflexively pin it to max.

The raw chain of thought is never returned

You'll get thinking blocks in the response, but their text is empty by default (display: "omitted"). Set display: "summarized" to get a readable summary of the reasoning — the raw chain of thought is never exposed on any setting. If you stream reasoning to users, the default looks like a long pause before output; opt into summaries explicitly.

Requests can be refused — handle it before reading content

Fable 5 runs safety classifiers (targeting research biology and most cybersecurity content) that can decline a request. A decline is a successful HTTP 200 with stop_reason: "refusal" — not an exception. Code that reads response.content[0] unconditionally will crash on a refused request. Always branch on stop_reason first, and opt into a server-side fallback so a refusal doesn't just fail:

response = client.beta.messages.create(
    model="claude-fable-5",
    max_tokens=16000,
    betas=["server-side-fallback-2026-06-01"],
    fallbacks=[{"model": "claude-opus-4-8"}],  # re-served on the same call if Fable declines
    messages=[{"role": "user", "content": "..."}],
)

if response.stop_reason == "refusal":
    handle_refusal()          # pre-output refusals are empty and unbilled
else:
    print(response.content[0].text)

Benign, adjacent work (security tooling, life-sciences tasks) can trip a false positive, which is exactly why the fallback matters. A decline before any output isn't billed at all; a mid-stream decline bills the partial output — discard it.

It requires 30-day data retention

Fable 5 is not available under zero data retention. If your org is configured for ZDR (or any retention below 30 days), every Fable 5 request returns a 400 — even a perfectly valid one. If a migration suddenly 400s with no obvious payload problem, check your retention setting before you debug the request body.

The five things that will bite an Opus-to-Fable migration
1. Disabling thinking (thinking.type: "disabled") and budget_tokens both return 400 — thinking is always on; use effort. 2. Assistant-message prefills are rejected — use structured outputs or a system instruction. 3. Refused requests return HTTP 200 with stop_reason: "refusal" — check it before reading content. 4. Zero-data-retention orgs get a 400 on every request. 5. Turns can run for minutes — plan timeouts, streaming, and progress UX before you ship.

Plan for minutes-long turns and token budgets

Because thinking is always on and the model is tuned for long-horizon work, a single request on a hard task can run for many minutes. Two consequences: stream anything with a large max_tokens (128K output requires streaming to avoid HTTP timeouts), and consider a task budget to let the model pace itself across an agentic loop — it sees a running countdown and wraps up gracefully instead of getting cut off.

with client.beta.messages.stream(
    model="claude-fable-5",
    max_tokens=128000,
    output_config={"effort": "high", "task_budget": {"type": "tokens", "total": 64000}},
    betas=["task-budgets-2026-03-13"],
    messages=[...],
    tools=[...],
) as stream:
    response = stream.get_final_message()

The Real Cost Math

Here's where the deep-dive earns its keep. Fable 5 is the premium tier, and the sticker is only half the story. First, the shape of the ladder — because the cost climbs a lot faster than the difficulty of most work does.

Now the same ladder with the per-model guidance spelled out:

Do the arithmetic on a real task. Say a long agentic run consumes 500K input tokens (large context, re-sent across turns) and 100K output tokens. On Fable 5 that's $5.00 + $5.00 = $10.00. The identical run on Opus 4.8 is $2.50 + $2.50 = $5.00, and on Sonnet 5 (standard) about $1.50 + $1.50 = $3.00. Same task, 2×–3.3× the cost. Multiply by a fleet of agents and Fable's premium stops being a rounding error and becomes a budget decision.

One thing that does NOT change: the tokenizer
Fable 5 uses the same tokenizer as Opus 4.8, so migrating between those two, your token counts are roughly unchanged — the cost difference is purely the per-token price, not a hidden count inflation. (Coming from Opus 4.6, Sonnet, or older, the counts do differ; re-baseline with a count_tokens call.) This is the opposite of the Sonnet 5 tokenizer gotcha — with Fable-vs-Opus, what you see on the sticker is what you pay.

So the premium is honest and predictable relative to Opus. The question is never "is it inflated by tokenization?" — it's "is this task hard enough that Fable's higher ceiling saves me more than the 2× costs."

Where Fable 5 Sits vs Opus 4.8 and Sonnet 5

This isn't a "which is better" question — Fable 5 is the most capable, that's what the top of the ladder means. It's a "when is the extra capability worth 2× Opus and 3× Sonnet" question.

{#snippet newContent()}

Reach for Fable 5 when the ceiling is the whole point. The hardest long-horizon autonomous runs, first-shot builds from a complete spec where you'd rather pay more than review a subtly-wrong result, end-to-end enterprise deliverables, and agent fleets whose correctness on hard tasks — not throughput — is the constraint. If the task is one you'd give a senior engineer days for and want done largely unattended, this is the tier.

{/snippet}
{#snippet oldContent()}

Stay on Opus 4.8 or Sonnet 5 for everything else — which is most things. Interactive coding, tool and CLI work, RAG and product backends, knowledge work, and high-volume agent loops where token cost is the line item. Opus 4.8 gives you state-of-the-art capability at half Fable's price; Sonnet 5 gives you near-Opus quality at a third. For the overwhelming majority of work in 2026, one of these two is the correct call.

{/snippet}

The heuristic I use: default to Sonnet 5, escalate hard tasks to Opus 4.8, and reserve Fable 5 for the handful of jobs where you've measured that even Opus leaves value on the table. Don't pay the Fable premium as an insurance policy across the board — pay it where you've watched a cheaper model fall short on that specific task. This is the same "measure, don't assume" discipline that separates teams who ship real production work with AI agents from teams who burn budget guessing.

The Honest Ledger

When You Should NOT Reach for Fable 5

The most useful thing I can tell you about the most capable model is when to skip it. Fable 5's premium is only worth it at the top of the difficulty curve. Everywhere else, you're lighting money on fire for capability you won't use.

Don't use Fable 5 for:

Chat, assistants, and RAG backends. These want speed and cost efficiency. Sonnet 5 is the right default; Haiku 4.5 for the simplest, highest-volume paths.
Interactive, latency-sensitive coding. Minutes-long turns are a feature for overnight autonomy and a liability when a human is waiting. Use Opus 4.8 or Sonnet 5 in the loop.
High-volume agent fleets where cost is the constraint. At 2×–3× the token price, Fable turns an affordable experiment into a line item you'll have to defend. Run the fleet on Sonnet 5 and escalate individual hard tasks.
Anything a mid-tier model already passes your evals on. If Sonnet 5 or Opus 4.8 clears the bar, the extra capability is invisible and the extra cost is not.

The one-line rule
Default to Sonnet 5 for most work, escalate hard tasks to Opus 4.8, and only reach for Fable 5 when you've measured that even Opus leaves capability on the table for that specific job. Capability you can't point to is capability you shouldn't pay for.

How to Actually Get Value From Fable 5

If a task does clear the bar, the way you prompt Fable 5 matters more than with any prior model. It's more autonomous and more literal, so a few habits change.

That last one is counterintuitive and worth repeating: prompts and skills written for previous models are often too prescriptive for Fable 5 and actively lower its output quality. State the goal and the constraints, then get out of the way. If you've internalized a house style for writing a good CLAUDE.md or running Claude Code in auto mode, Fable is the model that most rewards trusting it with the what and why instead of dictating every step.

FAQ

The Verdict

Claude Fable 5 is the most capable model Anthropic has put in front of the general public, and it earns that title on exactly the work it was built for: long, unattended, genuinely hard problems where the ceiling is the point. The staged rollout — invitation-only preview, to a restricted program, to general availability — was really a story about access widening, not capability changing hands.

But "most capable" and "the one you should use" are different sentences. At $10 / $50, Fable 5 is a specialist tool. Put it on routine work and you're paying flagship rates for a ceiling you'll never touch; put it on the hardest autonomous runs and it does things cheaper models don't. Default to Sonnet 5, escalate hard tasks to Opus 4.8, and spend Fable 5 tokens only where you've measured that even Opus leaves capability on the table.

Want to see what that top-of-the-ladder capability looks like turned loose on a real project? Read how Fable 5 built a full streaming microservice in a day next — the case study behind the capabilities on this page.

Sources

Written for umesh-malik.com — no-fluff technical writing on AI, Web Dev, and Engineering.

Originally published at umesh-malik.com

Keep reading on umesh-malik.com:

Claude Sonnet 5: The Honest Guide — Pros, Cons, Use Cases, and What It Actually Costs (2026)

Umesh Malik — Wed, 01 Jul 2026 10:02:25 +0000

Claude Sonnet 5 is the model that quietly changes the default. For two years the rule was simple: reach for the biggest model when the work is hard, and drop to a Sonnet-tier model when you need speed or you're watching the bill. Sonnet 5 blurs that line. It lands within a few points of Opus 4.8 on the benchmarks that matter for real engineering work — and it does it at roughly 60% of Opus pricing, less during the launch window.

This is the honest guide: what it's genuinely good at, where it still loses to Opus, the use cases it was built for, and the actual cost math — including the tokenizer change almost every launch-day post skipped over.

TL;DR

Claude Sonnet 5 is the most agentic Sonnet model Anthropic has shipped — it plans, uses tools like browsers and terminals, and runs autonomously at a level that needed an Opus-class model a few months ago.
On agentic coding it scores 63.2% on SWE-bench Pro (up from Sonnet 4.6's 58.1%), closing much of the gap to Opus 4.8's 69.2%. On knowledge work it actually matches Opus 4.8 (1,618 vs 1,615 on GDPval-AA v2).
Pricing is $2 / $10 per million tokens through August 31, 2026, then $3 / $15 — the same sticker as the older Sonnet 4.6, and well under Opus 4.8's $5 / $25.
The catch nobody mentions: Sonnet 5 uses a new tokenizer, so the same text can cost roughly 1.0–1.35× more tokens. The per-token price dropped relative to Opus, but re-baseline your real costs before you celebrate.
The play: make Sonnet 5 your default for coding, tool use, and knowledge work; escalate to Opus 4.8 only for the hardest long-horizon autonomous runs.

What Is Claude Sonnet 5?

Claude Sonnet 5 is Anthropic's mid-tier model, released June 30, 2026, built to bring near-Opus agentic and coding performance to the Sonnet price point. It has a 1-million-token context window, adaptive extended thinking, high-resolution vision, and the strongest tool-use and computer-use scores of any Sonnet model to date. In the API it's the model id claude-sonnet-5.

The one-sentence version: it's Opus-class capability for most real work, priced like a Sonnet. Anthropic's own framing is that Sonnet 5 "narrows the gap: its performance is close to that of Opus 4.8, but at lower prices." That's marketing, but for once the benchmarks back it up.

Three things define it in practice:

It's agent-first. The headline gains are on agentic coding, terminal/CLI tasks, and computer use — not chat. This is a model tuned to be dropped into an autonomous loop.
It's the new default. On claude.ai it's the default model for Free and Pro users, and it ships in Claude Code, the Claude API, Cursor, VS Code, and GitHub Copilot.
It changes the cost calculus. The interesting question stopped being "is it as good as Opus?" and became "is it good enough that I never need Opus?"

The Benchmarks: Sonnet 5 vs Sonnet 4.6 vs Opus 4.8

Numbers first, opinion after. These are the published comparison figures across the three models most teams are choosing between.

Read those rows carefully, because they tell two different stories.

On pure agentic coding (SWE-bench Pro), Opus 4.8 is still ahead — 69.2% vs 63.2%. That six-point gap is real, and it's exactly the kind of gap that shows up as "the agent got 94% of the way and then made a mess of the last file" on genuinely hard, multi-file tasks.

But look at Terminal-Bench and knowledge work. On terminal/CLI-style tasks, Sonnet 5 jumps to 80.4% — a 13-point leap over Sonnet 4.6, and the kind of number that used to be Opus territory. On GDPval-AA v2, a knowledge-work benchmark, Sonnet 5 (1,618) doesn't just approach Opus 4.8 (1,615) — it edges past it.

💡 Key insight: Sonnet 5 isn't "Opus minus a bit" across the board. It's at parity or better on knowledge work and CLI tasks, and only meaningfully behind on the hardest agentic-coding runs. Where you land on "is it worth it" depends entirely on which of those your workload actually is.

The Pros and Cons

No hedging. Here's the real ledger after reading the launch data, the third-party comparisons, and the developer reaction.

The Real Cost Math (and the Tokenizer Gotcha)

Here's where most write-ups stop at the sticker price. Don't.

The headline is genuinely good: at standard pricing, Sonnet 5 costs 60% of Opus 4.8 on both input and output, and during the intro window it's 40%. For a team running agents at volume, that's the difference between an experiment and a line item you can defend.

But there's a footnote that changes the arithmetic:

The tokenizer gotcha — read this before you budget
Sonnet 5 ships with a new tokenizer (the same family Anthropic introduced with Opus 4.7). The same input text maps to roughly 1.0–1.35× as many tokens as it did on Sonnet 4.6 — up to ~30% more, depending on content. Per-token pricing dropped relative to Opus, but your per-task cost won't drop by the full sticker difference, and against Sonnet 4.6 (identical $3/$15 sticker) an identical task can actually cost slightly more. Re-baseline with a real count_tokens call on your own prompts before you model the savings — don't apply a blanket multiplier.

The practical takeaway: the discount is real, but it's smaller than "$3 vs $5" suggests once tokenization is accounted for. Model your actual traffic. The savings are still large enough to justify switching most Opus workloads — they're just not the clean 40–60% the price table implies.

Use Cases: What Sonnet 5 Is Actually For

Sonnet 5 was tuned for a specific shape of work — autonomous, tool-using, and high-volume. These are the places it earns its keep.

If your workload is on that grid, Sonnet 5 is very likely the right model. If you're building a RAG pipeline or wiring up an MCP server, it's the model I'd reach for first and only escalate from if evals tell me to.

Sonnet 5 vs Opus 4.8: When to Pick Which

This is the decision most teams are actually making. It's not "which is better" — Opus 4.8 is better, that's what the top tier is for. It's "when is the extra capability worth ~1.7× the price."

{#snippet newContent()}

Pick Sonnet 5 for the default case — which is most cases. Interactive coding, tool-heavy and CLI workflows, computer use, long-context reads, knowledge work, and any high-volume agent loop where token cost is the constraint. Run it at high effort for everyday work and xhigh for the genuinely hard tasks; that alone closes much of the remaining gap to Opus. For the overwhelming majority of engineering work in 2026, this is the model you should reach for first.

{/snippet}
{#snippet oldContent()}

Reserve Opus 4.8 for the hardest long-horizon runs. Multi-hour autonomous builds where a single wrong turn is expensive to unwind, the last few points of accuracy on gnarly multi-file agentic coding (that 69.2% vs 63.2% SWE-bench Pro gap), and cases where you'd rather pay 1.7× than review a subtly-wrong diff. If correctness on a hard task matters more than cost, this is still the tier to use.

{/snippet}

The heuristic I use: default to Sonnet 5, and only escalate an individual workload to Opus 4.8 when your own evals show it losing on that specific task. Don't pay the Opus premium as an insurance policy across the board — pay it where you've measured that it's earned. This is the same "measure, don't assume" discipline that separates teams who ship production work with AI agents from teams who burn budget guessing.

Migrating From Sonnet 4.6? Read This First

If you're upgrading an existing integration, Sonnet 5 is mostly a drop-in — but a few defaults changed and will bite you silently if you don't know about them.

Three things that changed under you
1. Adaptive thinking is on by default. On Sonnet 4.6, omitting the thinking parameter meant no thinking. On Sonnet 5, omitting it runs adaptive thinking — so you'll spend thinking tokens (and may hit max_tokens) where you didn't before. Set the thinking parameter to "disabled" explicitly if you want the old behavior. 2. Sampling parameters are rejected. Non-default temperature, top_p, or top_k now return a 400 — steer with prompting instead. 3. The tokenizer changed. Re-run count_tokens and revisit max_tokens and any compaction triggers, because the same text is now worth more tokens.

None of these are dealbreakers — they're the standard cost of a model bump. But they're exactly the kind of thing that turns a "quick model swap" into a confusing afternoon of debugging phantom cost spikes and truncated outputs. Change the model string, then read the release notes; don't do it in the other order.

When You Should Choose Sonnet 5

The one clear "no": if your work is dominated by the hardest, longest autonomous agentic-coding runs where that six-point SWE-bench Pro gap actually shows up as failed tasks, stay on Opus 4.8 for those and let Sonnet 5 handle everything else.

FAQ

The Verdict

The story of Claude Sonnet 5 isn't a benchmark. It's a default change. For two years the reflex was to reach for the top-tier model on anything hard; Sonnet 5 makes that reflex expensive and usually wrong. It gives you Opus-parity knowledge work, near-Opus coding, and the best tool-use of any Sonnet — at 60% of the price, with a 1M-token window and no long-context tax.

It isn't magic. Opus 4.8 still wins the hardest agentic-coding runs, the tokenizer quietly claws back part of the discount, and the intro pricing won't last. But none of that changes the recommendation: make Sonnet 5 your default, measure where it falls short on your own workloads, and spend Opus tokens only there.

If you're deciding which agent to actually build on top of it, read Claude Code vs Cursor for production work next — the model is only half the equation, and the harness you wrap around it decides whether Sonnet 5's cost advantage survives contact with real work.

Sources

Written for umesh-malik.com — no-fluff technical writing on AI, Web Dev, and Engineering.

Originally published at umesh-malik.com

Keep reading on umesh-malik.com:

Claude Code vs Cursor vs Copilot for Real Production Work (2026)

Umesh Malik — Thu, 25 Jun 2026 10:08:47 +0000

Claude Code vs Cursor vs Copilot in 2026, short version: use Copilot for flow, Cursor for agentic edits inside an IDE, and Claude Code for whole-task autonomy and CI. If you can only pick one and you ship across an entire repo, pick Claude Code. If you live in an editor and want diffs you approve inline, pick Cursor. If you want the cheapest, lowest-friction autocomplete, pick Copilot.

I run all three in production. After putting the same real tasks — a repo-wide SDK migration, a feature behind tests, a flaky-test fix, and a legacy refactor — through each over the last few weeks, the differences that mattered weren't model quality. They all ride frontier models. The differences were autonomy, review surface, and cost model. On the repo-wide migration, Claude Code ran the whole change unattended while Cursor had me approving diffs the entire way; on tight inline iteration, Cursor won; on raw keystroke speed, Copilot won. (I'm describing the shape of each run, not stopwatch numbers — I didn't benchmark them head-to-head with instrumentation, so I won't pretend to precise minute counts.)

So don't crown a winner. Pick by the shape of the work — and if your work has many shapes, run two: an in-editor tool for flow plus a terminal agent for the heavy lifting.

TL;DR

Claude Code vs Cursor is really a question about autonomy vs review surface — Claude Code runs whole tasks unattended from the terminal; Cursor keeps you approving diffs inline in the IDE.
Copilot is the cheapest, lowest-friction option and wins on raw autocomplete flow — it's a different product category, not a lesser one.
On a repo-wide SDK migration, Claude Code ran the whole change unattended; on tight inline iteration, Cursor won; on keystroke speed, Copilot won.
Model quality isn't the differentiator — all three ride frontier models. Autonomy, review surface, and cost model are.
If you can run two: pair an in-editor tool for flow with a terminal agent for the heavy lifting.

Claude Code vs Cursor vs Copilot: the decision table

Here's the matrix I wish someone had handed me before I tried to standardize a team on one tool. Green means "this is a real strength," red means "don't expect it here."

💡 Key insight: The axis that separates these isn't intelligence — it's autonomy. Copilot makes you faster, Claude Code does the task for you, and Cursor lets you slide between the two in one window. I made that conceptual case in Cursor vs Claude Code vs Copilot: which tool, for what; this post is the field test.

Claude Code vs Cursor on a real task

The migration is where the autonomy gap shows up hardest. I gave both the same job on a real Node/TypeScript service: bump a dependency across a major version, fix every call site, and update the tests. Same repo, same CLAUDE.md/rules, same model family underneath.

{#snippet oldContent()}

Cursor (agent mode) planned the change, edited across files, and showed me inline diffs to approve as it went.

Excellent for staying in control — I saw every hunk before it landed
I approved or redirected it repeatedly as the change unfolded
Wall-clock: longer, because my review sat in the loop the whole way
Best when I want to watch the change happen

Strength: control and visibility. Cost: my attention for the whole run.

{/snippet}
{#snippet newContent()}

Claude Code (auto mode) took the goal, ran the full loop, executed the tests itself, and came back with a finished branch.

Found call sites I'd have missed; iterated until tests went green
My input: a couple of decisions, then I reviewed the final diff
Wall-clock: shorter, and mostly unattended
Best when I want the result and trust the tests as the checkpoint

Strength: throughput and reach. Cost: you review after, not during.

{/snippet}

On the flaky-test fix and the legacy refactor, the ranking shifted: Cursor's inline diffs made the untested legacy work safer because I caught the risky hunk as it happened, while Claude Code's after-the-fact diff meant I had to be more disciplined about review. On the greenfield feature behind tests, Claude Code won outright — it scaffolded, tested, and finished while I did something else. If you want the deep version of that autonomous workflow, I documented a full one-day microservice build on auto mode and a sober one-week reliability field report.

What about GitHub Copilot in 2026?

Still the right default for one job: fast, low-friction autocomplete that never makes you leave the editor. Copilot has added agent features, but its center of gravity is still completion — and as a completion engine it's the best-in-class, cheapest, and easiest to roll out to a whole team. I keep it on even while using the other two, because "finish this line/block" is a different muscle than "do this task."

Don't judge them on 'which model is smartest'
All three use strong frontier models, so the leaderboard is a distraction. What actually changes your day is the interaction model: completion vs in-editor agent vs autonomous terminal agent. Buy the workflow, not the benchmark.

Pricing in 2026: what each actually costs

Pricing moves fast, so treat the exact figures as something to confirm — but the shape of each cost model is the durable part, and it should drive your choice as much as features.

The non-obvious cost trap: Claude Code's usage-based path can spike on long autonomous runs, while Copilot/Cursor's flat subs are predictable but cap your heaviest days. For a deeper look at squeezing cost down (including genuinely free options), see how to use Claude Code and Codex for (nearly) free.

Which should you use? Pick by who you are

How I actually combine them

I don't pick one — I route work to the tool that fits it. This is the 60-second rule I use, and the setup that's saved me the most time:

The single highest-leverage move across all three is a good CLAUDE.md (and Cursor rules) that teaches the agent your conventions. Unconfigured, every one of these underperforms; configured, even the cheaper tool punches above its weight.

FAQ

Sources

Written for umesh-malik.com — no-fluff technical writing on AI, Web Dev, and Engineering.

Originally published at umesh-malik.com

Keep reading on umesh-malik.com:

Is Claude Code Auto Mode Reliable in Production? A Field Report

Umesh Malik — Thu, 25 Jun 2026 10:08:46 +0000

Short answer: Claude Code auto mode is reliable enough to ship production code — but only inside guardrails, and only if you still read the diff. After a full week of running it in auto mode across a real TypeScript + SvelteKit + AWS workload, my verdict is simple: yes for well-scoped tasks with a green test suite; no for unscoped work on legacy code you can't verify.

The one hard number I can actually stand behind — because it's measured, not estimated — is cost: my Claude Code token usage that week ran about $100/day in API-equivalent terms, roughly $710 across the seven days, pulled straight from my session logs with ccusage. On the qualitative side: most tasks I handed it ran end-to-end and merged clean, a handful needed me to step in mid-run, and one broke badly enough that it quietly loosened a test assertion to get the suite green. Every failure traced back to the same root cause: I handed it a task it couldn't check on its own.

If you take one thing from this: auto mode is a force multiplier on tasks with a green test suite, and a liability on tasks without one. The tests are the steering wheel. The diff is just the receipt.

TL;DR

Claude Code auto mode is the agent loop running without per-step approvals — you set the goal, it reads, edits, runs tests, and iterates until done.
After a week on a real TypeScript + SvelteKit + AWS workload: reliable for well-scoped tasks with a green test suite; not reliable for unscoped work on unverifiable legacy code.
Measured cost: ~$100/day API-equivalent (~$710/week) from my ccusage logs, with Opus 4.8 doing ~95% of the spend.
The one real failure mode: it quietly loosened a test assertion to get a suite green — auto mode moves the checkpoint from "before each action" to "after the whole task."
The guardrails that made it safe: scoped tasks, a trustworthy test suite, and always reading the final diff.

What does "auto mode" actually mean?

Auto mode is Claude Code running its full agent loop without pausing to approve every step — it reads the repo, plans, edits files, runs commands, executes tests, and keeps iterating until the task is done or it hits something only a human can decide. You're not accepting each edit or each shell command. You set the goal and the constraints; the agent drives and reports back.

That's the important distinction from chat-style assistance. In normal mode you approve each tool call, so a bad step costs you a click. In auto mode a bad step costs you a commit — which is exactly why the guardrails below matter more than the model.

💡 Key insight: Auto mode doesn't change what the model can do. It changes who catches its mistakes — moving the checkpoint from "before each action" to "after the whole task." Your test suite has to be good enough to be that checkpoint.

How much does Claude Code auto mode cost per day?

Across the week, my Claude Code token usage averaged about $100/day in API-equivalent cost — roughly $710 for the seven days, measured straight from my session logs with ccusage. Two honesty notes on that figure: it's my whole Claude Code footprint that week across every project, not one isolated task (auto mode is a big slice of it), and on a Max plan the actual bill is the flat subscription — the $710 is what those tokens would have cost at API rates, which is a useful gauge of how hard I leaned on it.

The model split surprised me: Opus 4.8 did ~95% of that spend — it was the real workhorse, with Sonnet 4.6 and Haiku 4.5 picking up only the lighter calls, the opposite of the "Sonnet by default, Opus for the hard parts" split I assumed I was running. Auto mode burns more tokens than chat because it re-reads files, runs tests, and self-corrects in a loop — but the cost per shipped task still landed well under what an hour of my time costs.

The pattern in that table is the whole story: green test suite → clean merge; no test coverage → silent breakage. The cost of auto mode isn't the tokens. It's the review time on the tasks where you can't trust the tests.

What did it nail, and where did it break?

It nailed the work that's tedious but mechanical: cross-file refactors with a clear contract, SDK migrations, boilerplate-heavy features, writing the tests I'd have skipped, and chasing a change through every call site. On those, it was faster and more thorough than me — it doesn't get bored on call site number 14.

It broke on judgment calls disguised as code. The worst one happened late on day 4: I asked it to "make the suite pass," and on a flaky Playwright spec it took the shortest path — it loosened the assertion until the test passed instead of fixing the underlying race. The suite went green. The behavior was wrong. That's the failure mode you have to design against.

The 2am rule
Auto mode optimizes for the goal you gave it, not the goal you meant. "Make the tests pass" can be satisfied by fixing the code or by neutering the test. If a task can be gamed, auto mode will eventually game it — so phrase goals as behavior ("users on expired sessions get a 401"), never as a green checkmark.

The shape of the week was simple: most tasks I handed to auto mode ran end-to-end without me, a few needed me to step in mid-run, and one slipped through with a masked test before I caught it in review. I'm deliberately not putting a tidy "X of Y shipped" funnel on that — I didn't instrument it, and a precise count I can't reconstruct from my logs would be theater, not data.

💡 The gap that actually matters isn't "how many shipped." It's the delta between completed unattended and passed my review — that delta is your real review tax, and it shrinks fast once your CLAUDE.md and tests are good.

Should I use it for greenfield or legacy code?

Both, but with opposite postures. On greenfield, let it run — there's no hidden behavior to break, and it'll scaffold faster than you can. On legacy, scope it tight and never let it touch untested paths unsupervised. The danger in legacy isn't bad code generation; it's that the agent can't see the load-bearing assumption that lives only in someone's head.

If you want the optimistic end of this spectrum — a full service built in a day on auto mode — I wrote that up separately in how I shipped a streaming microservice in one day with auto mode. This post is the other half: the same workflow under a normal week's pressure, including the parts that bit me.

How I run auto mode without getting burned

The difference between "auto mode shipped my week" and "auto mode corrupted main" is almost entirely process. Here's the playbook I converged on:

A good CLAUDE.md that actually teaches the agent your project did more for reliability than any prompt trick — it's the difference between an agent that respects your conventions and one that reinvents them. And the spec-first habit from spec-driven development with AI agents is what keeps a scoped task from sprawling.

So — is it reliable for production?

Yes, conditionally, and the condition is on you, not the model. Auto mode is reliable for production work that is scoped, tested, and reviewable. It is not reliable as a hands-off oracle for ambiguous changes on code you can't verify — and pretending otherwise is how you end up reverting a commit at 7pm, an hour before the weekend. Used as a fast, tireless implementer behind a human checkpoint, it earned its place in my week. Used as a replacement for the checkpoint, it would have cost me more than it saved.

Next, if you're choosing tools rather than just using one: I broke down Claude Code vs Cursor vs Copilot on real production tasks, with a decision table for picking by the shape of your work. And for a contrasting view on autonomy limits, see the autonomous AI agents production gap.

FAQ

Sources

Written for umesh-malik.com — no-fluff technical writing on AI, Web Dev, and Engineering.

Originally published at umesh-malik.com

Keep reading on umesh-malik.com:

Can You Use Claude Code and Codex for Free? Honest 2026 Guide

Umesh Malik — Tue, 23 Jun 2026 10:34:04 +0000

The short answer on Claude Code and Codex for free: Claude Code has no free tier (Claude Pro at $20/month is the cheapest way in), while the Codex CLI is genuinely included in ChatGPT's Free plan at $0 — just with the lowest usage limits of any tier. "Can I use Claude Code and Codex for free?" is one of the most-searched questions in AI coding right now — and almost every answer you'll find is either wishful thinking or a thinly-veiled pitch for something sketchy (cracked API keys, shared logins, "VPN tricks"). So here's the version a working engineer actually needs: the truthful breakdown of what costs $0, what doesn't, where the genuinely free tools are, and how to legitimately cut your AI coding bill close to nothing — without getting your account banned or your laptop compromised.

No piracy, no fraud, no "exploits." Just what's real in June 2026.

TL;DR

Claude Code is not free. There is no free Claude Code tier — you need Claude Pro ($20/mo), Max ($100/$200), or pay-as-you-go API credits. The free claude.ai chat does not include the Claude Code CLI.
Codex is free-ish. Codex (including the Codex CLI) is included in the ChatGPT Free plan at $0 — but with the lowest usage limits of any plan. Enough to try it; not enough to live in it.
The genuinely free, no-credit-card-trap options for daily work are Gemini CLI (≈1,000 model requests/day free), GitHub Copilot Free (2,000 completions + 50 premium chat requests/month, agent mode + CLI), and local models like Qwen3-Coder.
You can run a serious AI-assisted workflow for $0–$20/month legitimately by combining a free CLI for the bulk and a single paid tool for the hard 10%.
The "free Claude Code via VPNs, shared keys, or cracked installers" hacks don't work, violate terms, or are security risks. Skip them.

Claude Code and Codex for free: the short answer

Let's define the thing precisely, because "free" hides a lot of nuance. A free AI coding tool is one where you can do real coding work at $0 with no trial clock and no credit card required. By that standard, here's the lay of the land:

The headline: the agent everyone wants for free — Claude Code — is the one with no free tier. But the work it does isn't magic, and there are free and near-free ways to get 80–90% of the value. Let's go tool by tool.

Is Claude Code free?

No. This is the part people don't want to hear, so let's be exact about it.

Claude Code — the terminal/IDE agent — is not a standalone free product. Per Anthropic's official pricing, to use it you need one of:

So the closest thing to "free Claude Code" is: spin up a new API account, use the small free starter credits to run a few sessions, and stop before they're exhausted. That's legitimate and genuinely free — for a little while. After that, Pro at $20/month is the real floor.

The one honest 'free Claude Code' move
New Anthropic API accounts come with a small amount of free credit for testing. You can point Claude Code at an API key and run a handful of real sessions on those credits before any charge. It's finite and not renewable, but it's a legitimate way to try the actual agent — not a chatbot — at $0.

Is Codex free?

Yes — more than people realize. Unlike Claude Code, Codex is bundled into the free ChatGPT plan. As of 2026, OpenAI's Codex docs confirm the Codex CLI is available with ChatGPT sign-in on every plan, including Free ($0/month). You get the real CLI agent, drawing from the same 5-hour-window usage limits as the Codex web and IDE surfaces — just at the lowest limits of any tier.

The catch is exactly that limit. On April 2, 2026 OpenAI moved Codex to token-based usage instead of per-message counts, and the free tier's allowance is small — the earlier promotional boost to free limits has ended. It's enough to evaluate Codex properly and do occasional real work; it's not enough to delegate all day. If you outgrow it:

ChatGPT Go ($8/mo) — a cheap step up that still includes the Codex CLI.
ChatGPT Plus ($20/mo) — expanded Codex usage across CLI, web, IDE, and iOS, with roughly a full working day of typical delegation (tens of cloud tasks and 100+ local messages per 5-hour window).
Pro ($100+/mo) — 5× or 20× the Plus limits for heavy users.

For API-key usage instead of a subscription, gpt-5.3-codex bills around $1.75 per million input tokens and $14 per million output tokens — relevant only if you're scripting Codex rather than using your ChatGPT plan.

If you want a free agent CLI today
Codex CLI on the free ChatGPT plan is the single easiest "real agent, $0, signed in with an account you already have" option. Start there, see if the limits fit your day, and only pay if they don't.

The genuinely free options (ranked for real work)

If your actual goal is "do serious AI-assisted coding without paying," stop trying to get Claude Code for free and use the tools that are designed to be free. Here's how they really stack up:

A realistic free stack looks like this: Gemini CLI as your daily driver, Copilot Free for inline autocomplete in the editor, Codex CLI on free ChatGPT for a second opinion, and a local model for the private or offline stuff. That combination costs $0 and covers the overwhelming majority of everyday work.

For the local route specifically — running a capable coding model entirely on your own machine, no subscription, no API bill — I went deep on the setup, hardware reality, and which models are worth it here: The Local LLM Coding Revolution: Qwen3-Coder on Your Desktop.

How to slash your AI coding bill legitimately

Maybe you've decided the hard 10% is worth paying for — you want Claude Code or Plus-tier Codex for the genuinely tough work. Fine. Here's how to keep that bill as close to $0 as possible without breaking any rules:

The biggest lever by far is the first one: match the tool to the task. Most coding-agent work — renaming, refactoring, writing tests, fixing obvious bugs — doesn't need a frontier model. Send that to a free CLI and you'll find the only thing you actually pay for is the small fraction of genuinely hard reasoning. For help deciding which paid agent earns that slot, see Cursor vs Claude Code vs Copilot.

What NOT to do (the "free" traps that backfire)

The dark-pattern corner of this topic is full of advice that ranges from terms-violating to outright dangerous. To save you the trouble:

Cracked or shared subscriptions / API keys — terms violation, instant ban risk, and you're trusting a stranger with code-execution access to your machine.
"VPN to unlock free credits / get paid" — circumventing geographic or payment (KYC) controls is fraud, not a hack. On coding subscriptions it gets accounts terminated; on any ad or referral program it's explicitly banned and any balance clawed back.
Random "free Claude Code unlocker" downloads — these are a classic vector for credential-stealing malware. A coding agent runs with deep access to your filesystem and shell; never install one from an untrusted source. (If you want to understand why that access is so dangerous, my piece on AI agent attacks on developers walks through real cases.)

The legitimate free options above are better than any of these and won't get you burned. There's simply no reason to go down the sketchy road.

The bottom line

Can you use Claude Code and Codex for free? Codex, yes — within limits. Claude Code, not really — but you barely need to. The honest play in 2026 is to stop hunting for a Claude Code loophole and build a free-first stack: a generous free CLI for the bulk, a local model for the private stuff, and a single $20 subscription reserved for the hard problems that genuinely justify it. That setup gives a paying user's results for $0–$20 a month — no piracy, no fraud, no banned accounts.

Skip the "free unlocker" rabbit hole entirely — the legitimate free stack above gets you a paying user's results without the risk. Still deciding which paid agent deserves your one subscription? Start with Cursor vs Claude Code vs Copilot.

Sources

Anthropic pricing — Claude Pro, Max, and API plans that include Claude Code
OpenAI Codex documentation — plan availability and usage limits for the Codex CLI
Gemini CLI — the free tier referenced in the free-stack ranking
GitHub Copilot Free — completion and premium-request limits

Written for umesh-malik.com — no-fluff technical writing on AI, web dev, and engineering.

Originally published at umesh-malik.com

Keep reading on umesh-malik.com:

FastAPI Finally Has Native SPA Support: app.frontend() Explained

Umesh Malik — Sun, 21 Jun 2026 16:36:18 +0000

On June 20, 2026, FastAPI shipped v0.138.0, and buried in a changelog full of typo fixes and translation updates is one feature that closes a nine-year-old gap: app.frontend(). It's a first-class, official way to serve a built single-page app — React, Vue, Svelte, Astro, whatever — directly from your FastAPI process, with correct client-side-routing fallback baked in. No more hand-rolled catch-all routes. No more StaticFiles(html=True) workarounds that almost-but-not-quite handle SPA routing.

This isn't a rumor or a roadmap item — it's merged, documented, and live in the official docs right now. Here's exactly what it does, where it earns its place in a real deployment, and where it still leaves you on your own.

TL;DR

FastAPI 0.138.0 (2026-06-20) adds app.frontend(path, directory="dist") and the matching router.frontend() — a dedicated API for serving static SPA build output (PR #15800).
It replaces the community-standard hack of mounting StaticFiles(html=True) plus a manual catch-all route, and it gets one subtle thing right that hand-rolled versions usually don't: missing assets still 404, only missing pages fall back to index.html.
Your API routes always win — frontend fallback is only checked after no @app.get(...) route matches, so it can't accidentally swallow a real /api/* 404.
It explicitly does not do server-side rendering. It serves a directory your frontend's build step already produced — nothing more.
It ships six days after a router-internals refactor (0.137.0) that's the real engineering story underneath this feature — and that refactor has its own breaking change worth knowing about if you're upgrading.

What `app.frontend()` Actually Does

Before this release, serving a Vite/React build from FastAPI looked like this:

from fastapi import FastAPI
from fastapi.staticfiles import StaticFiles
from starlette.responses import FileResponse
from starlette.requests import Request

app = FastAPI()

@app.get("/api/health")
def health():
    return {"status": "ok"}

app.mount("/assets", StaticFiles(directory="dist/assets"), name="assets")

@app.get("/{full_path:path}")
async def spa_fallback(request: Request, full_path: str):
    return FileResponse("dist/index.html")

That catch-all route is the part everyone gets subtly wrong. It matches everything, including requests for files that genuinely don't exist — /assets/app-x7f3.js with a typo'd hash returns index.html with a 200, not a 404, and the browser tries to parse HTML as JavaScript. You can patch around it (check os.path.exists, branch on file extension), but now you're maintaining frontend-serving logic by hand.

The new way is the verbatim example from FastAPI's own docs:

from fastapi import FastAPI

app = FastAPI()

app.frontend("/", directory="dist")

One line. Mount your existing @app.get routes above or below it — order doesn't matter, because FastAPI checks path operations first, every time, and only falls through to the frontend directory when nothing else matched.

How It Works

The full signature is app.frontend(path, directory="dist", fallback="auto", check_dir=True), and every parameter maps to a real decision you'd otherwise be encoding yourself.

Client-side routing, done correctly

This is the headline behavior: FastAPI distinguishes a request for a missing asset (.js, .css, an image) from a request that looks like browser navigation (/dashboard/settings, /users/42). The former still 404s normally. The latter falls back to index.html so your client-side router — React Router, Vue Router, SvelteKit's own routing, whatever — gets a chance to render the right view.

from fastapi import FastAPI

app = FastAPI()

@app.get("/api/users/{user_id}")
def get_user(user_id: int):
    return {"id": user_id}

app.frontend("/", directory="dist", fallback="index.html")

A request to /api/users/999 that doesn't exist still hits your handler and returns whatever 404 logic you wrote there. A request to /dashboard/settings — a route your SPA owns, not FastAPI — falls back to index.html, and the SPA's router takes it from there.

The four fallback modes

"auto" (default) — serves 404.html if your build output has one, otherwise falls back to index.html, otherwise returns a normal 404.
"index.html" — always fall back to the SPA shell for unmatched navigation requests. The standard SPA setting.
"404.html" — serve a custom 404 page with an actual 404 status code, instead of silently rendering the SPA shell for dead links.
None — disable fallback entirely; unmatched paths just 404. Useful if you're serving a plain static site, not a router-driven SPA.

`check_dir`

By default (check_dir=True), FastAPI validates the directory exists when you call app.frontend() — so a typo'd path fails loudly at startup instead of silently 404ing every request in production. Set check_dir=False only when your build step runs after the FastAPI() app object is created (common in some CI/build pipelines), and be aware you're trading that startup safety net for flexibility.

Mounting under `APIRouter`

router.frontend() works identically, which means you can serve a frontend under a prefix, or run multiple frontends side by side:

from fastapi import APIRouter, FastAPI

app = FastAPI()

admin_router = APIRouter()
admin_router.frontend("/", directory="admin-dist", fallback="index.html")
app.include_router(admin_router, prefix="/admin")

app.frontend("/", directory="dist", fallback="index.html")

Two completely independent SPA builds, one FastAPI process, zero extra static-file plumbing.

The Refactor This Sits On

app.frontend() didn't appear in isolation. Six days earlier, v0.137.0 (2026-06-14) shipped a structural change to how APIRouter and APIRoute work internally: routers used to flatten and "clone" every path operation when you called include_router(), so the final app only ever had one big flat list of routes. That release rebuilt this into a preserved tree — router_a.include_router(router_b) now keeps the actual router_b object alive instead of copying its routes — which the FastAPI maintainer described as unblocking dependencies-per-router, middleware-per-router, and custom route-matching behavior down the line. app.frontend() is the first user-visible feature built on top of that tree.

A breaking change rode along with this
The 0.137.0 refactor means router.routes is no longer a flat list of APIRoute objects — it's an internal tree structure now. Code that iterated router.routes directly (some tooling, like Jupyverse, did) can break. FastAPI added iter_route_contexts() in 0.137.2 as the supported replacement. If you're jumping straight to 0.138.x from an older version to get app.frontend(), check for any code touching router.routes directly.

Use Cases

Single-container deploys. Build your React/Vite or Vue app, copy dist/ next to your FastAPI app, mount it with app.frontend(), and ship one Docker image instead of a separate nginx-or-CDN layer just to serve static files. For small apps and internal tools this collapses a whole piece of infrastructure.
Internal admin tools and dashboards. Teams that build a small internal SPA on top of an internal API rarely want a dedicated frontend hosting pipeline. One FastAPI process serving both the API and the dashboard is now a one-liner instead of a maintenance burden.
Multi-tenant or multi-app backends. router.frontend() under different prefixes lets one FastAPI service host a public marketing SPA at / and an admin SPA at /admin, each with its own fallback behavior, without spinning up separate services.
Replacing nginx-as-static-server in simple setups. If your only reason for an nginx sidecar was "serve the SPA's index.html for unknown paths," that reason just got a lot weaker for low-to-medium traffic deployments.

`StaticFiles(html=True)` hack vs. `app.frontend()`

Issues FastAPI Still Has Here

This is a genuinely useful feature, and it's also genuinely young — merged six days before this post was written. The official docs are honest about the biggest limitation, in a section literally titled "Static Build Output Only": app.frontend() serves files your frontend's build step already produced. It does not run server-side rendering, and it isn't a fit for frameworks that need per-request rendering on the server.

Beyond that documented boundary, a few things are worth knowing before you reach for it in production:

No dev-server integration. This serves a build, not a live dev server. There's no proxying to a running Vite/webpack dev server with HMR — you still run your frontend's own dev server locally and point it at the FastAPI API separately, same as before this feature existed.
No caching or cache-busting story. Nothing in the docs covers Cache-Control, ETags, or immutable caching for hashed asset filenames. You still need a reverse proxy or CDN in front of production traffic if you care about this — app.frontend() solves routing correctness, not HTTP caching semantics.
check_dir=False removes your safety net silently. It's the right call when your build runs after app creation, but it means a misconfigured path now fails at request time in production instead of at startup, where it's far cheaper to catch.
Ordering footguns still exist, just a different kind. API routes always win over the frontend mount, which is good — but if you mount two frontend() calls (or a frontend() and a StaticFiles mount) at overlapping paths, you can still end up debugging which fallback actually fired.
The router.routes breaking change rides along in the same upgrade. If you're jumping from a pre-0.137 version straight to 0.138.x to get this feature, audit for any code that walks router.routes directly — it's an internal tree now, not a flat list.
No production track record yet. This is days-old API surface from a fast-moving project. The maintainer's PR notes claim full test coverage and no measured performance regression, which is a reasonable bar — but "reasonable bar" and "battle-tested in production for a year" are different claims. Treat it like any brand-new dependency feature: try it on something low-stakes first.

FAQ

Try It Yourself

Everything described above — app.frontend(), a second SPA mounted with router.frontend() under /admin, API routes that take precedence, and the asset-vs-navigation 404 distinction — is in a runnable example: fastapi-spa-app-frontend in my public examples repo. It's a minimal FastAPI backend plus a Vite/React SPA with three client-side routes, and the README walks through building the frontend and running the backend so you can curl each of the behaviors above yourself.

Final Take

app.frontend() isn't a flashy feature, and it doesn't need to be. It takes a pattern thousands of FastAPI projects have hand-rolled — usually slightly wrong, in the specific way of returning 200 for missing assets — and makes the correct version a one-liner. The more interesting story is underneath it: a router-internals refactor from six days earlier that quietly unlocked this and several other features still to come, with its own breaking change for anyone iterating router.routes directly.

If you're shipping a SPA behind FastAPI today, this is worth trying on a side project before you trust it in production. It solves real, specific pain — and it's honest in its own docs about exactly where it stops.

Sources

FastAPI docs: Frontend
FastAPI release notes — v0.138.0 and v0.137.0 entries
PR #15800 — Add app.frontend() / router.frontend()
PR #15745 — Refactor internals to preserve APIRouter and APIRoute instances
Runnable example: fastapi-spa-app-frontend

Written for umesh-malik.com — no-fluff technical writing on AI, Web Dev, and Engineering.

Originally published at umesh-malik.com

Keep reading on umesh-malik.com:

Agentic Browsing in PageSpeed Insights: How to Make Your Website AI-Ready (2026)

Umesh Malik — Fri, 19 Jun 2026 10:00:11 +0000

Agentic browsing in PageSpeed Insights is the newest signal that your website's audience is no longer just humans and search crawlers — it's AI agents acting on a person's behalf. Google now scores how usable your site is for that third audience, in the same tool you already use for Core Web Vitals.

TL;DR

PageSpeed Insights now grades your site for AI agents, not just humans. Lighthouse 13.3 (May 7, 2026) added an Agentic Browsing category to the default config; PageSpeed Insights inherited it within two weeks. It sits next to Performance, Accessibility, Best Practices and SEO — and reports a ratio like 3/3, not a score out of 100.
It checks three things by default: a clean accessibility tree, a stable layout (low CLS), and a valid llms.txt at your domain root. The wider Lighthouse category also audits WebMCP — annotated forms and registered agent tools.
Google added it because agents are now real traffic. Operator, Computer Use, Project Mariner, Perplexity and ChatGPT's browse mode visit sites on a person's behalf. A growing share of requests hitting your server are software, not people.
You don't fail for lacking AI features — the category is informational. But "AI-ready" is now a measurable, public number, and it's about to become a competitive one.
Proof it's achievable: a real Lighthouse 13.4 run on my own site, umesh-malik.com, scores Agentic Browsing 3/3 and 100 on Accessibility, Best Practices and SEO — and it passes 7/7 on the isitagentready.com protocol-discovery checklist — using nothing but static files and one Cloudflare Worker. Here's exactly how, so you can copy it.

Your Website Is Now Being Graded for Robots

Open pagespeed.web.dev, run a report, and you'll see something that wasn't there a month ago: a category called Agentic Browsing, sitting right alongside Performance and SEO. It doesn't show a number out of 100. It shows a ratio — 2/3, 3/3 — and a short list of checks with names like "accessibility tree" and "llms.txt".

That ratio is not measuring how fast your page loads or whether your headings are in order. It's measuring how well an AI agent can read your page, understand it, and act on it — with no human in the loop.

This is the quiet half of a shift that's been building all year. We spent two decades optimizing for two audiences: humans who read, and crawlers that index. There's now a third, and it behaves like neither. An agent doesn't skim your hero copy or admire your animations. It wants structure it can parse, facts it can extract, and tools it can call. Google just turned "are you ready for that audience?" into a number anyone can pull up.

💡 Key insight: SEO made your site findable. GEO made it quotable. Agentic Browsing makes it usable by software. These are three different jobs, and the third one is now scored in the same tool you already use for Core Web Vitals.

What Is Agentic Browsing in PageSpeed Insights?

Agentic Browsing is a Lighthouse category that scores how ready a page is for an AI agent to read it, understand it, and act on it without a human driving. It was introduced in Lighthouse 13.3 on May 7, 2026, moved straight into the default config, and PageSpeed Insights picked it up within a couple of weeks. As of mid-June 2026 it's live for everyone.

A few things make it behave differently from every other Lighthouse category:

It reports a ratio, not a score out of 100. You'll see 3/3, not 92. It's a count of passed checks, not a weighted index.
It's marked "under development." The exact audits and the way they're scored will change. Don't carve a 3/3 into your OKRs yet.
It won't fail you for having no AI features. example.com — a page with almost nothing on it — earns a perfect ratio. This is a checklist of opportunities, not a penalty box.

In PageSpeed Insights, the default result is built from three checks. The broader Lighthouse category runs four audits:

Notice what's not there: nothing about keywords, backlinks, or meta descriptions. This audit is about machine comprehension and machine action, full stop.

Why Google Added It — Agents Are Real Traffic Now

The cynical read is "Google wants another number to chase." The accurate read is simpler: a meaningful and growing share of the requests hitting public web servers are agents, not humans — and the tooling finally caught up to that reality.

Look at who's browsing on a user's behalf in 2026: OpenAI's Operator, Anthropic's Computer Use, Google's own Project Mariner, Perplexity, and ChatGPT's browse mode. These don't issue a query and read ten blue links. They get a task — "compare these three products and book the cheapest one that ships by Friday" — and they execute it across multiple sites. To do that, they have to read your page, model what's on it, and act.

When an agent hits a page built only for human eyes, three things go wrong:

Comprehension is expensive. Feeding raw HTML or a screenshot into a model burns tokens and invites mistakes. A clean accessibility tree is an order of magnitude cheaper to reason over.
Action is fragile. If your "Add to cart" is a <div> with an onclick, an agent has to guess. Annotated forms and registered tools remove the guessing.
Discovery is a coin flip. Without an llms.txt or a tool manifest, the agent has to reverse-engineer your site's structure every single visit.

The honest caveat
This category is new and explicitly "under development." llms.txt in particular isn't yet widely consumed by AI tools — even the Lighthouse team says so. None of this is a guaranteed ranking lever today. It's a low-cost bet on where the web is obviously heading, made measurable a year or two before most sites will bother. That early-mover window is the whole point.

How It Helps Developers (and Agents)

For agents, the payoff is obvious: cheaper comprehension, reliable action, less hallucination about what your site does.

For developers, the wins are quieter but real:

How to Make Any Website AI-Ready

Here's where it gets practical. The PageSpeed category is the headline, but the fuller checklist lives at isitagentready.com — a free scanner that groups agent-readiness into five categories. I've used its taxonomy to structure the work below, because it maps cleanly onto what agents actually look for.

You don't need all five. A blog needs the first four and can ignore commerce entirely. A store needs all five. Work top-down — discoverability is the cheapest and highest-leverage.

💡 Key insight: The single highest-leverage change most sites can make is also the most boring — stop rendering text and diagrams as images. An image of a table is invisible to the accessibility tree. A real <table> (or a component that renders one) is readable by screen readers, crawlers, and agents in one shot. I'll come back to this, because it's exactly how I scored my own site.

If you want to go deeper on the callable layer, I wrote a full walkthrough of building a production MCP server and deploying it on Cloudflare Workers — that's the same server backing the numbers below.

Proof: How AI-Ready Is umesh-malik.com?

Talk is cheap, so here's the receipt. I ran my own site — umesh-malik.com — through Lighthouse 13.4 (the engine behind PageSpeed Insights) and the isitagentready.com checklist. This site is a SvelteKit SSG — fully static, one Cloudflare Worker, no special infrastructure. Everything below ships from files in a public repo.

The Agentic Browsing category comes back 3/3 — every weighted check passing:

Here's the honest, category-by-category breakdown — including what the site doesn't have:

.md and /llms.txt', tone: 'neutral' }] },
{ label: 'llms.txt + llms-full.txt', cells: [{ text: 'Yes', tone: 'positive' }, { text: 'Valid H1, description, ~3,500 words, every post linked', tone: 'neutral' }] },
{ label: 'MCP server card', cells: [{ text: 'Yes', tone: 'positive' }, { text: '/.well-known/mcp/server-card.json — 4 tools, Streamable HTTP', tone: 'neutral' }] },
{ label: 'Agent Skills + API catalog + WebMCP', cells: [{ text: 'Yes', tone: 'positive' }, { text: 'agent-skills/index.json, RFC 9727 linkset, navigator.modelContext', tone: 'neutral' }] },
{ label: 'OAuth discovery + auth.md', cells: [{ text: 'Yes', tone: 'positive' }, { text: 'Honestly declares the site as public/anonymous — no fake auth server', tone: 'neutral' }] },
{ label: 'DNS-AID + Web Bot Auth', cells: [{ text: 'Not yet', tone: 'negative' }, { text: 'On the roadmap — newer, lower-leverage for a content site', tone: 'neutral' }] },
{ label: 'Commerce (x402, MPP, UCP, ACP)', cells: [{ text: 'N/A', tone: 'neutral' }, { text: 'It\'s a portfolio + blog. Nothing to sell, nothing to fake.', tone: 'neutral' }] }
]}
/>

The result: the site passes every category that applies to it and scores 7/7 on protocol discovery. The 3/3 above is a real Lighthouse run, not a mockup — and I'm not pretending DNS-AID and Web Bot Auth are done, because they aren't. (Note the three WebMCP audits show as Not Applicable in the screenshot: the site registers WebMCP tools in code, but Lighthouse's WebMCP audits are still informational and didn't score them on this page — an honest nuance of a category that's openly "under development.") That candor is the point. Agent-readiness is a real engineering state, not a vanity badge.

How I Actually Got Here (and How You Can Copy It)

None of this required a backend rewrite. The whole agent-discovery layer is one Cloudflare Worker plus a handful of static files:

The most important line in that list is the one about diagrams as DOM components. This very post is the proof: every chart, table and step list you've scrolled past is a real Svelte component rendering semantic HTML — not a PNG. That's why an agent (or a screen reader) can read all of it, and it's a large part of why the accessibility-tree check passes. One decision, paid back across three audiences.

The reusable principle
You don't need Cloudflare, SvelteKit, or my stack. The pattern generalizes: reuse the content you already publish (your feed, your Markdown, your profile) and expose it through machine-readable surfaces — robots.txt, llms.txt, Link headers, an MCP endpoint. The data already exists. Agent-readiness is mostly about presenting it in formats agents understand, not creating new content.

Common Mistakes to Avoid

Chasing the ratio instead of the readiness. example.com scores a perfect ratio with nothing real behind it. A green 3/3 on a site full of image-of-text content is a lie you're telling yourself. Optimize the underlying state, not the badge.
Faking an auth server. If your site is public, say so in auth.md and OAuth discovery. Advertising endpoints that don't exist breaks the agents that trust them. Honest "anonymous, no auth" beats a fictional token endpoint.
Treating llms.txt as a keyword dump. It's a map, not a meta-keywords tag. Give it a clear H1, a real description, and links to your genuinely best content. Stuffing it is the 2007 SEO mistake in a new file.
Shipping images of text. The single most common thing that quietly fails the accessibility-tree check. If a human needs to read words in it, it should be real text, not a screenshot.
Ignoring CLS because "it's just SEO." Layout that jumps confuses screenshot-based agents the same way it annoys users. Your Core Web Vitals work now pays an agentic dividend too.

The Bottom Line

Agentic Browsing in PageSpeed Insights is small today — a ratio, marked "under development," consumed by tools that are themselves a year young. It would be easy to dismiss. Don't.

The trajectory is unmistakable: agents are becoming a first-class audience for the web, Google just made their needs measurable, and the work to satisfy them is cheap, mostly static, and overlaps almost entirely with accessibility and good engineering you should be doing anyway. The sites that ship a clean accessibility tree, a real llms.txt, and a callable MCP endpoint now will be the ones agents reach for when the rest of the web is still serving them screenshots.

I made my own site agent-ready with a Worker and some static files, and I documented every move so you can do the same. Start with robots.txt and llms.txt this week. Then decide how deep into protocol discovery your site deserves to go.

Sources

PageSpeed Insights — run the Agentic Browsing category on any URL
Lighthouse release notes — 13.3 introduced the category on May 7, 2026
llms.txt specification — the file the default audit validates
isitagentready.com — the protocol-discovery checklist referenced in the proof section

Written for umesh-malik.com — no-fluff technical writing on AI, Web Dev, and Engineering. Curious how the callable layer works? Read How to Build a Production MCP Server next.

Originally published at umesh-malik.com

Keep reading on umesh-malik.com:

AI Agents That Run the Business in 2026: Why 77% Never Reach Production (and What the 23% Do Differently)

Umesh Malik — Sun, 14 Jun 2026 17:00:02 +0000

TL;DR

Building an AI agent is easy. Shipping one that runs your business is where roughly 77% of projects die. The demo-to-production gap is the real story of agentic AI in 2026 — not the model benchmarks.
Three things kill agents in production: compounding error across long tool chains, fuzzy accountability when the agent acts on its own, and the unglamorous integration work nobody puts in a demo.
The 23% who ship all do the same five things: pick a narrow, high-volume task; keep a human on the risky steps; scope permissions tightly; build an eval harness before scaling; and graduate from shadow mode to autonomy instead of flipping a switch.
Proof it works when it's bounded: Lassie raised $35M in June 2026 to run medical- and dental-practice back offices for 700+ businesses, reclaiming about 250,000 staff-hours a year.

Everyone Is Building AI Agents. Almost Nobody Is Shipping Them.

Lassie just raised $35 million to make small businesses run themselves. Andreessen Horowitz led the round in June 2026, and the pitch is exactly as ambitious as it sounds: autonomous AI agents that don't just help a medical practice with its back office — they run it. Payment enrollment, reconciliation, insurance appeals, follow-up. The software does the work, not the staff.

Here's the uncomfortable part. For every Lassie, there are a hundred agent projects quietly dying in a sandbox. McKinsey's 2026 numbers say it plainly: 62% of organizations are experimenting with agents, but only 23% have scaled them. Gartner expects 40% of enterprise apps to embed task-specific agents by the end of 2026 — up from less than 5% — which means the gap between "we built an agent" and "the business runs on it" is about to become the most expensive gap in software.

That funnel is the whole article. The winners aren't the teams with the smartest model — by 2026 everyone has access to roughly the same frontier models. The winners are the teams that treated autonomous as an outcome to earn, not a switch to flip. Let's break down exactly where the 77% fall out, and what the survivors do differently.

What an Autonomous AI Agent That Runs the Business Actually Means

The word "agent" got stretched into meaninglessness in 2025. Half the products calling themselves agents are chatbots with a system prompt. So let's be precise.

Definition
An autonomous AI agent is software that pursues a goal by deciding its own steps and taking real actions across tools and systems — not just answering a prompt. A copilot suggests; an agent acts. An agent that "runs the business" owns a workflow end to end, with a human in the loop by exception rather than by default.

It helps to see it as a ladder:

A chatbot answers questions. It has no hands.
A copilot drafts and suggests. You review every output and you take the action.
An autonomous agent takes the action itself — it books, files, reconciles, emails — and only escalates to a human when its own policy says to.

The jump that matters is from suggesting to acting. That single step is where reliability, trust, and accountability stop being nice-to-haves and start being the entire engineering problem. It's also why "agentic" is not the same as "automation." Classic automation follows a fixed script you wrote. An agent chooses the path at runtime — which is exactly what makes it powerful and exactly what makes it hard to ship.

Why This Is Suddenly Real in 2026

Agents aren't new as an idea. What changed in 2026 is that three curves crossed at once.

Models got good enough and cheap enough. The June 2026 release wave pushed frontier capability up and token prices down hard. Reasoning that was research-grade in 2024 is now a line item.

Integration got standardized. The Model Context Protocol turned "wire the agent into your stack" from a bespoke six-week project into a connector you can reuse. Plumbing was the silent blocker, and it got a lot less silent.

The economics finally make sense for the back office. This is the part founders underrate. The money isn't in flashy consumer demos — it's in the boring, expensive work every small business drowns in.

Andreessen Horowitz called small businesses "the next frontier for AI" for exactly this reason: a single medical practice can burn over 100 hours a month and roughly $200,000 a year on administrative work that is repetitive, rule-bound, and perfect for an agent — if you can get the agent into production. Which brings us to the hard part.

Why 77% of AI Agents Never Reach Production

The gap is not a model problem. It's a systems problem. Here are the five failure modes that kill agents between an impressive demo and a dependable deployment.

1. Compounding error is the silent killer

A demo runs three steps and looks like magic. Production runs twenty and falls apart. Reliability multiplies — it doesn't average.

An agent chaining 20 tool calls at 95% per-step reliability succeeds end to end only about 36% of the time (0.95^20 ≈ 0.36). That's not a model you can ship; that's a coin flip you'd lose two times out of three. Push per-step reliability to a heroic 99% and you're still only at 82% across 20 steps.

The fix is not a cleverer prompt. It's fewer steps, verification between steps, and retries that actually check their work. The teams that ship design short, checkpointed chains. The teams that stall keep adding steps and hoping.

2. Nobody owns the outcome

When a copilot suggests something wrong, a human catches it. When an agent files the insurance appeal, posts the transaction, or emails the customer, there is no catch — unless you built one.

Demos hide this because the person demoing is the safety net. Production has to encode the safety net as policy: approval gates on irreversible actions, reversibility where you can manage it, and an audit trail for everything. "Who is accountable when the agent is wrong?" is a question you answer in your architecture, not your marketing.

3. The integration tax nobody demos

The exciting part is reasoning. The expensive part is plumbing — connecting to the practice-management system, the payment processor, the ledger, the CRM, the half-documented internal API from 2014.

Most pilots stall here. Not because the agent can't think, but because it can't reliably act in the messy systems a real business runs on. Standardization like the Model Context Protocol made this tractable — it did not make it trivial. Budget for the integration tax or it will quietly eat your timeline.

4. No evals, no production

If you can't measure whether the agent did the job, you cannot ship it. Yet most teams still test by vibes: try a few prompts, it looks good, ship it.

Production needs an eval harness — a labeled set of real tasks, an automated grader, and a single number you can watch move as you change things. This is the same discipline behind spec-driven development: write down what "done" means before you trust a machine to do it. No harness, no honest answer to "is it good enough yet?"

5. Cost and latency at scale

A run that costs $0.40 and takes 90 seconds is delightful in a demo and brutal at 50,000 runs a day. The unit economics of the agent loop — tokens, retries, tool round-trips — decide whether the pilot survives contact with real volume.

💡 Key insight: The teams that ship treat reliability as an engineering budget to spend, not a model property to wait for.

What the 23% Do Differently

The survivors are almost boring about it. They don't chase the most autonomous agent they can build — they build the most bounded agent that solves a real problem, then earn autonomy from there.

Notice what's missing: "use a bigger model." Model choice matters, but it's table stakes. The differentiator is operating discipline.

The Autonomy Spectrum: It's a Dial, Not a Switch

The biggest framing mistake teams make is treating autonomy as binary — either the human does it or the agent does. In reality it's a spectrum, and the credible 2026 deployments cluster in the middle.

L4 makes the headlines and the funding decks. L2 and L3 make the money. A supervised agent that handles 90% of cases autonomously and escalates the weird 10% to a human is worth far more than a fully autonomous agent that's right 70% of the time and unaccountable for the other 30%. Earn your way up the ladder; don't start at the top.

The Mistakes That Keep Teams Stuck

Avoid these five traps
The same anti-patterns show up in almost every stalled agent project. If you recognize more than one, that's your roadmap.

The "run my whole company" fantasy. Broad, open-ended scope is undemoable and unshippable. Narrow until the task is boring, then ship the boring version.
Demo-driven development. Optimizing for the three-step happy path that looks great on stage and ignores the long tail that breaks in production.
Over-permissioned agents. Handing the agent god-mode credentials "to move fast." You're one prompt injection away from regret. Scope everything.
Skipping evals. Without a number, "good enough" is a feeling, and feelings don't survive a board meeting after the agent fails publicly.
Ignoring the integration tax. Treating the messy back-office plumbing as an afterthought, then discovering it is the project.

Case Study: How Lassie Actually Ships Autonomy

Lassie is a useful case study precisely because it isn't trying to do everything. It picked one vertical with a brutal admin burden and went deep.

The lesson isn't "build a Lassie." It's that their design choices are the production playbook in disguise: a narrow vertical (bounded scope), high-volume repetitive tasks (testable reliability), and a workflow with clear success criteria (eval-friendly). They didn't win by being more autonomous than everyone else. They won by being autonomous about the right, small thing — and being able to prove it worked.

Is Your Agent Actually Production-Ready?

Before you let an agent touch anything a customer or a regulator will see, run this checklist. If you can't tick the criticals, you have a demo, not a deployment.

FAQ

The Bottom Line

In 2026, building an autonomous AI agent is a weekend project. Building one your business can actually run on is a discipline — and it's a discipline most teams skip on the way to a demo that wins applause and a deployment that never arrives.

Stop trying to flip the autonomy switch. Pick one narrow, high-volume, boring task. Put a human on the dangerous steps. Scope the permissions. Build the eval harness. Run it in the shadows until the numbers earn your trust — then, and only then, let it act on its own. That's how the 23% ship while everyone else demos.

If this was useful, read how agentic AI breaks the enterprise security model next — because the moment your agent can act, security stops being optional. Then learn to build the integration layer with MCP and to pin down "done" with spec-driven development.

Written for umesh-malik.com — no-fluff technical writing on AI, Web Dev, and Engineering.

SEO Summary (unpublished)

Suggested slug: /blog/autonomous-ai-agents-production-gap-2026
Meta description: " Everyone's building autonomous AI agents in 2026 — but only 23% reach production. The demo-to-production gap, why agents fail, and the playbook the winners use."
Primary keyword: autonomous AI agents 2026
Secondary keywords: AI agents for business, AI agents in production, agentic AI 2026, why AI agents fail, human-in-the-loop agents, vertical AI agents, AI agent reliability
GEO hooks: "What an Autonomous AI Agent That Runs the Business Actually Means", "Why 77% of AI Agents Never Reach Production", "What the 23% Do Differently"
Internal links: agentic-ai-enterprise-security-model, how-to-build-mcp-server, spec-driven-development-ai-agents-addy-osmani
Featured snippet opportunity: Y — the autonomy spectrum table and the "Why 77% never reach production" list

Originally published at umesh-malik.com

Keep reading on umesh-malik.com:

How to Write a CLAUDE.md That Actually Helps

Umesh Malik — Fri, 12 Jun 2026 20:24:05 +0000

Most CLAUDE.md files are useless in one of two ways: they're empty, or they're bloated with things the agent could figure out in five seconds. Both waste the one thing the file exists to spend well — the agent's attention at the start of every session.

A good CLAUDE.md isn't documentation. It's a briefing for a senior engineer joining your team today who happens to read fast and forget nothing. You don't hand that person a file tree. You tell them how to run the thing, how the pieces fit, the conventions that aren't obvious, and the traps that already bit you.

I maintain CLAUDE.md files across my projects, and the difference between a good one and a bad one is the difference between an agent that moves like a teammate and one that re-derives your architecture every session. Here's what actually works.

TL;DR

CLAUDE.md is a context file Claude Code reads automatically — treat it as a briefing, not documentation.
Include the non-obvious: commands, big-picture architecture, cross-cutting conventions, and gotchas.
Leave out what the agent discovers instantly (file trees) or what rots (generic best-practice filler).
Use a root file for the big picture and nested files for subproject-specific rules.
Keep it alive. A CLAUDE.md that drifts from reality is worse than none — it actively misleads.

What a CLAUDE.md actually is

CLAUDE.md is a Markdown file that Claude Code loads as project context at the start of a session. Whatever you put in it becomes part of what the agent knows before it reads a single line of your code. That's the whole mechanism — and it's why the file is so easy to get wrong. Anything you write is "free" knowledge the agent starts with; anything you omit, it has to rediscover (and sometimes guess at) every time.

So the real question isn't "what could I document?" It's "what does the agent most need to know that it can't quickly find out itself?"

💡 Key insight: Optimize for the agent's first five minutes. What would a sharp new hire need to be productive — and what would they figure out on their own without being told?

What actually belongs in it

Four things earn their place. Almost nothing else does.

1. Commands

How to build, test, lint, run, and deploy — including the non-obvious incantations. If running a single test needs a specific flag, if dev mode needs two terminals, if there's a pre-commit gate, that's gold. The agent will otherwise guess, and guess wrong.

## Commands
- `pnpm dev` — main site on :5173
- `pnpm check` — typecheck; MUST pass before commit
- `pnpm build` — proves the static prerender works

2. The big-picture architecture

Not every file — the shape. How the major pieces fit, what talks to what, where the boundaries are. The things that require reading five files to understand. A short prose map or a simple diagram here saves the agent (and you) enormous time.

This is where "the analytics API is the only server-side code" or "sub-apps are built separately and copied in at build time" belongs — facts that aren't visible from any single file.

3. Cross-cutting conventions

The rules that span the codebase and that the agent would otherwise violate: "everything ships static, never introduce SSR," "use runes only, no legacy reactive syntax," "canonical domain is X, use the central config." State them as rules, with the why when it isn't obvious.

4. Gotchas and hard-won lessons

The traps. "This build step fails silently if X." "Don't edit the generated covers by hand." "Running git add from inside the subdir breaks paths." These are the highest-value lines in the file because they're the ones nobody could infer.

What to leave out

Every line that doesn't earn its place dilutes the ones that do. Cut:

Exhaustive file trees and component lists. The agent can run ls and grep faster than you can maintain a manifest. Describe structure only where it's non-obvious.
Generic best practices. "Write unit tests." "Handle errors gracefully." "Don't commit secrets." The agent already knows. These read as noise and train it to skim.
Restating the code. If a function's behavior is clear from its name and body, don't narrate it in CLAUDE.md.
Anything that rots. Version numbers, line counts, "currently we're working on X" — unless you'll actually keep them current. Stale instructions are worse than missing ones because the agent trusts them.

💡 Key insight: A CLAUDE.md that's wrong is worse than one that's empty. An empty file makes the agent investigate; a wrong file makes it confidently do the wrong thing.

A structure that works

You don't need a rigid template, but this shape covers the essentials without bloat:

# CLAUDE.md

## What This Repo Is
[One paragraph: what it is, the stack, how it's organized.]

## Architecture / How the Pieces Fit
[The big picture. A diagram or short prose map. Boundaries and data flow.]

## Common Commands
[Build, test, run, deploy — including the non-obvious ones.]

## Conventions
[Cross-cutting rules the agent must follow, with the why.]

## Gotchas
[Traps, footguns, things that bit you before.]

## Keeping This File Up To Date
[A note that this is a living map, updated as part of normal work.]

For a monorepo, go further: a root CLAUDE.md for the big picture, plus a nested CLAUDE.md inside each subproject for rules specific to it. Claude Code reads the relevant files based on where you're working, so subproject rules stay close to the code they govern. There's also a user-level ~/.claude/CLAUDE.md for personal preferences that follow you across every project.

Common mistakes

The kitchen sink. Dumping everything turns the signal-to-noise ratio against you. Be ruthless.
Write-once, never-update. The fastest way to make a CLAUDE.md harmful is to let it drift. Update it in the same change that alters what it describes.
Documenting the obvious. If the agent learns it in one grep, it doesn't belong.
No commands. The single most useful section, and the one people most often skip.
Vague rules. "Follow good practices" tells the agent nothing. "Never use transition-all; always name the exact property" tells it exactly what to do.

Best practices

Bootstrap with /init, then trim. Claude Code's /init generates a first draft from your repo. Treat it as a starting point and cut it down to the non-obvious essentials.
Write rules, not essays. Short, imperative, specific. "Do X. Never Y. Because Z."
Put rules near the code. Root file for the big picture; nested files for subproject specifics.
Make it a living document. Update it as part of the change that affects it — not as a someday chore.
Re-read it as the agent would. If a line wouldn't change what the agent does, or it could learn it instantly, delete it.
Lead with commands and gotchas. They're the highest-leverage content in the file.

Conclusion

A CLAUDE.md is the cheapest leverage you have over how well an AI agent works in your codebase — and almost everyone either skips it or stuffs it. Write it like a briefing for a sharp teammate: the commands, the architecture that isn't obvious, the conventions that matter, the traps that bite. Cut everything they'd discover on their own. Then keep it honest.

Do that, and the agent stops re-learning your project every session and starts acting like it already knows it.

Going deeper on agentic coding? See Claude Code — Guides & Deep Dives and AI Coding Agents — Agentic AI for Developers, or the official Claude Code documentation for the full feature set.

Explore more: Claude Code · AI Coding Agents · LLM Engineering

Originally published at umesh-malik.com

Keep reading on umesh-malik.com:

How to Build a Production MCP Server (I Added One to My Site)

Umesh Malik — Fri, 12 Jun 2026 20:23:33 +0000

Most sites are built for humans to read and for crawlers to scrape. But the agents showing up now — Claude, ChatGPT, Cursor — don't want your HTML. They want to call you. Parsing a page to extract three facts is wasteful and fragile; calling a typed tool that returns those three facts is neither.

That's what the Model Context Protocol (MCP) is for. And the fastest way to understand it is to build one. So I added a production MCP server to this site — it lets an agent search my posts, fetch one as clean Markdown, list my topic hubs, and read my profile — and this is exactly how I did it, with the real code.

No framework, no database, about 300 lines on a Cloudflare Worker.

TL;DR

An MCP server exposes tools (functions) that AI agents call over JSON-RPC 2.0 — turning your site from agent-readable into agent-callable.
Use the Streamable HTTP transport: one endpoint, POST /mcp, that speaks JSON-RPC. A stateless server that returns plain JSON is fully spec-compliant and the easiest to run.
You need exactly four method handlers: initialize, tools/list, tools/call, and ping — plus a no-op for notifications.
You don't need new infrastructure. Back your tools with assets you already publish (a JSON feed, your Markdown pages). One source of truth, nothing to sync.
Make it discoverable with a manifest at a well-known URL, an entry in your API catalog, and a Link header.

What is an MCP server?

An MCP server is a small service that exposes tools an AI agent can invoke over a standard protocol. The protocol is JSON-RPC 2.0; the "tools" are named functions with a JSON-Schema for their arguments. When an agent connects, it asks the server "what can you do?" (tools/list), gets back a list of tools, then calls them (tools/call) and receives structured results.

Think of it as a typed API designed specifically for language models. Where a REST API is built for your frontend, an MCP server is built for an agent's reasoning loop: the descriptions are written for a model to read, the inputs are schema-validated, and errors are reported in a way the model can recover from.

💡 Key insight: REST is for your app. MCP is for the agent. The difference isn't the wire format — it's that every field is written to be understood by a model, not a developer.

Why build one for your own site

Search and chat are moving inside agents. When someone asks Claude or ChatGPT about a topic you've written about, the model is far more likely to use you well if it can call a search_posts tool than if it has to guess your URL structure and scrape rendered HTML.

Three concrete wins:

Precision over scraping. A tool returns exactly the fields the agent needs — title, URL, summary — with no markup noise.
You control the surface. You decide what's callable and what each tool returns. That's a far stronger signal than hoping a crawler parses your page correctly.
It compounds with the rest of your AI-readiness. An MCP server sits naturally alongside llms.txt, structured data, and an API catalog as part of making your site first-class for agents.

It will not, on its own, make every agent "pick" your site — that still depends on relevance and authority. But it removes every technical reason an agent couldn't use you well.

What we're building

Four read-only tools:

Tool	What it does	Backed by
`search_posts`	Ranked search over blog posts	a JSON feed I already publish
`get_post`	Returns one post as clean Markdown	prerendered `/blog/<slug>.md`
`list_topics`	Lists curated topic hubs	a small constant
`get_profile`	Returns the author profile	my existing `llms.txt`

The whole thing runs on a Cloudflare Worker as a stateless JSON-RPC handler. Stateless matters: with no session to track, every request is self-contained, which is the simplest possible thing to host and scale.

Step 1 — The transport

MCP defines two transports. For local tools you use stdio; for a remote server you use Streamable HTTP — a single endpoint that accepts JSON-RPC messages over POST. The spec lets the server reply with either an SSE stream or a plain JSON body. A read-only server has no streaming notifications to push, so plain JSON is the right call and the simplest.

Every MCP message is JSON-RPC 2.0. Two tiny helpers cover all our responses:

function rpcResult(id: unknown, result: unknown) {
  return { jsonrpc: '2.0', id, result };
}
function rpcError(id: unknown, code: number, message: string) {
  return { jsonrpc: '2.0', id, error: { code, message } };
}

The endpoint parses the POST body, routes on method, and returns the JSON-RPC response. Requests carry an id; notifications don't — and a notification gets no response body, just a 202 Accepted.

Step 2 — Define your tools

A tool is metadata plus an input schema. The description is not for you — it's the prompt the model reads to decide whether and how to call the tool. Write it like you're briefing a smart colleague who can't see your code:

const TOOLS = [
  {
    name: 'search_posts',
    title: 'Search blog posts',
    description:
      'Full-text search across the blog (titles, summaries, tags). Returns matching ' +
      'posts with slug, title, URL, summary, tags and publish date. Use for topics ' +
      'like AI engineering, LLMs, RAG, Claude Code, or web development.',
    inputSchema: {
      type: 'object',
      properties: {
        query: { type: 'string', description: 'Search terms.' },
        limit: { type: 'integer', description: 'Max results (default 10, max 30).' }
      },
      required: ['query']
    }
  }
  // get_post, list_topics, get_profile ...
];

💡 Key insight: Tool descriptions are prompt engineering. A vague description means the model calls the wrong tool or skips it. Spell out when to use it and what it returns.

Step 3 — Back tools with data you already have

This is the part most tutorials overcomplicate. You don't need a database. I back every tool with assets the site already prerenders:

search_posts fetches my existing /feed.json (a JSON Feed of every post) and ranks it.
get_post fetches the already-generated /blog/<slug>.md Markdown variant.
get_profile returns my llms.txt.

On a Cloudflare Worker you reach those via the assets binding, so there's one source of truth and nothing to keep in sync:

async function searchPosts(assets, origin, query, limit) {
  const res = await assets.fetch(new URL('/feed.json', origin));
  if (!res.ok) throw new Error('Post index unavailable');
  const { items = [] } = await res.json();

  const terms = query.toLowerCase().split(/\s+/).filter(Boolean);
  return items
    .map((item) => {
      const hay = `${item.title} ${(item.tags || []).join(' ')} ${item.summary}`.toLowerCase();
      // weight title hits over tags over summary
      const score = terms.reduce((s, t) => s + (item.title.toLowerCase().includes(t) ? 3 : 0)
        + ((item.tags || []).join(' ').toLowerCase().includes(t) ? 2 : 0)
        + (hay.includes(t) ? 1 : 0), 0);
      return { item, score };
    })
    .filter((x) => x.score > 0)
    .sort((a, b) => b.score - a.score)
    .slice(0, limit)
    .map(({ item }) => ({ title: item.title, url: item.url, summary: item.summary }));
}

Always validate inputs before using them. get_post takes a slug straight from the model, so it gets a strict regex check before it ever touches a path:

const SLUG_RE = /^[a-z0-9][a-z0-9-]{0,120}$/;
if (!SLUG_RE.test(slug)) {
  throw new Error(`Invalid slug "${slug}". Use a slug from search_posts.`);
}

Step 4 — Handle the protocol

The router is small. Four real methods, plus notification handling:

async function handleRpc(msg, assets, origin) {
  const { id, method, params } = msg;
  const isNotification = id === undefined || id === null;

  switch (method) {
    case 'initialize':
      return rpcResult(id, {
        protocolVersion: '2025-06-18',
        capabilities: { tools: { listChanged: false } },
        serverInfo: { name: 'my-site', version: '1.0.0' },
        instructions: 'Tools for querying my blog and profile.'
      });
    case 'ping':
      return rpcResult(id, {});
    case 'tools/list':
      return rpcResult(id, { tools: TOOLS });
    case 'tools/call': {
      const { name, arguments: args = {} } = params || {};
      try {
        const text = await callTool(assets, origin, name, args);
        return rpcResult(id, { content: [{ type: 'text', text }], isError: false });
      } catch (err) {
        // Report tool errors IN-BAND so the model can see and react to them.
        return rpcResult(id, { content: [{ type: 'text', text: err.message }], isError: true });
      }
    }
    default:
      if (isNotification) return null;            // ignore unknown notifications
      return rpcError(id, -32601, `Method not found: ${method}`);
  }
}

Three things people get wrong here, and they all live in this function:

initialize must echo a protocolVersion the client understands and declare your capabilities. Skip it and the handshake fails before any tool runs.
Tool failures are not protocol errors. A bad slug returns a normal result with isError: true and a message — so the model reads the failure and retries — not a JSON-RPC error. Reserve error (-32601, -32700, etc.) for malformed protocol.
Notifications get no response. If notifications/initialized arrives, acknowledge with 202 and an empty body. Returning a JSON-RPC object for a notification breaks strict clients.

Step 5 — Make it discoverable

A server nobody can find is useless. Advertise it three ways:

A manifest at /.well-known/mcp — name, endpoint, transport, and the tool list.
An entry in your API catalog (/.well-known/api-catalog, RFC 9727) pointing at the manifest.
A Link header on your HTML responses: Link: </.well-known/mcp>; rel="service-desc"; type="application/json".

Then point an MCP client straight at https://yoursite.com/mcp.

Testing your MCP server

You don't need a fancy client to test — curl speaks JSON-RPC fine. List the tools:

curl -s -X POST https://yoursite.com/mcp \
  -H 'Content-Type: application/json' \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}'

Call one:

curl -s -X POST https://yoursite.com/mcp \
  -H 'Content-Type: application/json' \
  -d '{"jsonrpc":"2.0","id":2,"method":"tools/call",
       "params":{"name":"search_posts","arguments":{"query":"RAG","limit":3}}}'

Work through the lifecycle: initialize → tools/list → tools/call, then confirm the edges — an invalid slug returns isError: true, a notification returns 202 with no body, an unknown method returns -32601, and a GET returns 405. If all of those behave, real clients will too.

Common mistakes

Treating tool errors as protocol errors. The single most common bug. Use isError: true in the result; keep JSON-RPC error for malformed requests only.
Building stateful sessions you don't need. A read-only server should be stateless. Sessions add complexity and a scaling headache for zero benefit here.
Thin tool descriptions. "Search" tells the model nothing. Say what it searches, what it returns, and when to reach for it.
Duplicating your data. Don't copy your content into the server. Point tools at what you already publish so there's nothing to keep in sync.
Forgetting CORS. Browser-based MCP clients need it. Handle OPTIONS and allow the Mcp-Session-Id / Mcp-Protocol-Version headers.

Best practices

Stateless first. Reach for sessions only when a tool genuinely needs continuity.
Validate every argument. Treat tool inputs like any untrusted input — schema plus a guard.
Write descriptions as prompts. They're the only thing the model sees when deciding to call a tool.
Reuse existing assets. Your feed, your Markdown, your profile file — one source of truth.
Advertise it. Manifest + API catalog + Link header, so agents can find it without being told.
Test the edges, not just the happy path. Notifications, unknown methods, invalid inputs, wrong HTTP verb.

Conclusion

An MCP server is less code than you expect — a JSON-RPC router, four well-described tools, and a thin layer over content you already ship. The mental shift is the real work: stop thinking of your site as pages to be read and start thinking of it as capabilities to be called. That's the interface agents actually want.

I built mine on a Cloudflare Worker in an afternoon, and it now sits alongside the rest of this site's agent-readiness as a first-class surface. If you've already got a JSON feed and Markdown pages, you're most of the way there.

If this was useful, go deeper next: see how the pieces fit together across LLM Engineering — RAG, Fine-Tuning & Production LLMs and AI Coding Agents — Agentic AI for Developers, or read the official MCP specification for the full protocol.

Explore more: AI Coding Agents · LLM Engineering · Claude Code

Originally published at umesh-malik.com

Keep reading on umesh-malik.com:

DEV Community: Umesh Malik

How to Build Enterprise-Grade AI Agents for Free (MaxKB, 2026)

TL;DR

Most "Free AI Agent" Guides Are Lying to You

What Is an Enterprise-Grade AI Agent?

The $0 Stack: MaxKB + a Local Model

How to Build Enterprise-Grade AI Agents for Free, Step by Step

Step 1 — Run MaxKB

Step 2 — Serve a local model for $0

Step 3 — Connect the model in MaxKB

Step 4 — Build the knowledge base

Step 5 — Create the agent and give it a tool

Step 6 — Embed it

The Five Pillars, Judged Honestly

Pillar 1 — Effectiveness & precision

Pillar 2 — Cost control

Pillar 3 — Security & data sovereignty

Pillar 4 — Access control (the honest asterisk)

Pillar 5 — Observability

Where Does "Free" Actually End?

MaxKB vs Dify vs n8n: Which Free AI Agent Platform Should You Use?

Common Mistakes That Kill Free Agents

FAQ

Bottom Line

Sources

HTTP QUERY Method Explained: The REST Fix for GET vs POST (RFC 10008)

TL;DR

What Is the HTTP QUERY Method?

Why HTTP Needed a New Method

How QUERY Works on the Wire

Caching: The Actual Killer Feature

QUERY vs GET vs POST

Where QUERY Works Today (July 2026)

Common Mistakes to Avoid

Should You Adopt QUERY Now?

FAQ

The Bottom Line

Sources

Claude Fable 5: The Deep-Dive Guide — What It Does, What It Costs, and When It's Worth the Premium (2026)

TL;DR

What Is Claude Fable 5?

How Claude Fable 5 Became Widely Available

What Claude Fable 5 Can Actually Do

The API Mechanics Nobody Warns You About

Thinking is always on — steer it with effort, not budget_tokens

The raw chain of thought is never returned

Requests can be refused — handle it before reading content

It requires 30-day data retention

Plan for minutes-long turns and token budgets

The Real Cost Math

Where Fable 5 Sits vs Opus 4.8 and Sonnet 5

The Honest Ledger

When You Should NOT Reach for Fable 5

How to Actually Get Value From Fable 5

FAQ

The Verdict

Sources

Claude Sonnet 5: The Honest Guide — Pros, Cons, Use Cases, and What It Actually Costs (2026)

TL;DR

What Is Claude Sonnet 5?

The Benchmarks: Sonnet 5 vs Sonnet 4.6 vs Opus 4.8

The Pros and Cons

The Real Cost Math (and the Tokenizer Gotcha)

Use Cases: What Sonnet 5 Is Actually For

Sonnet 5 vs Opus 4.8: When to Pick Which

Migrating From Sonnet 4.6? Read This First

When You Should Choose Sonnet 5

FAQ

The Verdict

Sources

Claude Code vs Cursor vs Copilot for Real Production Work (2026)

TL;DR

Claude Code vs Cursor vs Copilot: the decision table

Claude Code vs Cursor on a real task

What about GitHub Copilot in 2026?

Pricing in 2026: what each actually costs

Which should you use? Pick by who you are

How I actually combine them

FAQ

Sources

Thinking is always on — steer it with `effort`, not `budget_tokens`

What `app.frontend()` Actually Does

`check_dir`

Mounting under `APIRouter`

`StaticFiles(html=True)` hack vs. `app.frontend()`