DEV Community: Mudassir Khan

One AGENTS.md for every coding agent: stop maintaining CLAUDE.md and GEMINI.md separately

Mudassir Khan — Thu, 11 Jun 2026 10:32:17 +0000

You have a CLAUDE.md in your repo. You have a GEMINI.md. If you use Cursor, you have .cursorrules. They all say roughly the same thing: run pnpm test before committing, don't touch the generated files in src/generated/, keep components colocated with their tests.

Then you update one and forget the other two. The agent you're using today reads the stale file. Bugs follow.

There's a spec for this now. It's called AGENTS.md and it's worth knowing about.

The problem: one repo, five instruction files

Every coding agent has its own convention for where it looks for instructions. Claude Code reads CLAUDE.md. Codex reads AGENTS.md. Cursor reads .cursorrules. Gemini Code Assist reads GEMINI.md. Some tools read from multiple places and merge them.

The result is that a shared codebase ends up with a pile of overlapping instruction files, each slightly out of date. Whoever added the note about the database migration last month updated CLAUDE.md. Nobody updated the Cursor file. Now your teammate is using Cursor and the agent keeps trying to use the old migration pattern.

This is a solved problem in human tooling. We don't maintain separate README-for-alice.md and README-for-bob.md. We have one README. We need the same thing for agents.

What AGENTS.md is and where it came from

AGENTS.md is an open format that works as a "README for agents." It loads automatically into an agent's context and holds the information that would clutter a human README: build steps, test commands, project conventions, and things the agent should never do.

The format emerged from collaboration across the ecosystem. OpenAI Codex, Google Jules, Cursor, Amp, and Factory were all involved in the early development. A growing set of tools now supports it: Aider, goose, opencode, Zed, Warp, VS Code, and Devin are all on board.

The spec lives at agents.md and on GitHub. It's intentionally minimal. The goal is to be readable by both humans and tools, not to define a complex schema.

What goes in it (and what stays in the README)

Think of it as the context your agent needs that a human already knows from working in the repo. A good AGENTS.md usually covers:

# AGENTS.md

## Build
pnpm install
pnpm run build

## Test
pnpm test          # unit tests (Vitest)
pnpm test:e2e      # Playwright, requires local dev server

## Conventions
- Components live next to their tests: src/components/Foo/Foo.tsx + Foo.test.tsx
- Never edit files in src/generated/ — they are codegen outputs
- Use Zod for all input validation, not manual typeof checks
- CSS modules only, no inline styles in components

## Architecture notes
- Auth is handled in src/middleware.ts — do not add auth logic to route handlers
- Database access goes through src/lib/db.ts only — never import Prisma client directly

What stays in the README: human onboarding context, project history, how to get a dev environment running the first time, links to docs. That's for humans. The agent doesn't need your origin story.

The split sounds obvious once you name it. In practice, most READMEs are a mix of both, which is why agents end up reading things that don't help them and missing things that do.

Tool support across the ecosystem

Here's where things stand. This is worth bookmarking because it's moving fast:

Tool	Reads AGENTS.md
OpenAI Codex	Yes (native)
Google Jules	Yes
Cursor	Yes
Amp	Yes
Factory (Droids)	Yes
Aider	Yes
goose	Yes
opencode	Yes
Zed	Yes
Warp	Yes
VS Code (Copilot)	Yes
Devin	Yes
Claude Code	Reads `CLAUDE.md` natively; AGENTS.md support in progress

Claude Code currently reads CLAUDE.md as its primary file. If you're using Claude Code alongside other agents, you still want AGENTS.md as your canonical source and CLAUDE.md as a thin wrapper (more on that in the next section).

Deriving CLAUDE.md from one source

The pattern that works: put everything shared in AGENTS.md. In CLAUDE.md, include a short note pointing to it, then add only configuration specific to Claude.

<!-- CLAUDE.md -->

# Claude Code Instructions

See AGENTS.md for shared project conventions. The notes below are Claude-specific.

## Extended thinking
Use extended thinking for architecture decisions and debugging sessions where context depth matters.

## MCP tools
Local dev uses the Filesystem MCP mounted at /tmp/project. Don't use the Bash tool for file operations when the Filesystem MCP is available.

This way AGENTS.md is the truth. CLAUDE.md is an extension. When a convention changes (new test runner, new linting rule), you update one file.

Same pattern for GEMINI.md or any other agent file: start with "see AGENTS.md" and add only the things that are genuinely specific to that tool.

Gotchas: hierarchy, extensions, and edge cases

Codex extends the spec with a hierarchy. OpenAI Codex supports a hierarchical AGENTS.md system where you can place AGENTS.md files in subdirectories and they compose together. This is an extension specific to Codex, not part of the base spec. Other tools read the root AGENTS.md only (or whatever their lookup path is). Don't rely on the hierarchy pattern if you're targeting the broader ecosystem.

Keep it evergreen. The same rule about avoiding dates that applies to your README applies here. Agent instructions that reference a specific sprint or quarter confuse the agent when that sprint is long past. Write instructions for the current state of the codebase, not for what you were working on when you wrote them.

Test descriptions, not test output. Some teams paste in example test output to tell the agent "this is what passing tests look like." That output goes stale fast. Instead, describe the test runner invocation and what a passing run looks like in plain terms. If you're wiring up more rigorous evaluation for LLM agents in production, this framing carries over.

Tool lookup paths vary. Most tools look for AGENTS.md at the repo root, but some support workspace configurations. Check the docs for whichever tool you're using if you have a monorepo setup. Codex in particular has well documented monorepo support via the hierarchical extension.

FAQ

What is AGENTS.md?
It's an open format for a single instruction file that multiple coding agents can read automatically. It holds project conventions, build commands, and test steps that you want every agent working in the repo to know.

Which tools support AGENTS.md?
OpenAI Codex, Google Jules, Cursor, Amp, Factory, Aider, goose, opencode, Zed, Warp, VS Code (Copilot), and Devin. Claude Code reads CLAUDE.md natively and AGENTS.md support is in progress.

How is it different from a README?
A README is written for humans. AGENTS.md is written for agents. It covers build and test mechanics, codebase conventions, and architecture rules. Human onboarding context belongs in the README.

If you want to go deeper on structuring AI systems architecture in your codebase, that is exactly the kind of work I take on.

More writing on agentic patterns and developer tooling on my blog.

Curious which agents you're targeting. Drop a comment with your stack and whether you've hit the problem of maintaining five instruction files.

Claude Fable 5: Features, Pricing, and Fallbacks

Mudassir Khan — Wed, 10 Jun 2026 17:37:38 +0000

Claude Fable 5: Features, Pricing, and Fallbacks

Anthropic just shipped claude-fable-5, its most capable widely released model. It is a Mythos class model made safe for general use, with a 1M token context window, adaptive thinking, and a safety classifier that quietly hands risky prompts to Claude Opus 4.8. Here is what actually matters when you wire it into a product.

Fable 5 and Mythos 5 are the same model

The launch came in two names. claude-fable-5 and claude-mythos-5 are the same underlying model. The only difference is the safeguards.

Mythos 5 runs without the safety classifiers and is limited to a small group of cyber defenders through Project Glasswing. Fable 5 is the version everyone gets, on the Claude API, Amazon Bedrock, Vertex AI, and Microsoft Foundry. A Mythos class model sits a tier above the Opus class in raw capability, so when your request is not flagged, you are talking to the strongest model Anthropic has put in general hands. Their own data says more than 95% of Fable sessions never hit a fallback.

The specs you plan around

Spec	Claude Fable 5	Claude Opus 4.8
API id	`claude-fable-5`	`claude-opus-4-8`
Context window	1M tokens	up to 1M tokens
Max output	128k tokens	lower
Input price / M tokens	$10	$5
Output price / M tokens	$50	$25
Thinking mode	adaptive only	configurable

Fable 5 costs twice as much per token as Opus 4.8. That premium buys the strongest reasoning and the longest autonomous runs, but it means routing matters. Send the hard, long horizon tasks to Fable and keep routine work on Opus or Sonnet. The 1M context is on by default, and the 128k output ceiling is enough to return whole files or a long migration in one response.

Where it actually leads

The headline is long horizon autonomy. The longer the task, the bigger Fable 5's lead over earlier models.

Software engineering. Stripe ran a codebase wide migration across 50 million lines of Ruby in a single day. Fable 5 also tops Cognition's FrontierCode eval, even at medium effort.
Knowledge work. Highest score of any model on Hebbia's finance benchmark, with real gains on document reasoning and chart and table interpretation.
Vision. New state of the art. It reads precise numbers off scientific figures and rebuilds a web app's source from screenshots alone.
Memory. It stays focused across millions of tokens and improves using notes it writes to a file based memory tool.

The safety classifier is the thing to design for

This is the part that will surprise you in production. A separate classifier model sits in front of Fable 5. When a prompt looks like it touches cybersecurity, biology, chemistry, or distillation, Fable does not answer. The request falls back to Claude Opus 4.8, and the user is told it happened.

The classifiers are tuned conservatively, so they sometimes catch harmless prompts. Anthropic says the fallback fires in under 5% of sessions on average. For builders, the move is to expect the occasional fallback on benign cyber or bio adjacent prompts and handle it gracefully instead of treating it as an error.

What changes on the API

The biggest one is refusals. When Fable 5 declines, the Messages API does not throw. You get a successful HTTP 200 with stop_reason set to refusal, and it tells you which classifier declined. You are not billed for a request refused before any output is generated.

// A refused request comes back as a normal 200 response
{
  "type": "message",
  "role": "assistant",
  "stop_reason": "refusal",
  "content": [],
  "usage": { "input_tokens": 412, "output_tokens": 0 }
}

A refused request can usually be served by another model. Pass the fallbacks parameter and the API retries for you, or use the SDK middleware to retry from the client. Fallback credit refunds the prompt cache cost so you do not pay it twice.

{
  "model": "claude-fable-5",
  "fallbacks": ["claude-opus-4-8"],
  "messages": [ ]
}

Three more behaviors that are specific to Fable 5 and Mythos 5:

Adaptive thinking is always on. It is the only thinking mode. You cannot disable thinking. Use the effort parameter to control depth and spend.
Raw thinking is never returned. It is omitted by default. Set thinking display to summarized for readable summaries, and pass thinking blocks back unchanged in multiturn conversations on the same model.
30 day data retention. Both models are Covered Models, so zero data retention is not available. The data is used for safety, not training.

Pricing and availability

Pricing is $10 per million input tokens and $50 per million output tokens for both models. On the Claude API and consumption based Enterprise plans, Fable 5 is fully available now. On subscription plans the rollout is staged: included on Pro, Max, Team, and seat based Enterprise for a short window, then drawing on usage credits until capacity allows it back into the standard plans. Mythos 5 stays restricted to Glasswing partners and, soon, a small set of biology researchers.

Three things to check before you ship on Fable 5

Handle stop_reason: "refusal" as a normal response, not an exception.
Decide your fallback model now, either with the fallbacks param or SDK middleware.
Drop the thinking: {"type": "disabled"} path. It is not supported here.

If you want a deeper look at Claude Fable 5, I cover the features, pricing, and fallback behavior in more detail on my site.

If you want a model like this wired into a real product end to end, that is exactly the kind of work I take on.

Drop a comment if your fallback strategy looks different. Curious what people are routing to Opus versus keeping on Fable.

AI agent memory management: beyond the context window

Mudassir Khan — Sat, 06 Jun 2026 10:17:54 +0000

AI agent memory management: beyond the context window

Your agent answered correctly five minutes ago. Now it's asking for the same information again. The context window filled up, the early messages got evicted, and all that history is gone.

This is not a hallucination problem. It's a memory architecture problem.

The symptom: agents that forget during a task

You're debugging an agent that handles multistep workflows. Somewhere around turn 15, it starts contradicting itself. It asks questions you already answered in turn 3. It ignores decisions it made in turn 7.

Context limits are the obvious culprit, but the real issue is deeper. Most agent implementations treat the context window as memory. It isn't. It's working memory that evicts old data as soon as the window fills. The moment something scrolls out of the context, the agent has no path back to it.

That's the gap this article addresses.

Working memory vs longterm memory

A context window is the agent's working memory. It holds the current conversation, the current task state, and whatever you stuffed in at the start. It's fast to read, but it resets. Every new session starts blank.

Longterm memory is what persists across sessions: user preferences, prior decisions, learned facts about the problem domain. Without it, every session is the agent's first session.

The distinction matters because the solutions are different. Working memory problems (forgetting during a task) need context management techniques. Longterm memory problems (cross session amnesia) need a storage and retrieval layer.

Most articles conflate the two. Most agents solve neither properly.

The lost in the middle problem and why position matters

Even inside a single context window, position matters more than you'd think.

Research shows that LLMs exhibit a "lost in the middle" effect: accuracy is highest when relevant information is at the start or end of the context, and drops significantly for information buried in the middle. If you have a 64k token window and you put the most critical retrieved documents at position 30k, you've effectively hidden them from the model.

The practical consequence: in a RAG system, you should not dump all your retrieved chunks in the middle of the prompt and hope the model weighs them equally. It doesn't.

A production fix is to place your highest confidence retrieved documents at the very beginning and end of the context. Treat the context window like a sandwich: critical context at the top, critical context again at the bottom, filler in the middle.

The hybrid pattern: autoretrieve at start, agent driven storage after

The pattern that holds up in production is a combination of two mechanisms.

Autoretrieve at request start. Before every agent turn, automatically retrieve relevant longterm memories based on the current query or task state. This means the agent always starts with the best available context, even in a fresh session.

Agent driven storage. Let the agent decide what is worth keeping. After completing a task or making a significant decision, the agent writes to longterm memory: "User prefers TypeScript strict mode, reminded me three times." That's information worth retrieving on the next session.

You can think of solid agent memory management as a filing cabinet alongside the working scratchpad. Autoretrieve pulls relevant files from the cabinet at the start of each turn. Agent driven storage files things back when they are worth keeping.

Where vector stores fit and how to rerank

Longterm memory at scale lives in vector databases. You embed memories (or document chunks) into a vector space and retrieve by semantic similarity. When a new query arrives, you run a similarity search and pull the top K most relevant chunks.

The problem is that "most similar" and "most useful" are not the same thing. A retrieval system that returns raw cosine similarity scores will sometimes surface tangentially related content over the most relevant hit. That's where reranking earns its place.

A reranker takes the top K retrieved chunks and scores them again using a more expensive crossencoder model. It's slower than vector similarity search, but it runs on a small candidate set (K is usually 10 to 20), so the latency stays manageable. The output is a reordered list where the most genuinely useful chunks end up at the top.

If you're picking a stack, you can compare vector databases by latency, filtering support, and managed hosting options before committing.

A minimal memory loop in code

Here's a realistic Python sketch of the hybrid pattern. It uses a hypothetical memory client wrapping a vector store.

from memory_client import MemoryClient
from llm_client import LLMClient

memory = MemoryClient(user_id="user_123")
llm = LLMClient()

def agent_turn(query: str, agent_id: str) -> str:
    # Step 1: autoretrieve relevant memories before every turn
    memories = memory.search(
        query=query,
        agent_id=agent_id,
        top_k=5,
        rerank=True,
    )

    # Step 2: inject top memories at start, rest at end (sandwich pattern)
    top_memories = memories[:2]    # highest confidence at the very top
    bottom_memories = memories[2:] # remaining near the bottom

    context_parts = []
    if top_memories:
        context_parts.append(format_memories(top_memories))

    context_parts.append(query)

    if bottom_memories:
        context_parts.append(format_memories(bottom_memories))

    response = llm.complete("\n\n".join(context_parts))

    # Step 3: agent driven storage — write what is worth keeping
    if response.should_store:
        memory.add(
            content=response.memory_payload,
            agent_id=agent_id,
            user_id="user_123",
        )

    return response.text

The should_store flag is where the interesting decisions happen. You can implement it as a second LLM call ("Is this response something worth remembering for future sessions?"), a simple heuristic (decisions over a certain length, or responses containing explicit preferences), or a structured output field the main LLM populates.

Start simple. A naive heuristic beats no memory at all, and you can upgrade the storage logic once you see what your agent actually needs to keep.

FAQ

What is the difference between a context window and memory?

A context window is temporary working memory. It holds the current turn's information and is cleared between sessions. Memory is a persistent store that survives session resets, backed by a database the agent reads from and writes to explicitly.

How do agents remember across sessions?

They don't automatically. You wire up explicit storage (usually a vector database or a structured keyvalue store) and retrieval logic. The agent writes important information to the store at the end of a task and retrieves relevant entries at the start of the next session.

What is the lost in the middle problem?

LLMs pay less attention to information in the middle of a long context window than to information at the start or end. If you place your most critical retrieved documents in the center of a large prompt, the model may effectively ignore them. The fix is to place highest confidence chunks at the very beginning and end of the context.

If you want a deeper look at agent memory architecture, I cover it in more detail on my site.

If you want this wired up on your own system end to end, that is exactly the kind of work I take on.

Drop a comment if your setup looks different. Curious what memory stacks people are running in production.

How to test MCP servers in TypeScript before they break in production

Mudassir Khan — Mon, 01 Jun 2026 10:20:52 +0000

Your MCP server works on your laptop. The tool calls return the right shapes, the client connects cleanly, the session behaves. Then you deploy it and a client reconnects after a network hiccup and the session state is gone. Or you scale to two instances and half the requests fail because session IDs resolve to the wrong process. Or someone sends two concurrent requests and the tool handler corrupts shared state.

Testing catches these before your users do. This is a testing playbook for TypeScript MCP servers built on the official SDK.

The demo-to-production gap for MCP servers

The official TypeScript SDK makes it easy to get something working. A few tool registrations, an McpServer instance, a transport, and you are serving. The problem is that "working" in the demo sense and "working" in the production sense are different things.

A demo tests one happy path. Production tests edge cases that emerge from real clients: reconnects, concurrent tool calls, malformed inputs, slow downstream APIs, and the transport contract itself. None of those show up in a single manual run against your local instance.

The gap is not a criticism of the SDK. It is a consequence of how easy the SDK makes it to build a server without thinking about what breaks it. A test suite closes the gap before you ship.

What actually breaks: transport, sessions, tool contracts

Three categories fail most often.

Transport behavior. The SDK added Streamable HTTP support in version 1.10.0. Under this transport, the server exposes a single HTTP endpoint that handles both POST and GET. Clients use POST for tool calls and GET to open a streaming connection via server sent events. Tests that only exercise stdio miss this entirely.

Session state. The StreamableHTTPServerTransport is stateful per session. If you store anything in process memory keyed by session ID, a restart or a second instance will drop it. Tests that do not simulate reconnects miss this failure mode.

Tool contracts. Each tool you register has an input schema and an expected output shape. Tests that call tools with valid inputs only miss the cases where a client sends a slightly wrong shape or a downstream API returns something unexpected.

Unit-testing tools and resources in isolation

The cleanest place to start is the tool handler itself, before any transport is involved.

Each tool handler is a function that takes validated input and returns a result. Extract the handler logic from the server.tool() registration so you can call it directly in tests.

// tool-handlers.ts
export async function getItemHandler(input: { id: string }) {
  const item = await fetchItem(input.id);
  return { content: [{ type: "text", text: JSON.stringify(item) }] };
}

// getItem.test.ts
import { getItemHandler } from "./tool-handlers";

test("returns item content", async () => {
  const result = await getItemHandler({ id: "abc" });
  expect(result.content[0].type).toBe("text");
});

This pattern makes each handler independently testable without spinning up the full server. Stub the downstream calls with your test framework's mocking utilities. Cover the happy path, malformed input, and downstream failure.

Resource handlers work the same way. Extract, test in isolation, stub dependencies.

Contract assertions against the MCP schema

After unit tests, the next layer checks that your tool registrations conform to the MCP protocol. A contract test instantiates the real server and sends actual protocol requests, then asserts on the response shape.

import { Client } from "@modelcontextprotocol/sdk/client/index.js";
import { InMemoryTransport } from "@modelcontextprotocol/sdk/inMemory.js";
import { createServer } from "./server";

test("list tools returns registered tools", async () => {
  const server = createServer();
  const [clientTransport, serverTransport] = InMemoryTransport.createLinkedPair();
  await server.connect(serverTransport);

  const client = new Client({ name: "test", version: "1.0.0" }, {});
  await client.connect(clientTransport);

  const result = await client.listTools();
  const toolNames = result.tools.map((t) => t.name);
  expect(toolNames).toContain("get-item");
});

The InMemoryTransport in the SDK is designed exactly for this. It lets you run client and server in the same process without any network, which keeps tests fast and deterministic.

Assert on the full response shape for each tool: input schema, output content types, error response format. This is the layer that catches the gap between what your server claims to support and what it actually returns.

Testing Streamable HTTP behavior

The in memory transport covers the protocol layer. Testing the HTTP transport layer catches a different class of failure: auth middleware, session header handling, and the streaming path.

Stand up a real HTTP server on a random port, run your requests against it, and shut it down after each test.

import { createServer as createHttpServer } from "http";
import { StreamableHTTPServerTransport } from "@modelcontextprotocol/sdk/server/streamableHttp.js";

let httpServer: ReturnType<typeof createHttpServer>;
let baseUrl: string;

beforeAll(async () => {
  const transport = new StreamableHTTPServerTransport({
    sessionIdGenerator: () => crypto.randomUUID(),
  });
  const mcpServer = createServer();
  await mcpServer.connect(transport);
  httpServer = createHttpServer((req, res) => transport.handleRequest(req, res));
  await new Promise<void>((r) => httpServer.listen(0, () => r()));
  const addr = httpServer.address() as { port: number };
  baseUrl = `http://localhost:${addr.port}`;
});

afterAll(() => httpServer.close());

test("POST returns a valid MCP response", async () => {
  const res = await fetch(`${baseUrl}/mcp`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ jsonrpc: "2.0", method: "tools/list", params: {}, id: 1 }),
  });
  expect(res.status).toBe(200);
  const json = await res.json();
  expect(json.result.tools).toBeDefined();
});

Add a test that opens GET on the same endpoint and confirms the SSE connection accepts. Add a test that sends an invalid session ID and checks the server handles it without crashing.

A CI setup that catches regressions

A test suite only helps if it runs consistently. For MCP servers, the minimal CI setup is:

Unit tests on every commit (fast, no network)
Contract tests via InMemoryTransport on every commit (still fast)
HTTP transport tests on pull requests and on merge to main

If you deploy across multiple instances, add a test that starts two server processes and verifies that a session created on instance one can be resumed on instance two. This tests your external session store. It is slower, so running it on pull requests is reasonable.

If you are building an AI product that depends on your MCP server, the deployment and observability patterns in Next.js for AI products apply directly to the production layer above the server. For the enterprise session model, see MCP for enterprise agents.

FAQ

What is an MCP server?
An MCP server exposes tools, resources, and prompts to LLM clients via the Model Context Protocol. Clients connect to it and call tools by name, receiving structured results.

How do you test an MCP server?
Start with unit tests on each tool handler in isolation. Add contract tests using InMemoryTransport. Add HTTP transport tests against a local server instance for the full network path.

What transport does MCP use?
MCP supports stdio for local use and Streamable HTTP for networked servers. Streamable HTTP uses a single endpoint that handles POST requests for tool calls and GET requests for SSE streaming. The TypeScript SDK supports Streamable HTTP from version 1.10.0 onward.

Building production AI systems? I consult on agentic AI and AI product engineering at mudassirkhan.me.

Per User OAuth in a Next.js MCP Server (Step by Step)

Mudassir Khan — Fri, 29 May 2026 13:32:36 +0000

Per User OAuth in a Next.js MCP Server (Step by Step)

Your MCP server is using one shared API key for every caller. That works in a demo. The second you need each user to call a tool with their credentials (their GitHub token, their Notion workspace, their Stripe key) a shared key blows up. Here's how to wire per user OAuth into a Next.js MCP server so each session gets its own scoped token.

The problem with shared keys

When you build an MCP server the usual way, tools look like this:

server.tool("get-my-repos", {}, async () => {
  const octokit = new Octokit({ auth: process.env.GITHUB_TOKEN });
  const { data } = await octokit.repos.listForAuthenticatedUser();
  return { repos: data.map((r) => r.full_name) };
});

That GITHUB_TOKEN is yours. Every user who connects to this server gets results from your account. In a multitenant setup, that's a footgun. Alice calls get-my-repos and sees Bob's repos. Bob calls create-issue and writes to Alice's project.

You need each MCP session to carry its own OAuth token. Here's how to do it without reinventing the session layer.

How the token flow works

Three pieces fit together:

NextAuth holds the user's OAuth access token in their session cookie.
A Next.js Route Handler acts as the MCP transport. It can read the session before any tool runs.
MCP tool context receives the token via closure (no global state, no env vars).

The key insight: your Route Handler runs in a request context, so it has full access to the session. You create the MCP server instance inside that handler, which means every tool closure captures the authenticated user's token before it runs.

Step 1: Set up NextAuth with token persistence

Install what you need:

npm install next-auth @auth/core @octokit/rest

Configure NextAuth to store the OAuth access token in the session. Critically, you need the jwt and session callbacks. By default NextAuth does not expose the raw access token to your session object:

// src/app/api/auth/[...nextauth]/route.ts
import NextAuth from "next-auth";
import GitHub from "next-auth/providers/github";

export const { handlers, auth } = NextAuth({
  providers: [
    GitHub({
      clientId: process.env.GITHUB_CLIENT_ID!,
      clientSecret: process.env.GITHUB_CLIENT_SECRET!,
      authorization: {
        params: { scope: "repo read:user" },
      },
    }),
  ],
  callbacks: {
    async jwt({ token, account }) {
      // `account` is only present on the initial sign-in
      if (account) {
        token.accessToken = account.access_token;
      }
      return token;
    },
    async session({ session, token }) {
      session.accessToken = token.accessToken as string;
      return session;
    },
  },
});

export const { GET, POST } = handlers;

And extend the session type so TypeScript doesn't complain:

// src/types/next-auth.d.ts
import { DefaultSession } from "next-auth";

declare module "next-auth" {
  interface Session extends DefaultSession {
    accessToken: string;
  }
}

After a user signs in with GitHub, session.accessToken is their personal GitHub token scoped to repo read:user. It's stored in their encrypted session cookie, not anywhere on your server.

Step 2: The MCP Route Handler with per user context

This is the important part. Your MCP transport lives inside a Route Handler at /api/mcp:

// src/app/api/mcp/route.ts
import { createMcpHandler } from "@vercel/mcp-adapter";
import { auth } from "@/app/api/auth/[...nextauth]/route";

export const POST = async (request: Request) => {
  // Read the session for THIS request
  const session = await auth();

  if (!session?.accessToken) {
    return new Response("Unauthorized", { status: 401 });
  }

  // Create a new MCP handler instance for this request.
  // The tool registrations close over `session` — each call gets its own.
  const handler = createMcpHandler(
    (server) => {
      server.tool("get-my-repos", {
        description: "Lists repositories the current user has access to",
        inputSchema: {
          type: "object",
          properties: {
            visibility: {
              type: "string",
              enum: ["all", "public", "private"],
              default: "all",
            },
          },
        },
      }, async ({ visibility = "all" }) => {
        const response = await fetch(
          `https://api.github.com/user/repos?visibility=${visibility}&per_page=20`,
          {
            headers: {
              Authorization: `Bearer ${session.accessToken}`,
              Accept: "application/vnd.github+json",
            },
          }
        );

        if (!response.ok) {
          throw new Error(`GitHub API returned ${response.status}`);
        }

        const repos = await response.json();
        return {
          content: [
            {
              type: "text",
              text: repos
                .map((r: { full_name: string; private: boolean }) =>
                  `${r.full_name} (${r.private ? "private" : "public"})`
                )
                .join("\n"),
            },
          ],
        };
      });
    },
    {},
    { redisUrl: process.env.REDIS_URL }
  );

  return handler(request);
};

The session.accessToken in the tool handler is captured from the outer POST function scope. It's the token for whoever made this request. Two users hitting /api/mcp at the same moment get two separate MCP handler instances, each with their own token.

No global state. No shared credentials. Each request is isolated.

Step 3: Adding a second tool

Once the pattern is in place, adding tools is just adding more server.tool calls inside the same closure. They all get session for free:

server.tool("create-issue", {
  description: "Creates a GitHub issue in a repo the current user owns",
  inputSchema: {
    type: "object",
    required: ["repo", "title"],
    properties: {
      repo: { type: "string", description: "owner/repo format" },
      title: { type: "string" },
      body: { type: "string" },
    },
  },
}, async ({ repo, title, body }) => {
  const [owner, repoName] = repo.split("/");

  const response = await fetch(
    `https://api.github.com/repos/${owner}/${repoName}/issues`,
    {
      method: "POST",
      headers: {
        Authorization: `Bearer ${session.accessToken}`,
        Accept: "application/vnd.github+json",
        "Content-Type": "application/json",
      },
      body: JSON.stringify({ title, body }),
    }
  );

  if (!response.ok) {
    const err = await response.json();
    throw new Error(err.message || `GitHub API returned ${response.status}`);
  }

  const issue = await response.json();
  return {
    content: [{ type: "text", text: `Created: ${issue.html_url}` }],
  };
});

This tool creates an issue on the calling user's behalf. If you'd wired this with a shared token, every issue would be created by your service account. With per user tokens, it's the actual user: their name on the issue, their rate limits, their permissions.

Verifying credential isolation

Here's a quick check before you ship. Sign in as two different GitHub accounts and grab each session cookie from the browser DevTools. Then run both calls at the same time:

# Alice's session
curl -s -X POST http://localhost:3000/api/mcp \
  -H "Cookie: next-auth.session-token=<alice-token>" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"tools/call","id":1,"params":{"name":"get-my-repos","arguments":{}}}' \
  | jq '.result.content[0].text' &

# Bob's session (run at the same time)
curl -s -X POST http://localhost:3000/api/mcp \
  -H "Cookie: next-auth.session-token=<bob-token>" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"tools/call","id":1,"params":{"name":"get-my-repos","arguments":{}}}' \
  | jq '.result.content[0].text'

You should get Alice's repos and Bob's repos in the two outputs. If both return the same list, your session isn't threading through. Check that auth() in the Route Handler is returning the right session for each request.

What to watch out for

Token expiry mid session. OAuth tokens expire. If your MCP client holds a long running session, the token can go stale between calls. Handle this by catching 401 responses from the upstream API inside your tool handlers and returning a clear error message the AI can relay to the user: "Your GitHub session expired, please sign in again."

Stateless vs stateful transports. The example above uses @vercel/mcp-adapter with a Redis URL for stateful SSE sessions. If you're using a stateless transport (plain POST with no SSE), you don't need Redis, but you also can't hold multiturn conversations over the same session. For most auth use cases, stateless is fine. You get a fresh authenticated context on every tool call.

Avoid the "just use NextAuth" trap. That's going to be the first reply to this article. Yes, NextAuth is handling the session layer here. The hard part that NextAuth does not solve for you is threading that session token into the MCP tool context specifically. That's what the closure pattern above does.

Three things to verify right now

console.log(session.accessToken) inside the Route Handler: should print a real token, not undefined.
Hit /api/mcp without a session cookie: you should get a 401, not a 500 or a 200 with the server token.
Two concurrent users, same tool, different results. That's the definitive proof the isolation is working.

If you want a deeper look at production MCP patterns at scale, I wrote about the full enterprise architecture in my post on MCP for enterprise agents. The NebulaDesk case study in my agentic AI product work is also a good read if you are building a multitenant Next.js product on top of AI tooling.

If you want this wired up on your own stack end to end, that is exactly the kind of work I take on via my Next.js for AI products service. Alternatively if you want strategic help on the broader agentic AI architecture, my consulting engagement covers that.

Drop a comment if your setup looks different. Curious what OAuth providers people are pairing with MCP in production. Notion? Linear? Stripe? All of the above?

Your LLM Is Wrong. Your Codebase Is Why.

Mudassir Khan — Tue, 26 May 2026 21:53:18 +0000

It happened on a Tuesday. I asked my AI coding assistant to explain a function I'd written three months earlier. It described a function that doesn't exist.

Not a total hallucination. The function did exist. Just not by that name, not with those parameters, not doing what the model confidently told me it was doing. The model had assembled a plausible story from vague signals and filled the gaps with fiction.

My first instinct was to blame the model. My second instinct, the one that actually helped, was to look at the code itself.

The model wasn't broken. My codebase was.

What is comprehension debt?

Technical debt is code that's hard to change. Comprehension debt is code that's hard to understand. Not just by future developers. By anything that has to read it cold: a new hire, a rubber duck, and increasingly, an AI assistant.

You've probably heard "write code as if the next maintainer is a serial killer who knows where you live." The LLM version is more forgiving. But not by much.

Comprehension debt shows up when the intent of your code isn't captured in your code. The logic works. The tests pass. But nothing in the source tells a reader why a function does what it does, what its constraints are, or what it absolutely should not do. That knowledge lives in someone's head, in a Slack thread from two months ago, or nowhere at all.

LLMs don't have access to the Slack thread. They only have your source.

Five signals your LLM is showing you

When your AI assistant gets your own codebase wrong, it's not random. The errors cluster around specific failure modes, and each one points to a real gap.

1. It invents function names.
The model calls functions that don't exist, or calls existing functions by the wrong name. This usually means your naming is inconsistent or your barrel exports are incomplete. The model is pattern matching across conventions that don't agree with each other.

2. It gets parameter types wrong.
It passes a string where you want a typed enum, or a plain object where you've defined a specific interface. This almost always means missing or implicit type annotations in your function signatures. The model is guessing.

3. It imports packages you don't use.
It reaches for lodash or axios when you've got utility wrappers that wrap those already. Your actual internal abstractions aren't legible to the model because they aren't documented anywhere they can be found. The model falls back to what it knows from training.

4. It uses patterns you've deprecated.
It calls the old version of your API, the one you stopped using eight months ago. Your codebase still contains those old patterns (maybe for backward compatibility, maybe just because cleanup hasn't happened yet) and the model doesn't know which version is current. Deprecation comments cost thirty seconds to write. Their absence costs you five minutes of confusion per assistant interaction.

5. It doesn't know the business rule.
It gives you the technically valid version of a function, not the version that accounts for the actual constraint. "This user lookup should always check the soft delete flag first" lives in a comment in no file. It was decided in a call. The model can't know what was never written down.

Each of these errors is a free audit item. You didn't have to run a tool to find it. The model found it for you.

The five minute audit

You don't need a formal process for this. You just need to treat your LLM's confusion as a signal instead of noise.

Pick a module. Any module that's been around for more than a few months. Feed it to your AI assistant and ask these questions:

"What does this module do? Describe it in two sentences."
"List all exported functions, their parameters, and their return types."
"What would break if I deleted this module?"
"What should I never pass to this module's main function?"
"Is there anything in this module that looks like it was deprecated but never removed?"

Don't correct the model when it gets something wrong. Write down what it got wrong. That list is your comprehension debt register.

For a healthy module, the model will get most of this right. For a module with comprehension debt, you'll see the five signals show up fast.

I ran this on an internal TypeScript service last quarter. Twelve exported functions. The model hallucinated the names of three of them, got the return type wrong on two others, and had no idea what the rate limit parameter was for. That's a 41% wrong answer rate on a module I thought was well maintained. It wasn't. It just worked.

Working and legible are not the same thing.

What to do about it

The instinct is to reach for RAG (chunk your codebase, embed it, retrieve relevant context before each LLM call). That helps. I cover the full approach in my production RAG guide if you want the implementation details.

But RAG retrieves your documentation. If your documentation is the code itself and the code is opaque, RAG gives the model better access to opaque code. The underlying problem doesn't change.

The actual fix is cheaper than you think:

Write the intent, not the implementation. A JSDoc comment that says "Validates and normalizes a user object. Always call this before persisting to the database. Does NOT check permissions." gives the model something to retrieve. A comment that says "validates user" does not.

Mark your deprecations inline. @deprecated Use getUserV2 instead takes five seconds. It means the model stops confidently recommending the old API.

Put your business rules in the file that enforces them. Not in the ticket. Not in Confluence. In the file. A comment above the rate limit parameter that says "this is hardcoded per the billing agreement with enterprise customers, do not make it configurable" is documentation that actually travels with the code.

The goal isn't to write documentation for humans. It's to write documentation that your LLM assistant can parse so it can help you correctly. The secondary effect is that it also helps the next human on your team. That's free.

For teams working on larger AI agent systems, the memory and context patterns that help here are the same ones I break down in my post on AI agent memory management. Comprehension debt in your codebase and context gaps in your agents come from the same root cause: undocumented intent.

You can also get a quick read on your current exposure with this LLM hallucination risk estimator. It won't diagnose specific debt, but it gives you a calibrated starting point for where to focus.

The model is the test

Your LLM assistant is, right now, the most honest reader your codebase has. It doesn't know the context you carry in your head. It doesn't remember the decision you made in 2024. It reads what's there and tries to make sense of it.

When it gets something wrong, that's signal. The model isn't failing. It's showing you exactly what a reader without your context has to work with.

That's a gift. Most code never gets that kind of external read until the next engineer joins and asks the same confused questions.

Use it.

If you want a deeper look at production AI systems, I cover it on mudassirkhan.me.

How wrong does your LLM get your own codebase? Drop a number in the comments. Curious what percentage of wrong answers people are seeing in production.

Building MCP Servers in TypeScript That Don't Fall Apart

Mudassir Khan — Sun, 24 May 2026 11:18:35 +0000

Building MCP Servers in TypeScript That Don't Fall Apart

Your MCP server works great at tool number three. By tool number twelve it is a pile of switch cases you are afraid to touch. Here is the TypeScript architecture that keeps it clean as it grows — borrowed straight from Domain-Driven Design.

The Problem with Flat MCP Servers

Most MCP server tutorials start with something like this:

import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";

const server = new Server(
  { name: "my-server", version: "1.0.0" },
  { capabilities: { tools: {} } }
);

server.setRequestHandler(ListToolsRequestSchema, async () => ({
  tools: [
    { name: "get_user", description: "\"Get a user by ID\", inputSchema: { ... } },"
    { name: "create_order", description: "\"Create an order\", inputSchema: { ... } },"
    { name: "send_email", description: "\"Send an email\", inputSchema: { ... } },"
    { name: "get_product", description: "\"Get product details\", inputSchema: { ... } },"
    // ...8 more tools
  ],
}));

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  switch (request.params.name) {
    case "get_user": return handleGetUser(request.params.arguments);
    case "create_order": return handleCreateOrder(request.params.arguments);
    case "send_email": return handleSendEmail(request.params.arguments);
    // ...
  }
});

Familiar? The problem is not that this code is wrong. It is that it scales to exactly one person maintaining it for exactly three weeks.

Once you hit real production requirements — different owners for user logic vs order logic, different auth contexts, stateful resources — you need structure. That is where Domain-Driven Design earns its keep.

Three DDD Concepts, One MCP Mapping

You do not need to read a 600 page book to apply the ideas that matter here. Three concepts cover 90% of what you will encounter:

DDD Concept	What it means	MCP equivalent
Bounded context	A namespace where terms have a consistent meaning	Tool name prefix (`users__get`, `orders__create`)
Aggregate root	The single entry point for a cluster of related state	A typed server context object passed to each handler
Domain event	A fact that happened in the system	An MCP notification emitted after a mutation

The translation rule for this niche: never write an "architecture" article — write a TypeScript tutorial that teaches architecture through code. So let us build the thing.

Bounded Contexts as Tool Namespaces

In DDD, a bounded context is a boundary within which a particular domain model applies. In an MCP server, that maps cleanly to a tool naming convention plus a factory function that groups related tools together.

Here is a createDomainTools factory:

// src/domains/types.ts

import { Tool, CallToolResult } from "@modelcontextprotocol/sdk/types.js";

export interface DomainTool<TInput = unknown> {
  definition: Tool;
  handler: (input: TInput) => Promise<CallToolResult>;
}

export interface DomainModule {
  namespace: string;
  tools: DomainTool[];
}

export function createDomainModule(
  namespace: string,
  tools: DomainTool[]
): DomainModule {
  return {
    namespace,
    tools: tools.map((tool) => ({
      ...tool,
      definition: {
        ...tool.definition,
        name: `${namespace}__${tool.definition.name}`,
      },
    })),
  };
}

Now define the users domain:

// src/domains/users.ts

import { createDomainModule } from "./types.js";

export const usersDomain = createDomainModule("users", [
  {
    definition: {
      name: "get",
      description: "Get a user by ID",
      inputSchema: {
        type: "object",
        properties: {
          id: { type: "string", description: "User UUID" },
        },
        required: ["id"],
      },
    },
    handler: async ({ id }: { id: string }) => {
      const user = await db.users.findById(id);
      return {
        content: [{ type: "text", text: JSON.stringify(user) }],
      };
    },
  },
  {
    definition: {
      name: "list",
      description: "List users with optional filters",
      inputSchema: {
        type: "object",
        properties: {
          limit: { type: "number" },
          role: { type: "string" },
        },
      },
    },
    handler: async ({ limit = 20, role }: { limit?: number; role?: string }) => {
      const users = await db.users.findMany({ limit, role });
      return {
        content: [{ type: "text", text: JSON.stringify(users) }],
      };
    },
  },
]);

The resulting tool names are users__get and users__list. The double underscore is a common MCP convention for namespacing — it is readable and easy to split on in client code.

I cover the agent design patterns that consume these namespaced tools in my multi-agent design patterns post — the namespace boundary maps directly to the "agent as specialist" pattern.

Aggregate Root for Server State

The aggregate root is the single class responsible for coordinating all writes to a domain's state. In an MCP server, "state" usually means cached data, auth tokens, connection pools, or feature flags that multiple tools share.

The anti-pattern: each handler closes over a different reference to the same underlying data, and you end up with stale reads mid-session.

The fix is a typed ServerContext that every handler receives as an injected dependency:

// src/context.ts

export interface UserContext {
  currentUserId: string | null;
  permissions: Set<string>;
  refresh: () => Promise<void>;
}

export interface ServerContext {
  users: UserContext;
  requestId: string;
  startedAt: number;
}

export function createServerContext(): ServerContext {
  const users: UserContext = {
    currentUserId: null,
    permissions: new Set(),
    async refresh() {
      const session = await auth.getCurrentSession();
      this.currentUserId = session?.userId ?? null;
      this.permissions = new Set(session?.permissions ?? []);
    },
  };

  return {
    users,
    requestId: crypto.randomUUID(),
    startedAt: Date.now(),
  };
}

Then update the DomainTool interface to carry context:

export interface DomainTool<TInput = unknown> {
  definition: Tool;
  handler: (input: TInput, ctx: ServerContext) => Promise<CallToolResult>;
}

Now each handler has a type safe path to shared state with no globals:

handler: async ({ id }: { id: string }, ctx: ServerContext) => {
  if (!ctx.users.permissions.has("users:read")) {
    return {
      content: [{ type: "text", text: "Forbidden" }],
      isError: true,
    };
  }
  const user = await db.users.findById(id);
  return { content: [{ type: "text", text: JSON.stringify(user) }] };
},

A stateful MCP server is also an injection surface worth thinking about carefully. I cover the specific attack patterns you should harden against in my post on prompt injection hardening for agents.

Domain Events as MCP Notifications

Domain events record facts that have happened ("OrderPlaced", "UserCreated"). MCP has a notification mechanism that maps naturally to this pattern.

Here is a typed notification emitter that the publisher can subscribe to:

// src/events.ts

import { Server } from "@modelcontextprotocol/sdk/server/index.js";

type EventName =
  | "users.created"
  | "orders.placed"
  | "orders.failed";

interface DomainEvent<TData = unknown> {
  name: EventName;
  data: TData;
  occurredAt: number;
}

export function createEventEmitter(server: Server) {
  return {
    async emit<TData>(name: EventName, data: TData): Promise<void> {
      const event: DomainEvent<TData> = {
        name,
        data,
        occurredAt: Date.now(),
      };

      // MCP notification — any connected client sees this
      await server.notification({
        method: "notifications/message",
        params: {
          level: "info",
          logger: name,
          data: JSON.stringify(event),
        },
      });
    },
  };
}

Use it inside a handler after a successful mutation:

handler: async ({ email, role }: CreateUserInput, ctx: ServerContext) => {
  const user = await db.users.create({ email, role });

  await events.emit("users.created", { userId: user.id, email });

  return {
    content: [{ type: "text", text: JSON.stringify(user) }],
  };
},

The client (your AI agent or IDE) now receives a realtime notification whenever a user is created — without polling. For testing these notification paths, the evaluation patterns I describe in evaluating LLM agents in production apply directly: set up an observer client, trigger the mutation, assert the notification fired within your SLA window.

Wiring It All Together

Here is a complete server.ts that composes the domain modules, context, and events:

// src/server.ts

import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import {
  ListToolsRequestSchema,
  CallToolRequestSchema,
} from "@modelcontextprotocol/sdk/types.js";

import { createServerContext } from "./context.js";
import { createEventEmitter } from "./events.js";
import { usersDomain } from "./domains/users.js";
import { ordersDomain } from "./domains/orders.js";

const server = new Server(
  { name: "my-structured-server", version: "1.0.0" },
  { capabilities: { tools: {}, logging: {} } }
);

const events = createEventEmitter(server);
const domains = [usersDomain, ordersDomain];

// Flat list of all tool definitions across all domains
const allTools = domains.flatMap((d) =>
  d.tools.map((t) => t.definition)
);

// Lookup map: tool name → handler
const handlerMap = new Map(
  domains.flatMap((d) =>
    d.tools.map((t) => [t.definition.name, t.handler])
  )
);

server.setRequestHandler(ListToolsRequestSchema, async () => ({
  tools: allTools,
}));

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  const ctx = createServerContext();
  await ctx.users.refresh();

  const handler = handlerMap.get(request.params.name);
  if (!handler) {
    return {
      content: [{ type: "text", text: `Unknown tool: ${request.params.name}` }],
      isError: true,
    };
  }

  return handler(request.params.arguments ?? {}, ctx);
});

const transport = new StdioServerTransport();
await server.connect(transport);

The handlerMap lookup is O(1). Adding a new domain means writing one file and adding one import — the server.ts does not change.

Is This Overkill?

For three tools? Yes. For a server that a team will ship, maintain, and extend over six months? No.

The patterns here add about 40 lines of infrastructure. In exchange you get: namespaced tools that cannot collide, a single typed context object instead of scattered globals, and notifications that fire without wiring up polling clients.

If you are not sure whether your use case actually needs an MCP server or a simpler agent framework, the AI agent framework chooser on my site walks through the decision.

If you want a deeper look at how stateful MCP servers fit into larger agent architectures, I cover it in more detail on my site.

If you want this wired up on your own production system end to end, that is exactly the kind of work I take on.

Drop a comment if your setup looks different — curious what variations people are running in production.

WebMCP and the Browser AI Layer: What Next.js Devs Need to Know

Mudassir Khan — Sat, 23 May 2026 22:00:38 +0000

WebMCP and the Browser AI Layer: What Next.js Devs Need to Know

The dev.to MCP discourse this week is predictably binary — either "WebMCP will destroy how browsers work" or "it's just another spec that won't ship." Both framings skip the question that actually matters for practitioners: when a browser speaks MCP natively, what specifically changes in a Next.js app, and what stays the same?

I've been watching the WebMCP proposal since it surfaced and I want to give you a grounded read — not hype, not dismissal.

What WebMCP actually is (and isn't)

MCP (Model Context Protocol) is a protocol for letting AI agents communicate with tools and data sources in a structured way. You've likely used it via the Node.js SDK: you define server capabilities, expose them, and an LLM host calls them.

WebMCP is the proposed browser-native version of that same protocol — a JavaScript API that lets a webpage (or a web application) act as an MCP client directly inside the browser, without proxying through a backend server. The browser itself becomes an MCP-aware runtime.

What it is not: a replacement for your Next.js API routes. It is not a way to skip your backend. And it is not — at the time I'm writing this — a shipped standard. It's a proposal with real momentum (Google showed a demo at I/O, the Chromium bug tracker has an entry, and multiple browser teams are watching), but momentum is not a ship date.

The distinction matters because several posts this week conflate "WebMCP exists as a spec" with "your existing Next.js app is already obsolete." It's not.

What actually changes for your Next.js routes

The meaningful shift is about who initiates the MCP conversation.

Today, if your Next.js app integrates with an AI agent, the flow looks like this: browser → Next.js API route → MCP server → LLM. Your Next.js route is the trust boundary. It holds the credentials, it validates the request, it shapes the response.

With WebMCP, a browser-native MCP client can connect directly to an MCP server — your server — without going through the Next.js route at all. The flow becomes: browser (WebMCP client) → MCP server.

Here is a minimal illustration of what that means for a capability you might expose today:

// Today: your Next.js API route wraps the MCP capability
// app/api/search/route.ts
export async function POST(req: Request) {
  const { query } = await req.json();
  // validate session, rate-limit, sanitise
  const result = await mcpClient.callTool('search', { query });
  return Response.json(result);
}

// With WebMCP: the browser calls your MCP server directly.
// Your route no longer sits between the browser and the capability.
// Anything that was implicitly safe because it was server-side
// now needs to be explicitly safe at the MCP server boundary.

The security model has to move from the HTTP route layer to the MCP server itself. If your MCP server today relies on "only my backend calls this," that assumption breaks with WebMCP.

What to watch before you ship anything WebMCP-adjacent

Three things I'm tracking — and you should too.

1. The trust boundary shift is not free. Every tool you expose through an MCP server needs to be safe to call from an untrusted browser context, not just from your backend. That means per-user auth at the MCP layer, rate limiting at the MCP layer, and input validation at the MCP layer — none of which your API route was previously responsible for. I wrote about the prompt injection surface area that MCP servers already carry in a separate post on prompt injection risks I documented — the WebMCP expansion makes that surface much wider.

2. The spec is still moving. OAuth-style per-user sessions for WebMCP are proposed but not finalised. The browser API shape changed at least twice between I/O and the current draft. Building on top of a draft spec is fine for exploration but not for production. I'd build the MCP server side of your stack now — that's stable — and treat the browser client as a layer you'll wire up later.

3. SEO and AEO implications are real but indirect. If AI browsers start rendering MCP-powered responses inside the browser chrome (not your page), what gets indexed changes. The page content that currently signals authority to crawlers may not be what the AI browser layer actually presents to the user. This is the piece of the puzzle I'm watching most closely — it's still early, but it's the same pattern I've been mapping in my post on agentic AI production architecture where the rendering and the indexing layer start to diverge.

What to do right now (practically)

You don't need to refactor anything today. But you should do two things:

Audit your MCP server's trust assumptions. Walk every tool your MCP server exposes and ask: "If an untrusted browser client called this directly, what's the worst that happens?" If the answer involves data exfiltration, privilege escalation, or uncapped resource use — that's your prep list for when WebMCP ships.

Track the spec, not the takes. The Chromium intent to prototype and the WebMCP GitHub repo are better signal than blog posts (including this one). Subscribe to the MCP spec changelog. When the browser teams move from "intent to prototype" to "origin trial," that's when you actually need to ship something.

I built an agentic workspace that hit several of these boundary problems in production — the patterns I used there are the same ones WebMCP will put pressure on at the browser layer.

If you want a deeper read on building production-grade AI agent systems, I cover the architecture decisions in more detail on mudassirkhan.me.

If you want this kind of thing wired up on your stack end to end — MCP servers, Next.js integration, trust boundary design — that is exactly the kind of work I take on.

Drop a comment if you're already running an MCP server behind a Next.js route — curious what your auth layer looks like and whether you've thought about the WebMCP scenario yet.

llms.txt vs robots.txt vs ai.txt: The Developer's Cheat Sheet

Mudassir Khan — Sat, 23 May 2026 12:40:02 +0000

llms.txt vs robots.txt vs ai.txt: The Developer's Cheat Sheet

Every Next.js developer building a public site in the last 18 months has hit the same wall: you Google "how to control what AI crawlers read" and get three different answers pointing at three different files — robots.txt, llms.txt, and ai.txt. They are not the same thing. They do not talk to the same audience. And using the wrong one (or none at all) means AI search engines are either ignoring your content entirely or indexing pages you never intended them to.

This is the one-stop breakdown I wish existed when I was figuring this out.

The three files at a glance

	`robots.txt`	`llms.txt`	`ai.txt`
Proposed by	Martijn Koster (1994)	Anthropic + community	AI-txt.com initiative
Primary audience	Web crawlers (Googlebot, Bingbot, etc.)	LLM training & AI search crawlers	AI assistants (ChatGPT, Claude, Gemini)
Format	Key-value directives	Markdown / structured text	Key-value + JSON blocks
Spec status	RFC standard, universally supported	Emerging, growing adoption	Early proposal, limited adoption
Enforced by	All major search engines	Anthropic, Perplexity, some others	No major enforcer yet
Location	`yourdomain.com/robots.txt`	`yourdomain.com/llms.txt`	`yourdomain.com/ai.txt`

The short mental model: robots.txt is for Googlebot. llms.txt is for ClaudeBot and GPTBot when they are building knowledge, not just indexing. ai.txt is a newer proposal that tries to cover AI assistants reading your site in real time. Use all three if you want complete coverage — but robots.txt + llms.txt is where you get 90% of the value today.

robots.txt — the original gatekeeper

robots.txt has been around since 1994. Every crawler on the internet — Googlebot, Bingbot, DuckDuckBot, GPTBot, ClaudeBot, PerplexityBot — checks it before crawling. If you block GPTBot in robots.txt, it will not crawl your site for training data or AI-search indexing.

Basic syntax:

User-agent: *
Disallow: /admin/
Disallow: /private/

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Allow: /blog/
Disallow: /

User-agent: * applies to all crawlers. Named user-agents override * for that specific bot. Allow and Disallow are path-based — no wildcards by default in the original spec, though most modern crawlers support them.

In Next.js App Router — generate it dynamically from app/robots.ts:

// app/robots.ts
import { MetadataRoute } from 'next'

export default function robots(): MetadataRoute.Robots {
  return {
    rules: [
      {
        userAgent: '*',
        allow: '/',
        disallow: ['/admin/', '/api/private/'],
      },
      {
        userAgent: 'GPTBot',
        allow: ['/blog/', '/services/'],
        disallow: ['/'],
      },
    ],
    sitemap: 'https://yourdomain.com/sitemap.xml',
  }
}

Next.js renders this as text/plain at /robots.txt automatically. No separate file needed.

What robots.txt does NOT do: it does not stop a crawler from reading pages it already knows about from other sources (backlinks, sitemaps). It only stops it from actively crawling those paths. If GPTBot found your /admin/ page linked from a public page, it may already have cached it.

llms.txt — built for the AI era

llms.txt was proposed by Answer.AI and picked up by Anthropic, Perplexity, and others as a structured way to tell LLMs what your site is actually about — not just what they can crawl, but what context they should carry when reasoning about your content.

Unlike robots.txt which is access control, llms.txt is documentation. Think of it as a README for your site aimed at language models.

Basic structure:

# YourSite

> One-line description of what this site is and who it's for.

A few sentences of context. What does this site do? Who is the author?
What should an LLM understand before citing any page from this domain?

## Blog

- [Post title one](https://yourdomain.com/blog/post-one): One-line summary.
- [Post title two](https://yourdomain.com/blog/post-two): One-line summary.

## Services

- [Service name](https://yourdomain.com/services/name): What this service does in one line.

## Contact

- Author: Your Name
- Email: you@yourdomain.com
- LinkedIn: linkedin.com/in/yourhandle

The format is intentionally plain Markdown. No special parser needed — any LLM can read it. The /llms.txt path is the convention; some sites also serve /llms-full.txt with deeper content for models that want more context.

In Next.js — generate dynamically from app/llms.txt/route.ts:

// app/llms.txt/route.ts
import { blogPosts } from '@/data/blog-posts'

export async function GET() {
  const lines = [
    '# YourSite',
    '',
    '> AI-search-ready Next.js development and SEO consulting.',
    '',
    'This site covers AI engineering, GEO/AEO, and production Next.js patterns.',
    '',
    '## Blog',
    '',
    ...blogPosts.map(p => `- [${p.title}](https://yourdomain.com/blog/${p.slug}): ${p.excerpt}`),
    '',
    '## Services',
    '',
    '- [AI-Search Consulting](https://yourdomain.com/services): End-to-end GEO and AEO for Next.js sites.',
  ]

  return new Response(lines.join('\n'), {
    headers: { 'Content-Type': 'text/plain; charset=utf-8' },
  })
}

This keeps llms.txt in sync with your actual content automatically — no manual updates.

Who reads llms.txt today: ClaudeBot (Anthropic's crawler), PerplexityBot, some versions of GPTBot. Adoption is growing fast. If you are publishing content you want AI search engines to cite accurately, this file is non-negotiable.

ai.txt — the wildcard

ai.txt is a newer proposal from a different working group. Where llms.txt focuses on what an LLM should know about your site, ai.txt focuses on granting or denying permission for AI assistants to use your content in responses.

Basic syntax:

# ai.txt
Version: 1.0

[permissions]
allow: true
commercial-use: false
training: false
real-time-access: true

[attribution]
require: true
format: "Source: {title} ({url})"

[contact]
email: you@yourdomain.com

Honest assessment: ai.txt has minimal enforcer support right now. No major AI company officially reads it. The spec is still evolving. That said, if the initiative gains traction (similar to how robots.txt went from informal convention to de-facto standard), having it early costs nothing and signals intent.

For most developers today: add it, keep it simple, and do not spend more than 10 minutes on it.

How a crawler actually decides what to read

The flow for a modern AI crawler like GPTBot or ClaudeBot hitting your domain:

Fetch robots.txt — am I allowed to crawl this path?
If allowed, fetch the page HTML
Fetch llms.txt (periodically, not per-request) — what is this site actually about?
Check ai.txt if the implementation supports it
Index the content with the context from steps 2–4 combined

The key insight: robots.txt is checked per crawl request. llms.txt is fetched periodically and cached — it shapes how the model understands your whole site over time, not just whether it can read one page.

Putting it all together in Next.js App Router

Here is the complete implementation for a Next.js 15 App Router site:

app/
├── robots.ts          ← generates /robots.txt
├── sitemap.ts         ← generates /sitemap.xml
├── llms.txt/
│   └── route.ts       ← generates /llms.txt (dynamic, always in sync)
public/
└── ai.txt             ← static file, update manually

robots.ts (full version with AI crawler rules):

// app/robots.ts
import { MetadataRoute } from 'next'

const BASE_URL = process.env.NEXT_PUBLIC_SITE_URL ?? 'https://yourdomain.com'

export default function robots(): MetadataRoute.Robots {
  return {
    rules: [
      // Default: allow everything
      { userAgent: '*', allow: '/' },

      // GPTBot: allow blog and services, block everything else
      {
        userAgent: 'GPTBot',
        allow: ['/blog/', '/services/', '/tools/'],
        disallow: ['/admin/', '/api/'],
      },

      // ClaudeBot: same rules
      {
        userAgent: 'ClaudeBot',
        allow: ['/blog/', '/services/', '/tools/'],
        disallow: ['/admin/', '/api/'],
      },

      // PerplexityBot: full access (it drives meaningful referral traffic)
      { userAgent: 'PerplexityBot', allow: '/' },

      // Google-Extended (used for Gemini training): restrict to blog only
      {
        userAgent: 'Google-Extended',
        allow: ['/blog/'],
        disallow: ['/'],
      },
    ],
    sitemap: `${BASE_URL}/sitemap.xml`,
  }
}

llms.txt/route.ts (dynamic, pulls from your data layer):

// app/llms.txt/route.ts
import { blogPosts } from '@/data/blog-posts'
import { servicePosts } from '@/data/services'

const BASE_URL = process.env.NEXT_PUBLIC_SITE_URL ?? 'https://yourdomain.com'

export async function GET() {
  const content = `# YourSite

> [One-sentence description of your site and its purpose.]

[Two or three sentences giving an LLM the context it needs to cite your site accurately.
What topics do you cover? Who is the author? What makes this site's perspective unique?]

## Blog

${blogPosts.map(p => `- [${p.title}](${BASE_URL}/blog/${p.slug}): ${p.excerpt}`).join('\n')}

## Services

${servicePosts.map(s => `- [${s.title}](${BASE_URL}/services/${s.slug}): ${s.summary}`).join('\n')}

## Author

- Name: Your Name
- Site: ${BASE_URL}
- Expertise: [Your primary expertise areas]
`

  return new Response(content, {
    headers: {
      'Content-Type': 'text/plain; charset=utf-8',
      'Cache-Control': 'public, max-age=3600, stale-while-revalidate=86400',
    },
  })
}

public/ai.txt (static, update when the spec stabilizes):

# ai.txt
Version: 1.0

[permissions]
allow: true
commercial-use: false
training: false
real-time-access: true

[attribution]
require: true

[contact]
site: https://yourdomain.com

Which file do you actually need?

Start here:

Building a public Next.js site? → Add robots.txt first. Always.
Want AI search engines (ChatGPT, Perplexity, Claude) to cite your content accurately? → Add llms.txt. This is the highest-leverage file for AI-search visibility right now.
Want to future-proof against the ai.txt spec gaining enforcement? → Add a simple public/ai.txt. It costs 10 lines.
Want to block AI crawlers from training on your content? → Set Disallow: / for GPTBot, ClaudeBot, and Google-Extended in robots.txt. This is the only file that actually enforces it today.

The one mistake I see most often: developers add robots.txt but skip llms.txt, then wonder why ChatGPT gives wrong answers about their site even though Googlebot indexes it fine. Googlebot and GPTBot-for-knowledge are completely different crawlers with different purposes.

Three things to verify right now

Open your terminal and check these three URLs on your live site:

curl https://yourdomain.com/robots.txt
curl https://yourdomain.com/llms.txt
curl https://yourdomain.com/ai.txt

If any returns a 404 or HTML error page, that file is missing. For robots.txt, a missing file means all crawlers assume full access — usually fine for public sites, but you lose granular control. For llms.txt, missing means LLMs are forming their understanding of your site from raw page HTML with no structured context — which almost always leads to inaccurate citations.

If you want a deeper look at how AI crawlers read Next.js sites specifically — what RSC payloads they fetch, how streaming affects what they see, and which metadata fields they actually use — I have a longer writeup on the AI-search architecture patterns I use in production that goes further than this cheat sheet.

And if you want this wired up on your own site end-to-end, that is exactly the kind of work I take on.

If your own llms.txt or robots.txt setup looks different from what I showed here — especially if you are on an older Next.js version or using the Pages Router — drop it in the comments. Curious what variations people are running in production.

CLaRa: Fixing RAG’s Broken Retrieval–Generation Pipeline With Shared-Space Learning

Mudassir Khan — Tue, 09 Dec 2025 11:00:13 +0000

Retrieval-Augmented Generation (RAG) has become the default solution for grounding LLM outputs in external knowledge. But the classical RAG setup still carries a major architectural flaw: the retriever and generator learn in isolation. This separation quietly sabotages accuracy, increases hallucinations, and prevents genuine end-to-end optimization.

CLaRa (Closed-Loop Retrieval and Augmentation) introduces a fundamentally different approach — one that actually allows the retriever to learn from what the generator gets wrong.

Let’s break down why that matters.

The Core Problem: RAG Is Optimizing Two Brains That Never Talk

Traditional RAG pipelines train two components separately:

Retriever → picks documents using similarity search (dense or sparse).

Generator (LLM) → takes raw text and tries to answer.

The failure point?
There is no gradient flow between these two components.

The retriever has no idea whether the documents it selected actually helped the generator produce the correct answer. It only optimizes for similarity—not usefulness.

This leads to:

"Close but wrong" retrieved documents

Irrelevant context passed to the LLM

Weak factual grounding because retrieval can't learn from generation errors

RAG keeps trying harder at the wrong task.

CLaRa’s Fix: A Shared Continuous Representation Space

CLaRa solves the broken gradient issue by mapping both queries and documents into a shared representation space.

This changes everything.

How the shared space helps:

Document embeddings and query embeddings coexist in the same vector space

The generator’s final answer loss backpropagates through the retriever

Retriever learns what actually helps answer a query

Retrieval stops being a similarity contest and becomes a relevance optimization loop

This feedback loop is the missing piece in traditional RAG.

The result:
Your retriever becomes intelligent — not just associative.

3.Document Compression: Retrieval Without Text Bloat

One of CLaRa’s most practical innovations is how it handles documents:

It never retrieves raw text. It retrieves compressed memory tokens.

These are compact, dense vector representations that summarize meaning, not wording.

How it works:

Document → compressed memory tokens (embeddings)

Retriever fetches tokens instead of full text

Generator consumes tokens directly

Why this matters:

Context length shrinks dramatically

You can process more documents without hitting LLM token limits

Computation cost drops

Throughput increases

This isn’t just more accurate — it’s more efficient.

SCP: Training the Compressor to Capture Meaning, Not Noise

CLaRa doesn’t trust standard compression to produce semantically meaningful vectors (and rightly so).
So it introduces Salient Compressor Pre-training (SCP).

Goal of SCP:

Make compressed representations focus on meaning, not superficial text features.

How SCP trains the compressor:

The system uses synthetic data generated by an LLM:

Simple QA pairs

Complex QA tasks

Paraphrased document sets

The compressor is trained to:

Generate embeddings that can answer these questions

Reconstruct paraphrased meaning (not the exact text)

This forces the vectors to internalize the semantic core of the document.

By the time end-to-end training starts, the compressor already knows how to distill content into high-information embeddings.

Why CLaRa Matters ?

CLaRa isn't just a tweak — it’s a structural correction to how RAG should work:

Retriever learns from generator errors

Vector-based compressed memory beats raw-text retrieval

End-to-end gradients reconnect the entire pipeline

Accuracy improves without inflating compute

Embeddings become meaning-first, not token-first

This is the kind of architecture shift that will define the next generation of knowledge-augmented LLM systems.

The Rise of Prompt-Driven Development: Why the Future of Software May Be Written in Prompts

Mudassir Khan — Wed, 24 Sep 2025 03:36:03 +0000

f someone told you a few years ago that software engineers would spend more time writing prompts than code, you probably would have laughed. Yet, here we are, standing right on the edge of that reality.

Think about it. The last decade was all about cloud computing, DevOps, and agile transformations. Today, the shift is happening all over again—this time towards Prompt-Driven Development (PDD) and Governed Prompt Software (GPS) Engineering.

📖 A Little Story: From Code to Prompts

Imagine you're building a chatbot for a healthcare startup.

The Old Way: You'd spend weeks coding dialogue trees, managing APIs, and writing endless if/else statements.

The New Way: You simply design structured prompts: "If a patient reports chest pain, escalate immediately to a human doctor and provide emergency guidelines."

The code is still there, of course. But most of the intelligence now comes from how you craft prompts, set rules, and govern the AI's behavior.

It feels less like writing code… and more like writing instructions for a very smart but unpredictable intern. 😅

🚀 What is Prompt-Driven Development (PDD)?

PDD treats prompts the same way traditional software treats code. Instead of just focusing on syntax and logic, you are now responsible for designing prompt workflows, testing them, and documenting your design choices.

Some of its main building blocks include:

Prompt Requests (PRs): Think of them as reusable functions, but for prompts.

Architectural Decision Records (ADRs): Why did you phrase the prompt this way? Why was this workflow better than another?

Prompt History Records (PHRs): A changelog of how your prompts evolve over time, just like version control for code.

Testing & Validation: Running prompts against test cases (almost like TDD, but for language models).

🛡️ And What About GPS (Governed Prompt Software)?

If PDD is about building the system, GPS is about keeping it safe.

AI is incredibly powerful, but it can also be unpredictable. GPS Engineering ensures that your prompts don't just "work," but that they also follow governance rules:

Prevent biased or harmful outputs.
Ensure compliance with safety standards.
Maintain accountability (who wrote this prompt, and why?).

Think of it as DevSecOps for prompts—a safety layer that ensures your AI systems are trustworthy.

💡 Why Does All This Matter?

Let's look at a few real-world scenarios:

Finance: Imagine tokenizing ETFs (like what BlackRock is experimenting with). Instead of traders managing everything manually, prompts can govern transactions, risk checks, and reporting in real time, 24/7.

Healthcare: A digital nurse agent could run entirely on prompt workflows, escalating only when human intervention is truly required.

Education: Personalized tutors, driven by carefully crafted prompts, could adapt teaching styles to each student in ways no static curriculum ever could.

In all of these cases, the prompts themselves become the real intellectual property (IP).

⚡ The Big Shift: From Coder to Prompt Architect

Just as the rise of DevOps created new roles like SREs and Platform Engineers, this new era is already giving birth to new roles:

Prompt Architects

AI Workflow Engineers

Governance Leads for AI Systems

The engineers of tomorrow may not just write Python or Java—they will design conversational logic, governance frameworks, and agent orchestration strategies.

⚠️ A Reality Check

We are still in the very early innings of this journey. Every company has its own approach, and the standards are still evolving. But remember: agile, DevOps, and cloud computing all started this way—as small experiments that eventually reshaped the entire industry.

Prompt-Driven Development and GPS Engineering could be the next major transformation.

🧭 Final Thoughts

The future of software might not be written line by line in code, but designed prompt by prompt, governed with care, and orchestrated at scale.

The real question is: are you ready to evolve from just a developer into a prompt architect?

Because the next generation of apps won't just be coded. They'll be prompted, governed, and trusted.

👉 What do you think? Let me know in the comments! Will prompts ever truly replace coding workflows, or will they simply add another powerful layer to the software stack?

I Had a Fight With My Toaster. It Made Me Realize Everything About the Future of AI.

Mudassir Khan — Tue, 23 Sep 2025 06:43:40 +0000

It was 7 AM. The coffee was brewing, the sun was streaming in, and I was arguing with a toaster.

You had one job, I mumbled, scraping the blackened edges of my toast into the sink. The toaster, a "smart" one, was supposed to know my preferences. Yet, here we were. This little gadget, a tiny island of supposed intelligence, was completely disconnected from the rest of my morning. It didn't know I was running late. It didn't know the coffee machine had just finished a dark roast, which pairs terribly with burnt bread. It just knew toast.

This frustratingly common experience is a symptom of a larger problem. We've built millions of "smart" things, but they're not wise. They're isolated, they follow rigid rules, and they don't talk to each other.

But what if they did? What if we're on the verge of a world that isn't just smart, but truly alive? I recently stumbled upon a vision for the future that gave this idea a name: Agentia World.

From Rigid Commands to Living Conversations
Before we dive in, let's talk about how things work now. Almost every digital interaction you have is governed by APIs (Application Programming Interfaces). Think of an API as a strict restaurant menu. You can order item #3 with a side of #B, and the kitchen knows exactly what to do. It's efficient, but rigid. You can't say, "I'm feeling something light and spicy today." The system doesn't understand intent.

Agentia World imagines a future that scraps this menu. Instead, our devices—our "agents"—will have intelligent dialogues.

Imagine your car agent talking to your home agent. It wouldn't send a rigid API call like home.gate.open(). It would have a conversation:

Car Agent: "Hey, I'm about five minutes away with the owner. It's been a long day."

Home Agent: "Understood. Preparing for arrival. I'll open the garage, turn on the hallway lights, and start the evening focus playlist on the speakers."

This isn't just a command; it's a collaborative exchange based on understanding the goal. The home doesn't just obey; it anticipates.

A World Where Everything is an Agent
This is where the vision gets really wild. It proposes that everything becomes an AI agent. Not just your phone and your car, but the mundane, everyday objects.

Your coffee machine becomes a personal barista, checking your calendar for early meetings and your health tracker for your caffeine limits.

Your houseplants have agents that negotiate with the window blinds for the perfect amount of sunlight.

An entire city becomes a macro-agent. The traffic light agents talk to public transport agents, which talk to the power grid agents, all working in a seamless, living network to eliminate traffic jams and save energy.

This creates a living network that's constantly learning and adapting. It's a system that's both digital (the AI making decisions) and physical (the agents controlling real-world objects).

So, What Does This Mean for Us?
For the average person, it means a world that simply works. A world where the technology disappears into the background, seamlessly orchestrating our lives for the better. The constant micro-management of our digital lives fades away.

For us developers, it represents the next frontier. We'll move from building isolated apps and services to designing collaborative agents that can negotiate, learn, and act within a massive, decentralized ecosystem. The challenge will shift from writing rigid code to teaching intelligent systems.

The fight with my toaster was a reminder of how far we still have to go. We don't just need smarter devices; we need a wiser world. And that's the promise of a future where everything, from our toasters to our cities, is part of one intelligent, living conversation.