DEV Community: Ali Ibrahim

MCP v2: What's Changing, What's Deprecated, and Why

Ali Ibrahim — Fri, 17 Jul 2026 13:30:00 +0000

MCP v2 breaking changes explained: the protocol goes stateless and deprecates sampling, roots, and logging. What is changing across every MCP SDK, why, and whether you should migrate yet.

Introduction

If you have built anything on the Model Context Protocol, the next few weeks matter. MCP v2 (protocol revision 2026-07-28) finalizes on July 28, 2026. The release candidate was locked on May 21, 2026, and SDK maintainers are validating it against real workloads during a ten-week window before the spec is published.

This is not a point release. v2 makes the protocol stateless, formalizes extensions as first-class components, and deprecates three subsystems that many existing servers rely on: sampling, roots, and logging.

This article is the language-agnostic companion to our SDK-specific guides. It covers what is changing and why, so that when the stable SDKs land you understand the shape of the migration regardless of whether you write TypeScript, Python, Go, or C#. For the current v1 TypeScript walkthrough, see The MCP TypeScript SDK: A Complete Guide; a v2 update to that guide will follow once the SDKs reach stable.

Important: As of this writing the v2 SDKs are in beta. The MCP team is explicit: "For any critical workloads, the stable SDK releases remain the recommended versions," and "public APIs may still change between the betas and the stable releases." Treat the code shapes below as directional, not final.

What you'll learn:

The headline change: why MCP is going stateless
Which subsystems are deprecated (sampling, roots, logging) and what replaces each
What the stateless shift means for the SDKs you build on
What extensions are, and the two official ones shipping with v2
Whether you should migrate now or wait for stable

The Headline Change: MCP Goes Stateless

The single biggest change in v2 is that the protocol becomes stateless.

In v1, every session began with an initialize / initialized handshake, and the server issued an Mcp-Session-Id that the client sent back on every subsequent request. That session ID pinned a client to a specific server instance. If you scaled your server horizontally, you needed sticky sessions so that follow-up requests landed on the same process that ran the handshake.

v2 removes all of that:

No initialize / initialized handshake.
No Mcp-Session-Id header.
Client information now travels in _meta fields on each request, so any request is self-contained.
Two new operational headers, Mcp-Method and Mcp-Name, let infrastructure route and observe requests without parsing the body.
Caching metadata (ttlMs and cacheScope) is now part of the protocol.

The practical payoff: a request can be handled by any server instance. No sticky sessions, no session affinity, no shared session store. This is a much better fit for serverless and autoscaled deployments, where you cannot assume the next request reaches the same box.

This one change is also the reason behind most of the deprecations below. Several v1 features quietly assumed a long-lived, stateful connection between one client and one server process. Once that assumption is gone, those features no longer fit.

What's Deprecated, and Why

Three subsystems are deprecated under v2's new lifecycle policy. Deprecated does not mean removed. Each remains functional during a one-year grace period, so existing v1 servers keep working. But new code should adopt the replacements.

Sampling

What it was: Sampling let a server ask the client's LLM to generate text mid-execution (via createMessage in the TypeScript SDK). It powered "agentic servers" that could reason and make decisions internally by borrowing the client's model.

Why it's going away: Sampling requires the server to reach back into the client while it is not handling a client request, which only works over a persistent, stateful connection. That is exactly the assumption v2 removes. In a stateless world, the server processing your request may not be the one holding a connection to your client at all.

What replaces it:

Direct integration with LLM provider APIs. If your server needs a model, call the provider (Anthropic, OpenAI, etc.) directly from the server. You control the model, the keys, and the cost.
The InputRequiredResult pattern for the cases where you genuinely need something from the client mid-task. Instead of a live callback, the server returns a result that says "I need this input," and the client retries the request with the answer. Because each round-trip is a fresh, self-contained request, it can be retried against any server instance, keeping the interaction stateless.

If you use sampling today, this is the deprecation most likely to require real rework, because the replacement changes where the model call happens (server-side, not client-side).

Roots

What it was: Roots were URIs the client provided to scope what the server should operate on, for example file:///home/user/my-project to tell a code-analysis server which directory to scan. Servers read them with listRoots().

Why it's going away: Roots were another piece of session-scoped state pushed from client to server and held for the life of a connection.

What replaces it: Pass the same information explicitly, per request:

Tool parameters — accept the working directory or scope as a tool input.
Resource URIs — encode the scope directly in the resource being requested.
Server configuration — set boundaries at deploy time when they are static.

Logging

What it was: Structured log messages sent from the server to the client over the MCP protocol, filtered by a client-set level (logging/setLevel).

Why it's going away: Protocol-level logging tied observability to a live client connection, which is awkward for stateless, multi-instance deployments and duplicates tooling that already exists.

What replaces it:

stderr for stdio transports. Write logs to standard error; the host captures them. (This was already the recommended practice for stdio servers in v1, since stdout is reserved for the protocol.)
OpenTelemetry for structured, production-grade observability. Emit traces and metrics to your existing OTel pipeline rather than through MCP.

What the Stateless Shift Means for the SDKs

Beyond the deprecations, statelessness ripples through every SDK's surface. The specifics differ by language, and the v2 SDKs are still settling, so this is a map of what kind of change to expect rather than a list of exact symbols. Three patterns show up regardless of which SDK you use:

Transports get reshaped. Removing sessions is a transport-level change, so expect HTTP transports to be renamed or split by runtime, and expect the old SSE transport (a two-endpoint design built around a persistent stream) to disappear. A single, unified request/response transport fits the stateless model; a long-lived stream does not.
Errors split into protocol errors and local SDK errors. v2 draws a clearer line between "the request itself was malformed" (a wire-protocol error the other side should see) and "something failed locally, like the HTTP connection dropped." If your code inspects error types, expect that distinction to surface in the API.
Request context becomes transport-aware. Because a stdio server has no HTTP request behind it, the per-request context object separates protocol-level fields from transport-specific ones (like auth info that only exists over HTTP). Handlers read those optional fields defensively.

None of this requires action today. It is a map of what the eventual migration touches so you can gauge its size for your codebase, whatever language you build in. When the SDKs reach stable, our language-specific guides will cover the concrete renames.

Should You Migrate Now?

Short answer: not for production, not yet. Here is the state of play as of early July 2026:

The v2 SDKs are still pre-stable across the board. The TypeScript and Python SDKs are in beta; Go and C# are on pre-release/preview builds. Exact versions are moving quickly, so check each SDK's releases rather than trusting a number quoted here.
Stable is close. The spec finalizes July 28, 2026, and stable SDK releases are expected around the same timeframe.
The betas are opt-in. Upgrading the SDK does not automatically switch your server to the new protocol revision. You choose when to adopt v2 behavior.
Pin exact versions if you experiment. Public APIs may change between the pre-stable builds and stable, so a floating version range can break you.

A reasonable plan for most teams:

Now: Read the spec changes (this article). Audit your servers for sampling, roots, and logging usage. Note where you rely on session state.
On stable (post July 28): Upgrade a non-critical server first. Work through the renamed imports and the deprecated subsystems.
Within the grace year: Migrate the rest before the deprecated subsystems are removed.

What's New: Extensions

v2 does not only subtract. It also formalizes extensions as first-class protocol components with reverse-DNS identifiers and independent versioning. Instead of bolting capabilities onto the core spec, features can now evolve as versioned extensions.

Two official extensions launch with v2:

MCP Apps — server-rendered UIs, so a server can present a rich interface to the user rather than only returning text and structured data.
Tasks — a standard pattern for long-running operations, so a server can kick off work that outlives a single request and report on it. This pairs naturally with the stateless core: a task is addressable across instances rather than tied to one connection.

Expect the extension model to be where much of MCP's future capability growth happens, precisely because extensions can ship and version without waiting on a full spec revision.

What This Means for Your Existing v1 Servers

If you have MCP servers in production today, nothing breaks on July 28. To recap:

v1 servers keep working. Deprecations have a one-year grace period.
You are not forced to migrate on a deadline; you are forced to migrate before the grace period ends.
The biggest real change is sampling. Roots and logging have straightforward replacements you may already be halfway to. Sampling moves the model call from client to server, which is an architectural shift, not a rename.
Statelessness is a gift if you run at scale. It removes sticky-session complexity from your infrastructure.

Plan for it now, migrate on stable, and you will have a full year of runway.

Resources

MCP 2026-07-28 Release Candidate — the protocol changes, straight from the source
MCP SDK v2 Betas — per-language beta status and versions
TypeScript SDK v2 Migration Guide — the concrete TS upgrade path
MCP Specification — the full protocol spec

The MCP TypeScript SDK: A Complete Guide — the current v1 SDK deep dive
Create Your First MCP Server in 5 Minutes — beginner quickstart
Getting Started with FastMCP in TypeScript — a streamlined framework
Securing MCP Servers with OAuth and Keycloak — authentication
Top AI Agent Protocols in 2026 — where MCP fits among agent protocols

Top AI Agent Standards to Know in 2026

Ali Ibrahim — Mon, 15 Jun 2026 13:30:00 +0000

Protocols tell agents how to connect. Standards tell them what to know. As the agent ecosystem matures, a second layer of convergence is emerging: open formats that give agents consistent, structured context — about projects, capabilities, and design systems. Unlike protocols (which define communication between systems), these standards are file-based, human-readable, and version-controlled alongside your code. Here are the three standards shaping how agents are informed and extended in 2026.

1. AGENTS.md

agentsmd/agents.md | Agentic AI Foundation (Linux Foundation) | MIT

The universal context file for AI coding agents. Where a README explains a project to humans, AGENTS.md explains it to agents: build commands, test commands, code style conventions, testing frameworks, architectural decisions, and anything else an agent needs to work effectively in the codebase. Plain Markdown, no required schema, no tooling to install — any agent that reads it benefits immediately.

The problem it solves is fragmentation. Before AGENTS.md, every tool was reading different files, or nothing. Cursor read .cursorrules. Claude read CLAUDE.md. Most agents read whatever they found and hoped for the best. AGENTS.md gives a single, predictable location for agent-specific context without bloating the human README with instructions no human needs.

Adopted by OpenAI Codex, Cursor, GitHub Copilot, and others — reported across 60,000+ open-source repositories as of mid-2026. Governance moved to the Agentic AI Foundation (AAIF) under the Linux Foundation, the same body that now stewards MCP.

Avoiding vendor lock-in: The pragmatic pattern many teams use is to write AGENTS.md as the canonical source of truth, then in tool-specific files (like CLAUDE.md or .cursorrules) simply instruct the agent to read AGENTS.md. One file to maintain, every tool benefits. If a tool stops being used, nothing is lost.

2. Agent Skills (SKILL.md)

agentskills.io | Open standard | MIT

Where AGENTS.md tells agents what a project is, Agent Skills tell agents how to do something — and crucially, that capability travels with the agent across any project. A skill is a folder containing a SKILL.md file with two required YAML fields (name and description) and a Markdown body with instructions. Optional assets like scripts, templates, and reference files live alongside it.

The distinction matters: AGENTS.md is project-scoped context. Agent Skills are reusable, portable capabilities — domain expertise, team-specific workflows, and repeatable procedures that agents load on demand. A skill for writing commit messages, one for generating migration scripts, one for running the company's deployment checklist: each is self-contained, version-controlled, and usable in any compatible agent.

Originally developed at Anthropic and released as an open standard in late 2025, it has since been adopted by Claude Code, OpenAI Codex, Cursor, VS Code, and reported 30+ other tools. Partners including Atlassian, Figma, Stripe, and Notion published skills at launch.

The format is intentionally minimal. Two required fields and a Markdown body — simple enough to implement in an afternoon. No protocol negotiation, no runtime dependencies, no auth flows.

The open format is what makes marketplaces possible. Because a skill is just a folder with a Markdown file, anyone can publish one and any compatible agent can consume it. skills.sh by Vercel is the most active registry today, hosting skills from Anthropic, GitHub, OpenAI, and the community — installable with a single npx skills add <owner/repo> command. The existence of a thriving marketplace is the clearest signal that the standard is working.

3. DESIGN.md

google-labs-code/design.md | Google Labs | Apache-2.0 | ⭐ ~14.6k

The newest of the three, and the most specialized. DESIGN.md is a format for encoding a project's visual identity system in a way agents can read and apply when generating UI code. It combines machine-readable design tokens (YAML) with human-readable rationale (Markdown prose) — giving agents not just the values but the reasoning behind them.

Without something like DESIGN.md, agents generating frontend code have no reliable way to know a brand's colors, typography scale, spacing system, or interaction patterns. They guess from comments in CSS files, or they ignore design consistency entirely. DESIGN.md solves this by making the design system a first-class input to the agent.

Google Labs introduced and open-sourced DESIGN.md as the export/import format for Google Stitch, an AI design canvas that uses Gemini to generate UI from natural language. Designers export a DESIGN.md from Stitch; developers import it into their project; agents use it to keep generated code on-brand. An npm package handles validation (npx @google/design.md lint DESIGN.md), diffing, and export to Tailwind, CSS variables, or W3C Design Token format.

Status: currently in alpha. The format is still evolving and breaking changes are expected. Worth watching and experimenting with, but not yet a safe dependency for production workflows.

Special Mentions

CLI tools as a convention: Not a formal standard, but worth naming. As agents become more capable of invoking shell commands, CLI interfaces for well-known tools (git, gh, docker) are increasingly the simpler, cheaper alternative to a full MCP server — when the tool is already well-documented and the agent can reason about flags. The MCP vs CLI debate has been heated; the practical answer is: MCP for APIs and internal systems, CLI for tools that already have mature interfaces. The AI Agent Roadmap covers this tradeoff in Phase 4.

Key Takeaways

Three complementary layers: AGENTS.md gives agents project context; SKILL.md gives them portable capabilities; DESIGN.md gives them visual identity. They solve different problems and work together.
Markdown as the lingua franca: All three formats are Markdown-first, human-readable, and version-control-friendly. No complex serialization, no runtime dependencies.
Open foundations matter: AGENTS.md is now AAIF-governed alongside MCP, signaling the industry wants neutral, community-owned standards for this layer too.
One file, many tools: The lock-in pattern to avoid is maintaining separate context files per tool. AGENTS.md as the canonical source, referenced from tool-specific files, is the practical answer.

What to Read Next

Enjoying content like this? Sign up for Agent Briefings, where I share insights and news on building and scaling AI agents.

Sources and further reading

In 2026, There Are 4 Ways to Build an AI Agent. Here's How to Choose

Ali Ibrahim — Wed, 03 Jun 2026 13:30:00 +0000

In 2025, the default assumption was: if you want an AI agent, you build one. Pick a framework, wire up your tools, own the stack. The instinct was to build — almost automatically, regardless of whether it was the right call.

That assumption is worth questioning in 2026. Not because building is wrong, but because it's now one option among four. And defaulting to it without asking which path actually fits your situation is how teams spend weeks on infrastructure that didn't need to be theirs.

The four paths below are not a ranking. They're different tools for different jobs, and they can combine. The goal is to give you enough of a framework to ask the right question before you commit to an approach.

Path 1: Build It Yourself

This is the original answer to the question: you write the agent. You own the full stack: the model calls, the tool wiring, the memory system, the orchestration loop, the deployment, the monitoring. Frameworks like LangGraph and the OpenAI Agents SDK give you building blocks, but the architecture is yours.

When this is the right path:

Your requirements are specific enough that no existing agent or service maps to them cleanly
The agent needs deep integration with internal systems that can't be exposed to external infrastructure
The way you build the agent is itself the competitive advantage: proprietary orchestration logic, domain-specific memory structures, custom tool design
You need to understand every layer because you're responsible for debugging it in production

What it demands:

Time and discipline. The gap between a working demo and a production agent is real and large. A demo proves the model can do the task. A production agent handles failure gracefully, recovers from interrupted sessions, behaves predictably across thousands of runs, and doesn't surprise you at 2am.

This path also demands architectural judgment that the frameworks don't supply. Which layer owns state? How does context flow between agents in a multi-agent system? What does the agent do when a tool call fails three times in a row? These questions have answers that depend on your system, not on the framework's defaults.

There's a third cost that doesn't show up in tutorials: the standards are moving. MCP, agent skills, sandboxed execution — new primitives are landing every quarter, and whatever you build today needs to be able to absorb what ships next. On Path 1, that adaptation burden is yours. The teams doing this well aren't just building agents; they're building agents that are designed to evolve. That's a different, harder problem.

Where to go deeper:

AI Agent Roadmap: Everything You Need to Build Agents (In the Right Order) covers this path in full, from picking your stack through production deployment, with links to the depth articles for each phase.

If you want a starting point that's already wired together but simple enough to understand every layer, the Agentailor fullstack starter gives you a LangGraph + Next.js scaffold you can extend without fighting boilerplate — and the architecture is deliberately decoupled, so swapping LangGraph for another orchestration layer is straightforward if your requirements call for it.

Path 2: Build It With a Coding Agent

This path gets confused with vibe coding. It's not the same thing.

Vibe coding means: describe what you want, accept what comes out, ship it. For most software, that's increasingly viable. The models are good enough, the training data is dense enough, and the blast radius of a suboptimal decision is manageable.

Building agent systems with a coding agent is a different situation. The domain is too new, the training data is too sparse, and the reference repos that exist were themselves largely vibe-coded. When you ask Claude Code or Cursor to scaffold a multi-agent orchestration loop, it's drawing on a shallow well. It will produce something that runs. Whether you'd want to run it in production is a separate question.

Agentic engineering is the discipline that fills that gap. You make the architectural decisions before the agent touches the keyboard: which transport layer, which abstraction boundaries, where state lives, what the agent is not allowed to do. You point the coding agent to the right reference material rather than letting it reach for whatever it finds. You review not just whether the code works but whether the decisions embedded in it are the ones you'd have made.

The coding agent handles implementation. You handle architecture. The split matters.

When this is the right path:

You know what you want to build and have strong opinions about how it should work
You want velocity without sacrificing quality or scalability
Your backend instincts are strong enough to review what the agent produces critically, not just whether the outputs look plausible

What it demands:

Strong opinions upfront. The architectural decisions have to be made before the session starts, not discovered during it. This also requires knowing enough about the domain to recognize when the agent made a choice you wouldn't have, even if the code compiles and the tests pass.

Where to go deeper:

Agent Briefings Issue 16 goes deep on the four practices that separate agentic engineering from vibe coding: decision authority, resource quality, orchestration as system design, and context engineering as architecture. Issue 17 will cover the next step: formalizing these practices into specifications so you don't have to enforce them manually every session.

Path 3: Deploy an Existing Open-Source Agent

The instinct to build from scratch runs deep in engineering culture. Sometimes it's the right instinct. Often, for agents, it's a waste.

The OSS agent space has matured to the point where real options exist across multiple categories: task-execution agents, gateway agents, self-improving server agents. The capability that would have taken weeks to build in 2024 often exists today as a configurable extension. The question is no longer "does something exist?" but "which one is worth deploying, and why?"

When this is the right path:

Your use case maps well to what an existing agent already does
You want full infrastructure control without the cost of building the agent yourself
80% or more of the functionality you need already exists and the remaining gap can be closed through configuration or extension

What it demands:

Choosing carefully. "Open source" covers everything from a weekend project with 400 stars to foundation-governed infrastructure with hundreds of contributors. The star count is a weak signal; the governance model and production track record are the signals that matter. An agent abandoned by its maintainer is worse than no agent, because you've now inherited the maintenance burden.

The other judgment call is fit. OSS agents have opinions baked in: extension models, memory architectures, sandboxing approaches, provider assumptions. You need to know whether those opinions align with your use case before you build on top of them, not after.

Three examples worth knowing:

Goose — local-first task-execution agent, built by Block and governed by the Agentic AI Foundation under the Linux Foundation. MCP-based, provider-agnostic, 44k+ stars. The reference point when governance and long-term stability matter.

Hermes — a self-improving server agent by Nous Research that runs persistently on your own infrastructure, learns from completed tasks, and auto-generates reusable skills over time. 173k+ stars, MIT licensed. Built for longer-running autonomous workloads rather than interactive sessions.

OpenClaw — a multi-channel gateway that routes conversations across WhatsApp, Telegram, Discord, Slack, and more through a single runtime. 374k+ stars, community-maintained. A different category from the two above: if your use case is multi-platform orchestration rather than task execution, it's the one to evaluate.

These three aren't competing for the same slot. Hermes even ships with built-in migration tooling from OpenClaw, which tells you something about how the space is consolidating around clearer categories.

Path 4: Use a Managed Agent Service

This is the path to watch most closely in 2026. The frontier labs are converging on it, and the category is moving faster than any of the others.

Someone else provides the harness, wires the primitives, handles deployment, manages the infrastructure. You configure and consume via API. Anthropic's Claude Managed Agents, LangChain's Managed Deep Agents, and Vercel Agent all take this approach, each with different trade-offs in scope and generality.

When this is the right path:

Most teams. The session persistence layer is not a competitive advantage. Neither is the execution environment or the retry logic. If you're spending engineering time rebuilding those components, you're not spending it on what makes your agent actually better.

The teams for whom this path fits are the ones who can honestly answer: "Is owning this infrastructure proportional to the value it creates for us?" For most, it isn't.

What it demands:

Careful evaluation before you commit. Managed doesn't mean hands-off on architecture, and it doesn't mean all services are equivalent. Four things matter most when assessing a managed agent service:

Primitive coverage. Does the service actually wire the capabilities your use case needs? Verify at the tool level, not the marketing page.
Observability access. Can you see what the agent did, step by step? In production, you need traces, not just final outputs.
Ejection path. How painful is the dependency if circumstances change? This is the lock-in question asked practically.
Execution environment. Where does the agent actually run? For agents handling sensitive data or internal systems, the answer to this question may determine viability.

Worth noting: this market is moving fast. Google announced their own managed agent service days after Agent Briefings Issue 15 covered the category. OpenAI will likely follow. The list of providers will look different by the end of 2026 — which is a reason to evaluate carefully now, not to wait.

Where to go deeper:

Agent Briefings Issue 15 covers what distinguishes a managed agent from a standard inference API, the vendor lock-in trade-off reframed, and a practical breakdown of what to look for when evaluating a specific service.

Paths Can Combine

The framework above is a starting point, not a box.

A developer doing agentic engineering (path 2) might be building on top of a managed service (path 4), making architectural decisions while outsourcing the operational layer. A team that started with Goose (path 3) might eventually fork and customize it to the point where they're effectively on path 1. Someone who started on a managed service might eject to self-hosted infrastructure once their requirements outgrow the service's constraints.

The paths describe different approaches to where ownership sits: who builds the agent, who runs it, and who makes the architectural decisions. Most real agent projects mix these to some degree.

The decision that matters most is the first one: which path do you start on, and why? Getting that right determines whether you're building toward what you actually need or accumulating technical debt in the wrong direction.

Pick based on what fits your requirements today, not what sounds most sophisticated. The right answer changes as the problem matures.

Enjoying content like this? Sign up for Agent Briefings, a bi-weekly newsletter on building and scaling AI agents in production.

References

Agentic Engineering Patterns — Simon Willison
Claude Managed Agents Overview — Anthropic
Introducing Managed Deep Agents — LangChain
Hermes Agent — Nous Research
Goose — Agentic AI Foundation

Observability for AI Agents: Why Tracing Matters and How to Do It with Langfuse

Ali Ibrahim — Sun, 17 May 2026 13:30:00 +0000

Introduction

Deploying an AI agent means shipping a system you can't fully predict. The same user message can produce different behavior on different runs. A bad tool result at step 3 can silently corrupt steps 4 through 10. Token costs compound in loops you didn't expect. Without visibility into what's happening inside the agent, you're flying blind.

This is where observability comes in — and if you've built distributed systems before, you already know the concepts. Traces, spans, metadata: the same model applies. If you've used OpenTelemetry, Jaeger, or Datadog APM, your existing instrumentation still helps. Your OTel setup will capture the HTTP request, the database query, the response time. Your trace might look like this:

Trace: POST /api/agent/stream
  └─ HTTP span (200 OK, 2.1s)
       └─ DB span: checkpoint read (12ms)

That's useful, but it leaves the most important part opaque. The 2.1s that happened inside the agent — the LLM calls, the tool decisions, the graph node execution — is invisible. Your trace tells you the agent ran. It doesn't tell you:

Which LLM call returned a hallucinated result?
Why did it invoke the same tool three times?
Which node in the graph caused the wrong branch to execute?
How much did that one conversation cost?

Agents inherit the same observability concepts as distributed systems, but they need one layer deeper: semantic traces that capture agent reasoning, not just infrastructure spans.

Why Agents Are Uniquely Hard to Debug

Non-determinism. The same user message can produce different agent behavior on every run. You can't just "reproduce the bug" in isolation. The trace of what happened IS the bug report.

Multi-step reasoning chains. A LangGraph agent might run through 8–12 nodes before producing a response. A bad tool result at step 3 can silently corrupt everything that follows. Without visibility into each step, you're guessing where the chain broke.

Compounding costs. LLM calls in agent loops are expensive. Without per-call visibility, you only discover the runaway cost when the billing statement arrives. Knowing that one node is responsible for 80% of token usage changes where you optimize.

Real-world side effects. Agents don't just think — they act. They send emails, write to databases, call external APIs. A trace showing "tool called 4 times" versus "tool called once" can be the difference between a recoverable bug and an incident.

Standard OTel instrumentation captures none of this. It sees the HTTP boundary, not the reasoning inside it. What you need is an additional layer that understands agent semantics: what was the LLM asked, what did it decide, which tools did it choose and why.

What Agent Tracing Adds

You already know traces and spans from distributed systems. The model is the same — a trace is a record of one end-to-end operation, composed of nested spans. What changes for agents is what goes inside those spans.

Instead of just latency and status codes, agent spans carry:

The full prompt sent to the LLM and its completion
Token counts and cost per call
Which tool was invoked, with what arguments, and what it returned
The graph node that triggered each operation

For a LangGraph agent with human-in-the-loop tool approval, a single user message produces a trace like this:

Trace: user-message
  ├─ LangGraph: agent node
  │    └─ LLM call (GPT-4o, 1,240 tokens, 0.8s, $0.006)
  ├─ LangGraph: tool_approval node
  │    └─ interrupt (waiting for human approval)
  ├─ LangGraph: tools node
  │    └─ search_web("...") → result
  └─ LangGraph: agent node (second pass)
       └─ LLM call (GPT-4o, 380 tokens, 0.3s, $0.002)

That trace answers every question from the introduction. You can see exactly where time was spent, which LLM call was expensive, and what inputs each node received.

Why Langfuse

Langfuse is an open-source LLM observability platform built specifically for this. I'm not affiliated with Langfuse — I use it because it's open-source, does the job well, and doesn't lock you in. Here's what makes it a good fit for agent builders:

Open-source and self-hostable. All Langfuse features (traces, evals, prompt management, annotation queues, playground) are MIT licensed. You can run the full platform on your own infrastructure with Docker Compose, which means full data ownership and no vendor lock-in. This matters for teams in regulated industries or with data residency requirements.

LangGraph native. LangGraph is built on LangChain. Langfuse's CallbackHandler integrates directly with the LangChain callback system, automatically capturing every graph node execution, LLM call, and tool invocation with full semantic context.

Broad compatibility via OpenTelemetry. Beyond LangGraph, Langfuse provides native SDKs covering all major TypeScript and Python frameworks — OpenAI SDK, LiteLLM, and more. For anything not natively supported, if your stack already emits OpenTelemetry spans, Langfuse picks them up without any additional integration code. One platform covers your entire stack.

Traces are just the start. Langfuse also handles prompt management and evaluations in the same platform. Once your traces are flowing, you can run LLM-as-a-judge evals on them, collect user feedback, and build datasets for regression testing. More on that in a follow-up post.

Alternatives worth knowing: LangSmith is excellent if you're already deep in the LangChain ecosystem, but it's closed-source. Helicone offers the fastest setup (a proxy URL swap), but proxy-based tracing captures less semantic detail for complex agent graphs. Langfuse sits in the sweet spot of depth and control.

Adding Langfuse to the Fullstack Template

I recently added Langfuse support to the fullstack-langgraph-nextjs-agent template. The integration required about 26 lines of code across two files. Here's how it works.

Install

pnpm add @langfuse/langchain @langfuse/otel @opentelemetry/sdk-node

Configure

LANGFUSE_ENABLED=true
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_BASE_URL=https://cloud.langfuse.com

For self-hosting, swap LANGFUSE_BASE_URL to your local instance (e.g., http://localhost:3000). The setup instructions are in OBSERVABILITY.md.

Piece 1: OTel Initialization (`instrumentation.ts`)

According to the Langfuse OTel docs, the SDK should be initialized once per process — before any application code runs — so that the span processor is registered before the first trace is emitted. How you do that depends on your framework. For non-Next.js apps, the docs cover the setup for Node.js, Python, and other runtimes.

In Next.js, the right place is the built-in instrumentation.ts hook, which the framework calls once at process startup before any route handler executes:

export async function register() {
  if (process.env.NEXT_RUNTIME === 'nodejs' && process.env.LANGFUSE_ENABLED === 'true') {
    const { NodeSDK } = await import('@opentelemetry/sdk-node')
    const { LangfuseSpanProcessor } = await import('@langfuse/otel')

    const sdk = new NodeSDK({
      spanProcessors: [new LangfuseSpanProcessor()],
    })

    sdk.start()
  }
}

The NEXT_RUNTIME === "nodejs" guard is important. Next.js evaluates instrumentation.ts in both the Node.js runtime and the Edge runtime. The OTel SDK uses Node.js-only APIs that would crash in Edge context. Dynamic imports prevent Next.js from bundling these modules for the Edge bundle.

Piece 2: Semantic Callbacks (`agentService.ts`)

The OTel processor captures infrastructure spans (HTTP calls, DB queries). To capture agent-specific semantics — LLM inputs/outputs, token counts, tool names — we need to hook into LangGraph's execution. Langfuse does this through LangChain's callback system.

Callbacks are the right pattern here for two reasons. First, they're async and fire out-of-band: sending trace data to Langfuse adds no latency to your agent's response time. Second, they're isolated: if the Langfuse endpoint is down or a tracing call fails, it doesn't throw in your agent's execution path. Observability that can take down your agent isn't observability you can trust in production.

We attach a CallbackHandler to each agent.stream() call:

import { CallbackHandler } from '@langfuse/langchain'

const langfuseHandler = process.env.LANGFUSE_ENABLED === 'true' ? new CallbackHandler() : null

// In streamResponse():
const iterable = await agent.stream(inputs, {
  configurable: { thread_id: threadId },
  ...(langfuseHandler ? { callbacks: [langfuseHandler] } : {}),
})

When LANGFUSE_ENABLED is false or missing, the handler is null and the spread adds nothing. Zero overhead, no silent errors from missing credentials, no code changes needed to disable.

What You See in the Dashboard

Here's the agent running in the browser — a standard chat interface backed by the LangGraph agent:

Every conversation that goes through that UI generates a trace. Once traces are flowing, Langfuse gives you a hierarchical view of each agent run — the full node execution sequence, nested LLM calls with token counts and costs, tool invocations with their inputs and outputs, and latency at every step.

At the project level you get aggregated cost and latency trends over time, which makes it easy to spot regressions after prompt or model changes.

Quick Start

The template has Langfuse support built in. Clone it and add your keys:

git clone https://github.com/agentailor/fullstack-langgraph-nextjs-agent
cd fullstack-langgraph-nextjs-agent
pnpm install
# Copy .env.example to .env.local and fill in LANGFUSE_* vars
pnpm dev

Traces appear in your Langfuse project the moment the agent handles its first message. For self-hosting instructions, see the OBSERVABILITY.md.

Conclusion

Observability isn't an optional extra for production agents. It's how you move from "I think it's working" to "I can prove why it works and catch it when it doesn't."

Langfuse gives you traces. But traces are the foundation, not the destination. Once you have a record of what your agent does, the next step is evaluations: systematically measuring output quality, catching regressions when you change a prompt, and building datasets that let you iterate with confidence.

That's the next article.

Enjoying content like this? Sign up for Agent Briefings, where I share insights and news on building and scaling AI agents.

Resources

Fullstack LangGraph + Next.js Template (GitHub) — complete implementation with Langfuse built in
OBSERVABILITY.md — setup guide including self-hosting
Langfuse Docs — official documentation
Langfuse LangGraph Integration — official cookbook

How I Made My Blog Native to AI Agents (And Launched One)

Ali Ibrahim — Thu, 07 May 2026 13:49:10 +0000

A few months ago I started noticing something in my analytics. Agentailor is about a year old now, ~40K reads total, averaging around 3.5K reads per month. But inside that traffic, a growing slice wasn't coming from humans. It was agents — coding assistants, AI crawlers, automated pipelines — fetching articles to use as context.

This wasn't a surprise exactly. The blog covers MCP, LangGraph, and production AI agent patterns. Of course tools like Claude Code, Cursor, and ChatGPT are going to pull this content when developers ask them agent-related questions. But watching the trend grow month after month made something click: if agents are already here, I should build for them properly.

That's when I started thinking seriously about what "AI-first" means in practice — not as a design philosophy but as a set of concrete engineering choices. Here's what I ended up building, why each decision was made, and what you can apply to your own project.

What AI-First Actually Means

Before diving into the features, let me be precise about what I mean.

AI agents are already users of your platform. Most content platforms haven't designed for them. The instinct is to say "agents can just convert HTML to Markdown" — and technically, many MCP tools do exactly that. But third-party conversion is lossy. You have no control over what gets dropped (code blocks, tables, structured callouts), it adds latency to every request, and the output quality varies based on which tool is doing the converting.

First-party Markdown means you control the output. You decide what structure is preserved, how code blocks are formatted, whether image URLs are absolute. The agent gets exactly what you intended to deliver.

But this isn't only about agents. It's about giving users — human and AI alike — the choice of how to consume your content. Not everyone wants to read a 4,000-word article end-to-end. Some want to paste it into their AI assistant and ask a specific question. Some want to skim. Giving them clean, accessible formats respects that.

The analogy I keep using: just like you built a mobile-responsive layout for phone users, you should build an agent-responsive content layer for AI users. Your human readers will benefit from it too.

llms.txt and llms-full.txt

The first thing I added was llms.txt.

The idea is simple: robots.txt tells crawlers what they can access, llms.txt tells AI assistants what they should read. It's an emerging standard (llmstxt.org) and the concept is straightforward — a plain text file at the root of your site that lists your content with links and optional summaries.

Here's what Agentailor's llms.txt looks like:

# Agentailor

> Your go-to resource for building production-ready AI agents...

Note: All article links point to .md files containing clean Markdown. Use these
instead of the HTML pages to save tokens.

## Blog Posts

- [How to Build Your First MCP Server in 5 Minutes](https://blog.agentailor.com/posts/create-your-first-mcp-server-in-5-minutes.md): Step-by-step guide...
- [LangGraph vs LlamaIndex for JavaScript](https://blog.agentailor.com/posts/langgraph-vs-llamaindex-javascript.md): Comparing the two...

Note that the links point to .md files, not HTML pages. This is intentional: when an AI assistant follows a link from llms.txt, it gets clean Markdown, not a web page. The file is telling agents: "here's what exists, and here's how to read it efficiently."

I also generate llms-full.txt: the same index but with more detailed summaries of every post included inline. This is useful for AI tools that want to load the entire site's content into a single context window.

Both files are generated at build time from the published blog posts. Every build keeps them in sync automatically.

If you want to add this to your project: the format is dead simple. A # title, a > site description, and a list of links. You could generate it with a build script, a route handler, or even a static file you update manually. The spec is open and lightweight.

Per-Post Markdown API

Every post on Agentailor is available at /posts/[slug].md.

Fetch https://blog.agentailor.com/posts/mcp-typescript-sdk-complete-guide.md and you get 45,000 words of clean, structured Markdown. No nav. No footer. No scripts. No conversion artifacts. The code blocks are intact, the headings are preserved, the image paths are absolute URLs.

This is the backbone that llms.txt links to. But it's also available to anything that knows the URL pattern — MCP fetch tools, RAG pipelines, AI coding assistants.

The implementation is straightforward: during the build, a script generates a .md file for each published post and writes it to the static output folder. The files are served as static assets — no route handler needed.

The "Copy as Markdown" button is the human-facing equivalent of the same feature. On every post, there's a small button in the header. Click it and the article's Markdown is copied to your clipboard — ready to paste directly into Claude, ChatGPT, or any AI assistant. The implementation fetches /posts/[slug].md on demand and writes it to the clipboard:

const handleCopy = async () => {
  const mdUrl = window.location.pathname + '.md'
  const res = await fetch(mdUrl)
  const text = await res.text()
  await navigator.clipboard.writeText(text)
}

That's the whole thing. The URL pattern pathname + '.md' maps /blog/some-post to /posts/some-post.md. No API call, no server-side logic, just a static file fetch.

The /summarize <url> command in the Agentailor agent (more on that below) is the agent-native equivalent of this button. Both are solving the same problem — give users and agents the content in the format they actually need — from different entry points.

The Blog Redesign

I also redesigned the blog earlier this year.

The old design was functional but cluttered. The new one takes inspiration from a minimalistic approach: high contrast, generous whitespace, typography-first, minimal chrome. The goal is that nothing competes with the article itself.

This matters more than it sounds for AI-first design. Clean, semantic HTML is easier for agents to parse even before they hit the Markdown API. Clear heading hierarchy, consistent code block structure, and minimal decorative markup all make the HTML more parseable. It's a floor, not a ceiling — the Markdown API is still the right choice for agents — but it doesn't hurt to start from a well-structured baseline.

More importantly, the redesign was about making the blog a place worth returning to. 3.5K reads a month means people are finding Agentailor useful. The redesign is about earning that return visit.

The Agentailor Agent: v0.1

This is the piece I'm most excited about, and the most intentionally scoped.

The Agentailor agent is a chat widget — designed as a terminal — embedded on the site. It's live now at blog.agentailor.com. The terminal aesthetic isn't decoration. The audience is developers. Terminals are how we think. Slash commands map to developer instincts in a way that clicking buttons doesn't.

Here's what it supports in v0.1:

Command	What it does
`/help`	Show available commands
`/clear`	Clear the terminal, start a new session
`/summarize <url>`	Summarize a blog article by URL
`/find <topic>`	Find articles about a topic
Natural language	Consult on architecture, production patterns, and agent design

The knowledge base is intentionally focused: it knows Agentailor's content deeply. That's a feature, not a limitation. A focused agent that gives grounded, opinionated answers is more useful than a broad one that hedges everything. Ship something real, learn, expand.

Why This Is Different from a Chat Widget

Most chat widgets on content sites are support tools — they answer questions, resolve ambiguity, point to docs. This one is designed to do something harder: act as an architect.

The difference matters. Any agent with web search or the right skills can answer a question. What's rarer is an agent that tells you why your current approach won't scale, what the production failure mode looks like, or how the pattern you're using breaks under real-world load. That's the gap Agentailor is designed to fill — not information retrieval, but architectural judgment built on direct experience building and writing about these systems.

Think of it less as "ask a question, get an answer" and more like consulting a senior engineer: "here's what I'm building, here's my current approach — is this right? What would you do differently at scale?"

The design is explicitly dual-audience: built for human developers today, built for agents too. What does "built for agents" mean here? It means the interface is designed to be callable, not just clickable. The slash command vocabulary maps to agent-native patterns (consult, find, summarize) not GUI patterns (click, scroll, navigate). The responses are structured to be useful as context passed between agents, not just readable as chat.

The roadmap for v0.2 and beyond is: more commands, broader knowledge base, and an agent-to-agent interface. The vision is an agent that other coding agents can consult the way a junior dev consults a senior engineer. Your agent hits a hard architectural decision — "should I use a supervisor pattern or parallel subgraphs here?" — and rather than guessing or surfacing generic results from a web search, it consults Agentailor: a platform with a proven track record, built specifically for this problem space. It gets back a specific, opinionated answer grounded in real production experience — not a generic result scraped from the internet.

That's what "AI-first platform" means at its fullest: not just readable by AI, but genuinely useful to AI as a peer.

What You Can Take From This

If you're building a content platform, a developer tool, or any product that AI agents might interact with, here are five concrete things you can steal:

1. Add llms.txt. Takes less than an hour. List your content, point links to clean Markdown versions. The spec is at llmstxt.org. It signals to AI tools that you've thought about their needs.

2. Serve your content as first-party Markdown. Whether it's /posts/[slug].md, a /api/content endpoint, or a bulk export, give agents a format they can use without lossy conversion. You control the fidelity.

3. Add a "Copy as Markdown" button. Your human readers who use AI assistants will thank you. One button, one fetch call to your own .md endpoint. Fifteen minutes of work.

4. Design for choice, not just completeness. Not everyone reads every word. Give users tools to engage with your content in the way that works for them — summaries, search, quick copy. This applies to both human and AI users.

5. Ask whether your interface serves agents. If you're building a chat interface or an API, ask explicitly: can an AI agent use this? Not just a human with an AI assistant — but an agent acting autonomously. If the answer is no, it probably could be yes with small changes.

What's Next

The Agentailor agent is v0.1 and I'm shipping this intentionally. The next version will expand the knowledge base, add more commands, and open up the agent-to-agent interface.

If you want to see it in action: open blog.agentailor.com, click the terminal in the bottom right, and bring it a real problem — an architectural decision you're stuck on, a pattern you're not sure scales, a tradeoff you want a second opinion on. That's what it's built for.

For context on where agents are heading more broadly, the Agent Development Roadmap is a good next read. If you want to go deeper on MCP specifically, the MCP TypeScript SDK Complete Guide covers the protocol end-to-end.

The web was built for browsers. It's being rebuilt for agents. Might as well build for both.

AI Agent Roadmap: Everything You Need to Build Agents (In the Right Order)

Ali Ibrahim — Sun, 19 Apr 2026 13:29:57 +0000

Introduction

There is no shortage of content on AI agents. Tutorials, framework comparisons, deep dives on MCP, prompting guides, memory strategies — the material is out there. What is often missing is the map.

If you are a developer picking up agents for the first time, the landscape can feel overwhelming: Which framework? Which language? Do I need MCP? What even is an eval? This article answers all of those questions, but more importantly, it answers them in the right order.

By the end, you will know what to learn, what to build first, and what to come back to later. Each phase links to dedicated articles that go deeper. Think of this as your table of contents for the entire journey.

Phase 0: Get the Mental Model Right

Before you pick a framework or write a single line of agent code, you need to answer one question: does your problem actually need an agent?

Most AI-powered features do not. A workflow — a predefined sequence of LLM calls and logic — is simpler, faster, cheaper, and easier to debug. Agents shine when the path to the goal is genuinely uncertain: when the system needs to reason about what to do next, adapt based on new information, or handle open-ended tasks.

Using an agent when a workflow would do is one of the most common mistakes in AI development. It adds complexity without adding value.

The distinction is not just conceptual. It shapes your architecture, your testing strategy, and your costs. Get this right before anything else.

Read: The Future of AI Building: Workflows, Agents, and Everything In Between

Phase 1: Pick Your Stack (and Stop Second-Guessing It)

Once you have decided agents are the right tool, you will face the stack question. The good news: you probably already have the answer.

Language

If you write Python: Stay there. The Python agent ecosystem (LangChain, LangGraph, the OpenAI Agents SDK) is mature, well-documented, and has the largest community.

If you write TypeScript: You are equally well-served. LangGraph.js, Vercel AI SDK, and the OpenAI Agents SDK for TypeScript have all reached production maturity. The gap with Python has closed significantly.

If you come from a typed language like Java, Go, or C#: TypeScript is the recommended entry point. The mental model will feel familiar, the npm ecosystem for agents is growing fast, and you will not need to learn a dynamically typed language to get started.

The one thing to avoid: switching languages specifically to learn agents. The cognitive overhead of learning a new language and a new paradigm at the same time is high. Pick the language you already know.

Framework

The framework landscape can be paralysing. A few principles to cut through it:

Pick one framework to start. Depth in one beats surface knowledge across five.
For multi-step, stateful agents, LangGraph (Python or JS) is the most battle-tested option.
For simpler, tool-calling agents, the OpenAI Agents SDK is a good starting point.

Read: Choosing Your Stack: LangChain and LangGraph in Python vs TypeScript

Read: Top 10 Most Starred AI Agent Frameworks on GitHub (2026)

Read: Top 5 TypeScript AI Agent Frameworks You Should Know in 2026

Read: LangGraph vs LlamaIndex Showdown: Who Makes AI Agents Easier in JavaScript?

Phase 2: Learn the 4 Core Primitives

Every AI agent, regardless of framework or language, is built from the same four pieces. Master these concepts and any framework becomes learnable quickly. Skip them and you will be debugging symptoms instead of understanding causes.

1. The Model (The Brain)

The language model is the reasoning engine of your agent. Everything else is infrastructure around it.

Choosing the right model is not just a performance question; it is a cost, latency, and deployment question. Frontier models like GPT-5 or Claude offer the highest capability but come with API costs and latency. Open-weight models give you more control and can run locally, but require more setup.

For most developers starting out, begin with a hosted frontier model. Optimize later once you understand your agent's actual requirements.

Read: GPT-5 Is Here — And It's Built for Devs Who Build with Tools

Read: OpenAI Releases GPT-OSS: What It Means for AI Developers and Agent Builders

Read: Run Open-Source AI Models Locally with Docker Model Runner

2. Tools (How Agents Act on the World)

A model without tools can only reason and respond. Tools are what let an agent actually do something: search the web, query a database, call an API, write a file.

Tool design is one of the most underestimated skills in agent development. Poorly named tools, tools that do too much, or tools with unhelpful error messages are a common source of agent failures that look like model problems.

Key principles: each tool should do one thing, have a name that is self-explanatory to the model, and return errors in a form the model can reason about and recover from.

Read: Writing Effective Tools for AI Agents: Production Lessons from Anthropic

3. Memory (What It Remembers)

Agents operate inside a context window. That window is finite, and in multi-turn conversations or long-running tasks, it fills up fast.

Memory in agents has two layers: short-term (what is currently in the context window) and long-term (external storage the agent can read from and write to). Managing the boundary between the two is an engineering problem, not just a prompt problem.

Naive approaches — keeping the full message history forever — break down quickly. Smarter strategies use summarization, selective retention, and structured external memory to keep agents coherent across long sessions.

Read: Don't Let Your AI Agent Forget: Smarter Strategies for Summarizing Message History

4. Prompting (The System Prompt Is Code)

The system prompt is not a suggestion. It is the behavioral contract for your agent: what it does, how it reasons, when it uses tools, what it refuses, how it handles uncertainty.

Treat it with the same discipline you would apply to application code. Version it. Review changes. Test it against known failure cases. Small edits to the system prompt can have outsized effects on agent behavior, for better or worse.

Read: The Art of Agent Prompting: Anthropic's Playbook for Reliable AI Agents

Phase 3: Build Your First Agent

With the mental model in place and the primitives understood, it is time to build something that runs.

The goal of this phase is not a production-ready application. It is getting the feedback loop working: write agent logic, run it, observe what it does, understand why, iterate. This is how you learn faster than any tutorial can teach you.

Pick one framework from Phase 1 and follow it end-to-end. Resist the urge to switch frameworks when you hit friction; friction early is usually a sign you are learning, not a sign you chose wrong.

Read (TypeScript): Getting Started with OpenAI's Agents SDK for TypeScript

Read (LangGraph path): How to Build a Fullstack AI Agent with LangGraphJS and NestJS

Phase 4: Extend With MCP (Tools at Scale)

Once your agent is working, you will quickly hit the ceiling of hand-coded tools. Building a custom integration for every API your agent needs does not scale.

This is where the Model Context Protocol (MCP) comes in. MCP is an open standard that lets agents connect to tools, data sources, and services through a common interface. Instead of writing custom tool code for GitHub, Notion, or Stripe, you connect your agent to existing MCP servers that expose those integrations.

There are two paths here:

The first is using existing MCP servers: running pre-built servers locally or in the cloud and connecting your agent to them.
The second is building your own: creating MCP servers to expose your own APIs and data sources to any compatible agent.

A note on the current debate: you will find arguments online that "MCP is dead" and that CLI tools are the better default.

CLI tools are a legitimate choice for well-known, documented tools like git or gh, where a shell command is simpler and cheaper to invoke than a full MCP server. But this framing misses what MCP is actually good at: standardized access to APIs and internal systems that have no CLI equivalent, with scoped permissions, auditable logs, and a consistent interface across any compatible agent.

The standard is also gaining institutional backing, which matters for enterprise contexts. The practical answer is not CLI or MCP; it is knowing when to use each. Do not let the hype cycle — in either direction — skip this phase for you. Understanding MCP is foundational to building agents at scale.

Read: Run Any MCP Server Locally with Docker's MCP Catalog and Toolkit

Read: Create Your First MCP Server in 5 Minutes with create-mcp-server

Read: The MCP TypeScript SDK: A Complete Guide to Tools, Resources, Prompts, and Beyond

Phase 5: Evaluate Before You Ship

This is the phase most developers skip. It is also the one they regret most.

Agents are non-deterministic. The same input can produce different outputs across runs. Manual testing — running the agent a few times and checking that it "seems fine" — is not enough. It gives you false confidence, and it does not scale as your agent's behavior becomes more complex.

Evaluation is the practice of measuring agent performance systematically. Before you write your first eval, define what "correct" looks like in concrete terms. What does a good output contain? What does a bad output look like? Without that definition, you cannot measure anything meaningful.

Start small: collect 20 to 50 real-world cases where your agent failed or behaved unexpectedly. These are worth more than hundreds of synthetic benchmarks. Then build graders to score outputs automatically. Three types are available to you:

code-based graders for deterministic checks (did the agent call the right tool?)
model-based graders for flexible judgment (is this response helpful and accurate?), and
human graders for ground truth calibration.

Because agents are non-deterministic, use pass@k metrics: run each test case multiple times and measure how often the agent succeeds across those runs. This gives you a much more honest picture than a single pass or fail.

Anthropic's engineering team has written the most thorough practical guide on this topic available today.

Read: Demystifying Evals for AI Agents — Anthropic Engineering

Phase 6: Go Fullstack

An agent that runs in a terminal is a prototype. A product needs a UI, real-time feedback, authentication, and — for many use cases — a human-in-the-loop approval step.

Going fullstack means wiring your agent backend to a frontend: streaming responses to the user as the agent works, handling long-running tasks without timeouts, and letting users approve or reject agent actions before they execute. Human-in-the-loop is not just a safety feature; it is often what makes users trust the system.

Read: Building a Fullstack AI Agent with LangGraph.js and Next.js: MCP Integration and Human-in-the-Loop

Read: Implementing OAuth for MCP Clients: A Next.js and LangGraph.js Guide

Phase 7: Deploy

Getting off localhost is a milestone. It means your agent is accessible, persistent, and running in a real environment.

For MCP servers, Google Cloud Run is a strong starting point: it scales to zero when idle, has a generous free tier, and deploys with minimal infrastructure setup. For the agent backend itself, the same principle applies: start with managed infrastructure that lets you focus on the agent, not the servers.

When deploying, pay attention to environment management (API keys, model endpoints), logging (you need to be able to debug agent runs after the fact), and cost monitoring (agent runs can be expensive at scale if not tracked).

Read: Deploy Your MCP Server to Google Cloud Run (For Free)

Read: How I Built and Deployed a Production-Ready AI SaaS in 14 Days Using Agent Initializr

Phase 8: Think Like an Architect

Once you have shipped an agent, the real education begins. You will look back at your first design and see all the decisions you made by accident. This phase is about making those decisions on purpose.

Two concepts become important at this stage.

Skills are a composability pattern: instead of baking every capability directly into your agent, you package behaviors as plug-in skills that the agent can load and use. This keeps your agent core small and lets you iterate on capabilities independently.

Architecture patterns — how you structure agent state, how you handle errors, how you design for multi-step tasks — matter more as your agent grows. Real production systems have made these mistakes and learned from them.

Read: Lessons from OpenClaw's Architecture for Agent Builders

Read: Top 5 Agent Skills Every Agent Builder Should Install

Read: How to Build and Deploy an Agent Skill from Scratch

Conclusion

The path above is sequential for a reason. Each phase builds on the one before it. Getting the mental model right (Phase 0) shapes every framework choice (Phase 1). Understanding the primitives (Phase 2) makes your first build (Phase 3) faster and less frustrating. Evaluating before you ship (Phase 5) is what separates prototypes from products.

If you take one thing from this roadmap: do not skip Phase 5. Evaluation is the most commonly skipped step and the one developers most wish they had started earlier.

The map is here. Start at Phase 0 and build forward.

Enjoying content like this? Sign up for the newsletter Agent Briefings, where I share insights and news on building and scaling AI agents.

References

Demystifying Evals for AI Agents — Anthropic Engineering
How to Think About Agent Frameworks — LangChain
Building Effective Agents — Anthropic

5 Agent Skills I’d install before starting any new agent project in 2026

Ali Ibrahim — Mon, 16 Mar 2026 14:44:21 +0000

Your coding agent can write code, refactor functions, and debug errors. But can it design production-grade prompts? Build MCP servers that follow best practices? Evaluate whether your agent's outputs are actually good?

Agent Skills give your coding assistant specialized expertise on demand. They're folders containing a SKILL.md file with instructions, workflows, and references that your agent loads only when relevant. No context bloat, no manual setup. For a deep dive into how skills work and how to build your own, see How to Build and Deploy an Agent Skill from Scratch.

Here are 5 skills that cover the full agent development lifecycle, from designing prompts to evaluating outputs. Every skill listed works across Claude Code, Cursor, VS Code Copilot, Codex, and Gemini CLI.

To install any skill, run:

npx skills add <owner/repo> --skill <skill-name>

1. prompt-engineer

An expert prompt engineering skill that teaches your agent advanced techniques for designing effective LLM prompts. It covers system prompt architecture, few-shot example design, chain-of-thought patterns, output format specification, and context management. The skill identifies common pitfalls like imprecise language, missing format constraints, and prompt injection vulnerabilities.

Why it matters: Prompt design is the highest-leverage activity in agent development. A well-crafted prompt can be the difference between a prototype and a production-ready agent. This skill turns your coding assistant into a prompt engineering partner that catches issues before they reach users.

Best for: Any developer writing or refining prompts for LLM-powered applications and agents.

npx skills add davila7/claude-code-templates --skill prompt-engineer

GitHub: davila7/claude-code-templates

For Anthropic's approach to agent-specific prompting, see The Art of Agent Prompting.

2. skill-creator

Anthropic's official skill for creating, modifying, and evaluating skills. Instead of starting from a blank SKILL.md, this skill guides your agent through an iterative development cycle: define intent, draft the skill file, test with sample prompts, evaluate outputs, and refine. It adapts to different environments (Claude.ai, Claude Code, Cursor) and supports users across technical expertise levels.

Why it matters: Building skills manually is educational but slow. This skill automates the creation process and includes built-in evaluation with variance analysis, helping you ship higher-quality skills faster.

Best for: Developers who want to create custom skills for their team, product, or domain without starting from scratch.

npx skills add anthropics/skills --skill skill-creator

GitHub: anthropics/skills

For the manual approach that teaches you every component, see How to Build and Deploy an Agent Skill from Scratch.

3. mcp-builder

Anthropic's official guide for creating high-quality MCP (Model Context Protocol) servers that enable LLMs to interact with external services. The skill covers the full development cycle in four phases: research and planning, implementation, review and testing, and evaluation creation. It supports both Python (FastMCP) and TypeScript implementations, and emphasizes concise tool descriptions and actionable error messages.

Why it matters: Skills give agents knowledge; MCP servers give agents capabilities. If you need your agent to interact with APIs, databases, or external services, this skill teaches it how to build MCP servers that follow best practices.

Best for: Developers building custom MCP servers to extend their agents' capabilities.

npx skills add anthropics/skills --skill mcp-builder

GitHub: anthropics/skills

For a quick hands-on start with MCP, see Create Your First MCP Server in 5 Minutes.

4. agentic-eval

Patterns and techniques for evaluating and improving AI agent outputs through iterative refinement. This skill teaches your agent to implement self-critique loops, evaluator-optimizer pipelines, rubric-based assessment, and LLM-as-judge evaluation systems. Rather than relying on single-shot generation, it introduces systematic approaches to measuring and improving output quality.

Why it matters: Building an agent is half the challenge. Knowing whether it works reliably is the other half. This skill teaches your coding assistant how to evaluate agent outputs systematically, a practice that separates prototype-quality agents from production-ready ones.

Best for: Developers testing, debugging, or improving agent output quality, especially when building evaluation pipelines.

npx skills add github/awesome-copilot --skill agentic-eval

GitHub: github/awesome-copilot

5. openai-docs

Provides up-to-date OpenAI developer documentation with citations, covering the Responses API, Agents SDK, Chat Completions, Codex, Realtime API, model capabilities, and more. The skill uses OpenAI's MCP server to search, fetch, and browse official documentation pages, prioritizing MCP tools over general web search for accuracy.

Why it matters: LLM training data goes stale. If you are building with OpenAI APIs, this skill ensures your agent references the latest documentation rather than outdated knowledge. Every answer comes with a citation to the official source.

Best for: Developers building with OpenAI APIs who need accurate, current references without leaving their IDE.

npx skills add openai/skills --skill openai-docs

GitHub: openai/skills

Bonus: ai-sdk

The top 5 skills above are broadly useful to any agent builder regardless of stack. This bonus is more specialized, but for its audience, it may be the most valuable skill on this list.

The ai-sdk skill answers questions about the Vercel AI SDK and helps build AI-powered features. It covers core functions like generateText, streamText, ToolLoopAgent, embed, and tool calling. The skill checks local node_modules/ai/docs/ first, then falls back to ai-sdk.dev for the latest information.

Why it matters: The AI SDK is the most downloaded TypeScript AI framework with 2.8M weekly npm downloads. If you're building AI features in a Next.js or React application, this skill makes your coding assistant an AI SDK expert.

Best for: React and Next.js developers integrating AI features using the Vercel AI SDK.

npx skills add vercel/ai --skill ai-sdk

GitHub: vercel/ai

Quick Reference

Skill	Best For	Created By
prompt-engineer	Writing effective LLM prompts	Community (davila7)
skill-creator	Creating and iterating on skills	Anthropic
mcp-builder	Building MCP servers	Anthropic
agentic-eval	Evaluating agent outputs	GitHub
openai-docs	OpenAI API documentation	OpenAI
ai-sdk (Bonus)	Vercel AI SDK development	Vercel

Key Takeaways

These 5 skills cover the full agent development lifecycle: design prompts, package expertise, build tools, evaluate quality, and reference documentation.
All skills are cross-platform. They work across Claude Code, Cursor, VS Code Copilot, Codex, and Gemini CLI.
Combine skills for compound effect. Use prompt-engineer to design your agent's prompts, skill-creator to package your expertise, and agentic-eval to verify quality.
Vet skills before installing. Prefer skills published by known organizations like Anthropic, GitHub, or OpenAI. For community skills, check their detail page on skills.sh: every listed skill displays a Security Audit report so you know what you're installing.
The skills ecosystem is growing fast. Browse skills.sh regularly to discover new skills as they are published.

What to Read Next

Enjoying content like this? Sign up for Agent Briefings, where I share insights and news on building and scaling AI agents.

Resources

Securing MCP Servers: A Practical Guide with Keycloak (using create-mcp-server)

Ali Ibrahim — Tue, 03 Mar 2026 14:30:00 +0000

Introduction

MCP servers are powerful. They let AI agents interact with databases, APIs, file systems, and virtually anything you can imagine. But there's a catch: most tutorials show you how to build MCP servers without authentication.

That's fine for local development. It's a problem for production.

An unsecured MCP server is an open door. Anyone who discovers your endpoint can invoke your tools, access your resources, and potentially wreak havoc on your systems. As MCP adoption grows and servers move from localhost to cloud deployments, security isn't optional anymore.

The good news? The MCP Authorization specification provides a standard way to secure MCP servers using OAuth 2.1. And with the right tools, implementing it is straightforward.

In this guide, you'll learn:

How MCP authorization works (without the jargon)
How to scaffold a secure MCP server with create-mcp-server
How to set up Keycloak as your OIDC provider
How to test your authenticated server with VS Code, Cursor and a terminal client

Let's lock down your MCP server.

Understanding MCP Authorization

If you're new to MCP, here's the short version: the Model Context Protocol is an open standard that lets AI assistants discover and use external tools. Think of it as a universal adapter between AI models and the real world. For a deeper dive, check out our article on running MCP servers with Docker.

Why Security Matters

When you deploy an MCP server, you're exposing capabilities to the network. Those capabilities might include:

Reading and writing to databases
Sending emails or notifications
Accessing internal APIs
Managing cloud resources

Without authentication, anyone can call these tools. That's why the MCP specification includes authorization as a core feature.

OAuth: The Hotel Key Card

OAuth might sound intimidating, but the concept is simple. Think of it like a hotel key card system:

You check in at the front desk (authentication)
You receive a key card (access token)
You use the key card to access your room (authorized requests)
The door checks if your card is valid (token validation)

That's OAuth in a nutshell. The MCP client gets a token from an authorization server, then includes that token with every request. The MCP server validates the token before granting access.

The MCP specification requires OAuth 2.1 with PKCE (Proof Key for Code Exchange), which adds an extra security layer to prevent token interception. You don't need to understand the cryptographic details, just know that it's a modern, secure approach.

Dynamic Client Registration (DCR)

Here's something important that often gets overlooked: Dynamic Client Registration.

In traditional OAuth setups, you manually register each client application with your authorization server. You create a client, get a client ID and secret, and configure them in your app. This works fine when you have a handful of known clients.

But MCP is different. The whole point is that any MCP-compatible client should be able to connect to your server. Claude Desktop, VS Code, Cursor, custom agents, there could be dozens of different clients trying to connect.

DCR solves this. It allows clients to register themselves automatically with the authorization server. No manual setup required. The client says "hey, I'd like to connect," and the server says "here are your credentials."

This is critical for the MCP ecosystem to scale. And it's one of the main reasons we're using Keycloak: Keycloak fully supports Dynamic Client Registration. Not all OIDC providers do. If you're evaluating alternatives like Auth0, Azure AD, or Okta, check their DCR support carefully.

For the complete technical specification, see the MCP Authorization documentation.

Introducing create-mcp-server

Building an MCP server from scratch with OAuth is tedious. You need to set up Express, configure middleware, handle token validation, manage sessions, and wire up SSE for real-time updates. That's hours of boilerplate before you write a single tool.

create-mcp-server eliminates that friction. It's a CLI tool that scaffolds production-ready MCP servers in seconds:

npx @agentailor/create-mcp-server

The CLI walks you through a few questions and generates a complete project with TypeScript, Express.js, and optionally OAuth authentication baked in.

Two Templates

Feature	Stateless	Stateful
Session management	—	✓
SSE support	—	✓
OAuth option	—	✓
Endpoints	POST /mcp	POST, GET, DELETE /mcp

Stateless: Each request creates a new transport instance. Simple, but no session persistence.

Stateful: Sessions are maintained across requests. Supports Server-Sent Events for real-time updates. This is what you need for OAuth.

For this guide, we'll use the Stateful template with OAuth enabled.

Scaffolding Your Secure MCP Server

Let's create our server. Run the CLI and answer the prompts:

npx @agentailor/create-mcp-server@0.2.1

Note: If you are using v0.3.0 or later, you'll have a new prompt for selecting a framework. To follow this article, you should select the official SDK.

When prompted:

Enter (y) for npx to download the package
Project name: my-secure-mcp-server
Template: Stateful
Enable OAuth: Yes
Package manager: Your preference (npm, pnpm, or yarn)

The CLI generates this structure:

my-secure-mcp-server/
├── src/
│   ├── server.ts     # MCP server (tools, prompts, resources)
│   ├── index.ts      # Express app and transport setup
│   └── auth.ts       # OAuth middleware
├── package.json
├── tsconfig.json
├── .gitignore
├── .env.example
└── README.md

Key Files Explained

src/server.ts: This is where you define your MCP tools, prompts, and resources. It's the "business logic" of your server.

src/index.ts: The Express application. It sets up routes, applies middleware, and manages the HTTP transport.

src/auth.ts: The OAuth middleware. This is provider-agnostic, it works with any OIDC-compliant authorization server. You configure the provider through environment variables.

.env.example: Template for required environment variables:

PORT=3000

# OAuth Configuration
# Issuer URL - your OAuth provider's base URL
# Examples:
#   Auth0: https://your-tenant.auth0.com
#   Keycloak: http://localhost:8080/realms/your-realm
OAUTH_ISSUER_URL=https://your-oauth-provider.com

# Audience - the API identifier (optional, but recommended)
# This should match the "aud" claim in your JWT tokens
OAUTH_AUDIENCE=https://your-mcp-server.com

Install dependencies and you're ready to configure Keycloak:

cd my-secure-mcp-server
npm install

Setting Up Keycloak

We're using Keycloak as our OIDC provider. Here's why:

Open-source: No vendor lock-in, full transparency
OIDC-compliant: Works with any OAuth 2.1 / OpenID Connect client
Supports DCR: Dynamic Client Registration out of the box
Self-hosted: Complete control over your auth infrastructure
Battle-tested: Used by enterprises worldwide

Running Keycloak with Docker

From your terminal, run the following command to start the Keycloak container:

docker run -p 127.0.0.1:8080:8080 -e KC_BOOTSTRAP_ADMIN_USERNAME=admin -e KC_BOOTSTRAP_ADMIN_PASSWORD=admin quay.io/keycloak/keycloak start-dev

Wait a minute for startup, then access the admin console at http://localhost:8080.

Configuring Keycloak

1. Create a Realm

Click on "Manage realms" in the top-left sidebar
Click Create realm
Name it mcp-realm
Click Create

2. Create a Client

Go to Clients → Create client
Client ID: mcp-server-client
Client authentication: Enable (this makes it confidential)
Leave redirect URIs empty (we'll update later if needed)
Click Save

Environment Variables

Update/create your .env file with the Keycloak values:

PORT=3000

OAUTH_ISSUER_URL=http://localhost:8080/realms/mcp-realm
OAUTH_AUDIENCE= #leave empty for Keycloak

3. Create a Test User

Go to Users → Add user
Username: testuser
Set Email verified to ON
Click Create
Go to Credentials tab → Set password
Enter a password and disable "Temporary"

Connecting MCP Server to Keycloak

With Keycloak running and your .env configured, start your MCP server:

npm run dev

You should see output like:

[auth] Validating OAuth configuration for issuer: http://localhost:8080/realms/mcp-realm
[auth] Successfully fetched OIDC discovery document
[auth] Authorization endpoint: http://localhost:8080/realms/mcp-realm/protocol/openid-connect/auth
[auth] Token endpoint: http://localhost:8080/realms/mcp-realm/protocol/openid-connect/token
[auth] JWKS URI: http://localhost:8080/realms/mcp-realm/protocol/openid-connect/certs
[auth] JWKS endpoint is accessible
[auth] OAuth configuration validated successfully
MCP Stateful HTTP Server listening on port 3000
OAuth metadata available at http://localhost:3000/.well-known/oauth-protected-resource

The auth middleware automatically:

Intercepts incoming requests
Extracts the Bearer token from the Authorization header
Validates the token against Keycloak's JWKS endpoint
Rejects requests with invalid or missing tokens

Your server is now protected. Unauthenticated requests will receive a 401 Unauthorized response.

Testing Your Secure MCP Server

Let's verify everything works. We'll test with three methods: VS Code integration, Cursor integration, and a terminal client.

Setting Redirect URIs for VS Code and Cursor

In Keycloak, go to Clients → mcp-server-client → Settings
Under Valid Redirect URIs, add the following URIs:

cursor://anysphere.cursor-mcp/oauth/callback
https://vscode.dev/redirect/*
http://127.0.0.1:33418/* # this may vary based on your setup

Click Save

Note: If you have an error of "Invalid redirect URI", double-check the URIs match exactly or add missing ones.

VS Code Integration

VS Code supports MCP servers with OAuth authentication. Create a .vscode/mcp.json file in your project:

{
  "servers": {
    "my-secure-mcp-server": {
      "type": "http",
      "url": "http://localhost:3000/mcp"
    }
  }
}

On top of the server name you'll see "Start":

VS Code will try to connect to your MCP server
If it detects OAuth, it initiates the authorization flow
If DCR is supported, it registers the client dynamically, otherwise it will prompt you to enter client details
You log in with your Keycloak credentials
VS Code receives the token and connects to your server

Your MCP tools are now available through VS Code's Copilot.

Cursor Integration

Cursor also supports MCP with OAuth. In Cursor, add a new MCP server:

Go to Cursor settings → Tools & MCP → New MCP Server
Enter the following details:

{
  "mcpServers": {
    "oauth-server": {
      "url": "http://localhost:3000/mcp",
      "auth": {
        "CLIENT_ID": "mcp-server-client",
        "CLIENT_SECRET": "<your-client-secret | EMPTY it's optional>",
        "scopes": ["mcp:tools"]
      }
    }
  }
}

After saving, go back to Tools & MCP and click on "Connect" next to your new server.
Cursor will open a browser window for you to log in via Keycloak.
After logging in, Cursor will receive the access token and connect to your MCP server.

Terminal Client Testing

The terminal client uses Dynamic Client Registration (DCR) to connect to your MCP server. This requires additional Keycloak configuration that wasn't needed for VS Code or Cursor (which use predefined clients).

Enabling Dynamic Client Registration in Keycloak

For DCR to work, Keycloak needs to trust the host where your client is running:

In Keycloak, go to Clients → Client registration → Trusted Hosts
Disable the Client URIs Must Match setting
Add your testing host's IP address to the trusted hosts list

To find your host IP:

Linux/macOS: Run ifconfig in your terminal
Windows: Run ipconfig in Command Prompt

If you're unsure which IP to add, check the Keycloak logs for a line like Failed to verify remote host : 192.168.x.x. That's the IP you need to whitelist.

Creating the mcp:tools Scope

The terminal client requires a custom scope to access MCP tools. Without this, authentication will succeed but tool access will fail.

In Keycloak, go to Client scopes → Create client scope
Name: mcp:tools
Type: Set to Default (so it's automatically included)
Include in token scope: Enable this toggle (required for token validation)
Click Save

Running the Terminal Client

Now you can test with the mcp-oauth-client:

git clone https://github.com/IBJunior/mcp-oauth-client
cd mcp-oauth-client
npm install
npm run build

Configure the client with your server and Keycloak details, then run:

npm run dev

The client will:

Use DCR to register itself with Keycloak automatically
Perform the OAuth flow
Obtain an access token (if the browser doesn't open automatically, copy-paste the URL from the console)

Connect to your MCP server
List available tools
Allow you to invoke tools interactively

This is useful for debugging and verifying your setup works end-to-end.

Adding Custom Tools

The scaffolded server comes with example tools. Let's add a custom one to see how easy it is.

Open src/server.ts and add a new tool:

server.registerTool(
  'greet',
  {
    description: 'Greets a user in their preferred language.',
    inputSchema: {
      name: z.string().describe("The user's name"),
      language: z.enum(['en', 'es', 'fr']).optional().describe('Greeting language'),
    },
  },
  async ({ name, language = 'en' }) => {
    const greetings = {
      en: `Hello, ${name}! Welcome to the secure MCP server.`,
      es: `¡Hola, ${name}! Bienvenido al servidor MCP seguro.`,
      fr: `Bonjour, ${name}! Bienvenue sur le serveur MCP sécurisé.`,
    }

    return {
      content: [
        {
          type: 'text',
          text: greetings[language],
        },
      ],
    }
  }
)

That's it. Restart your server and the new tool is available, automatically protected by OAuth. No additional authentication code required per tool. The middleware handles everything at the request level.

Using MCP Inspector

MCP Inspector is a debugging UI for MCP servers. It lets you explore available tools, test invocations, and inspect responses.

The scaffolded project includes an inspect script:

npm run inspect

Important caveat: For authenticated servers, you'll need to manually setup the oauth flow in MCP Inspector.

Recap

Let's summarize what we've covered:

What	Why
MCP Authorization	Industry standard for securing MCP servers
OAuth 2.1 + PKCE	Modern, secure token-based authentication
Dynamic Client Registration	Allows any MCP client to connect without manual setup
Keycloak	Open-source, OIDC-compliant, supports DCR
create-mcp-server	Fast scaffolding with authentication built-in
OIDC abstraction	Swap providers via environment variables

The key insight: security doesn't have to be complicated. With the right tools and a clear understanding of the concepts, you can go from zero to a production-ready, authenticated MCP server in minutes.

Conclusion

You've built a production-ready MCP server with OAuth authentication. Your tools are protected, tokens are validated, and you're following the MCP Authorization specification.

But here's the best part: Keycloak is just one option.

Because create-mcp-server uses a provider-agnostic OIDC implementation, you can swap Keycloak for any compliant provider. Just update your environment variables and you're done. That's the power of standards-based authentication.

A note on production deployments: The Keycloak configuration in this guide is designed for demonstration purposes. A production setup requires additional hardening: HTTPS everywhere, stricter redirect URI validation, token lifetime tuning, proper realm and client policies, and more. For production-grade configuration, refer to the official Keycloak documentation.

Next Steps

Star create-mcp-server on GitHub
Contribute to create-mcp-server
Explore the mcp-oauth-client for testing

Enjoying content like this? Sign up for Agent Briefings, where I share insights and news on building and scaling MCP Servers and AI agents.

Resources

create-mcp-server v0.6.0 is out, now with stdio transport support

Ali Ibrahim — Thu, 26 Feb 2026 20:22:59 +0000

Until now, the CLI only scaffolded Streamable HTTP servers (remote, cloud-deployable). But a lot of people build local MCP servers for Claude Desktop and other local clients — and stdio is the right transport for that.

So I added it.

npx @agentailor/create-mcp-server --name=my-server --stdio

That's it. No HTTP server, no port, no Dockerfile. Just a clean stdio server ready to connect to any local MCP client.

You can also combine it with FastMCP:

npx @agentailor/create-mcp-server --name=my-server --stdio --framework=fastmcp

What the CLI now supports:
→ HTTP (Streamable) or stdio transport
→ Official MCP SDK or FastMCP
→ Stateless or stateful server modes
→ Optional OAuth (SDK + HTTP)
→ npm, pnpm, or yarn

📚 Learning Resources

If you're getting started with MCP servers, here are the guides I've written:

→ Build your first MCP server in 5 minutes
https://blog.agentailor.com/posts/create-your-first-mcp-server-in-5-minutes

→ Secure your MCP server with OAuth (Keycloak)
https://blog.agentailor.com/posts/oauth-for-mcp-servers-practical-guide-keycloak

→ Getting started with FastMCP
https://blog.agentailor.com/posts/getting-started-with-fastmcp

→ OAuth for MCP Clients (Next.js + LangGraph.js)
https://blog.agentailor.com/posts/mcp-client-oauth-nextjs-langgraph

Star the repo if this is useful 🙏
https://github.com/agentailor/create-mcp-server

Deploy Your MCP Server to Google Cloud Run (For Free)

Ali Ibrahim — Mon, 23 Feb 2026 18:47:58 +0000

Introduction

You've built an MCP server. It works on localhost. Your AI assistant can call tools, fetch data, and do useful things, as long as everything runs on your machine.

But what happens when you want to share it with your team? Or connect to it from a different device? Or just keep it running without your laptop open?

You need to deploy it.

This guide walks you through deploying a Streamable HTTP MCP server to Google Cloud Run, from scaffolding to a live URL in minutes. Streamable HTTP is the transport designed for remote deployments: it works over standard HTTPS, plays nicely with load balancers, and doesn't require the client to run your server as a subprocess.

Note: If you're working with a stdio MCP server, the deployment path is different. We'll cover that briefly at the end of this article.

If you've already built an MCP server using our first MCP server guide, you can deploy that project directly. Otherwise, we'll scaffold a fresh one below.

What you'll learn:

How to scaffold a deployment-ready MCP server
How to set up the Google Cloud CLI
How to deploy to Cloud Run with a single command
How to test your live server with MCP Inspector

Prerequisites:

Node.js 20 or later
A Google Cloud account (free to create)
Basic terminal familiarity

Why Google Cloud Run?

Cloud Run can deploy from a container image, a Dockerfile, or even raw source code (using buildpacks). The create-mcp-server scaffold includes a Dockerfile, so that's the path we'll use. If you're bringing your own server, any of these options work.

Here's why Cloud Run is a great fit for MCP servers:

Generous free tier — 2 million requests/month, 180,000 vCPU-seconds, and 360,000 GiB-seconds of memory. More than enough for development, testing, and light production use.
Build from source — No need to install Docker locally. Cloud Run uses Cloud Build to build your Dockerfile remotely.
HTTPS by default — Every deployed service gets an HTTPS URL automatically. MCP clients expect HTTPS for remote servers.
Scale to zero — When no one is calling your server, it scales down to zero instances. You pay nothing when idle.
No code changes — Streamable HTTP servers work as-is on Cloud Run. The POST /mcp endpoint maps directly to Cloud Run's request-based model.

In short: you get a free, secure, production-ready deployment with almost zero configuration.

Scaffold Your MCP Server

If you already have an MCP server project with a Dockerfile, skip ahead to Set Up the gcloud CLI.

Otherwise, let's scaffold a fresh one. Run:

# For more options, see https://github.com/agentailor/create-mcp-server
npx @agentailor/create-mcp-server --name=my-mcp-server

This creates a stateless MCP server using the Official TypeScript SDK. No interactive prompts, no choices needed. One command, one project.

We're using a stateless server here for simplicity, but you can deploy a stateful server to Cloud Run the same way. The deployment steps are identical.

The generated project structure:

my-mcp-server/
├── src/
│   ├── server.ts     # MCP server (tools, prompts, resources)
│   └── index.ts      # Express app and HTTP transport
├── Dockerfile        # Production-ready Docker build
├── package.json
├── tsconfig.json
├── .env.example
├── .gitignore
└── README.md

Install dependencies and verify it works locally:

cd my-mcp-server
npm install
npm run dev

You should see:

MCP Stateless HTTP Server listening on port 3000

Your server is running at http://localhost:3000/mcp. Stop it with Ctrl+C. We're ready to deploy.

Want to understand the scaffolded code in detail? See our Create Your First MCP Server guide.

The Dockerfile

The scaffold includes a production-ready Dockerfile. Here's what it does:

# Multi-stage build for production
FROM node:20-alpine AS builder

WORKDIR /app

# Copy package files
COPY package.json package-lock.json ./

# Install all dependencies (including dev)
RUN npm ci

# Copy source code
COPY . .

# Build the application
RUN npm run build

# Production stage
FROM node:20-alpine AS production

WORKDIR /app

# Copy package files
COPY package.json package-lock.json ./

# Install production dependencies only
RUN npm ci --omit=dev

# Copy built application from builder stage
COPY --from=builder /app/dist ./dist

# Expose the port the app runs on
EXPOSE 3000

# Start the application
CMD ["node", "dist/index.js"]

It's a multi-stage build: the first stage compiles TypeScript, the second stage copies only the compiled output and dependencies. This keeps the final image small.

The important part: it exposes port 3000, which matches the --port 3000 flag we'll use when deploying. Cloud Build will use this Dockerfile automatically. You don't need Docker installed on your machine.

Set Up the gcloud CLI

Install gcloud CLI

Windows (via winget):

winget install Google.CloudSDK

Or download the installer.

macOS (via Homebrew):

brew install --cask google-cloud-sdk

Linux:

curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-linux-x86_64.tar.gz
tar -xf google-cloud-cli-linux-x86_64.tar.gz
./google-cloud-sdk/install.sh

After installing, restart your terminal and authenticate:

gcloud init
gcloud auth login

Enable Billing

Cloud Run requires a billing account, even for the free tier. You won't be charged if you stay within the free limits.

Option 1: Via the console

Go to Google Cloud Billing
Click Link a billing account (or Create account if you don't have one)
Add a payment method
Select your project and link it to the billing account

Option 2: Via CLI (if you already have a billing account):

gcloud billing accounts list
gcloud billing projects link YOUR_PROJECT_ID --billing-account=YOUR_BILLING_ACCOUNT_ID

Deploy to Cloud Run

Create a New Project

We recommend creating a dedicated project for this tutorial. This keeps your demo isolated from existing resources, so cleanup commands won't accidentally affect your other projects or container images.

gcloud projects create my-mcp-project --name="My MCP Server"
gcloud config set project my-mcp-project

Project IDs must be globally unique. If my-mcp-project is taken, choose something else.

Enable Required APIs

gcloud services enable run.googleapis.com artifactregistry.googleapis.com cloudbuild.googleapis.com

This enables three services:

Cloud Run: hosts your container
Artifact Registry: stores your container image
Cloud Build: builds your Dockerfile remotely

Deploy from Source

Make sure you're in your project directory (my-mcp-server/), then run:

gcloud run deploy my-mcp-server \
  --source . \
  --region us-central1 \
  --allow-unauthenticated \
  --port 3000

Here's what each flag does:

--source .: sends your source code to Cloud Build, which builds the container using your Dockerfile
--region us-central1: deploys to a US region (or choose another region closer to you)
--allow-unauthenticated: makes your /mcp endpoint publicly accessible
--port 3000: tells Cloud Run which port your container listens on

The first deploy takes a couple of minutes as Cloud Build pulls the base image and builds your container. Subsequent deploys are faster thanks to layer caching.

Get Your Service URL

gcloud run services describe my-mcp-server --region us-central1 --format='value(status.url)'

This outputs something like:

https://my-mcp-server-abc123xyz.us-central1.run.app

Your MCP endpoint is now live at https://<service-url>/mcp.

Troubleshooting: Invalid Host Error

If you get an error like "Invalid Host: my-mcp-server-abc123xyz.us-central1.run.app", it means the server is validating the Host header and rejecting the Cloud Run domain.

Fix it by setting the ALLOWED_HOSTS environment variable:

gcloud run services update my-mcp-server \
  --region us-central1 \
  --set-env-vars ALLOWED_HOSTS=my-mcp-server-abc123xyz.us-central1.run.app

Replace the domain with your actual service URL (without https://).

Test Your Deployed Server

Let's verify everything works. Open a terminal and run:

npx @modelcontextprotocol/inspector

Once the inspector opens in your browser:

Change the transport type to Streamable HTTP
Enter your Cloud Run URL: https://<service-url>/mcp
Click Connect

You should see your server's tools listed. Click on any tool, fill in the parameters, and run it.

That's it. Your MCP server is live on the internet, accessible from any MCP-compatible client, anywhere.

Cleanup

If you want to remove the deployed resources:

# Delete the Cloud Run service
gcloud run services delete my-mcp-server --region us-central1

# Delete the container image from Artifact Registry
gcloud artifacts docker images list us-central1-docker.pkg.dev/my-mcp-project --format='value(IMAGE)' | xargs -I {} gcloud artifacts docker images delete {} --quiet

# Optionally delete the entire project (removes everything)
gcloud projects delete my-mcp-project

Note: Cloud Build stores your container image in Artifact Registry, which charges for storage after the first 500MB free. Deleting the project removes everything, but if you're keeping the project, clean up the images separately.

If you're staying within the free tier, there's no urgency to clean up. But it's good practice to remove resources you no longer need.

What's Next?

Now that your server is deployed, here are some ideas:

Add authentication: secure your deployed server with OAuth using our OAuth for MCP Servers guide.

Build real tools: replace the placeholder tools with your own by following our first MCP server guide.

Try FastMCP: build with a different framework using our FastMCP guide.

Connect your IDE: point your VS Code or Cursor MCP configuration at your Cloud Run URL instead of localhost.

Set up a custom domain: configure a custom domain in Cloud Run for a cleaner URL.

What About Stdio Servers?

Everything in this guide applies to Streamable HTTP servers, the transport designed for cloud deployment. But not all MCP servers use HTTP.

Stdio servers communicate via stdin/stdout and run as local subprocesses. They can't be deployed as web services. Instead, they're distributed so clients can run them locally: via npm (npx your-server), PyPI (uvx your-server), or Docker Hub (pull and run via Docker).

We'll cover stdio distribution in detail in a future article. Stay tuned.

Conclusion

You just went from a scaffolded project to a live, publicly accessible MCP server, in minutes, for free. No Docker installed locally, no infrastructure to manage, just a single gcloud run deploy --source . command.

With Cloud Run, your MCP server is always available, scales automatically, and costs nothing while idle. That's a pretty good deal for getting your AI tools off localhost and into the real world.

If you found create-mcp-server useful, consider giving it a star on GitHub and sharing it with others who are building MCP servers. It helps the project grow and helps more developers get started quickly.

Enjoying content like this? Sign up for Agent Briefings, where I share insights and news on building and scaling MCP Servers and AI agents.

Resources

create-mcp-server (GitHub): the CLI tool used in this guide
Google Cloud Run Documentation: official Cloud Run docs
Google Cloud Free Tier: free tier details and limits
MCP Inspector (GitHub): testing and debugging tool for MCP servers
MCP Documentation: official protocol specification
gcloud CLI Reference: CLI documentation

Lessons from OpenClaw's Architecture for Agent Builders

Ali Ibrahim — Thu, 19 Feb 2026 10:51:26 +0000

Introduction

OpenClaw has over 200,000 GitHub stars. It is one of the fastest-growing open-source projects in history. Lex Fridman discussed it on his podcast. Andrej Karpathy called one of its side projects "the most incredible sci-fi takeoff-adjacent thing" he has seen recently.

Most of the coverage has been surface-level: "look, an AI that controls your computer." But if you are building agents, the interesting question is not what OpenClaw does — it is how the architecture enables it.

OpenClaw solves a problem most agent frameworks ignore: running a persistent, multi-channel AI agent on your own hardware that does not break under real-world usage. The engineering decisions behind that are worth studying.

What you'll learn:

The 4-layer gateway architecture and why a single process matters
The Lane Queue system, the core reliability pattern most agents lack
Skills-as-markdown: why prompt engineering beat code for extensibility
Human-readable memory you can open in a text editor
Security lessons from CVEs and supply chain attacks
10 concrete patterns to adopt in your own agent systems

Why OpenClaw Matters for Builders

OpenClaw was created by Peter Steinberger, an Austrian software engineer known for building developer tools in the iOS/macOS ecosystem. The project started in November 2025 as a personal WhatsApp relay script called Clawdbot. After a trademark dispute with Anthropic, it became Moltbot, then settled on OpenClaw three days later.

Three factors drove the viral growth:

Local-first in the age of cloud lock-in. Your data, your hardware, no vendor dependency.
It actually works across platforms. WhatsApp, Telegram, Discord, iMessage, Slack, Signal, and more from a single agent.
Self-modifying skills. The agent can write and deploy its own new capabilities mid-conversation.

But the insight that matters most for builders: OpenClaw is not a framework. It is a gateway — a single runtime that sits between your AI model and the outside world. That architectural choice shaped every other decision in the project.

Let's walk through the architecture layer by layer.

The 4-Layer Architecture

OpenClaw's architecture breaks down into four distinct layers, each with a clear responsibility:

OpenClaw's 4-layer architecture: Gateway, Integration, Execution, and Intelligence.

Layer	Responsibility	Key Pattern
Gateway	Connection management, routing, auth	Single-process multiplexing
Execution	Task ordering, concurrency control	Per-session serial queues (Lane Queue)
Integration	Platform normalization	Channel adapters
Intelligence	Agent behavior, knowledge, proactivity	Skills + Memory + Heartbeat

What Runs the Agent Itself?

A notable architectural decision: OpenClaw does not implement its own agent runtime. The core agent loop — tool calling, context management, LLM interaction — is handled by the Pi agent framework (@mariozechner/pi-agent-core, pi-ai, pi-coding-agent). OpenClaw builds the gateway, orchestration, and integration layers on top of Pi.

This separation is telling. It reinforces the project's core thesis: the hard problem in personal AI agents is not the agent loop itself, but everything around it. Channel normalization, session management, memory persistence, skill extensibility, and security are where the complexity lives. Pi handles the "think and act" cycle. OpenClaw handles the "connect, queue, remember, and extend" layers.

OpenClaw also implements the Agent Client Protocol (ACP) via @agentclientprotocol/sdk, a standardized protocol for agent-to-editor communication. This bridges the Gateway to tools like code editors, mapping ACP sessions to Gateway session keys and translating between protocol-native commands (prompt → chat.send, cancel → chat.abort).

The Gateway Layer

Everything routes through a single Node.js process — the Gateway. It runs locally on port 18789 and handles WebSocket control messages, HTTP APIs (OpenAI-compatible), and a browser-based Control UI from a single multiplexed port.

This is a deliberate trade-off. A single process means no inter-process communication overhead, simple deployment (one npm i -g openclaw command), and straightforward debugging. It also means no horizontal scaling, but OpenClaw targets personal and small-team use, where operational simplicity matters more than throughput.

The Gateway enforces authentication by default. Non-loopback binding without a token is refused. The WebSocket protocol follows a strict handshake: the client sends a connect frame, and the Gateway responds with a hello-ok snapshot containing presence, health, state, uptime, and rate limits.

Note: This single-process design is a conscious trade-off. If you need horizontal scaling across many users, you need a different architecture. OpenClaw optimizes for the "personal AI assistant" use case where one Gateway serves one person (or a small team).

The Intelligence Layer

The top layer is where agent behavior lives: skills, memory, the heartbeat daemon, and multi-agent routing. We will cover each of these in dedicated sections below.

The Lane Queue: OpenClaw's Core Innovation

If you take one pattern from this article, make it this one.

The Problem

Most agent systems let multiple requests execute concurrently against the same session state. A user sends three messages in quick succession. The agent starts processing all three in parallel. Now you have three tool calls potentially writing to the same file, three API requests with contradictory assumptions, and an interleaved log that is impossible to debug.

Race conditions in agent systems are not edge cases — they are the default failure mode when you accept concurrent input without explicit ordering.

The Solution: Default Serial, Explicit Parallel

OpenClaw's Lane Queue enforces a simple rule: every session gets its own queue, and tasks within a queue execute one at a time.

With the Lane Queue, messages execute serially per session, eliminating race conditions by design.

Here is the conceptual model:

type SessionKey = `${string}:${string}:${string}` // workspace:channel:userId

class LaneQueue {
  private queues = new Map<SessionKey, Task[]>()

  async enqueue(sessionKey: SessionKey, task: Task) {
    const queue = this.queues.get(sessionKey) ?? []
    this.queues.set(sessionKey, queue)
    queue.push(task)

    if (queue.length === 1) {
      // No other task running, execute immediately
      await this.process(sessionKey)
    }
    // Otherwise, this task waits its turn
  }

  private async process(sessionKey: SessionKey) {
    const queue = this.queues.get(sessionKey)!
    while (queue.length > 0) {
      const task = queue[0]
      await task.execute() // Serial: wait for completion
      queue.shift()
    }
  }
}

The key decisions:

Session keys are structured. workspace:channel:userId, not just a user ID. This prevents cross-context data leaks between the same user in different channels.
Parallelism is opt-in. Additional lanes (e.g., cron, subagent) allow background jobs to run without blocking the main session queue. But the default is serial.
Backpressure is built in. If the agent is overwhelmed, the queue grows. You can implement timeout or overflow strategies at the queue level, not scattered across individual handlers.

Why This Matters for Your Agents

Even if you are not building a multi-channel gateway, the per-session serial queue pattern prevents an entire class of bugs. If your agent can receive concurrent input — webhooks, streaming UI, multiple users — you need something like this.

The Lane Queue also makes debugging straightforward. Every action for a given session happened in order. There is no "which thread was this?" question.

Channel Abstraction: One Agent, Ten Platforms

OpenClaw supports over a dozen messaging platforms. Core channels are implemented in src/ (WhatsApp via Baileys, Telegram via grammY, Discord via @buape/carbon, Slack via @slack/bolt, iMessage, Signal), while extension channels live in the extensions/ directory as standalone packages (Matrix via @vector-im/matrix-bot-sdk, Google Chat, Microsoft Teams, LINE, Feishu/Lark, and more).

Each of these platforms has a wildly different message format, media handling, authentication model, and rate limiting strategy. OpenClaw normalizes all of this through channel adapter interfaces defined in its plugin-sdk (including ChannelMessagingAdapter, ChannelGatewayAdapter, and ChannelAuthAdapter).

Conceptually, the pattern looks like this:

// Simplified illustration of the adapter pattern (not actual OpenClaw code)
interface ChannelAdapter {
  name: string
  connect(): Promise<void>
  send(sessionKey: string, message: UnifiedMessage): Promise<void>
  onMessage(handler: (sessionKey: string, msg: UnifiedMessage) => void): void
}

interface UnifiedMessage {
  text?: string
  media?: MediaAttachment[]
  replyTo?: string
  metadata: Record<string, unknown>
}

Key design decisions:

Adapters are stateless. Connection state lives in the Gateway, not in individual adapters. This means you can restart an adapter without losing session context.
Media is normalized. Images, audio, and documents all get the same treatment regardless of source platform. The agent does not need to know whether a photo came from WhatsApp or Telegram.
Platform-specific features use a metadata bag. Reactions, threads, typing indicators, and read receipts flow through metadata. The core agent logic never touches platform-specific fields.
Fault isolation. Each adapter starts independently. If the WhatsApp connection fails, Telegram keeps running. One failing channel does not take down the Gateway.

Takeaway for builders: If your agent integrates with even two platforms, build a normalization layer early. The unified message format is the contract between your integration layer and your intelligence layer. Without it, platform-specific logic leaks into your agent's core and becomes impossible to untangle later.

Skills: Prompt Engineering as the Extension Mechanism

OpenClaw's capabilities are modular plugins called skills, but they are not what you might expect. Skills are not TypeScript modules or Python packages. They are folders containing a SKILL.md file, a markdown document with YAML frontmatter.

This is the same format we covered in our previous article on building agent skills. OpenClaw was one of the first large projects to adopt this pattern at scale.

The skills lifecycle: discovery, activation, execution, and self-authoring.

How Skills Work

On startup, the agent reads skill names and descriptions, roughly 97 characters per skill. This is the progressive disclosure pattern: keep initial context lean.
When a user request matches a skill's description, the full skill content is injected into the agent's context as markdown.
Skills can reference local files (scripts, templates, reference data).
Skills are hot-reloadable. Edit the file, and the agent picks it up on the next turn (configurable debounce of 250ms).

The Self-Writing Agent

The feature that captured the most attention: the agent can create and edit its own SKILL.md files. It observes patterns in how the user asks for things, identifies repetitive workflows, and writes a skill to handle them better next time.

The agent stores self-authored skills in the per-agent workspace (<workspace>/skills/) or the shared ~/.openclaw/skills/ directory. These skills persist across sessions and survive restarts.

ClawHub Marketplace

OpenClaw has a community skill registry called ClawHub, where users share and discover skills via CLI commands. The agent can even auto-search for and install skills at runtime based on user intent.

This extensibility is powerful, and also where the security story gets interesting (more on that later).

Why Markdown Over Code?

Approach	Skills (Markdown)	Plugins (Code)
Extension language	Natural language + YAML	TypeScript / Python
Who can author	Anyone (including the agent)	Developers only
Security surface	Low (injected as context)	High (arbitrary code execution)
Hot reload	Trivial (re-read the file)	Requires restart or dynamic import
Debuggability	Read the file, read the prompt	Stack traces, runtime errors

The markdown-first approach has a key advantage: the barrier to creating skills is effectively zero. Users who cannot write code can still teach their agent new workflows. And the agent itself can participate in the skill ecosystem.

Note: For a hands-on guide to building skills in this format, see our How to Build and Deploy an Agent Skill from Scratch.

Memory Architecture: Files You Can Read

Most agent memory lives in a vector database that humans cannot inspect, edit, or debug. When the agent remembers something wrong, you have no practical way to fix it.

OpenClaw takes a different approach. Memory is stored as flat files:

Markdown files for long-form notes and context
YAML files for structured data (user preferences, configurations)
JSONL files for conversation history (one line per message, append-only)

Everything lives under ~/.openclaw/ in a directory structure you can browse in your file manager.

Hybrid Search

Retrieval uses two complementary search strategies, both running locally in SQLite:

Vector similarity search via sqlite-vec, which finds semantically related content even when the wording differs
Keyword search via FTS5 for precise matches on exact technical terms, names, and identifiers

Hybrid search consistently outperforms either strategy alone. Vector search introduces semantic noise on precise queries. Keyword search misses paraphrased content. Combining them gives you the best of both.

Smart Sync

When the agent writes to a memory file, a file monitor automatically triggers an index update for both vector embeddings and the full-text index. New "experiences" are immediately available for the next prompt. No manual reindexing.

Why Flat Files Matter

You can git diff your agent's memory. Version control for agent state.
You can edit memory in VS Code. Wrong fact? Fix it directly.
You can back up memory with standard file system tools. No database export needed.
You can review what your agent "knows" without building a custom admin UI.

Semantic Snapshots for Browser Automation

When OpenClaw automates a browser, it does not rely on screenshots. Instead, it parses the accessibility tree, a structured text representation of the page content. This "Semantic Snapshot" approach is cheaper in tokens, faster to process, and more accurate for LLM reasoning than pixel data.

Accessibility trees give the model structured information about buttons, links, form fields, and content hierarchy. A screenshot gives it pixels. For most agent tasks, the structured data wins.

The Heartbeat: Proactive Agents

Most agents sit idle until a user sends a message. OpenClaw has a different model.

The Gateway runs as a background daemon with a configurable heartbeat interval (30 minutes by default). On each tick, the agent reads a HEARTBEAT.md checklist in the workspace:

- Check for new emails and summarize anything urgent
- Review today's calendar for upcoming meetings
- Run daily expense summary if it's after 6 PM

The agent processes each item and can send proactive messages to the user via any connected channel. If nothing requires attention, it responds with HEARTBEAT_OK and goes back to sleep.

This is a simple but powerful pattern: a cron job for your AI agent, configured in plain text. No scheduling framework, no database of recurring tasks. Just a markdown file the user can edit.

Takeaway for builders: If your agent should be proactive, implement a heartbeat. A markdown checklist is all you need to start. It is far simpler than building a full scheduling system.

Nodes: The Companion Device Model

OpenClaw "nodes" are native apps on iOS, macOS, and Android that connect to the central Gateway via WebSocket. They act as peripherals: the phone node can take photos and read notifications, while the macOS node can interact with desktop apps and record the screen.

Nodes register through a pairing protocol (node.pair.request → node.pair.approve) and expose capabilities through a standardized node.invoke interface. The model never communicates directly with nodes. It talks to the Gateway, which forwards calls to the appropriate device.

Capability	macOS	iOS	Android	Headless
WebView (canvas)	Yes	Yes	Yes	No
Camera	Yes	Yes	Yes	No
Shell commands	Yes	No	No	Yes
SMS sending	No	No	Yes	No
Screen recording	Yes	Yes	Yes	No
Location	Yes	Yes	Yes	No

Takeaway for builders: If your agent needs device-specific capabilities, a lightweight WebSocket peripheral model is cleaner than trying to run everything on the server. The key is keeping nodes dumb: they execute commands, they do not run agent logic.

Security: The Cautionary Tale

OpenClaw's architecture is innovative. Its security track record is a cautionary tale. These lessons are essential for anyone building agent systems.

CVE-2026-25253: One-Click Remote Code Execution

The most critical vulnerability, patched in v2026.1.29:

The Gateway's Control UI trusted the gatewayUrl from the query string without validation and auto-connected on load, sending the stored authentication token in the WebSocket payload. A crafted link could redirect this token to an attacker-controlled server.

The root cause was deeper: the WebSocket server did not validate the Origin header. Any website could connect to a running OpenClaw instance. With the token, an attacker could:

Connect to the victim's local Gateway
Modify configuration (disable sandbox, weaken tool policies)
Invoke privileged actions using operator.admin and operator.approvals scopes
Run arbitrary commands on the host machine — full remote code execution

Lesson for builders: If your agent exposes any network interface — HTTP, WebSocket, gRPC — origin validation and authentication are not optional. A local-first agent is only safe if the network boundary is actually enforced. In the single-process Gateway model, one WebSocket vulnerability compromises everything because there is no isolation boundary.

ClawHub Supply Chain Attacks

ClawHub, OpenClaw's skill marketplace, became a major attack vector:

Security audits found that 12-20% of uploaded skills contained malicious instructions
A campaign called ClawHavoc distributed macOS malware through skills with professional documentation and names like solana-wallet-tracker and youtube-summarize-pro
The #1 ranked community skill silently exfiltrated data and used direct prompt injection to bypass safety guidelines

The attack vector is subtle: skills are markdown injected into agent context. A malicious skill can instruct the agent to exfiltrate data, modify other skills, or execute harmful commands. This is prompt injection via the extension ecosystem.

ClawHub's publishing requirement was minimal: a 1-week-old GitHub account. No code review, no content scanning, no sandboxing.

Lesson for builders: If your agent loads third-party skills or prompts, treat them as untrusted input. "It is just markdown" does not mean "it is safe." Sandboxing, review processes, and automated content scanning are necessary for any public skill registry.

Plaintext Credential Storage

Connected account credentials (WhatsApp sessions, API keys for Anthropic/OpenAI, Telegram bot tokens, Discord OAuth tokens) are stored as plaintext files under ~/.openclaw/. Known malware families are already building capabilities to harvest these file structures.

Lesson for builders: Use your platform's secret storage (macOS Keychain, Windows Credential Manager, Linux secret-service). Never store credentials in plaintext, even for local-first applications.

Summary

Vulnerability	Root Cause	Lesson
CVE-2026-25253 (RCE)	Missing WebSocket origin validation	Always authenticate network interfaces
ClawHub malicious skills	No skill content scanning	Treat third-party prompts as untrusted input
Plaintext credentials	No OS secret store integration	Use platform-native credential storage

Note: These issues do not diminish OpenClaw's architectural innovations. They highlight that security for AI agents requires the same rigor as any networked application, and that agent-specific attack vectors (prompt injection via skills) require new defensive patterns.

What to Take Away: A Builder's Checklist

Here are the concrete patterns from OpenClaw's architecture worth adopting in your own agent systems:

Per-session serial queues. Default to serial execution within a session. Opt into parallelism only when provably safe. This prevents an entire class of race condition bugs.
Structured session keys. Scope isolation with workspace:channel:userId, not just a user ID. This prevents cross-context data leaks.
Channel adapter pattern. If you integrate with more than one platform, normalize messages before they reach your agent logic. Do this early.
Skills as markdown. For extensibility, markdown-first beats code-first. Lower friction, agent-authorable, hot-reloadable, debuggable.
Progressive skill disclosure. Load skill names and descriptions upfront (low tokens). Load full skill content only when activated. Keep your base context lean.
Human-readable memory. Store agent state in formats you can inspect and edit: Markdown, YAML, JSONL. The debugging advantage is worth the trade-off versus opaque vector stores.
Hybrid search. Combine vector similarity with keyword search. Use SQLite (sqlite-vec + FTS5) if you want to stay local with no external dependencies.
Accessibility trees over screenshots. For browser automation, parse the accessibility tree. It is cheaper, faster, and more accurate for LLM reasoning.
Heartbeat pattern. For proactive agents, a simple cron + checklist file is enough. You do not need a complex scheduling system.
Authenticate everything. WebSocket, HTTP, local sockets. If it accepts connections, it needs origin validation and authentication. Local-first does not mean security-optional.

Conclusion

OpenClaw is not a framework you adopt — it is an architecture you study.

The project's core insight is that a personal AI agent is fundamentally a gateway problem, not a model problem. Getting the runtime right, queuing, channel normalization, memory, extensibility — matters more than which LLM you use.

The Lane Queue and Skills system are the two patterns with the broadest applicability. If you take nothing else from this article, implement per-session serial execution and consider markdown-based extensibility for your agents.

The security story is equally important. Agent systems have a larger attack surface than traditional applications because they accept natural language input from multiple untrusted sources, including their own extension ecosystems. Build with that in mind.

Build agents that are reliable before they are clever. OpenClaw got that part right.

Enjoying content like this? Sign up for Agent Briefings, where I share insights and news on building and scaling AI agents.

Resources

OpenClaw GitHub Repository: Source code and documentation
OpenClaw Official Documentation: Architecture guides and API reference
OpenClaw Official Website: Project overview and getting started
CVE-2026-25253 Details: Full vulnerability disclosure
Agent Skills Specification: The skills format OpenClaw uses at scale

How to Build and Deploy an Agent Skill from Scratch: Build your own skills using the format OpenClaw adopted
Don't Let Your AI Agent Forget: Smarter Strategies for Summarizing Message History: Memory management patterns that complement OpenClaw's approach
Writing Effective Tools for AI Agents: Production Lessons from Anthropic: Tool design principles for the capabilities layer

How to Build and Deploy an Agent Skill from Scratch

Ali Ibrahim — Mon, 16 Feb 2026 13:00:00 +0000

Introduction

AI agents are increasingly capable, but they often lack the specialized knowledge needed for real work. Your agent might write code, but does it know your team's deployment process? It can analyze data, but does it understand your company's reporting standards?

Agent Skills solve this by packaging domain expertise, workflows, and context into portable folders that agents can discover and use on demand.

What you'll build:

In this guide, you'll create a skill that teaches AI agents how to generate beautiful financial reports with charts and insights—building on Cameron AI, the personal finance assistant from our previous articles on agent prompting and tool design.

Here is an overview of the report your agent will generate with this skill:

Example report generated by the Cameron Expense Reporter skill.

What Are Agent Skills?

Agent Skills are folders containing instructions, scripts, and resources that AI tools can discover and use. Think of them as training documents for your AI assistant, except they load automatically when relevant.

The key insight is progressive disclosure. When your agent starts, it only loads the name and description of each skill—roughly 100 tokens per skill. This means you can have dozens of skills available without bloating your context window. Your agent stays fast and focused, pulling in detailed guidance only when needed.

Skills vs MCP: Complementary, Not Competing

If you've worked with MCP servers, you might wonder how skills fit in. Here's the distinction:

Aspect	Agent Skills	MCP Servers
Purpose	Teach workflows and knowledge	Provide tools and capabilities
Format	Single SKILL.md file	Server code (TypeScript, Python)
Example	"How to write agent prompts"	"Fetch URL", "Query database"

Skills provide the knowledge; MCP provides the capabilities. A skill might instruct an agent to "fetch the API documentation and analyze it", while an MCP server provides the actual fetch tool.

For a deeper dive, see the official Agent Skills specification.

Understanding SKILL.md Format

Every skill is a folder containing a SKILL.md file. This file has two parts: YAML frontmatter and markdown instructions.

Image credit: Anthropic

Required Fields

Field	Max Length	Description
`name`	64 chars	Lowercase letters, numbers, and hyphens only
`description`	1024 chars	When and why to use this skill

The description is crucial, it's what the AI uses to decide when to activate your skill. Include specific keywords that match how users naturally ask for help.

Note: The specification supports additional optional fields (license, compatibility, allowed-tools, etc.). See the official documentation for the complete list.

Building the Cameron Expense Reporter Skill

Let's build a skill for Cameron AI, a personal finance assistant that helps users manage budgets and track expenses.

For simplicity, we'll suppose that Cameron has file system access via MCP and can execute JavaScript in a sandboxed environment—the infrastructure needed to discover and run skills.

The skill we're building will teach Cameron how to transform raw expense data into professional visualizations with Chart.js.

Context: If you're new to Cameron AI, check out:

The Art of Agent Prompting: How Cameron's prompts are designed
Writing Tools for AI Agents: How Cameron's cameron_get_expenses tool works

What you'll build: A complete skill with:

6-step visualization workflow (understand request → gather data → format → generate chart → add insights → compose report)
Helper scripts for Chart.js configuration and data formatting
Reference guides for chart type selection
HTML template for professional reports

Full implementation: The complete skill is available on GitHub. We'll show function signatures and key patterns here.

Step 1: Create the Skill Directory

Create a folder structure for your skill:

mkdir -p my-skills/cameron-expense-reporter
cd my-skills/cameron-expense-reporter

# Create subdirectories for supporting files
mkdir -p scripts references assets

This skill will use:

scripts/: Reusable JavaScript utilities for charts and data formatting
references/: Decision guides and complete examples
assets/: HTML templates for professional output

Step 2: Write the Frontmatter

Create SKILL.md with frontmatter that clearly describes when to activate this skill:

---
name: cameron-expense-reporter
description: Generate financial reports with charts using Chart.js. Use when users ask to visualize spending, show trends, create expense charts, analyze spending patterns, compare categories, track budget progress, or generate financial reports. Handles chart type selection (bar/line/pie), data formatting (currency, dates, aggregation), Chart.js configuration, and insight generation. Works with expense data from tools like cameron_get_expenses or similar financial data sources.
---

Key decisions:

name: Scoped to Cameron's domain with clear purpose (we can also just use expense-reporter)
description: Includes natural trigger phrases users would say ("visualize spending", "show trends", "expense charts")

The description is critical, it determines when the AI loads this skill. Include synonyms and common phrasings.

Step 3: Define the Workflow

After frontmatter, add the skill title and a 6-step workflow:

# Cameron Expense Reporter

Generate beautiful, insightful financial reports with charts and written analysis.

## Workflow

### Step 1: Understand the Request

Identify the user's intent to select the appropriate visualization:

- Category comparison → Bar chart
- Time-series trends → Line chart
- Distribution/proportions → Pie chart
- Budget tracking → Line chart with budget reference line

### Step 2: Gather Expense Data

Use the `cameron_get_expenses` tool to retrieve relevant data...

### Step 3: Prepare and Format Data

Use helper scripts for data transformation...

### Step 4: Generate Chart Configuration

Use helper scripts to create Chart.js configurations...

### Step 5: Generate Insights

Provide written analysis alongside the chart...

### Step 6: Compose the Report

Generate HTML, markdown, or inline visualization...

Why 6 steps: Each represents a distinct decision point in the visualization workflow. This guides the AI through: intent recognition → data retrieval → transformation → rendering → analysis → output.

Step 4: Add Supporting Scripts

For complex logic, extract reusable functions into scripts. Create two JavaScript utilities:

scripts/format_financial_data.js — Data transformation:

/**
 * Aggregate expenses by category
 */
function aggregateByCategory(expenses) {
  return expenses.reduce((acc, expense) => {
    const category = expense.category || 'Other'
    acc[category] = (acc[category] || 0) + expense.amount
    return acc
  }, {})
}

/**
 * Prepare expense data for visualization
 */
function prepareChartData(expenses, aggregationType = 'category', sortByValue = true) {
  let aggregated

  switch (aggregationType) {
    case 'category':
      aggregated = aggregateByCategory(expenses)
      break
    case 'month':
      aggregated = aggregateByMonth(expenses)
      break
    case 'week':
      aggregated = aggregateByWeek(expenses)
      break
  }

  if (sortByValue && aggregationType === 'category') {
    aggregated = sortByAmount(aggregated)
  }

  return toChartFormat(aggregated)
}

// Also includes: aggregateByMonth, aggregateByWeek, formatCurrency,
// calculatePercentageChange, formatDateRange, toChartFormat, sortByAmount

scripts/generate_chart_config.js — Chart.js configuration:

/**
 * Generate a bar chart configuration for categorical spending data
 */
function generateBarChart({ title, labels, data, currency = '$' }) {
  return {
    type: 'bar',
    data: {
      labels: labels,
      datasets: [
        {
          label: 'Spending',
          data: data,
          backgroundColor: 'rgba(59, 130, 246, 0.8)',
          borderColor: 'rgba(59, 130, 246, 1)',
          borderWidth: 1,
        },
      ],
    },
    options: {
      responsive: true,
      maintainAspectRatio: false,
      plugins: {
        title: { display: true, text: title, font: { size: 16, weight: 'bold' } },
        tooltip: {
          callbacks: {
            label: function (context) {
              return currency + context.parsed.y.toFixed(2)
            },
          },
        },
      },
      scales: {
        y: {
          beginAtZero: true,
          ticks: { callback: (value) => currency + value.toFixed(0) },
        },
      },
    },
  }
}

// Also includes: generateLineChart, generatePieChart

Why scripts: Instead of inlining complex logic in SKILL.md, extract to reusable functions the AI can read and execute.

Full implementations: See the complete skill on GitHub for all utility functions.

Step 5: Add Reference Guides

For decision logic, use reference documents the AI loads on-demand:

references/chart-types-guide.md — Chart selection decision tree:

## Decision Tree

### Categorical Comparison ("Which category did I spend most on?")

**Use: Bar Chart**

Best for:

- Comparing spending across categories
- Showing top spending categories

Example queries:

- "What are my biggest expenses?"
- "Show spending by category"

### Time-Series Trends ("How is my spending changing?")

**Use: Line Chart**

Best for:

- Showing spending trends over weeks/months
- Comparing to budget over time

Example queries:

- "Show my spending trends over the year"
- "How has my spending changed?"

### Distribution/Proportions ("Where does my money go?")

**Use: Pie Chart**

Best for:

- Showing percentage breakdown
- Understanding spending distribution

Example queries:

- "Where does my money go?"
- "What percentage do I spend on dining?"

Why references: Separates decision-making guidance from the main workflow. The SKILL.md references this guide: "For chart selection, see [references/chart-types-guide.md]".

The skill also includes:

references/chartjs-examples.md: Complete HTML examples for each chart type
assets/report-template.html: Professional HTML template with placeholders

Step 6: Reference Supporting Files in SKILL.md

In your workflow, tell the AI when to use these resources:

### Step 1: Understand the Request

Identify the user's intent to select the appropriate visualization.

For detailed chart selection guidance, see [references/chart-types-guide.md](references/chart-types-guide.md).

### Step 2: Gather Expense Data

Use the `cameron_get_expenses` tool (or equivalent) to retrieve relevant data:

### Step 3: Prepare and Format Data

**Load the formatting utilities:**

<!-- Read and use scripts/format_financial_data.js -->

### Step 4: Generate Chart Configuration

**Load the chart generator:**

<!-- Read and use scripts/generate_chart_config.js -->

Progressive disclosure: Reference files are loaded on-demand by the AI when needed. You don't inline everything in SKILL.md.

Step 7: Add Best Practices to SKILL.md

Include guidance on edge cases and performance:

## Best Practices

**Data preparation:**

- Always validate date ranges before querying
- Handle empty results gracefully
- Round currency to 2 decimal places for display

**Chart configuration:**

- Use consistent color scheme (blue primary, red for warnings)
- Set responsive: true for all charts
- Format currency in tooltips and axis ticks

**Insight generation:**

- Lead with the most important finding
- Use concrete numbers, not vague language ("Dining increased 15%" not "You spent more on dining")
- Provide context (percentages, comparisons)

**Performance:**

- For large datasets (>1000 expenses), aggregate before visualizing
- Use `response_format: 'concise'` when fetching chart data

Step 8: Validate the Complete Skill

Verify your skill has all components:

cameron-expense-reporter/
├── SKILL.md                          ✓ Main instructions with 6-step workflow
├── scripts/
│   ├── generate_chart_config.js      ✓ Bar, line, pie chart generators
│   └── format_financial_data.js      ✓ Aggregation and formatting utilities
├── references/
│   ├── chart-types-guide.md          ✓ Chart selection decision tree
│   └── chartjs-examples.md           ✓ Complete working examples
└── assets/
    └── report-template.html          ✓ HTML template for reports

Validation checklist:

[ ] SKILL.md has frontmatter with clear trigger phrases
[ ] Workflow is step-by-step and actionable
[ ] Supporting scripts have clear function signatures
[ ] References are linked from main SKILL.md with relative paths
[ ] All examples use consistent patterns (Chart.js v4, Tailwind colors)

Get the complete skill: GitHub repository

Testing Your Skill

Now let's install and test the skill.

Install in Claude

The quickest way to test is with Claude.ai or Claude Desktop:

Note: If you are using the final Github repo skill, make sure to remove the expenses.csv file from the root of the skill folder before zipping and uploading to Claude.

Zip your skill folder: zip -r cameron-expense-reporter.zip cameron-expense-reporter/
Go to Settings > Capabilities and ensure "Code execution and file creation" is enabled
Scroll to the Skills section and click Upload skill
Upload your ZIP file

Upload your skill in Claude's settings.

For details, see Using Skills in Claude.

Other tools: You're not limited to Claude, any skills-compatible agent works. For example, in Cursor, place your skill folder in .cursor/skills/. You can also use skills.sh to install skills across tools with a single command: npx skills add <owner/repo>.

Test with Sample Data

Download the sample expenses CSV and try this prompt:

Here are my expenses [attach the CSV file]. Can you create a report using the cameron-expense-reporter skill?

Expected behavior:

Claude generates a report with a bar chart comparing spending.

Claude activates the cameron-expense-reporter skill
Reads the CSV data
Selects an appropriate chart type (bar chart for category comparison)
Generates a Chart.js visualization with formatted data
Provides written insights about spending patterns
Outputs a complete HTML report

Claude generates a Pie chart showing spending distribution.

Best Practices

Now that you've built your first skill, here are some tips for creating effective skills:

Keep it focused. A skill should do one thing well. If your SKILL.md exceeds 500 lines, consider splitting into multiple skills or using supporting files.

Description keywords matter. The description field determines when the AI activates your skill. Include synonyms and common phrasings for your use case.

Use supporting files strategically. Complex skills benefit from:

scripts/: Reusable code the AI can execute (JavaScript, Python, shell scripts)
references/: Decision trees, documentation, examples (loaded on-demand)
assets/: Templates, images, static files (used in output, not loaded into context)

Show function signatures, not full implementations. In SKILL.md, include:

// Good: Clear signature with purpose
function aggregateByCategory(expenses) { ... }

// Avoid: Inline implementation
// Keep actual logic in scripts/ directory

Balance brevity with completeness. The cameron-expense-reporter SKILL.md is ~290 lines:

~100 lines: Workflow steps
~100 lines: Best practices and common patterns
~90 lines: Quick references and troubleshooting

Test with real scenarios. Run your skill against actual tasks you face. Adjust the description and instructions based on what works.

Using the Skill Creator

We built the skill manually to understand each component, but you don't have to. Anthropic provides a skill-creator skill that can generate or iterate on skills for you, just describe what you want and let the AI handle the structure and formatting.

Beyond the Tutorial

The skill we built here is for learning purposes. Before shipping a skill to production or sharing it with other users, there are additional concerns to consider,especially if you're building custom agents.

Evaluation: Test your skill across diverse prompts to see how reliably the agent activates it, follows the workflow, and produces correct output. A single successful test isn't enough, edge cases and ambiguous requests will reveal gaps in your instructions.

Context management: Monitor how your agent's context grows when a skill is active. Loading full instructions plus reference files plus scripts can consume significant tokens. If your agent uses multiple skills in one session, context pressure becomes a real concern.

Infrastructure (custom agents): Coding tools like Cursor and Claude handle skill discovery and code execution out of the box. For custom agents, you may need to evaluate whether your MCP file server and sandbox are robust enough—particularly around file access permissions, execution timeouts, and error handling when scripts fail.

Conclusion

You've just created a sophisticated skill that transforms Cameron AI into a financial visualization expert. The SKILL.md file, along with its supporting scripts, reference guides, and templates—works across Cursor, Claude Code, and any custom agent with the right infrastructure.

This is the power of open standards: write once, use everywhere, whether in coding assistants or custom agents you build.

The skill you built demonstrates key patterns for production-ready agent skills:

Progressive disclosure: Only load detailed references when needed
Reusable scripts: Extract complex logic into executable functions
Decision trees: Reference guides for consistent choices
Professional output: Templates for polished deliverables

What workflow does your agent repeat most often? That's your next skill.

Enjoying content like this? Sign up for Agent Briefings, where I share insights and news on building and scaling AI agents.

Resources

Agent Skills Specification — Official open standard
Example Skills — Official skill examples
Vercel Skills — Registry of popular skills

The Art of Agent Prompting: Anthropic's Playbook — Prompt design patterns behind Cameron AI
Writing Tools for AI Agents — How to design effective tools like cameron_get_expenses
Create Your First MCP Server in 5 Minutes — Add tool capabilities to complement your skills

DEV Community: Ali Ibrahim

MCP v2: What's Changing, What's Deprecated, and Why

Introduction

The Headline Change: MCP Goes Stateless

What's Deprecated, and Why

Sampling

Roots

Logging

What the Stateless Shift Means for the SDKs

Should You Migrate Now?

What's New: Extensions

What This Means for Your Existing v1 Servers

Resources

Related Articles

Top AI Agent Standards to Know in 2026

1. AGENTS.md

2. Agent Skills (SKILL.md)

3. DESIGN.md

Special Mentions

Key Takeaways

What to Read Next

Sources and further reading

In 2026, There Are 4 Ways to Build an AI Agent. Here's How to Choose

Path 1: Build It Yourself

Path 2: Build It With a Coding Agent

Path 3: Deploy an Existing Open-Source Agent

Path 4: Use a Managed Agent Service

Paths Can Combine

What to Read Next

References

Observability for AI Agents: Why Tracing Matters and How to Do It with Langfuse

Introduction

Why Agents Are Uniquely Hard to Debug

What Agent Tracing Adds

Why Langfuse

Adding Langfuse to the Fullstack Template

Install

Configure

Piece 1: OTel Initialization (instrumentation.ts)

Piece 2: Semantic Callbacks (agentService.ts)

What You See in the Dashboard

Quick Start

Conclusion

Resources

Related Articles

How I Made My Blog Native to AI Agents (And Launched One)

What AI-First Actually Means

llms.txt and llms-full.txt

Per-Post Markdown API

The Blog Redesign

The Agentailor Agent: v0.1

Why This Is Different from a Chat Widget

What You Can Take From This

What's Next

AI Agent Roadmap: Everything You Need to Build Agents (In the Right Order)

Introduction

Phase 0: Get the Mental Model Right

Phase 1: Pick Your Stack (and Stop Second-Guessing It)

Language

Framework

Phase 2: Learn the 4 Core Primitives

1. The Model (The Brain)

2. Tools (How Agents Act on the World)

3. Memory (What It Remembers)

4. Prompting (The System Prompt Is Code)

Phase 3: Build Your First Agent

Phase 4: Extend With MCP (Tools at Scale)

Phase 5: Evaluate Before You Ship

Phase 6: Go Fullstack

Phase 7: Deploy

Phase 8: Think Like an Architect

Conclusion

What to Read Next

References

5 Agent Skills I’d install before starting any new agent project in 2026

1. prompt-engineer

2. skill-creator

3. mcp-builder

4. agentic-eval

5. openai-docs

Piece 1: OTel Initialization (`instrumentation.ts`)

Piece 2: Semantic Callbacks (`agentService.ts`)