DEV Community: Nilam Bora

Your Website Has a New User — And It's Not Human. A Hands-On Guide to WebMCP

Nilam Bora — Sat, 23 May 2026 15:41:38 +0000

This is a submission for the Google I/O Writing Challenge

The Web Just Got a New Kind of Visitor

Here's a question nobody was asking five years ago: what happens when the primary user of your website isn't a person?

At Google I/O 2026, sitting in my home office watching the Chrome session, I got my answer — and it wasn't what I expected. While everyone was buzzing about Gemini 3.5 and Antigravity 2.0, the Chrome team quietly dropped what I think is the most consequential announcement for web developers: WebMCP.

WebMCP (Web Model Context Protocol) is a proposed W3C standard that lets websites expose JavaScript functions as structured tools that AI agents can call directly — no scraping, no DOM simulation, no brittle Puppeteer scripts. Just a clean, typed API living right inside the browser tab.

I've spent the last few days building with it. This isn't a summary of the keynote. This is what I learned when I actually opened chrome://flags and started writing code.

What WebMCP Actually Is (In 60 Seconds)

If you've worked with MCP (Model Context Protocol), you know the concept: give AI agents structured tools to call instead of making them guess what to click on a screen.

WebMCP takes that idea and moves it into the browser. Instead of running a separate Node.js or Python server, you register tools directly on your webpage using a new browser API: navigator.modelContext.

Here's the fundamental mental model shift:

	Traditional MCP	WebMCP
Runs on	Your backend server	The browser tab
Auth	API keys, OAuth tokens	User's existing session & cookies
Sees	Your database, internal APIs	The page's JavaScript context
Best for	Headless automation	Human-in-the-loop workflows
Infrastructure	New server required	Zero additional infrastructure

They're complementary, not competing. MCP is for your backend. WebMCP is for your frontend. But if you're a web developer who's been wondering "how do I make my site work with AI agents?" — WebMCP is the answer Google just handed you.

Setting Up: 5 Minutes to Your First WebMCP Tool

Prerequisites

Chrome 149+ (currently in Canary/Dev channel, or behind a flag in stable)
A basic web page
That's it. No npm packages. No build tools. No server.

Step 1: Enable the Flag

For local development, navigate to:

chrome://flags/#enable-webmcp-testing

Set it to Enabled, then relaunch Chrome.

For production deployment, you'll want to register for the origin trial at developer.chrome.com/origintrials — search for "WebMCP", register your origin, and include the token as a <meta> tag in your page's <head>.

Step 2: Open the WebMCP DevTools Pane

Chrome DevTools now has a dedicated WebMCP pane under the Application tab. This gives you:

Available Tools: Every tool registered on the page, with invocation counters
Invoked Tools Log: Chronological record of agent-tool interactions with input/output inspection
Manual Test: Execute any tool directly, bypassing agent logic
Schema Validation: Catches malformed JSON schemas before agents hit them

Alternatively, you can also install the Model Context Tool Inspector extension from the Chrome Web Store for a more visual workflow.

Step 3: Register Your First Tool

Create an index.html file:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>WebMCP Demo</title>
</head>
<body>
  <h1>🛠️ My Agent-Ready Page</h1>
  <div id="output"></div>

  <script>
    // Check if the API is available
    if ('modelContext' in navigator) {

      navigator.modelContext.registerTool({
        name: 'get_weather',
        description: 'Returns the current weather for a given city. Use this when the user asks about weather conditions.',
        inputSchema: {
          type: 'object',
          properties: {
            city: {
              type: 'string',
              description: 'The city name, e.g. "Tokyo" or "San Francisco"'
            },
            units: {
              type: 'string',
              enum: ['celsius', 'fahrenheit'],
              description: 'Temperature unit preference'
            }
          },
          required: ['city']
        },
        execute: async ({ city, units = 'celsius' }) => {
          // In a real app, this would call your existing API
          const mockData = {
            'tokyo': { temp: 22, condition: 'Partly Cloudy' },
            'san francisco': { temp: 18, condition: 'Foggy' },
            'london': { temp: 15, condition: 'Overcast' }
          };

          const data = mockData[city.toLowerCase()];
          if (!data) return { error: `No weather data for ${city}` };

          const temp = units === 'fahrenheit' 
            ? Math.round(data.temp * 9/5 + 32) 
            : data.temp;

          return {
            city,
            temperature: `${temp}°${units === 'fahrenheit' ? 'F' : 'C'}`,
            condition: data.condition,
            timestamp: new Date().toISOString()
          };
        }
      });

      console.log('✅ WebMCP tool registered: get_weather');

    } else {
      console.warn('WebMCP not available. Enable chrome://flags/#enable-webmcp-testing');
    }
  </script>
</body>
</html>

Open this in Chrome 149 with the flag enabled. Open DevTools console. You should see:

✅ WebMCP tool registered: get_weather

Congratulations — your page is now agent-ready. Any AI agent that supports WebMCP (like Gemini in Chrome) can discover and call this tool instead of trying to scrape your page.

Going Deeper: The Declarative Approach

The imperative JavaScript API gives you full control, but WebMCP also offers a declarative approach for common form-based workflows. This is where it gets really elegant.

Instead of writing JavaScript, you add attributes directly to your HTML forms:

<form toolname="search_products" 
      tooldescription="Search the product catalog by name, category, or price range">

  <label for="query">Search</label>
  <input type="text" id="query" name="query" 
         placeholder="What are you looking for?" required>

  <label for="category">Category</label>
  <select id="category" name="category">
    <option value="">All Categories</option>
    <option value="electronics">Electronics</option>
    <option value="books">Books</option>
    <option value="clothing">Clothing</option>
  </select>

  <label for="max_price">Max Price ($)</label>
  <input type="number" id="max_price" name="max_price" min="0" step="0.01">

  <button type="submit">Search</button>
</form>

The browser automatically generates the JSON schema from your form structure — input types, names, required fields, select options, min/max constraints. The agent gets a typed tool contract without you writing a single line of schema code.

This is honestly brilliant. If you have an existing website with forms, you can make it agent-accessible by adding two HTML attributes. That's it.

Building Something Real: A Task Manager with 3 WebMCP Tools

Let me walk through something more practical. I built a simple task manager that exposes three tools to AI agents:

// Tool 1: Add a task
navigator.modelContext.registerTool({
  name: 'add_task',
  description: 'Creates a new task in the task list. Returns the created task with its ID.',
  inputSchema: {
    type: 'object',
    properties: {
      title: { type: 'string', description: 'The task title' },
      priority: { 
        type: 'string', 
        enum: ['low', 'medium', 'high', 'urgent'],
        description: 'Priority level' 
      },
      due_date: { 
        type: 'string', 
        description: 'Due date in YYYY-MM-DD format (optional)' 
      }
    },
    required: ['title']
  },
  execute: async ({ title, priority = 'medium', due_date }) => {
    const task = {
      id: crypto.randomUUID(),
      title,
      priority,
      due_date: due_date || null,
      completed: false,
      created_at: new Date().toISOString()
    };

    tasks.push(task);
    renderTasks(); // Update the UI
    return { success: true, task };
  }
});

// Tool 2: List tasks with filters
navigator.modelContext.registerTool({
  name: 'list_tasks',
  description: 'Returns all tasks, optionally filtered by completion status or priority.',
  inputSchema: {
    type: 'object',
    properties: {
      status: { 
        type: 'string', 
        enum: ['all', 'pending', 'completed'],
        description: 'Filter by status' 
      },
      priority: { 
        type: 'string', 
        enum: ['low', 'medium', 'high', 'urgent'],
        description: 'Filter by priority' 
      }
    }
  },
  execute: async ({ status = 'all', priority } = {}) => {
    let filtered = [...tasks];

    if (status === 'pending') filtered = filtered.filter(t => !t.completed);
    if (status === 'completed') filtered = filtered.filter(t => t.completed);
    if (priority) filtered = filtered.filter(t => t.priority === priority);

    return { 
      total: filtered.length, 
      tasks: filtered 
    };
  }
});

// Tool 3: Complete a task
navigator.modelContext.registerTool({
  name: 'complete_task',
  description: 'Marks a specific task as completed by its ID.',
  inputSchema: {
    type: 'object',
    properties: {
      task_id: { type: 'string', description: 'The UUID of the task to complete' }
    },
    required: ['task_id']
  },
  execute: async ({ task_id }) => {
    const task = tasks.find(t => t.id === task_id);
    if (!task) return { error: 'Task not found' };

    task.completed = true;
    renderTasks(); // Update the UI in real-time
    return { success: true, task };
  }
});

What's happening here is subtle but important: when an AI agent calls add_task, the UI updates in real-time because renderTasks() runs inside the user's browser. The user watches their task list populate while the agent works. The agent and the human share the same view. This is fundamentally different from a headless MCP server processing requests in the background.

Testing Your Tools (Without an Agent)

You don't need a full AI agent to test your WebMCP implementation. Use the programmatic API:

// List all registered tools
const tools = await navigator.modelContext.getTools();
console.log('Registered tools:', tools.map(t => t.name));

// Execute a tool manually
const result = await navigator.modelContext.executeTool(
  'add_task', 
  { title: 'Write WebMCP article', priority: 'urgent' }
);
console.log('Result:', result);

Or use the Model Context Tool Inspector extension — it gives you a visual panel to browse tools, fill in parameters, and hit "Execute". During development, this was my most-used tool.

The Security Model: What You Need to Know

WebMCP isn't a free-for-all. The security considerations are thoughtful:

Same-Origin Policy: Tools are strictly isolated by origin. Cross-origin tools cannot be accessed, enumerated, or invoked — even from iframes. Third-party widgets can't interfere with your app's tools.
CSP Integration: WebMCP respects Content Security Policy headers. If your CSP blocks inline scripts, tool registration via inline <script> tags won't work (use external scripts instead).
HTTPS Required: Tools can only be registered on secure origins. No HTTP.
Human-in-the-Loop: This is the big one. The spec provides a requestUserInteraction() API for sensitive or destructive actions (payments, deletions, data exports). When invoked, the browser — not your website — renders a native consent dialog. Your site cannot suppress, restyle, or auto-click it. It's a mandatory choke-point.

   execute: async (params, { requestUserInteraction }) => {
     // For dangerous operations, require explicit user consent
     await requestUserInteraction(async () => {
       return `Delete all ${params.count} items? This cannot be undone.`;
     });
     // Only reaches here if user approved
     return await deleteItems(params.count);
   }

Agent Traceability: When an agent invokes a tool that makes HTTP requests, Chrome automatically adds a Sec-WebMCP-Agent: 1 header. On the server side, form submissions set SubmitEvent.agentInvoked = true. This means you can detect agent traffic, set up agent-specific rate limits, and build audit logs.
Permission-First Design: Nothing is exposed by default. You must explicitly call registerTool() to make anything available. It follows a least-privilege model and supports Permissions Policy headers for fine-grained access control.

This is a sane, web-platform-native security model. It doesn't invent new trust boundaries — it inherits the browser's existing ones and adds agent-specific layers on top.

What I Think Is Underrated About WebMCP

1. Zero-Auth for Agents

This is the killer feature that nobody's talking about enough.

With traditional MCP, you need to solve authentication for the AI agent separately — API keys, OAuth flows, token management. With WebMCP, the agent inherits the user's existing session. If the user is logged into your app, the agent can act on their behalf, within the same session, using the same cookies.

No new auth infrastructure. No token management. No security review for a new API surface. The user's login is the auth.

2. Progressive Enhancement

WebMCP follows the web platform's best tradition: your site works fine without it. The if ('modelContext' in navigator) check means you can ship WebMCP tools today — users on older browsers simply don't see them. No polyfills, no graceful degradation code. It's additive.

3. The Declarative API Lowers the Bar Massively

Not every web developer wants to write JSON Schema by hand. The <form toolname="..." tooldescription="..."> approach means a WordPress theme developer, a Shopify store owner, or a junior dev on their first project can make their site agent-ready. This is how you get ecosystem adoption.

What's Still Missing (Honest Take)

No technology ships perfect. Here's what I noticed during my hands-on time:

Tool Discovery is Unclear

How does an agent find which websites have WebMCP tools? There's no registry, no manifest file, no equivalent of robots.txt for agents. Right now, the agent has to navigate to your page first, and then discover tools. That's fine for "help me on this website" flows, but limits proactive agent discovery.

No Inter-Tool Dependencies

What if complete_task should only be callable after list_tasks returns results? There's no way to express tool ordering or dependencies in the schema. Agents have to figure this out from descriptions alone.

Debugging Could Go Further

The DevTools WebMCP pane is solid for basic inspection, but there's no integration with the Performance timeline yet. You can't profile tool execution latency alongside your app's rendering pipeline. For a spec aiming at production adoption, this will need to come.

The Origin Trial Boundary

This is experimental. Chrome 149 only. The spec lives in the W3C Web Machine Learning Community Group (repo: github.com/webmachinelearning/webmcp) — which is a Community Group Report, not a formal W3C Standard or even on the Standards Track yet. No Firefox, no Safari. If you're building production features on this today, you're building on sand. If you're prototyping and providing feedback to shape the spec — you're in exactly the right place.

Who Should Care About This Today

If you are...	What to do
A web developer	Prototype one tool. Get familiar with `registerTool()`. Ship behind a feature flag.
An e-commerce dev	The declarative form API is built for you. `<form toolname="search_products">` is a one-line upgrade.
A framework author	Start thinking about WebMCP primitives. React/Vue/Svelte components that auto-register tools will be huge.
A product manager	Know that your website will have two interfaces soon: one for humans, one for agents. Plan accordingly.

The Bigger Picture

WebMCP isn't just a Chrome API. It's a signal about where the web is heading.

For 30 years, the web has been a visual medium — we write HTML for human eyes, style it with CSS for human aesthetics, and add JavaScript for human interactions. WebMCP introduces a parallel interface layer: the same page, the same code, but now also speaking machine.

The websites that embrace this won't just be "AI-friendly" — they'll be the ones that AI agents actually recommend, prefer, and route users toward. Because an agent that can reliably call book_flight({ from: 'SFO', to: 'NRT', date: '2026-07-01' }) will always choose that over scraping a booking page and hoping the DOM hasn't changed.

The question isn't whether the web will become agent-ready. It's whether your website will be ready when the agents arrive.

Start with navigator.modelContext.registerTool(). Start today.

I'm Nilam — a developer who builds things and occasionally writes about them. If you found this useful, the 💜 reaction helps this article reach more developers. Got questions about WebMCP? Drop them in the comments — I'll answer everything I can from my hands-on experience.

The Anatomy of a Self-Improving AI Agent — How Hermes Agent's Closed Learning Loop Actually Works

Nilam Bora — Sat, 23 May 2026 15:00:23 +0000

This is a submission for the Hermes Agent Challenge: Write About Hermes Agent

Every AI agent framework solves the same problem: how to make an LLM use tools. Hermes Agent asks a different question entirely — how do you make an LLM that used a tool today use it better tomorrow?

That question sounds simple. The engineering required to answer it is not.

I've spent the past few weeks dissecting Hermes Agent's architecture, reading through its source code, studying the GEPA evolution pipeline, and mapping how its three-layer memory system actually works under the hood. What I found isn't just another agentic wrapper around API calls — it's a genuinely novel approach to building AI systems that compound in capability over time.

This article is a technical deep dive. No "getting started" tutorial, no surface-level overview. We're going straight into the internals — the Closed Learning Loop, the SKILL.md format, the progressive disclosure system that keeps token costs sane, and the DSPy + GEPA pipeline that lets Hermes evolve its own skills without a single GPU.

Let's crack it open.

The Goldfish Problem: Why Most Agent Frameworks Start From Zero Every Time

Here's a thought experiment. You ask an AI agent to scrape a website, parse the data, and save it to a CSV. It takes 12 tool calls, hits two errors, retries with different selectors, and eventually succeeds.

Tomorrow, you ask it to scrape a different website.

It starts from scratch. Same errors. Same retries. Same 12 tool calls. It learned nothing from yesterday.

This is what I call the Goldfish Problem — and it plagues every major agentic framework:

Framework	Mental Model	Learns From Past Tasks?
LangChain / LangGraph	Graph-based state machine	❌ No — state resets per workflow
CrewAI	Role-based multi-agent teams	❌ No — crews are stateless
AutoGen	Conversational multi-agent dialogue	❌ No — conversation history only
Hermes Agent	Persistent self-improving runtime	✅ Yes — Closed Learning Loop

LangGraph gives you surgical control over state machines. CrewAI lets you prototype multi-agent teams fast. AutoGen enables agent-to-agent dialogue. These are good tools for their intended use cases.

But none of them treat past execution as training data for future execution. They orchestrate. They don't learn.

Hermes Agent does both.

The Closed Learning Loop — Execute, Evaluate, Extract, Retrieve

The core innovation in Hermes Agent is a four-phase cycle that Nous Research calls the Closed Learning Loop. Here's how it works:

┌──────────────────────────────────────────────────┐
│                                                  │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐  │
│   │          │    │          │    │          │   │
│   │ EXECUTE  │───▶│ EVALUATE │───▶│ EXTRACT  │  │
│   │          │    │          │    │          │   │
│   └──────────┘    └──────────┘    └──────────┘  │
│        ▲                               │         │
│        │          ┌──────────┐         │         │
│        │          │          │         │         │
│        └──────────│ RETRIEVE │◀────────┘         │
│                   │          │                    │
│                   └──────────┘                    │
│                                                  │
└──────────────────────────────────────────────────┘

Phase 1: Execute

The agent performs a task using its available tools — web search, browser automation, terminal commands, file operations, code execution. Standard agentic behavior. Nothing unusual here.

Phase 2: Evaluate

After the task completes, the agent doesn't just move on. It reviews its own execution trace — every tool call, every reasoning step, every error and recovery. It asks: What worked? What failed? What took more steps than necessary?

This is the critical divergence from other frameworks. Most agents treat completion as the end. Hermes treats it as the beginning of learning.

Phase 3: Extract

If the agent identifies a reusable pattern — a sequence of tool calls that solved a class of problems, an error-handling strategy that recovered gracefully, a workflow that could be templated — it codifies that pattern into a Skill.

Skills are written as Markdown files (SKILL.md) with structured YAML frontmatter. They're human-readable, version-controllable, and composable. More on this in the next section.

Phase 4: Retrieve

The next time the agent encounters a similar task, it searches its skill library before acting. If a relevant skill exists, it loads the skill's instructions and follows the proven workflow — skipping the trial-and-error phase entirely.

Then the cycle repeats. The skill gets refined. Edge cases get covered. The agent gets measurably faster and more reliable at recurring tasks.

The neuroscience parallel is striking. This is essentially how human procedural memory works. You don't consciously think through every step of riding a bike — your brain retrieves a compressed motor program built from hundreds of past attempts. Hermes does the same thing with tool-calling sequences.

And critically, this is architecturally different from the three common approaches to "giving agents memory":

RAG retrieves information. Hermes retrieves procedures.
Fine-tuning bakes knowledge into weights. Hermes stores it in editable text files.
Prompt caching saves tokens. Hermes saves entire workflows.

SKILL.md — The Anatomy of an Agent's Procedural Memory

This is where Hermes Agent gets genuinely interesting from an engineering perspective. Let's look at what a Skill actually is.

Every skill lives in ~/.hermes/skills/ as a directory containing a SKILL.md file. Here's a realistic example:

---
name: web-scraping-with-retry
description: >
  Scrapes web pages using browser automation with intelligent retry
  logic, selector fallbacks, and anti-detection measures.
version: 1.2.0
author: hermes-agent (auto-generated)
license: MIT
metadata:
  tags: [web-scraping, browser, automation, retry]
  related_skills: [data-parsing-csv, browser-stealth-config]
  platform: [linux, macos]
  min_tools: [browser, file]
  trigger_conditions:
    - "user requests web scraping"
    - "task involves extracting data from websites"
  success_rate: 0.94
  total_executions: 47
---

# Web Scraping with Retry

## Overview
This skill handles web scraping tasks with built-in resilience.
It was originally created after a session where naive scraping
failed due to dynamic content loading and rate limiting.

## Procedure

### Step 1: Assess the Target
Before scraping, check for:
- robots.txt restrictions
- Dynamic rendering (SPA vs server-rendered)
- Rate limiting headers (X-RateLimit-*)

### Step 2: Choose Strategy
- **Static HTML**: Use `fetch_url` tool directly
- **Dynamic/SPA**: Use `browser_navigate` + wait for selectors
- **Rate-limited**: Add 2-5 second delays between requests

### Step 3: Implement Fallback Selectors
Always prepare 3 levels of selectors:
1. Semantic: `[data-testid="product-price"]`
2. Structural: `.product-card > .price-container > span`
3. XPath fallback: `//div[contains(@class,"price")]`

### Step 4: Error Recovery
If a selector fails:
- Wait 3 seconds and retry (content may still be loading)
- Try next fallback selector
- If all selectors fail, capture a screenshot and report

## Anti-Patterns to Avoid
- Never scrape without checking robots.txt first
- Never use fixed sleep() — use element-wait conditions
- Never store raw HTML when structured data is available

Notice what's happening here. This isn't a prompt template. It isn't a function signature. It's a structured decision-making guide that encodes judgment — when to use which strategy, what to watch out for, how to recover from specific failure modes.

The agent wrote this itself, after 47 executions of scraping tasks, refining the procedure each time. The success_rate: 0.94 in the metadata isn't a fabrication — it's tracked from actual execution outcomes.

Progressive Disclosure: How Hermes Keeps Token Costs Sane

Here's the engineering problem: if you have 200 skills, you can't load all of them into the context window. That would burn tens of thousands of tokens before the agent even starts working.

Hermes solves this with a three-level progressive disclosure system:

Level 0 — The Index (Always Loaded)
The agent sees only skill names and one-line descriptions. This is a compact lookup table that costs maybe 500 tokens for dozens of skills:

Available Skills:
- web-scraping-with-retry: Scrapes web pages with retry logic and fallbacks
- data-parsing-csv: Parses messy CSV/TSV data into clean structured formats
- git-pr-review: Reviews pull requests for code quality and security issues
- email-digest-summary: Summarizes email threads into actionable bullet points

Level 1 — Full Skill (Loaded On Demand)
When the agent decides a skill is relevant to the current task, it loads the full SKILL.md content. Now it has the complete procedure, anti-patterns, and decision logic.

Level 2 — Deep References (Rare)
Some skills include a references/ subdirectory with example scripts, configuration templates, or API documentation. These are loaded only when the agent needs specific implementation details — like a regex pattern for parsing dates, or a template for a specific API's authentication flow.

This three-tier system means the token cost scales with relevance, not with library size. An agent with 500 skills pays roughly the same base cost as one with 50 — only the skills actually needed get loaded.

When Does the Agent Create a New Skill?

Not every task becomes a skill. Hermes uses specific triggers:

Complexity threshold: The task required 5+ tool calls to complete
Error recovery: The agent hit errors and successfully recovered — that recovery pattern is worth saving
Novelty detection: The task doesn't match any existing skill in the library
User confirmation: In some configurations, the agent asks the user before creating a skill

The threshold is intentionally high. You don't want a skill for "read a file" — that's trivial. You want skills for multi-step procedures where the agent's learned judgment actually saves time.

Self-Evolution: How DSPy + GEPA Optimize Skills Without a GPU

Manual skill creation through the Closed Learning Loop is powerful. But Nous Research pushed further with an automated evolution pipeline called GEPA — Genetic-Pareto Prompt Evolution.

This is where Hermes Agent stops being "just" an agent framework and starts looking like an AI research platform.

The GEPA Pipeline (5 Stages)

Stage 1: Execution Trace Collection
GEPA reads the agent's SQLite database of past sessions. It doesn't just look at pass/fail outcomes — it analyzes the full reasoning trace: every tool call, every decision branch, every error message and recovery attempt.

Stage 2: Reflective Failure Analysis
An LLM examines failed traces and generates what Nous calls "Actionable Side Information" — a diagnosis of why the skill failed. Not "it didn't work," but "the CSS selector assumed a class name that changed between page loads" or "the retry logic didn't account for 429 rate-limit responses."

Stage 3: Targeted Mutation
Based on the failure analysis, GEPA proposes specific text edits to the SKILL.md file. These aren't random perturbations — they're targeted fixes and improvements:

  ### Step 4: Error Recovery
  If a selector fails:
  - Wait 3 seconds and retry
+ - Check for 429 status code — if present, back off exponentially
+   (5s, 15s, 45s) before retrying
  - Try next fallback selector
  - If all selectors fail, capture a screenshot and report

Stage 4: Multi-Candidate Evaluation
GEPA generates multiple mutated variants of the skill and evaluates each against a test suite — either synthetic test cases or replayed real-world scenarios. It uses Pareto optimization across multiple dimensions (success rate, token efficiency, execution speed) rather than optimizing a single metric.

This is critical. A skill that succeeds 100% of the time but uses 50,000 tokens per run isn't necessarily better than one that succeeds 95% of the time using 5,000 tokens. Pareto selection lets you keep multiple specialized variants.

Stage 5: Human-in-the-Loop Review
The winning variant isn't automatically deployed. GEPA generates a Pull Request with the proposed changes, a diff showing exactly what changed, and the evaluation metrics that justify the change. A human reviews and approves before the skill goes live.

This safety rail is non-negotiable. Without it, you risk skill hallucination — the agent convincing itself that a broken procedure works because it evaluated against flawed test cases.

Why This Isn't Reinforcement Learning

GEPA looks superficially like RL, but it's fundamentally different:

Property	Traditional RL	GEPA
Optimization target	Neural network weights	Plain text (Markdown)
Requires GPU	Yes	No — LLM API calls only
Transparency	Black box	Fully readable diffs
Rollouts needed	Thousands	~35x fewer than GRPO
Iteration unit	Training step	Git commit

Nous Research calls this "evolutionary search over text" — and it's a surprisingly effective paradigm. You get the benefits of automated optimization without the opacity of weight updates.

The Broader Architecture: What Powers the Loop

The Closed Learning Loop doesn't exist in isolation. It's supported by a comprehensive runtime architecture:

Three-Layer Memory System

Layer 1 — Core Memory: SOUL.md (persona), MEMORY.md (facts), USER.md (preferences). Loaded every session. Think of it as the agent's "identity."
Layer 2 — Procedural Memory: The skill library. Loaded on demand via progressive disclosure. This is where the learning loop writes to.
Layer 3 — Episodic Memory: SQLite database with full session history, FTS5 full-text search, and LLM-summarized recall. The raw material that GEPA mines for improvements.

Model Agnostic by Design

Hermes doesn't lock you into a provider. It works with OpenAI, Anthropic, OpenRouter, Nous Portal, and local models via Ollama or vLLM. Swap models without changing skills — the procedural knowledge is in Markdown, not in weights.

Multi-Platform Gateway

A single Hermes process can serve you across CLI, Telegram, Discord, Slack, WhatsApp, Signal, and Email simultaneously. Your skills, memory, and learning history follow you across every surface.

~70 Built-In Tools + MCP Extensibility

Web search, browser automation (with anti-detection via Camofox), terminal, file operations, code execution, vision, image generation, and a cron scheduler for autonomous recurring tasks. Plus Model Context Protocol (MCP) support for dynamically loading external tool servers.

Sub-Agent Delegation

For complex workflows, Hermes spawns isolated child agents with their own conversation history and toolsets. Only the final result returns to the parent — preventing context window flooding while enabling parallelizable work.

When Should You Actually Use Hermes Agent?

Not every problem needs a self-improving agent. Here's an honest decision framework:

Use Hermes Agent when:

✅ You need a persistent assistant that handles recurring tasks (daily summaries, weekly reports, ongoing monitoring)
✅ You want an agent that genuinely gets better at your specific workflows over time
✅ You care about data privacy and want full self-hosted control
✅ You need multi-platform access (message your agent from Telegram, get results in Slack)
✅ You're doing AI research and want to experiment with skill evolution

Use something else when:

➡️ One-off pipeline orchestration → LangGraph gives you more control over deterministic state machines
➡️ Quick multi-agent prototyping → CrewAI's role-task-crew abstraction is faster to set up
➡️ Agent-to-agent collaboration experiments → AutoGen's conversational paradigm is more natural
➡️ You need a library, not a runtime → Hermes is an always-on process, not a function you call

The honest truth: if you're building a one-shot data pipeline that runs once and dies, Hermes is overkill. Its value compounds over time. The longer it runs, the more skills it accumulates, the more efficient it becomes. Day one, it's just another agent. Day ninety, it's an agent that knows your infrastructure, your coding style, your failure modes.

The Compounding Agent Thesis

Here's what I think most people miss about Hermes Agent, and why I believe the Closed Learning Loop matters beyond this one framework:

The real moat in AI agents isn't who has the best LLM. Models are converging — GPT-4o, Claude, Gemini, Llama, they're all remarkably capable. The real moat is accumulated institutional knowledge.

Every company, every developer, every team has specific procedures, preferences, failure modes, and tribal knowledge that no foundation model can know out of the box. The agent that captures and operationalizes that knowledge — automatically, incrementally, without manual prompt engineering — wins.

Hermes Agent's Closed Learning Loop is the first serious implementation of this idea at the framework level. It treats every task completion not as an endpoint, but as a data point for improvement. Every error recovered from becomes a guardrail. Every successful workflow becomes a reusable template. Every edge case encountered becomes a documented anti-pattern.

The result isn't just an agent that runs tools. It's an agent that builds its own playbook — and gets better every time you use it.

That's not incremental improvement. That's a fundamentally different category of AI system.

Hermes Agent is open-source under the MIT license. You can find the code at github.com/NousResearch/hermes-agent, the self-evolution pipeline at github.com/NousResearch/hermes-agent-self-evolution, and the official documentation at hermes-agent.org.

Gemma 4 Isn't an AI Model — It's a Deployment Spectrum, and That Changes Everything

Nilam Bora — Tue, 19 May 2026 06:31:54 +0000

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

The Question Nobody's Asking

Here's what I don't understand about the discourse around open-weight AI models in 2026:

Why does everyone keep reviewing them like they're standalone products?

"Gemma 4 scores X on MMLU." "Llama 4 has a 10 million token context window." "Phi-4 is the best at math per parameter." Every review reads like a spec sheet comparison. As if we're buying refrigerators.

But when I actually sat down with Gemma 4 — not the benchmarks, not the blog posts, but the actual model family, all four sizes — I realized that benchmarks were the wrong lens entirely. Google DeepMind didn't ship a model. They shipped a deployment spectrum. And that distinction, once you understand it, changes how you think about building with AI.

Let me explain what I mean.

The Old Mental Model Is Broken

For the past three years, the open-weight model conversation has been dominated by a single question: "Which model is the best?"

And "best" always meant the same thing: highest score on the hardest benchmark. We talked about models like they were Formula 1 cars — only one can win, and winning means being the fastest on the same track.

This mental model made sense in the GPT-3 era, when you had one model, it ran on a cloud server, and you called it via API. There was one deployment target. One hardware profile. One question.

But here's what's changed: AI isn't just an API anymore.

In 2026, AI runs in your browser. It runs on your phone. It runs on a Raspberry Pi monitoring your greenhouse. It runs on an air-gapped corporate laptop that's never seen the internet. It runs on a single consumer GPU at 3 AM because your side project can't afford cloud inference.

The deployment surface has shattered into a thousand fragments. And if your model family only gives you one size that works in one place — congratulations, you've built a sports car for a world that needs a vehicle fleet.

Gemma 4: One Brain, Four Bodies

This is where Gemma 4 does something I haven't seen any other model family do as deliberately. It doesn't give you one model in different sizes. It gives you four distinct architectures, each engineered from the ground up for a specific deployment reality.

Let me break this down, because the details matter:

E2B — The Spy in Your Pocket

2 billion effective parameters. Runs on a high-end phone. Runs on a Raspberry Pi 5 with 8GB of RAM. Runs without an internet connection.

The E2B isn't a watered-down version of the big model. It's a purpose-built edge model with native audio input — yes, it can process raw speech directly, no transcription pipeline needed — and 128K tokens of context. It uses a dense architecture with Per-Layer Embeddings (PLE), which means every layer gets its own learned embedding, allowing for richer representations at a fraction of the parameter count.

Think about what this means. You can build a voice-controlled assistant that runs entirely on a $75 single-board computer. No cloud calls. No API keys. No monthly bill. No data leaving the device. The user speaks, the model listens, reasons, and responds — all locally.

A year ago, this was a research demo. Today, it's a pip install away.

E4B — The Daily Driver

4 billion effective parameters. This is the model for laptops and mid-range hardware. It keeps everything the E2B has — audio, images, 128K context — but adds enough reasoning depth to handle tasks that would trip up the smaller sibling.

I think of the E4B as the Toyota Corolla of AI models. Not flashy. Not headline-grabbing. But it starts every morning, handles whatever you throw at it, and does it on hardware that hundreds of millions of people already own.

If you're building a developer tool that needs to work offline — a local code assistant, a documentation summarizer, an accessibility layer for audio content — the E4B is probably your model. Not because it's the smartest. Because it's the one your users can actually run.

26B MoE — The Clever Optimizer

This one is fascinating. 26 billion total parameters, but only 3.8 billion active per token.

The Mixture of Experts (MoE) architecture means the model has specialized "expert" subnetworks, and a router decides which experts to activate for each token. So you get the knowledge capacity of a 26B model with the inference cost of a 4B one.

In practical terms: this model runs on a single consumer GPU. A used RTX 3090 from eBay. An M-series MacBook with 32GB of unified memory. It supports video input (up to 60 seconds), 256K context, and reasoning mode.

The 26B MoE is for people who need real intelligence but can't (or won't) rent a data center. Indie devs. Startups pre-revenue. Researchers at universities that aren't Stanford. The vast majority of builders on the planet.

31B Dense — The Heavyweight

31 billion dense parameters. Full video support. 256K context. This is the model that goes toe-to-toe with GPT-4-class systems on reasoning benchmarks — ranked #3 on Arena AI's text leaderboard at release.

But here's the part that doesn't show up in benchmarks: it runs on a single workstation. Not a cluster. Not a multi-GPU rig. One machine. The kind that sits under a developer's desk.

In the Llama 4 world, getting frontier-class reasoning means deploying Maverick — a 400B parameter MoE behemoth that needs multi-GPU servers. In the Gemma 4 world, you download a GGUF file, point Ollama at it, and you're having a conversation with a model that matches or beats most closed-source alternatives.

Why the Spectrum Matters More Than Any Single Model

Here's the argument I want to make, and I want to make it clearly:

The most important innovation in Gemma 4 is not any individual model. It's that all four models share the same training lineage, the same capabilities framework, and the same API surface.

This means you can architect a system where:

The E2B runs on a user's phone, handling real-time voice commands and basic reasoning offline
The E4B runs on a laptop, processing documents and generating drafts without cloud dependencies
The 26B MoE runs on a local server, handling complex multi-step workflows with visual understanding
The 31B Dense runs on a workstation (or cloud instance during peak), providing frontier-quality reasoning when it matters most

And the code that talks to all of them is nearly identical. The prompts transfer. The function-calling schema transfers. The system instructions transfer. You're not learning four different APIs or managing four incompatible prompt formats. You're working with a single coherent model family that scales from your pocket to your server rack.

This is what I mean by "deployment spectrum." It's not four models. It's one intelligence at four resolution levels, deployable across the entire range of hardware that exists in the real world.

The Apache 2.0 Bombshell

Let me address the elephant in the room: licensing.

Gemma 4 ships under Apache 2.0. Not a "Community License" with asterisks about monthly active users. Not a custom license that lawyers need to review. Apache 2.0 — the same license as Kubernetes, TensorFlow, and Android.

For individual developers, this means: do whatever you want. Fine-tune it. Distill it. Ship it in a commercial product. Embed it in hardware. No phone call to Google required.

For enterprises, this means something even more important: digital sovereignty. You can deploy Gemma 4 on air-gapped servers inside your own data center. Patient data stays in the hospital. Financial data stays in the bank. Legal documents stay in the firm. The model runs where your data lives, not the other way around.

In a world where data regulations are tightening every quarter and "but the cloud provider says they won't look at your data" is no longer a satisfactory answer to compliance teams — Apache 2.0 isn't a feature. It's a prerequisite.

What Everyone Else Is Getting Wrong

I've read about thirty "Gemma 4 review" articles in the past few weeks. Most of them fall into one of three categories:

Benchmark table → "It's good" — Useful but boring. Scores without context.
"I ran it locally and it worked" — Great, but a thousand people have written that article.
"Gemma 4 vs. Llama 4 vs. Phi-4" — Comparison charts that miss the point because they compare each model family's flagship instead of comparing deployment strategies.

Here's what I think most people are missing:

The real competition isn't model vs. model. It's ecosystem vs. ecosystem.

When you choose Gemma 4, you're not just choosing a model. You're choosing:

Apache 2.0 — vs. Llama's Community License that restricts companies above 700M MAU, and requires specific usage obligations
Native multimodality at every size — vs. competitors where vision/audio is only available at the largest tier
Google AI Studio + Hugging Face + Kaggle + OpenRouter — four free access channels vs. competitors with one or two
Function calling and structured output baked in — vs. models where agentic features are fine-tuned on top
The Gemini API compatibility — meaning code you write for Gemma works with Gemini when you need to scale up

This ecosystem coherence is a strategic advantage that doesn't show up on any leaderboard. But it shows up in your codebase, your deployment pipeline, and your total cost of ownership.

The Reasoning Mode: Not a Gimmick

Every model in the Gemma 4 family — including the tiny E2B — supports a reasoning mode where the model generates explicit chain-of-thought tokens before producing its final answer. Up to 4,000 tokens of "thinking out loud."

I've seen people dismiss this as a gimmick, a marketing checkbox to compete with OpenAI's reasoning models. But here's why it matters for practical builders:

Reasoning mode gives you observability into the model's decision-making process.

When your agent takes a wrong action, you can look at the reasoning trace and understand why. Was it a bad premise? A logical error? A hallucination in the intermediate steps? This isn't just useful for debugging — it's essential for building trust in autonomous systems.

And the fact that even the E2B supports it means you can have an on-device agent that not only acts, but explains itself. On a phone. Offline. Under Apache 2.0.

Try finding that combination anywhere else in the market.

Where Gemma 4 Falls Short (Honest Assessment)

I'd be dishonest if I didn't address the gaps. No model family is perfect, and pretending otherwise doesn't help anyone.

1. Context Window Isn't King

Gemma 4's largest models top out at 256K tokens. That's generous by most standards, but Llama 4 Scout offers 10 million tokens. If your use case involves ingesting entire codebases, processing book-length documents in a single pass, or building RAG systems over massive corpora — Llama 4 has a structural advantage that Gemma 4 can't match.

2. The 31B Dense Is Slower Than Expected Locally

The 31B model was trained with Multi-Token Prediction (MTP) heads designed to accelerate inference. But in practice, these MTP heads are stripped from the public GGUF weights, meaning local inference speeds are slower than the architecture suggests. If you're deploying the 31B for real-time interactive use, expect to invest in quantization tuning and hardware optimization.

3. Community Ecosystem Is Still Young

Compared to Llama's massive fine-tuning ecosystem and Hugging Face's years of accumulated tooling around Meta's models, Gemma 4's community is smaller. Fewer LoRA adapters. Fewer domain-specific fine-tunes. Fewer "I tried X and here's what happened" blog posts (ironically, including this one).

This will change with time and adoption, but right now, if you need a pre-built fine-tune for medical, legal, or financial domains, you'll find more options in the Llama ecosystem.

4. Video Support Is Limited

The workstation models (26B and 31B) support video input, but capped at 60 seconds at 1 FPS. For short clips and thumbnails, this is fine. For anything resembling real video analysis — security footage, lecture recordings, sports clips — you'll need something else or a creative chunking strategy.

My Actual Recommendation

Here's what I'd tell a developer who asks me "Should I use Gemma 4?"

If you're building something that needs to run locally — on a laptop, on a phone, on edge hardware — Gemma 4 is the best option available today. Not because any single model is the absolute best at any single benchmark, but because the family gives you a coherent path from prototype to production across the entire deployment spectrum.

If you're building a cloud-only application where cost-per-token is your primary concern and you'll never need to run anything locally — you could pick any of the major open-weight families and be fine. The differences at the top end are marginal.

If you need million-token context windows — Llama 4 Scout is your model. Full stop.

If you need the absolute smallest model for the most constrained hardware — Phi-4 Mini and Gemma 4 E2B are both excellent, but Gemma 4's multimodal capabilities (especially native audio) give it an edge for real-world edge deployments.

The right answer, as always, depends on where your code needs to run. And that's precisely the point. Gemma 4 is the first model family that treats that question as fundamental rather than incidental.

The Bigger Picture

Here's the thought I keep coming back to:

The history of computing is a history of intelligence moving closer to the user.

Mainframes centralized everything. PCs put a computer on every desk. Smartphones put one in every pocket. The cloud briefly reversed the trend — pulling compute back to data centers — but the pendulum is swinging again.

AI has been a cloud-first technology for its entire commercial life. Every ChatGPT conversation, every Midjourney image, every Claude response you've ever received — processed in a data center hundreds or thousands of miles away. Your data travels there, gets processed, and the result comes back. You never own the model. You never control the pipeline. You're renting intelligence.

Gemma 4 is part of a broader movement to change that. Not "local AI" as a novelty or a hobbyist pursuit, but local AI as a genuine alternative to the cloud-dependent default. A model that can reason, see, hear, and act — running on hardware you own, under a license that doesn't restrict you, processing data that never leaves your building.

We're not there yet. Local models are still behind frontier cloud models on the hardest tasks. The tooling is still maturing. The ecosystem is still growing.

But the gap is closing faster than anyone predicted. And Gemma 4 — with its four-model deployment spectrum, its Apache 2.0 license, and its "one family, runs everywhere" philosophy — is probably the strongest argument yet that the future of AI isn't exclusively in the cloud.

It's in your pocket. On your desk. In your server room. Wherever your users and your data actually are.

That's the revolution. Not a bigger number on a benchmark. A smarter model in more places.

What's your take? Are you building with Gemma 4 locally, or is cloud inference still the default for you? I'm especially curious about edge deployment stories — if you've gotten E2B or E4B running on unconventional hardware, I'd love to hear about it in the comments.

OpenClaw Isn't a Chatbot — It's the Unix of Personal AI

Nilam Bora — Fri, 24 Apr 2026 16:06:49 +0000

This is a submission for the OpenClaw Challenge.

I almost dismissed OpenClaw the first time I heard about it.

"Another AI wrapper," I thought. "Probably a ChatGPT skin with a Telegram bot glued on top." I'd seen a dozen of these. They all promise you a personal assistant, and they all end up being a slightly less convenient way to use the same chat interface you already have open in a browser tab.

I was wrong. Spectacularly, fundamentally wrong.

After spending real time with OpenClaw — reading the source, building skills, breaking things, rebuilding them — I've come to a conclusion that might sound ridiculous at first:

OpenClaw is doing for personal AI what Unix did for computing. And if you understand why, you'll understand why it matters far more than its viral popularity suggests.

Let me explain.

The Unix Philosophy, Briefly

In the 1970s, Ken Thompson and Dennis Ritchie built an operating system around a set of principles that felt almost radical at the time:

Do one thing and do it well. Each program should handle a single task.
Programs should work together. The output of one becomes the input of another.
Text is the universal interface. Everything communicates through plain, readable streams.
Build tools, not applications. Let users compose solutions from small pieces.

These ideas didn't just survive — they conquered. Every server running your favorite website, every phone in your pocket, every cloud instance spinning up right now — all of them trace their lineage back to these four principles.

The reason Unix won wasn't because it was the most powerful system. It won because it was the most composable system. And composability scales in ways that monoliths never can.

Now Look at OpenClaw

When you strip away the hype and the viral Twitter threads and the "I let an AI run my life for a week" clickbait, OpenClaw's architecture tells a remarkably familiar story.

Principle 1: Do One Thing Well — The Skill System

The fundamental unit of OpenClaw isn't a prompt. It's a skill.

A skill is a directory containing a SKILL.md file — a plain Markdown document with YAML frontmatter that tells the agent what a particular capability is, when to use it, and how to execute it.

---
name: morning_briefing
description: Compile and deliver a morning summary of calendar, weather, and top news.
---

# Morning Briefing

When the user asks for a morning update, or when triggered by the 7:00 AM schedule:

1. Check the user's Google Calendar for today's events
2. Fetch weather for the user's configured location
3. Pull top 3 headlines from configured news sources
4. Compile into a concise summary
5. Send via the user's preferred channel

That's it. No Python class hierarchy. No plugin interface to implement. No SDK to install. A skill is a document that describes a single capability in natural language, and the agent interprets and executes it.

This is grep. This is sort. This is wc. A small, focused tool that does one thing and does it well.

Principle 2: Programs Should Work Together — Composability

Here's where it gets interesting. Skills in OpenClaw aren't isolated. They compose.

Your morning_briefing skill calls the calendar, calls the weather API, calls the news source. But each of those could be its own skill too. You might have a google_calendar skill that handles all calendar interactions, a weather_lookup skill that knows how to query multiple weather providers, and a news_digest skill that curates headlines.

The morning briefing skill doesn't need to know how any of those work internally. It just needs to know they exist.

This is piping. This is cat access.log | grep 404 | sort | uniq -c | sort -rn. Small pieces, loosely joined, producing results that no single piece could achieve alone.

And the key insight is the same one Unix taught us fifty years ago: you don't need to predict in advance what combinations users will want. You give them sharp tools and the ability to compose, and they build things you never imagined.

Principle 3: Text Is the Universal Interface — Markdown All the Way Down

OpenClaw skills are Markdown. The agent's configuration is text files in a workspace directory. Communication happens through natural language over messaging platforms. Memory is stored as structured text.

There's no proprietary format. No binary blobs. No "export your workflow as a JSON file that only our platform can read." If you can read a text file, you can understand, modify, duplicate, and share any part of an OpenClaw setup.

This is profoundly important for a reason that goes beyond convenience. Text as interface means:

Version control works. Put your skills in a Git repo. Track changes. Roll back mistakes. Branch and experiment.
Sharing works. Send someone a SKILL.md file. They drop it in their workspace. Done.
Debugging works. When something goes wrong, you read the skill instructions. They're in English. There's no stack trace to decode, no minified JavaScript to untangle.

The Unix insight was that text streams were the lowest common denominator that everything could agree on. OpenClaw's insight is that natural language is the new text stream — the interface that both humans and AI models can read, write, and reason about.

Principle 4: Build Tools, Not Applications

This is the big one. This is where OpenClaw diverges from every other "personal AI" product on the market.

Siri is an application. Alexa is an application. Google Assistant is an application. They're monolithic systems built by large teams, with fixed capabilities, governed by product roadmaps decided in boardrooms you'll never enter. You use them. You don't build with them.

OpenClaw is a toolkit. It gives you:

A runtime that can execute skills
A messaging bridge to reach you wherever you are
A memory system to maintain context
A scheduler to run things without you

What you build on top is entirely up to you. There's no "approved skill store." There's no review process. There's no waiting for a product team to decide that your use case matters.

If you've ever felt the difference between using a Mac app and piping commands together in a terminal — that controlled, curated experience versus the raw, limitless power of composition — you already understand the difference between conventional AI assistants and OpenClaw.

What This Means in Practice

Let me move from philosophy to something concrete. Here's a real workflow I built with three skills, and it illustrates why the composable approach actually matters.

The Problem: I wanted my agent to monitor a GitHub repository for new issues, triage them based on labels and content, draft an initial response, and alert me via Telegram only if the issue looks like it needs my personal attention.

With a monolithic AI assistant, this is either impossible or requires some elaborate Zapier/n8n chain with brittle webhooks and API tokens scattered across three different platforms.

With OpenClaw, it's three skills:

Skill 1: github_watcher — Polls a repo for new issues on a cron schedule, stores raw issue data.

Skill 2: issue_triage — Reads new issues, classifies them (bug/feature/question/spam), estimates complexity, and decides if they need human attention based on configurable rules.

Skill 3: smart_notify — Takes triage results and sends me a Telegram message only for issues flagged as needing my input. Includes a summary, not the raw issue dump.

Each skill is a single SKILL.md file. Each one does one thing. They compose through the agent's natural ability to chain operations. And here's the kicker — I can reuse smart_notify for completely different workflows. It doesn't know or care that it's being fed GitHub issue data. It just knows how to decide whether something is worth interrupting me about.

Try doing that with Siri.

The Honest Risks

I'd be dishonest if I wrote a love letter without mentioning the risks, and you deserve the full picture.

OpenClaw with shell access is a loaded weapon. The same power that lets it manage your files, run scripts, and automate workflows also means a poorly written skill, a hallucinating model, or a prompt injection attack could do real damage. There are documented cases of agents deleting files they shouldn't have touched, sending messages that were never intended, and racking up API bills through runaway loops.

This is not hypothetical. This is real. And if you're going to use OpenClaw seriously, you need to:

Run it on an isolated machine. A Raspberry Pi, an old laptop, a cheap VPS. Never your primary workstation with your personal files and credentials.
Audit third-party skills before installing them. Read the SKILL.md. Understand what shell commands it might execute. If you wouldn't run a random bash script from the internet, don't install a random OpenClaw skill either.
Start with read-only skills. Build things that fetch and summarize before you build things that create and delete.

The Unix parallel holds here too, by the way. rm -rf / has existed since the 1970s. Power and danger are inseparable. The answer was never to remove the power — it was to teach people to use it wisely.

Where This Goes Next

If OpenClaw's trajectory follows the Unix playbook, here's what I think happens:

Short-term: The skills ecosystem explodes. We're already seeing community-built skills on ClawHub, but we're in the "early package manager" era — think npm circa 2012. Quality is inconsistent, discoverability is poor, but the velocity is real.

Medium-term: Conventions emerge. Right now every skill author structures their SKILL.md slightly differently. We'll see community standards solidify around things like: how to declare dependencies between skills, how to specify input/output formats, how to handle errors gracefully. This is the .bashrc and Makefile era.

Long-term: Composition protocols. When your OpenClaw agent can delegate tasks to my OpenClaw agent through a standard protocol — something like the Agent2Agent (A2A) protocol that Google just pushed to production — we'll have something genuinely new. Not just personal AI, but a network of personal AI agents, each specialized, each autonomous, composing together to handle tasks that no single agent could manage alone.

We're at the beginning of that curve. And if history is any guide, the people who learn to think in composable, tool-oriented ways today will have a massive advantage when the ecosystem matures.

Getting Started: The Three-Skill Rule

If you're new to OpenClaw and want to start building, here's my recommendation: build three skills before you build anything ambitious.

Skill 1: A fetcher. Something that pulls information from an external source — weather, calendar, RSS feed, API endpoint. This teaches you how skills interact with the outside world.

Skill 2: A processor. Something that takes data and transforms it — summarize, classify, filter, reformat. This teaches you how skills reason about information.

Skill 3: A notifier. Something that delivers a result to you — Telegram message, email draft, file write. This teaches you how skills close the loop.

Once you have these three, you've built a pipeline. And once you've built one pipeline, you understand the mental model. Everything after that is just variation and refinement.

The installation is straightforward:

curl -fsSL https://openclaw.ai/install.sh | bash

The onboarding wizard handles model configuration and messaging channel setup. And from there, every skill you build is just a Markdown file in your workspace:

mkdir -p ~/.openclaw/workspace/skills/my-first-skill

Drop a SKILL.md in there, restart the gateway, and you're live. The entire feedback loop from idea to running skill can be under five minutes.

Final Thought

There's a quote from Doug McIlroy, the inventor of Unix pipes, that I think about a lot:

"This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together."

OpenClaw didn't invent this philosophy. It inherited it. And by applying these decades-old principles to the newest frontier in computing — autonomous AI agents — it's produced something that feels genuinely different from everything else in the market.

Not because it's the most powerful AI system. Not because it uses the best model. But because it gives you tools instead of an application, composition instead of configuration, and ownership instead of subscription.

The last time software was built this way, we got Linux, the internet, and everything that runs on top of them.

I'm not saying OpenClaw is the next Linux. That would be absurd.

But I am saying it's built on the same ideas. And those ideas have a pretty good track record.

If you've built something with OpenClaw or have thoughts on the composability angle, I'd genuinely love to hear about it. Drop a comment or find me on the DEV community — let's compare notes.

Forget the Flashy Keynote — The A2A Protocol Is the Real Revolution From Google Cloud Next '26

Nilam Bora — Thu, 23 Apr 2026 14:11:40 +0000

This is a submission for the Google Cloud NEXT Writing Challenge

Everyone's Talking About the Wrong Thing

Google Cloud Next '26 dropped like a thunderstorm. The internet exploded over the Apple partnership, the slick Gemini Enterprise Agent Platform demos, and 8th-Gen TPUs. And look — those are legitimately exciting. But after watching the keynotes, reading the docs, and spending a few hours actually digging into what shipped, I'm convinced the announcement that will reshape how we build software didn't even make the front page of Hacker News:

The Agent2Agent (A2A) Protocol is now in production at 150+ organizations, it's at v1.2, and it's officially governed by the Linux Foundation.

If you're a developer who builds anything that talks to other services — and let's be honest, that's all of us — this is the announcement you should be losing sleep over.

A Quick Primer: What Is A2A, and Why Should You Care?

Think about how services communicate today. We write REST endpoints. We wrangle GraphQL schemas. We negotiate API contracts across teams, build custom SDKs, and maintain brittle integration layers that eat 20-40% of our development cycles. Now scale that problem to AI agents.

In an agentic world, you don't just have your service calling their API. You have autonomous systems — agents — that need to discover each other, negotiate capabilities, delegate tasks, and report results, all without a human choreographing every step.

That's the problem A2A solves. And here's how it works, in plain English:

1. Agent Cards: The Business Card for Software

Every A2A-compliant agent publishes a discoverable "Agent Card" at a well-known URL (/.well-known/agent-card.json). This card describes:

Who the agent is — name, description, version
What it can do — its capabilities and skills
How to talk to it — endpoints, auth requirements, supported protocols

Think of it as a combination of OpenAPI spec and DNS record, but purpose-built for autonomous AI systems. Any agent on the network can discover and evaluate another agent's capabilities without a human ever writing an integration doc.

2. Communication Over Familiar Rails

A2A doesn't reinvent the wheel. It uses HTTP/HTTPS, JSON-RPC 2.0, and Server-Sent Events (SSE) for streaming. If you've written a webhook handler this decade, you already know 80% of the transport layer.

This is a deliberate and brilliant design choice. By building on existing web infrastructure, A2A inherits decades of tooling: load balancers, API gateways, observability stacks, WAFs — all of it works out of the box.

3. Task Lifecycle Management for Long-Running Work

Here's where A2A separates itself from anything that came before. It includes a structured task lifecycle with explicit states:

pending → task received, queued for execution
in-progress → agent is actively working
completed → results ready
failed → something broke, and here's why

This isn't just status tracking. It's a contract that enables agents to manage complex, multi-step workflows across organizational boundaries. A client agent can kick off a task, go handle other work, and poll or stream for results — exactly like a well-designed async job system, but standardized across the entire ecosystem.

4. Security That Enterprises Actually Need

The v1.2 update (which dropped alongside Cloud Next) added cryptographically signed agent cards for domain verification, alongside OAuth 2.0 and mTLS support. This isn't a research protocol being bolted onto production systems. It was built for production from day one.

Combined with Google Cloud's Model Armor for inline traffic sanitization, you get a security story that doesn't require security teams to reinvent the wheel for every agent deployment.

Why This Matters More Than the Gemini Enterprise Agent Platform

Don't get me wrong — the Gemini Enterprise Agent Platform (the thing that used to be Vertex AI) is impressive. The Agent Designer's no-code canvas, the Inbox for managing long-running agent workflows, the Agent Registry — all genuinely useful tools.

But here's my hot take: platforms are proprietary; protocols are permanent.

The Gemini Enterprise Agent Platform is a Google product. It's excellent, and if you're in the Google Cloud ecosystem, you should absolutely use it. But the A2A protocol is an open standard under the Linux Foundation. It's already integrated into:

Google's Agent Development Kit (ADK)
LangGraph (LangChain)
CrewAI
LlamaIndex Agents
Microsoft Semantic Kernel
AutoGen

This is the rare case where a major cloud vendor released something that helps everyone, including developers on competing platforms. That's not altruism — it's a bet that standardization grows the pie faster than lock-in. And historically, that bet tends to be right (see: HTTP, TCP/IP, OAuth, OpenTelemetry).

The Developer Impact: What Changes Right Now

Let me paint a practical picture. Say you're building a customer support system. Today, your architecture probably looks like:

User → Your App → Custom LLM Integration → Custom CRM API Wrapper 
                                          → Custom Billing API Wrapper
                                          → Custom Knowledge Base Search

Every integration is bespoke. Every connection is a maintenance liability. Every new data source requires a new adapter, new auth handling, new error recovery logic.

With A2A in the mix, it looks more like this:

User → Your Orchestrator Agent 
        → discovers CRM Agent (via Agent Card)
        → discovers Billing Agent (via Agent Card)  
        → discovers Knowledge Agent (via Agent Card)
        → delegates tasks via standard A2A protocol
        → monitors lifecycle states
        → composes results

The orchestrator doesn't need to know how the CRM agent works internally. It just needs to read the Agent Card, understand the capabilities, and communicate via the standard protocol. When Salesforce ships their own A2A-compliant agent tomorrow, your system picks it up without a single line of new integration code.

That's the revolution. Not any single agent being smarter, but all agents being able to work together without us hand-wiring every connection.

A2A vs. MCP: Complementary, Not Competing

I've seen some confusion about how A2A relates to the Model Context Protocol (MCP), so let me clarify:

	MCP	A2A
Purpose	Connect agents to tools and data	Connect agents to other agents
Relationship	Agent ↔ Resource	Agent ↔ Agent
Analogy	USB port for peripherals	TCP/IP for networked systems
Use Case	"Query my database"	"Hey CRM Agent, look up this customer"

They're complementary layers. MCP gives your agent hands and eyes. A2A gives it colleagues. The Agentic Data Cloud and Knowledge Catalog (also announced at Next '26) sit at the MCP layer — providing the context and grounding agents need. A2A sits above, orchestrating the collaboration between specialized agents.

If you're building anything non-trivial in the agentic space, you'll need both.

What I Think Is Still Missing

No protocol is perfect at v1.2, and A2A has some gaps I'd love to see addressed:

1. Discovery at Scale

Agent Cards at well-known URLs work great when you know where to look. But what about discovering agents you don't know exist? There's no standardized registry or marketplace protocol yet. Google's Agent Registry helps within the GCP ecosystem, but the open protocol needs a decentralized discovery mechanism — something like DNS for agents.

2. Economic Primitives

When Agent A delegates a task to Agent B, who pays? A2A has no built-in concept of metering, billing, or cost negotiation. As we move toward agent marketplaces (Google mentioned one in Project Mariner's Q4 2026 roadmap), this will become critical.

3. Semantic Versioning for Capabilities

Agent Cards describe capabilities, but there's no standard for versioning those capabilities. When an agent updates its skills, how do clients know what changed? We need something like semver for agent capabilities.

4. Debugging Multi-Agent Workflows

Tracing a single agent is hard enough. Tracing a conversation across 5 agents from 3 different vendors? The observability story needs work. OpenTelemetry integration for A2A traces would be a game-changer.

The Bigger Picture: Google's Bet on the Agentic Enterprise

Zoom out, and the entire Cloud Next '26 narrative clicks into place:

Gemini Enterprise Agent Platform = the factory where you build agents
Agent Designer = the blueprinting tool for non-engineers
Knowledge Catalog + Agentic Data Cloud = the fuel (trusted context from enterprise data)
Model Armor + Agentic Defense = the guardrails (security and governance)
A2A Protocol = the roads connecting everything together
8th-Gen TPUs + Virgo Network = the power grid underneath it all

The Apple partnership? It's validation that Google's AI infrastructure is best-in-class — Apple choosing Google Cloud to build its next-gen foundation models is a vote of confidence in the Virgo fabric and TPU architecture. But for us as developers, it doesn't change what we build or how we build it.

A2A does. It changes the architecture of collaboration between intelligent systems. And that's a shift that will compound for years.

What You Should Do This Week

If any of this resonated, here's my practical advice:

Read the A2A spec. It's well-written and surprisingly short. Start at google.github.io/A2A.
Build a toy Agent Card. Publish a /.well-known/agent-card.json for one of your existing services. Even if you don't build the full A2A server, the exercise of describing your service's capabilities in a machine-readable format is incredibly clarifying.
Try the ADK. Google's Agent Development Kit has native A2A support out of the box. Spin up two agents and watch them talk. There's something magical about seeing autonomous systems discover and delegate to each other.
Think about your integration tax. Look at your current codebase. How much code exists purely to connect System A to System B? That's the code A2A is designed to eliminate. Start identifying the integration seams where a standardized protocol could replace bespoke glue code.
Watch the Developer Keynote replay. The showcase of Agent Designer building a multi-agent workflow in natural language is legitimately impressive, and it demonstrates the full A2A lifecycle in action.

Final Thought

Every major platform shift has been catalyzed by a protocol, not a product. The web wasn't built on Netscape — it was built on HTTP. Mobile wasn't defined by the iPhone — it was enabled by LTE. Cloud computing wasn't created by AWS — it was powered by APIs and OAuth.

The agentic era will be no different. And A2A is the protocol that makes it possible.

Google Cloud Next '26 was packed with flashy demos and blockbuster partnerships. But the most important thing they shipped was a boring, beautiful, open protocol that lets AI agents work together without asking permission from any single vendor.

That's the one worth your attention. That's the one worth building on.

What's your take? Is A2A the game-changer I think it is, or am I overreacting? Have you tried building with the protocol yet? I'd love to hear about your experience in the comments.

I built a niche API for Indian developers because no one else did — here's the whole story

Nilam Bora — Thu, 23 Apr 2026 07:43:58 +0000

I've been building in public for a while now, and this is the project I'm most excited to share.
The problem
If you've ever worked on Indian invoicing, cheque printing, or accounting software, you've run into this: you need to print a number in words. Not just "one hundred fifty thousand" — but "Rupees One Lakh Fifty Thousand Only."
The Indian number system uses Lakhs and Crores, not Millions and Billions. Every Indian fintech app, every GST invoice generator, every cheque printing system needs this.
I searched RapidAPI. Nothing. Stack Overflow is full of half-working snippets. PyPI has abandoned packages. So I decided to fix that.
What I built
RupeesInWords is a REST API that converts any number to Indian Rupees in words.

GET /api/v1/convert?number=150000

Response:

{
  "number": 150000,
  "words": "Rupees One Lakh Fifty Thousand Only",
  "indian_formatted": "1,50,000",
  "currency_symbol": "Rupees",
  "success": true
}

Features:

Full Lakh/Crore/Paise support
Batch conversion (up to 50 numbers per request)
Indian comma formatting (1,50,000)
Multiple currency symbols (₹, Rs., INR, Rupees)
Clean error handling for edge cases

The open-core model
The core converter logic is open-source on GitHub:
👉 https://github.com/NEXUS-Lord/rupees-in-words
Pure Python. Zero dependencies. MIT license. If you want to self-host or use it as a library, it's right there.
The hosted API — with API keys, rate limiting, batch endpoints, and uptime — is on RapidAPI:
👉 https://rapidapi.com/NEXUSLord/api/rupeesinwords-indian-number-to-words
Free tier available: 100 requests/month, no credit card required.
This is a classic open-core model. The library is free. The service is monetized. The open-source version builds trust and GitHub stars — the hosted API generates revenue.

The tech stack

Python + FastAPI — automatic Swagger UI at /docs out of the box
Render.com — deploys directly from GitHub, free tier to start
RapidAPI — marketplace and billing layer

The entire infrastructure setup took less than an hour. FastAPI's auto-generated docs alone saved me hours of documentation work.
What I learned
The hardest part wasn't building the API — it was validating the idea first. Before writing a single line of code, I searched RapidAPI, GitHub, and PyPI to confirm nothing like this existed for Indian developers. That 20 minutes of research saved me from building something nobody needed.
The Indian developer market is massively underserved by API tooling. Western developers have thousands of niche APIs available. Indian developers are still copy-pasting Stack Overflow snippets for basic financial formatting. That gap is the opportunity.
What's next
I'm targeting 5,000 GitHub stars by July 2026 and building this as a proof-of-concept for the full pipeline:
Identify a niche problem → build a clean solution → open-source the core → monetize the hosted version.
If you're building Indian fintech tools, invoicing software, or anything that needs Indian number formatting — try the free tier and tell me what I'm missing. Edge cases, feature requests, weird number formats — I want to hear all of it.
GitHub: https://github.com/NEXUS-Lord/rupees-in-words
Try it free — 100 requests/month, no credit card: https://rapidapi.com/NEXUSLord/api/rupeesinwords-indian-number-to-words
Building in public. One niche API at a time.