DEV Community: pulkitgovrani

5 Things I Wish I Knew Before Building with Hermes Agent

pulkitgovrani — Mon, 25 May 2026 09:30:00 +0000

This is a submission for the Hermes Agent Challenge: Write About Hermes Agent

I spent a week building a production system on Hermes Agent — a persistent AI memory layer for GitHub repositories called Shadow CTO. Here's what tripped me up and what I'd tell myself on day one.

1. Session IDs Are Your Most Important Design Decision

Make them meaningful from the start. A UUID is fine for experiments. For anything real, encode the domain semantics into the ID itself.

# Bad — opaque, can't debug, can't reason about isolation
session_id = str(uuid.uuid4())

# Good — self-documenting, domain-meaningful
session_id = f"repo:{owner}/{repo_name}"      # per-repository brain
session_id = f"user:{user_id}:support"         # per-user support history
session_id = f"project:{project_id}:planning"  # per-project context

Why this matters more than you think: You cannot merge sessions. If you realize six weeks in that you gave all users the same session ID instead of per-user IDs, you are starting over. There's no migration path. The accumulated memory is gone.

Spend 30 minutes mapping your domain to session IDs before writing a single API call. It's the highest-leverage design decision in a Hermes-backed system.

2. Feed Events, Not Documents

My first instinct was to dump entire files and documents into Hermes and then query them. This works but misses the point entirely.

Hermes's memory is most powerful when you feed it events — things that happened, decisions that were made, changes that occurred over time. Not static content.

# Weaker: dumping a document
await chat(
    "Here is our entire architecture documentation: [5,000 words of context]",
    session_id="repo:acme/backend",
)

# Stronger: feeding events as they happen
await chat(
    "Decision made on 2026-03-14:\n"
    "Switched from PostgreSQL to MongoDB.\n"
    "Reason: Schema flexibility needed for dynamic user preferences.\n"
    "Impact: Migration took 3 sprints, 2 rollback incidents.",
    session_id="repo:acme/backend",
)

The second approach lets Hermes build causal understanding. It knows what came before this decision and can reason about what changed as a result. By the 50th event, it has context that a document dump can never provide: sequence, causality, and pattern.

The rule of thumb: if the information has a timestamp and something happened, feed it as an event. If it's static reference material, RAG is the better tool.

3. Your System Prompt Is a Memory Architecture Decision

Hermes is smart, but it needs to know what to do with incoming information. A generic system prompt produces shallow retention. A structured extraction prompt produces deep, queryable memory.

# Shallow — Hermes stores raw text, answers surface questions
SYSTEM = "You are a helpful assistant."

# Deep — Hermes extracts and retains structured understanding
SYSTEM = """You are the institutional memory for {repo_name}.

When you receive new information:
1. Identify if a meaningful decision was made (skip noise like typo fixes)
2. Extract the rationale — the WHY, not just the what happened
3. Note any contradictions with previous decisions you remember
4. Remember the causal chain: what problem led to this decision

Store understanding, not transcripts. When asked later, cite specifics."""

The quality difference in answer depth between these two system prompts is significant enough that I'd call it the second most important design decision after session ID granularity.

One additional tip: use temperature 0.2–0.3 for ingest (you want consistent, structured extraction) and 0.6–0.7 for Q&A (you want thoughtful synthesis, not mechanical retrieval).

4. Cron Jobs Need Explicit Session Context in the Prompt

This one cost me two hours of debugging. I registered Hermes cron jobs on startup without explicitly anchoring them to a session context in the prompt. The jobs fired on schedule but gave generic, unhelpful responses because they weren't connecting to the right accumulated memory.

# This job fires but has no memory context
await hermes.create_job(
    name="daily-analysis",
    schedule="0 2 * * *",
    prompt="Identify recurring failure patterns from recent engineering decisions.",
    # Where? Whose memory? Hermes doesn't know.
)

# This job fires AND draws from the right accumulated context
await hermes.create_job(
    name="acme-backend-daily-analysis",
    schedule="0 2 * * *",
    prompt=(
        "You are the Shadow CTO for acme/backend. "         # identity
        "Using the engineering decisions stored in your memory "  # explicit memory reference
        "from this repository, identify recurring failure patterns: "
        "components that keep breaking, decisions that were reversed, "
        "or technical debt accumulating. Be specific — cite titles and dates."
    ),
)

The "You are the X for Y" clause in the prompt is what reconnects the cron job to the right session context. Without it, you're firing a prompt into a vacuum. With it, you're triggering an agent that knows who it is and what it's been watching.

5. Streaming Is Worth the Extra Code

The non-streaming endpoint is significantly simpler to implement. For any user-facing feature, use streaming anyway.

Hermes answers thoughtfully. On complex questions about accumulated history — "what are the three biggest risks before the next release?" — responses can take 10–20 seconds of generation time. Without streaming, users see a spinner and wonder if something broke. With streaming, they see the answer building in real time and the latency is invisible.

Backend (FastAPI SSE):

from fastapi.responses import StreamingResponse

@router.post("/query")
async def query(body: QueryRequest):
    async def generate():
        try:
            async for chunk in hermes.stream_chat(
                messages=build_messages(body.question),
                session_id=body.session_id,
            ):
                escaped = chunk.replace("\n", "\\n")
                yield f"data: {escaped}\n\n"
        except Exception as exc:
            yield f"data: [ERROR] {exc}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream",
                             headers={"Cache-Control": "no-cache",
                                      "X-Accel-Buffering": "no"})

Frontend (React):

const es = new EventSource("/api/query", { /* POST via fetch workaround */ });
es.onmessage = (e) => {
    if (e.data === "[DONE]") { es.close(); return; }
    setResponse(prev => prev + e.data.replace(/\\n/g, "\n"));
};

The UX difference between streaming and non-streaming is the gap between a prototype and something that feels production-ready. It's two hours of extra work and worth every minute.

The One-Line Summary

Hermes rewards investment in two things: session ID design and ingest prompt structure. Get those right on day one and the persistent memory largely takes care of itself — you build the domain logic, Hermes carries the institutional knowledge.

Everything else in this list is recoverable. Those two aren't.

Inside Hermes Agent's Session Memory: What X-Hermes-Session-Id Actually Does

pulkitgovrani — Mon, 25 May 2026 02:30:00 +0000

This is a submission for the Hermes Agent Challenge: Write About Hermes Agent

The header looks trivial. One line. But it's doing something architecturally significant. Here's exactly what happens when you pass X-Hermes-Session-Id to Hermes — and why it matters more than it appears.

The Naive Mental Model (and Why It's Wrong)

Most developers assume persistent session = stored chat history. Request comes in → look up conversation log → prepend to messages → send to LLM. Like a database-backed chatbot.

That's not what Hermes does.

The naive model has a linear cost problem:

Turn 1:   send 100 tokens
Turn 10:  send 1,000 tokens
Turn 100: send 10,000 tokens
Turn N:   send N × average_turn_length tokens

At 1000 turns you're sending a short novel on every request. This is why "just store the history" breaks for long-running agents.

What's Actually Happening: Compressed State, Not Transcript Replay

Hermes maintains a continuously updated compressed state per session ID — not a raw transcript that grows without bound.

Prior turns are distilled into the model's retained understanding. The context window stays bounded regardless of how many turns have occurred. New inputs are processed against accumulated understanding, not against a raw replay of every prior message.

The practical effect:

# Turn 1 — explicitly stated
chat("My name is Alex. I'm building a distributed cache in Rust.")

# Turn 200 — two months and 199 interactions later
# No history sent. No RAG lookup. Just the session ID.
chat("What tech stack are we using again?")
# "You're building a distributed cache in Rust."

The model doesn't "find" that fact. It retained it.

The Session ID as a Namespace

Each unique X-Hermes-Session-Id value is a completely isolated memory namespace. Sessions never bleed into each other. This makes session IDs a first-class design primitive.

from openai import AsyncOpenAI

client = AsyncOpenAI(base_url="http://localhost:11434/v1", api_key="hermes")

async def chat(message: str, session_id: str) -> str:
    response = await client.chat.completions.create(
        model="hermes",
        messages=[{"role": "user", "content": message}],
        extra_headers={"X-Hermes-Session-Id": session_id},
    )
    return response.choices[0].message.content

# These sessions are completely isolated brains
await chat("Commit: removed Redis cache, caused 3 outages", "repo:acme/backend")
await chat("Commit: added Redis cache layer for performance", "repo:widgets/frontend")

# Each query draws only from its own session
result = await chat("What cache decisions were made?", "repo:acme/backend")
# Knows about the Redis removal — knows nothing about widgets/frontend

Map session IDs to your domain:

Domain	Session ID Pattern
Per-user memory	`user:{user_id}`
Per-repository memory	`repo:{owner}/{name}`
Per-customer support	`support:{customer_id}`
Per-project context	`project:{id}:v{version}`

What Gets Retained and How

Every message sent through a session is processed and distilled. Hermes prioritizes retention of:

Explicit facts — names, decisions, stated preferences, numbers

"We use PostgreSQL 15 on RDS with read replicas in us-east-1"
→ retained verbatim

Causal relationships — X was done because of Y

"Removed Redis because cache invalidation bugs caused stale product prices"
→ the causal link is retained, not just the removal

Temporal markers — when things happened relative to each other

"Tried GraphQL in Q1, reverted in Q2 due to N+1 issues"
→ the sequence and the reason are retained together

Contradictions — when new information conflicts with what's stored

Prior: "We're committed to microservices"
New: "Merged all services back into a monolith"
→ Hermes flags this as a reversal when asked about architecture decisions

This is the distinction from retrieval. RAG finds text. Hermes retains understanding of relationships between facts.

The Cron Integration: Memory Meets Autonomy

Hermes's /api/jobs endpoint connects the session memory system to time. A registered job is a prompt that fires on a schedule — and crucially, it runs through the same accumulated session context.

import httpx

# Register a job that runs against its own accumulated memory
httpx.post(
    "http://localhost:11434/api/jobs",
    headers={"Authorization": "Bearer hermes"},
    json={
        "name": "weekly-pattern-report",
        "schedule": "0 9 * * 1",
        "prompt": (
            "You are the Shadow CTO for acme/backend. "
            "Review the engineering decisions you have stored in memory "
            "from the past week. Identify any recurring failure patterns "
            "or decisions that were reversed. Prepare a concise report."
        ),
    },
)

The agent isn't querying an external database. It's asking itself what it remembers. This is the architecture that enables genuinely autonomous behavior — not polling, not retrieval, not RAG. Introspection over accumulated memory.

Streaming: The Architecture Underneath

For user-facing features, always use the streaming endpoint. Hermes reasons before answering — on questions about accumulated history, full responses can take 10–20 seconds. Streaming makes that latency invisible.

# Streaming via SSE in FastAPI
async def generate_sse(session_id: str, question: str):
    stream = await client.chat.completions.create(
        model="hermes",
        messages=[{"role": "user", "content": question}],
        stream=True,
        extra_headers={"X-Hermes-Session-Id": session_id},
    )
    async for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            # Escape newlines for SSE wire format
            yield f"data: {delta.replace(chr(10), chr(92) + 'n')}\n\n"
    yield "data: [DONE]\n\n"

The frontend side is a standard EventSource. The user sees the answer build character by character, which feels fast even when total generation takes 15 seconds.

When This Architecture Wins vs. RAG

Scenario	RAG Better	Hermes Session Better
Search across 10k static documents	✅	❌
Remember context across 6 months of activity	❌	✅
Precise source citation with page numbers	✅	⚠️
Understanding causality and sequence over time	❌	✅
"What changed and why" questions	❌	✅
Real-time document ingestion at scale	✅	⚠️
Autonomous scheduled analysis	❌	✅
Detecting reversals and contradictions	❌	✅

The OpenAI Compatibility Layer

Because Hermes wraps an OpenAI-compatible API, migration from existing OpenAI code is nearly zero-cost:

# Before — OpenAI, stateless
from openai import AsyncOpenAI
client = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"])

response = await client.chat.completions.create(
    model="gpt-4o",
    messages=conversation_history,  # you manage this
)

# After — Hermes, persistent
from openai import AsyncOpenAI
client = AsyncOpenAI(
    base_url="http://localhost:11434/v1",
    api_key="hermes",
)

response = await client.chat.completions.create(
    model="hermes",
    messages=[{"role": "user", "content": latest_message}],  # just the new message
    extra_headers={"X-Hermes-Session-Id": user_session_id},  # Hermes handles the rest
)

You drop the conversation history management. You add one header. Tool use, function calling, and streaming patterns all work unchanged.

Summary

X-Hermes-Session-Id isn't a database lookup key. It's a namespace for a persistent reasoning state that accumulates understanding rather than replaying transcripts. The cost is bounded. The knowledge compounds. The autonomy follows naturally from the scheduling integration.

That's the architectural bet Hermes is making: that the future of AI agents is stateful participants that get smarter over time, not stateless query engines that start from zero on every call.

Based on what you can build with a single header, it's a bet worth taking.

Hermes Agent vs LangChain vs CrewAI: When to Reach for Each

pulkitgovrani — Sun, 24 May 2026 18:30:00 +0000

This is a submission for the Hermes Agent Challenge: Write About Hermes Agent

Three tools. All called "agentic frameworks." Completely different philosophies. Here's how to choose without spending a week reading documentation.

The Core Difference in One Line

LangChain — orchestration glue: chains, tools, and retrievers you wire together explicitly
CrewAI — multi-agent collaboration: roles, tasks, and crews that delegate to each other
Hermes Agent — persistent memory: a stateful AI brain that accumulates context across time

None of them are strictly better. They solve different problems.

LangChain

Best for: Connecting LLMs to tools, databases, and APIs in explicit, inspectable pipelines.

from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma

qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=Chroma(...).as_retriever()
)
qa.run("What does our refund policy say?")

Strengths:

Enormous ecosystem — 100+ integrations out of the box
Explicit control over every step of the pipeline
Best-in-class for RAG and document Q&A
Large community, extensive documentation

Weaknesses:

Stateless by default — you own the memory problem
Can become boilerplate-heavy for simple tasks
Abstractions leak under complex, branching workflows

Reach for it when: You need to wire an LLM to specific data sources with precise retrieval logic, or you need a known integration (Pinecone, Notion, Slack, etc.) working in hours, not days.

CrewAI

Best for: Multi-agent workflows where different AI "roles" collaborate on a shared goal.

from crewai import Agent, Task, Crew

researcher = Agent(
    role="Research Analyst",
    goal="Find and summarize relevant technical papers",
    backstory="Expert at finding signal in dense technical literature",
)

writer = Agent(
    role="Technical Writer",
    goal="Turn research into a clear, developer-friendly post",
)

crew = Crew(agents=[researcher, writer], tasks=[...])
result = crew.kickoff()

Strengths:

Natural mental model for role-based workflows
Built-in agent delegation and handoff
Good for document generation, research pipelines, content workflows

Weaknesses:

Agents are still stateless between separate runs
Overhead for simple single-agent tasks
Less control over inter-agent communication details

Reach for it when: Your task naturally decomposes into specialist roles that hand off work to each other — research → write → review, or plan → execute → validate.

Hermes Agent

Best for: Agents that need to accumulate knowledge over time and reason about what they've learned.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="hermes")

# No conversation history in the request — Hermes holds it
response = client.chat.completions.create(
    model="hermes",
    messages=[{"role": "user", "content": "What did we ship last week?"}],
    extra_headers={"X-Hermes-Session-Id": "team-standup-bot"},
)

Strengths:

True persistent memory per session ID — accumulates indefinitely
Built-in cron scheduler — agents that run themselves on a schedule
OpenAI-compatible API — near-zero migration cost from existing code
No external vector DB needed for temporal memory
Open source and locally runnable — you control the memory

Weaknesses:

Newer ecosystem, fewer community integrations than LangChain
Memory is opaque — you can't directly inspect or edit what's stored
Less fine-grained control over retrieval than an explicit RAG pipeline

Reach for it when: Your agent needs to learn and remember across many interactions over days, weeks, or months — and you want that memory to inform autonomous behavior, not just passive Q&A.

Decision Matrix

Need	LangChain	CrewAI	Hermes
Connect LLM to a database / API	✅	⚠️	⚠️
Multi-role agent collaboration	⚠️	✅	⚠️
Persistent memory across sessions	❌	❌	✅
Scheduled autonomous tasks	❌	❌	✅
OpenAI code migration (drop-in)	⚠️	❌	✅
Ecosystem / pre-built integrations	✅	⚠️	⚠️
RAG over large document corpus	✅	⚠️	❌
"What happened and why" questions	❌	❌	✅

A Real Scenario: Engineering Knowledge Base

Let's say you want an AI that answers questions about your codebase's history.

With LangChain: You'd embed all your commits and PRs into a vector store, build a retrieval chain, and query it. Fast to set up. Answers questions about text in documents. Can't reason about causality or change over time.

With CrewAI: You'd build a researcher agent to fetch GitHub data and a writer agent to summarize it. Good for one-shot reports. Forgets everything between runs.

With Hermes: You feed every commit and PR into a persistent session as it happens. Months later you ask "why was Redis removed?" and Hermes answers from accumulated understanding — knowing what came before and after that decision, not just that the word "Redis" appears in a commit message.

Different tools. Different outcomes.

They're Not Mutually Exclusive

Use LangChain to connect Hermes to specific retrieval sources when you need both precision and memory. Use CrewAI to coordinate multiple Hermes sessions as specialist agents with distinct memory namespaces.

The frameworks compose. Pick the right primitive for each layer of the problem.

The simplest rule: if your agent needs to remember things it wasn't explicitly told in this request, reach for Hermes first.

How to Build a Persistent AI Agent with Hermes in 15 Minutes

pulkitgovrani — Sun, 24 May 2026 14:30:00 +0000

This is a submission for the Hermes Agent Challenge: Write About Hermes Agent

Most AI integrations are stateless. Every request starts cold.
Hermes Agent is different — it remembers.

This guide walks you through spinning up Hermes locally and building a minimal agent that accumulates memory across sessions. No vector database. No RAG pipeline. Just a session ID.

Prerequisites

Docker or Python 3.11+
Basic familiarity with REST APIs
15 minutes

Step 1: Run Hermes Locally

# via Docker
docker pull nousresearch/hermes-agent
docker run -p 11434:11434 nousresearch/hermes-agent

Verify it's alive:

curl http://localhost:11434/health
# {"status":"ok"}

Step 2: Your First Stateful Chat

Hermes exposes an OpenAI-compatible /v1/chat/completions endpoint. The magic is one header: X-Hermes-Session-Id.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="hermes",
)

SESSION_ID = "my-first-agent"

def chat(message: str) -> str:
    response = client.chat.completions.create(
        model="hermes",
        messages=[{"role": "user", "content": message}],
        extra_headers={"X-Hermes-Session-Id": SESSION_ID},
    )
    return response.choices[0].message.content

Now send two messages in separate calls — no shared history in the request body:

print(chat("My name is Alex and I'm building a todo app in Go."))
# "Nice to meet you, Alex! ..."

print(chat("What language am I using?"))
# "You're using Go, as you mentioned earlier."

The second call has no conversation history in the request body. Hermes remembered anyway — because the session ID matched.

Step 3: Feed Events Over Time

Hermes memory compounds. The more you feed it, the richer its understanding becomes. Feed events as structured facts:

events = [
    "Decision: Switched auth from JWT to session cookies. Reason: race condition in token refresh under concurrent requests caused 2% of users to be logged out.",
    "Decision: Removed Redis cache layer. Reason: cache invalidation bugs caused stale data in production. Replaced with direct DB reads.",
    "Decision: Added rate limiting to /api/search. Reason: one customer was generating 40% of total API load.",
]

for event in events:
    chat(event)

# Now ask about the accumulated history
print(chat("What are the biggest reliability concerns in this codebase?"))
# Hermes synthesizes across all three events

Step 4: Register a Cron Job

Hermes has a built-in scheduler. Register a recurring autonomous task:

import httpx

httpx.post(
    "http://localhost:11434/api/jobs",
    headers={"Authorization": "Bearer hermes"},
    json={
        "name": "daily-standup",
        "schedule": "0 9 * * 1-5",
        "prompt": (
            "You are the project memory agent. Using what you remember "
            "from recent activity, generate a concise standup summary: "
            "what changed, why, and what to watch."
        ),
    },
)

That's it. Hermes now runs this prompt every weekday at 9am, drawing from whatever it has accumulated in memory — no external database, no retrieval pipeline.

What Just Happened

Concept	How Hermes Handles It
Memory	Persistent per session ID — no client-side history needed
Scheduling	Native `/api/jobs` endpoint with cron syntax
API surface	OpenAI-compatible — drop-in for existing code
Cost	Memory stays bounded — not a growing transcript

Session ID Design Patterns

Session IDs are namespaces. Make them intentional:

# Per-user memory
session_id = f"user:{user_id}"

# Per-repository institutional memory
session_id = f"repo:{owner}/{repo_name}"

# Per-customer support history
session_id = f"support:{customer_id}"

Sessions never bleed into each other. repo:facebook/react and repo:your-team/backend are completely isolated brains.

What to Build Next

Give each user their own session ID → per-user personalization without a user profile database
Feed GitHub commits into a session over time → a codebase that explains its own history
Schedule daily analysis jobs → autonomous agents that surface insights without being asked

The pattern scales to anything that benefits from an AI that remembers what it's seen before — which turns out to be almost everything worth building.

Getting Started with Gemini API Managed Agents — The Quickest Path from Prompt to Deployed Agent

pulkitgovrani — Sun, 24 May 2026 13:30:00 +0000

This is a submission for the Google I/O 2026 Challenge: Explore Google I/O 2026

Google shipped a lot of developer tooling at I/O 2026. The thing I want to actually build with is Managed Agents in the Gemini API.

Here's the pitch: one API call provisions a fully-functional agent with a remote execution sandbox. No infrastructure setup. No managing cloud VMs. No configuring the Antigravity agent harness manually. You write the agent logic. Google handles the environment.

This guide walks through what Managed Agents actually are, how they differ from the existing Gemini API, and how to get something running.

The Problem Managed Agents Solve

Building with the Gemini API before I/O 2026 meant one of two things:

Option A — Stateless calls: Send a prompt, get a response. Fine for Q&A. Useless for anything that requires multiple steps, state between calls, or executing code.

Option B — DIY infrastructure: Spin up a VM, configure the Antigravity harness (Google's agent runtime, announced at I/O 2025 as an alpha), manage sandboxing and credential isolation yourself, deploy the whole thing. Capable, but not a 30-minute getting-started experience.

Managed Agents collapse this to a single API call. The agent harness runs on Google's infrastructure. You get a remote sandbox, tool execution, and persistent state without provisioning anything.

What You Actually Get

A Managed Agent gives you:

Remote execution sandbox — code runs in Google's infrastructure, not your machine
Persistent state — the agent maintains context between tool calls and across a session
Tool use — the agent can call functions you define, fetch URLs, run code
Parallel subagents — via Antigravity's dynamic subagent system, agents can spin up specialized sub-agents for parallelized work
Scheduled tasks — background tasks that run on a schedule without a persistent connection

This is essentially the Antigravity 2.0 stack (Google's agent-first development platform, also announced at I/O 2026) delivered as an API.

Getting Started

1. Enable Managed Agents in AI Studio

Go to Google AI Studio, create a new project, and enable the Managed Agents feature under the Experimental section. You'll need a Gemini API key.

2. Install the SDK

pip install google-generativeai>=0.8.0

3. Your First Managed Agent

import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")

# Define the tools your agent can use
def search_documentation(query: str) -> str:
    """Search technical documentation for a given query."""
    # Your implementation here
    return f"Documentation results for: {query}"

def run_code(code: str, language: str = "python") -> dict:
    """Execute a code snippet and return the result."""
    # The Managed Agent sandbox handles actual execution
    return {"stdout": "...", "stderr": "", "exit_code": 0}

# Create the agent — Managed Agents handles the rest
agent = genai.ManagedAgent(
    model="gemini-3.5-flash",
    tools=[search_documentation, run_code],
    system_instruction=(
        "You are a technical assistant. When asked about code, "
        "search the docs first, then write and test code to verify your answer."
    ),
)

# Run the agent — it will use tools, iterate, and return a complete answer
result = agent.run(
    "Show me how to implement rate limiting in FastAPI with Redis"
)
print(result.text)

The agent call is blocking — it runs until the agent decides it has a complete answer. Under the hood, Gemini 3.5 Flash is orchestrating tool calls, synthesizing results, and iterating automatically.

Async and Streaming

For production use, you'll want async and streaming so your application stays responsive:

import asyncio

async def run_agent_async():
    agent = genai.ManagedAgent(
        model="gemini-3.5-flash",
        tools=[search_documentation, run_code],
        system_instruction="You are a technical assistant.",
    )

    # Stream the agent's output as it works
    async for event in agent.run_stream("Explain and demonstrate async context managers"):
        if event.type == "text":
            print(event.text, end="", flush=True)
        elif event.type == "tool_call":
            print(f"\n[Calling tool: {event.tool_name}]")
        elif event.type == "tool_result":
            print(f"[Tool returned: {event.result[:100]}...]")

asyncio.run(run_agent_async())

The event stream lets you show users what the agent is doing as it works — which matters for long-running tasks where a blank spinner kills perceived responsiveness.

Scheduled Tasks

One of the capabilities that separates Managed Agents from standard API calls is scheduled background tasks:

# Register a task that runs every day at 9am
agent.schedule_task(
    prompt=(
        "Check the GitHub repository for new issues labeled 'bug'. "
        "For each new issue, search the codebase for the relevant component "
        "and add a comment with likely root causes and affected files."
    ),
    schedule="0 9 * * *",  # cron syntax
    name="daily-issue-triage",
)

This runs without a persistent connection. The Managed Agent runtime handles the scheduling, execution, and logging. You can check task status and results via the API or in AI Studio.

Parallel Subagents

For tasks that decompose naturally into parallel work, Managed Agents supports dynamic subagent spawning via the Antigravity integration:

agent = genai.ManagedAgent(
    model="gemini-3.5-flash",
    tools=[...],
    system_instruction=(
        "You orchestrate research tasks. When given a broad topic, "
        "break it into parallel sub-tasks and spawn subagents to handle each. "
        "Synthesize their results into a unified response."
    ),
    enable_subagents=True,  # enables dynamic subagent spawning
)

result = agent.run(
    "Research the current state of WebAssembly runtimes: "
    "performance benchmarks, language support, production adoption, "
    "and future roadmap. Cover all angles in parallel."
)

The orchestrating agent decomposes the task, spawns specialized subagents for each sub-problem, and synthesizes the results — handling the parallelism automatically.

Deploy to Cloud Run in One Click

If you've built your agent in AI Studio, the I/O 2026 update adds a one-click Deploy to Cloud Run option. First two apps are free, no payment setup required. This is the fastest path from prototype to a publicly accessible endpoint.

For programmatic deployment:

# From the Antigravity CLI (also new at I/O 2026)
antigravity deploy my-agent \
  --runtime managed \
  --region us-central1 \
  --model gemini-3.5-flash

What Antigravity 2.0 Adds on Top

If you want more control than the Managed Agents API gives you, Antigravity 2.0 is the full platform:

Antigravity CLI for spinning up specialized subagents from the command line
Cross-platform terminal sandboxing with credential masking and hardened Git policies
Firebase integration for full-stack apps with auth and storage
Exports to Netlify from Google Stitch (the new UI design tool)

Managed Agents is Antigravity for developers who don't want to manage infrastructure. Antigravity 2.0 is for developers who want full control.

The Honest Assessment

What's genuinely good: The abstractions are right. Infrastructure should be invisible for most agent use cases. Managed Agents makes the 80% case — a stateful, tool-using agent that runs in the cloud — genuinely easy.

What to watch: The sandbox is Google's infrastructure. For agents that need access to internal systems, private databases, or custom toolchains, the Managed Agents sandbox has limits. The Antigravity CLI path gives you more flexibility there, but it's also more setup.

The pricing model matters: Scheduled tasks and long-running agents accumulate costs differently than stateless API calls. Test thoroughly before deploying agents that run on cron schedules against production data.

Bottom Line

Managed Agents is the "batteries included" path to building production agents on Gemini. One API call, a list of tools, and a system prompt is all you need to go from idea to a deployed, stateful, tool-using agent.

The scheduling and subagent features are where it stops being a demo feature and starts being infrastructure worth building on. The combination of Gemini 3.5 Flash's speed with Managed Agents' infrastructure is what makes the "agentic era" theme from Google's I/O keynote feel grounded rather than aspirational.

Build something with it this week. The time between "I wonder if this is possible" and a deployed endpoint is now measured in minutes, not days.

Links: Google AI Studio · Gemini API Docs · Antigravity 2.0 — Google I/O Developer Highlights · Google I/O 2026

Fine-Tuning Gemma 4 for Function Calling with TRL's New Multimodal Tool Support

pulkitgovrani — Sun, 24 May 2026 12:42:07 +0000

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

Gemma 4 ships with native function calling built in — trained from scratch, not prompt-engineered. But "built in" and "tuned for your specific tools" are different things.

If you have a set of internal APIs, a specific tool schema, or edge-case behaviors that the base model handles inconsistently, fine-tuning on your own function-calling data is the right move. TRL (Transformer Reinforcement Learning library) added multimodal tool response support in the same release window as Gemma 4, making this the first time you can fine-tune a multimodal model on tool use — including image outputs from tools.

This guide walks through the full pipeline: data format, fine-tuning with QLoRA, and evaluation.

What TRL's Multimodal Tool Support Actually Adds

Before this update, TRL's SFTTrainer (Supervised Fine-Tuning) could train on text tool calls and text tool responses. The new version adds:

Image outputs from tools — a tool that returns an image (chart, rendered diagram, screenshot) can now be part of the training trajectory
Multi-turn tool use sequences — full call → response → reasoning → next call chains
Gemma 4 chat template — native support for Gemma 4's system/user/assistant/tool format

For pure text tool calling, the older approach still works. The multimodal support matters when your agent needs to reason about visual outputs from tool calls.

Prerequisites

pip install trl>=0.12.0 transformers>=4.50 datasets peft accelerate bitsandbytes

For 4-bit quantization (recommended for consumer GPUs):

pip install bitsandbytes>=0.43.0

Data Format

Your training data needs to be structured as multi-turn conversations with tool calls and responses. TRL's SFTTrainer expects the Gemma 4 chat format:

# Each example is a list of messages
example = [
    {
        "role": "system",
        "content": "You are a data analysis assistant with access to a Python code executor and a charting tool."
    },
    {
        "role": "user",
        "content": "What's the trend in our monthly revenue for Q1 2026?"
    },
    {
        "role": "assistant",
        "content": None,
        "tool_calls": [
            {
                "type": "function",
                "function": {
                    "name": "execute_python",
                    "arguments": '{"code": "import pandas as pd\\ndf = pd.read_csv(\'revenue.csv\')\\nprint(df[df[\'quarter\']==\'Q1\'].groupby(\'month\')[\'revenue\'].sum())"}'
                }
            }
        ]
    },
    {
        "role": "tool",
        "name": "execute_python",
        "content": "month\nJanuary     842000\nFebruary    891000\nMarch       967000\nName: revenue, dtype: int64"
    },
    {
        "role": "assistant",
        "content": None,
        "tool_calls": [
            {
                "type": "function",
                "function": {
                    "name": "create_chart",
                    "arguments": '{"data": [842000, 891000, 967000], "labels": ["Jan", "Feb", "Mar"], "type": "line", "title": "Q1 2026 Monthly Revenue"}'
                }
            }
        ]
    },
    {
        "role": "tool",
        "name": "create_chart",
        "content": [
            {"type": "image", "image": "<PIL.Image object>"},  # chart image
            {"type": "text", "text": "Chart generated successfully"}
        ]
    },
    {
        "role": "assistant",
        "content": "Q1 revenue shows a clear upward trend: January $842K → February $891K → March $967K. That's 14.8% growth over the quarter, with acceleration in March (+8.5% vs February's +5.8%). The chart shows a steepening curve, which is a positive signal."
    }
]

The key difference from text-only tool calling: the tool role message can now contain {"type": "image", ...} content — the image output from the tool becomes part of the training trajectory.

Building a Training Dataset

For fine-tuning to work, you need enough examples to cover your tool schemas and the edge cases you care about. A minimum viable dataset for a specific domain is usually 100-500 examples.

from datasets import Dataset

def build_training_examples():
    """
    Build your domain-specific function calling examples.
    Each example should demonstrate:
    1. Correct tool selection
    2. Correct argument formatting
    3. Correct reasoning from tool output
    4. Edge case handling (tool errors, multiple calls, no tool needed)
    """
    examples = []

    # Example: correct tool selection
    examples.append([
        {"role": "system", "content": "You have access to: get_weather(city: str), search_web(query: str), calculator(expression: str)"},
        {"role": "user", "content": "What's 15% of $847?"},
        {
            "role": "assistant",
            "content": None,
            "tool_calls": [{"type": "function", "function": {"name": "calculator", "arguments": '{"expression": "847 * 0.15"}'}}]
        },
        {"role": "tool", "name": "calculator", "content": "127.05"},
        {"role": "assistant", "content": "15% of $847 is **$127.05**."}
    ])

    # Example: no tool needed
    examples.append([
        {"role": "system", "content": "You have access to: get_weather(city: str), search_web(query: str), calculator(expression: str)"},
        {"role": "user", "content": "What's the capital of France?"},
        {"role": "assistant", "content": "The capital of France is Paris."}
        # No tool call — model should answer directly from knowledge
    ])

    # Example: tool error handling
    examples.append([
        {"role": "system", "content": "You have access to: get_weather(city: str)"},
        {"role": "user", "content": "What's the weather in Atlantis?"},
        {
            "role": "assistant",
            "content": None,
            "tool_calls": [{"type": "function", "function": {"name": "get_weather", "arguments": '{"city": "Atlantis"}'}}]
        },
        {"role": "tool", "name": "get_weather", "content": '{"error": "City not found"}'},
        {"role": "assistant", "content": "I couldn't retrieve weather for Atlantis — it wasn't found in the weather service. Could you double-check the city name?"}
    ])

    return examples

raw_examples = build_training_examples()
dataset = Dataset.from_dict({"messages": raw_examples})

QLoRA Fine-Tuning with SFTTrainer

import torch
from transformers import AutoTokenizer, AutoModelForImageTextToText, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig

model_id = "google/gemma-4-E4B-it"

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)

# LoRA config — target the attention and MLP projection layers
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: ~20M || all params: ~8B || trainable%: ~0.25%

# Training config
training_config = SFTConfig(
    output_dir="./gemma4-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,        # effective batch size: 8
    learning_rate=2e-4,
    warmup_ratio=0.05,
    lr_scheduler_type="cosine",
    bf16=True,
    logging_steps=10,
    save_strategy="epoch",
    max_seq_length=4096,
    dataset_text_field=None,              # we're using messages format
    remove_unused_columns=False,
)

trainer = SFTTrainer(
    model=model,
    args=training_config,
    train_dataset=dataset,
    tokenizer=tokenizer,
    peft_config=lora_config,
)

trainer.train()
trainer.save_model("./gemma4-finetuned")

Memory requirements on E4B:

4-bit quantized base model: ~4GB
LoRA adapters + optimizer states: ~6GB
Activations + gradient checkpointing: ~4GB
Total: ~14GB — fits on a 16GB consumer GPU

Evaluating Tool Call Accuracy

After fine-tuning, evaluation should measure the things that matter:

from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="./gemma4-finetuned",
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

def evaluate_tool_calling(test_cases: list[dict]) -> dict:
    results = {"correct_tool": 0, "correct_args": 0, "no_hallucination": 0, "total": 0}

    for case in test_cases:
        response = pipe(case["messages"], max_new_tokens=256)[0]["generated_text"]

        # Check: did it call the right tool?
        expected_tool = case["expected_tool"]
        called_right_tool = expected_tool in response if expected_tool else "tool_calls" not in response

        # Check: were the arguments well-formed JSON?
        import json, re
        args_match = re.search(r'"arguments":\s*"({.*?})"', response)
        valid_args = False
        if args_match:
            try:
                json.loads(args_match.group(1).encode().decode('unicode_escape'))
                valid_args = True
            except:
                pass

        # Check: did it hallucinate a tool not in the schema?
        available_tools = case.get("available_tools", [])
        hallucinated = any(
            f'"name": "{t}"' in response
            for t in re.findall(r'"name":\s*"(\w+)"', response)
            if t not in available_tools
        )

        results["total"] += 1
        results["correct_tool"] += int(called_right_tool)
        results["correct_args"] += int(valid_args)
        results["no_hallucination"] += int(not hallucinated)

    return {k: v/results["total"] for k, v in results.items() if k != "total"}

metrics = evaluate_tool_calling(test_cases)
print(f"Correct tool selection: {metrics['correct_tool']:.1%}")
print(f"Valid argument JSON:    {metrics['correct_args']:.1%}")
print(f"No hallucinated tools:  {metrics['no_hallucination']:.1%}")

What Fine-Tuning Buys You Here

Gemma 4's base function-calling capability (86.4% agentic tool use on benchmarks) is already strong. Fine-tuning is worth doing when:

Your tool schema is unusual. If your tools have nested objects, enum parameters, or optional fields that the base model handles inconsistently, SFT on your schema stabilizes behavior.

You need edge case control. "When no tool is needed, answer directly" is a policy decision. "When the tool returns an error, do X not Y" is a policy decision. Fine-tuning encodes these policies reliably.

You have domain-specific tool semantics. A create_report function in your system means something specific to your domain. The base model doesn't know that.

You need multimodal tool outputs. If your pipeline includes tools that return images (charts, rendered documents, screenshots), the TRL multimodal support is the only path to training on those trajectories.

Exporting the LoRA Adapter

# Merge LoRA into base model for deployment
from peft import PeftModel

base_model = AutoModelForImageTextToText.from_pretrained(model_id, torch_dtype=torch.bfloat16)
merged_model = PeftModel.from_pretrained(base_model, "./gemma4-finetuned")
merged_model = merged_model.merge_and_unload()
merged_model.save_pretrained("./gemma4-merged")
tokenizer.save_pretrained("./gemma4-merged")

# Or keep as adapter for smaller disk footprint
# Just load the base + apply the adapter at inference time

The LoRA adapter is ~80MB. The base model is ~8GB. For deployment, keeping them separate is often more practical — you can swap adapters for different tool schemas without reloading the base model.

The Combination That Makes This Interesting

Gemma 4 with native function calling + thinking mode + multimodal tool outputs + fine-tuning for your specific schema is a stack that simply didn't exist six months ago in open-weight form.

An agent that reasons about image outputs from tools, fine-tuned to your internal APIs, running locally with no data leaving your infrastructure: that's the practical combination this tutorial enables.

The pieces are all available. This is how you connect them.

Gemma 4 Scored 89.2% on AIME. Here's Why That Number Should Change How You Think About Open-Source AI

pulkitgovrani — Sun, 24 May 2026 12:41:25 +0000

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

AIME — the American Invitational Mathematics Examination — is the test given to the top 5% of high school math competitors in the US. The problems require multi-step proof construction, elegant reasoning, and a comfort with number theory, combinatorics, and geometry that most adults don't have.

Gemma 3 scored 20.8% on AIME 2026.

Gemma 4 scored 89.2%.

That's not an incremental improvement. That's a qualitative category change — and it happened in one model generation, in an open-weight model that runs on a consumer GPU.

Here's what I think that actually means.

The Numbers, In Full

Don't just take AIME. Look at the whole picture:

Benchmark	Gemma 3 27B	Gemma 4 31B	What it measures
AIME 2026	20.8%	89.2%	Competition math
GPQA Diamond	42.4%	84.3%	Expert science QA
Codeforces ELO	110	2150	Competitive programming
Agentic Tool Use	6.6%	86.4%	Multi-step tool calling
MMLU Pro	—	85.2%	Professional knowledge
LiveCodeBench v6	—	80.0%	Real-world coding

A Codeforces ELO of 2150 is Grandmaster level — top 0.1% of competitive programmers globally. Gemma 3 at ELO 110 was essentially a beginner. Gemma 4 at 2150 would beat virtually every professional software engineer in a competitive programming contest.

The agentic tool use jump — 6.6% to 86.4% — is the one that matters most for developers. That's not an academic benchmark. That's the model's ability to chain together tool calls, handle errors, and complete multi-step tasks autonomously. An agent that succeeds 86% of the time on agentic tasks is a practical agent. One that succeeds 6.6% of the time is a toy.

What Changed Between Gemma 3 and Gemma 4

This wasn't just more compute and more data. The architectural and training changes were substantial:

Thinking mode. Gemma 4 was trained with chain-of-thought reasoning built in — up to 4,000+ tokens of working through a problem before committing to an answer. AIME at 20.8% is what you get from a model that answers immediately. AIME at 89.2% is what you get from a model that has 4,000 tokens of scratch paper.

Native function calling. Agentic tool use going from 6.6% to 86.4% is almost entirely explained by this. Gemma 3 wasn't trained for function calling — it was prompted into it. Gemma 4 was trained with tool use as a first-class objective.

MoE architecture. The 26B A4B MoE model achieves 88.3% on AIME — nearly matching the 31B dense model — while activating only 4B parameters per token. The implication is that expert specialization is doing real work: math problems route to math-specialized experts.

256K context. Multi-step reasoning problems often require holding a complex state across many reasoning steps. More context = less information loss as the reasoning chain grows.

None of these are incremental improvements to the same approach. They're a different approach.

The Open-Source Gap Is Closing Faster Than Anyone Expected

A year ago, the conventional wisdom was: open-source models are 6-12 months behind the frontier, they'll stay there, and for anything serious you need GPT-4 or Claude.

Here's what Gemma 4 31B benchmarks against:

Model	AIME 2026	GPQA Diamond	Codeforces ELO
Gemma 4 31B	89.2%	84.3%	2150
GPT-4o (May '24)	~56%	~53%	~900
Claude 3.5 Sonnet	~68%	~65%	~1200

I want to be careful here: these benchmarks are not identical versions tested simultaneously, and model capabilities change with updates. The point isn't "Gemma 4 beats GPT-4o on everything." It's that a locally-runnable, open-weight model is now in the same conversation as frontier commercial models on the hardest reasoning tasks.

A year ago that was not true. The gap was not closing this fast.

Why This Matters for Developers Specifically

You can run reasoning-capable AI on your hardware.

The 26B A4B fits on a 16GB GPU at 4-bit quantization and achieves 88.3% AIME. That's not a cloud service. That's not a $200/month subscription. It's an Ollama command.

The data never leaves your machine.

For a lot of real reasoning tasks — auditing financial models, analyzing proprietary codebases, processing sensitive documents — the reason you don't send them to GPT-4 isn't cost. It's that the data can't leave your infrastructure. A locally-runnable model at this capability level removes that barrier.

The licensing is Apache 2.0.

Build with it commercially. Fine-tune it. Distribute it. The benchmark improvement doesn't come with new licensing restrictions.

Agents that actually work.

86.4% agentic tool use success rate means you can build multi-step automated pipelines that are reliable enough to deploy. 6.6% means you're debugging agent failures constantly. This is the practical inflection point for building AI agents with an open-weight model.

The Honest Counterpoint

Benchmark performance and real-world task performance are not the same thing.

AIME 89.2% tells you the model can solve structured, well-defined math problems with clear right/wrong answers. It says less about:

Ambiguous tasks where the "correct" answer is subjective
Novel problem types not represented in training
Long-running, multi-day autonomous tasks
Tasks that require external world knowledge past the training cutoff

Codeforces ELO 2150 tells you the model writes excellent competitive programming solutions. It says less about:

Large-scale software architecture decisions
Debugging complex distributed systems
Understanding poorly-documented legacy code

The model is genuinely excellent at structured reasoning. It's not a general-purpose replacement for a senior engineer. These things are both true simultaneously.

What I Think This Actually Signals

The story of AI progress so far has been: capability concentrates at the frontier, the frontier is closed, open-source catches up slowly. The assumption baked into most developer tooling decisions is that if you need serious capability, you go to the API.

Gemma 4 disrupts that assumption. Not completely — the absolute frontier is still ahead — but enough to change the calculus for a large class of applications.

If your application needs:

Math or logical reasoning
Code generation and review
Tool-calling agents
Structured information extraction

...then Gemma 4 is now a legitimate option where it simply wasn't before. Not a compromise. Not "good enough for prototyping." A legitimate option.

The AIME score is a proxy for something more important: the capability level where local, private, open-weight AI becomes the right choice for production use cases, not just experimentation. Gemma 4 crossed it.

That's the story the 89.2% is telling.

Gemma 4's Audio and Video Inputs: A Hands-On Guide Nobody Has Written Yet

pulkitgovrani — Sun, 24 May 2026 12:40:48 +0000

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

Most coverage of Gemma 4's multimodal capabilities stops at images. That's understandable — image input is the most obvious thing to demo. But Gemma 4 E2B and E4B ship with two more input modalities that are genuinely novel for a local, open-weight model: native audio input (up to 30 seconds) and video input (up to 60 seconds via frame sampling).

This guide covers what these actually support, how to use them in code, and what practical tasks they open up — with honest notes on where the current implementation has limits.

The Architecture: What Makes Audio Work

Audio input in E2B and E4B is handled by a dedicated encoder — a USM-style conformer with approximately 300M parameters, trained separately and connected to the language model via a projection layer.

This is not "transcribe the audio then feed the transcript." The audio encoder produces continuous representations that the language model processes alongside text tokens. The model can reason about tone, pace, and audio quality — not just words.

The vision encoder (~150M params for E2B/E4B) handles both images and video frames using the same architecture, with variable aspect ratio support.

The 26B A4B and 31B models do not have audio input — only image and video. Audio is E2B/E4B exclusive.

Setup

pip install transformers>=4.50 torch accelerate pillow librosa soundfile

For video frame extraction:

pip install opencv-python

Audio Input: Transcription, QA, and More

Basic audio question answering

import torch
import librosa
import numpy as np
from transformers import AutoProcessor, AutoModelForImageTextToText

model_id = "google/gemma-4-E4B-it"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Load audio — librosa resamples to 16kHz automatically
audio_array, sr = librosa.load("interview.wav", sr=16000, mono=True)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": audio_array},
            {"type": "text", "text": "What is the main topic of this audio clip? Summarize in 3 bullet points."}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

with torch.inference_mode():
    outputs = model.generate(**inputs, max_new_tokens=512)

response = processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)

Transcription with speaker context

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": audio_array},
            {"type": "text", "text": "Transcribe this audio verbatim. Mark speaker changes with [Speaker A] and [Speaker B]."}
        ]
    }
]

Sentiment and tone analysis

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": audio_array},
            {"type": "text", "text": "Analyze the emotional tone of this audio. Is the speaker confident, uncertain, frustrated? What specific vocal cues indicate this?"}
        ]
    }
]

This is where native audio beats transcript-based approaches — the model can respond to pacing, hesitation, pitch changes, and delivery, not just words.

Video Input: Understanding Motion and Sequence

Video is handled by sampling frames at a configurable rate and processing them through the vision encoder. The model receives frames as a sequence, giving it temporal context.

Frame extraction helper

import cv2
import numpy as np
from PIL import Image

def extract_frames(video_path: str, num_frames: int = 16) -> list[Image.Image]:
    """Sample `num_frames` evenly spaced frames from a video."""
    cap = cv2.VideoCapture(video_path)
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

    indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)
    frames = []

    for idx in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if ret:
            frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            frames.append(Image.fromarray(frame_rgb))

    cap.release()
    return frames

Video QA

frames = extract_frames("cooking_tutorial.mp4", num_frames=16)

# Build content with all frames + question
content = [{"type": "image", "image": frame} for frame in frames]
content.append({"type": "text", "text": "What dish is being prepared? List the ingredients used in order of appearance."})

messages = [{"role": "user", "content": content}]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

with torch.inference_mode():
    outputs = model.generate(**inputs, max_new_tokens=1024)

print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Step-by-step action description

content = [{"type": "image", "image": frame} for frame in frames]
content.append({
    "type": "text",
    "text": "Describe the sequence of actions in this video as numbered steps. Be specific about what happens between each frame."
})

Combining Audio and Video

For videos with audio tracks, you can pass both:

audio_array, sr = librosa.load("presentation.mp4", sr=16000, mono=True)
frames = extract_frames("presentation.mp4", num_frames=12)

content = [{"type": "image", "image": frame} for frame in frames]
content.append({"type": "audio", "audio": audio_array})
content.append({
    "type": "text",
    "text": "The audio is from the same presentation shown in the frames. Does the speaker's verbal explanation match what's shown on the slides? Note any discrepancies."
})

This is the most novel use case — cross-modal consistency checking. You can ask: "Does the speaker's tone match the content?" or "What's happening visually when the speaker sounds uncertain?"

Practical Applications

Customer support QA: Feed audio recordings of support calls. Ask: "Was this issue resolved? What was the customer's emotional state at the start vs end?"

Video content moderation: Sample frames from user-uploaded video. Ask: "Does this video contain [category]? Describe what you see."

Lecture summarization: Audio of a lecture + slides as frames. Ask: "Summarize the key points from this lecture, connecting what the speaker said with what was shown."

Meeting notes from recordings: Audio transcription with speaker attribution and topic segmentation — without sending audio to an external API.

Security camera analysis: Frame sequence from a clip. Ask: "Describe the sequence of events. Is this activity consistent with normal patterns?"

Honest Limitations

The 30-second audio cap is real. For anything longer, you need to chunk the audio and either process chunks independently or summarize and chain the outputs. There's no built-in long-audio support in the current E2B/E4B checkpoint.

Frame count affects VRAM significantly. Each video frame is processed through the vision encoder and consumes context window budget. At 16 frames, a 60-second clip is fine on 16GB. At 32 frames, you'll push E4B's limits. Be conservative with frame counts.

No timestamp awareness. When processing frames, the model doesn't receive explicit timestamps. You can embed timestamps in text captions alongside frames, but it's manual:

content = []
duration = 60  # seconds
for i, frame in enumerate(frames):
    timestamp = (i / len(frames)) * duration
    content.append({"type": "image", "image": frame})
    content.append({"type": "text", "text": f"[{timestamp:.1f}s]"})
content.append({"type": "text", "text": "Your question here"})

Audio quality matters more than you'd expect. The encoder handles background noise reasonably well, but heavily compressed audio (low-bitrate voice memos, phone call recordings) reduces transcription quality noticeably compared to clean recordings.

The Part That's Actually New

Audio input in a locally-runnable, Apache 2.0, open-weight model is new. Before Gemma 4, building a pipeline that processes audio through a local model meant separate transcription (Whisper) → text → language model. Two models, two inference steps, text-only reasoning.

Gemma 4 E4B collapses that into one model that can reason across modalities simultaneously. The cross-modal consistency check (does the audio match the video?) simply wasn't possible with the separate-model approach.

That's not a small thing.

Gemma 4 26B A4B: What "Mixture of Experts" Actually Means for Your Inference Budget

pulkitgovrani — Sun, 24 May 2026 12:40:10 +0000

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

Gemma 4's most interesting model isn't the 31B flagship. It's the 26B A4B — a Mixture-of-Experts model that activates only 4 billion parameters per token while delivering performance nearly identical to the dense 31B.

If that sounds like magic, it's not. But the engineering behind it is worth understanding, because it changes what hardware you need to run a near-frontier model locally.

Dense vs MoE: The Core Difference

In a standard dense transformer (like Gemma 4 31B), every token that passes through the model activates every parameter. All 31 billion of them, every forward pass.

In a Mixture-of-Experts model, the network is split into a large pool of "expert" sub-networks. Each token is routed — by a learned gating function — to a small subset of those experts. Only the selected experts do computation for that token.

The Gemma 4 26B A4B has:

128 total expert sub-networks
8 experts activated per token (hence "A4B" — ~4B active params)
26B total parameters in the full model

During inference, you're doing the compute of roughly a 4B model. But the model has 26B parameters of learned knowledge available to route between.

Dense 31B:  [token] → ALL 31B params → output
            Cost: 31B FLOPs per token

MoE 26B A4B: [token] → router → 8 of 128 experts → output  
             Cost: ~4B FLOPs per token
             But knowledge from: 26B params

Why This Matters for VRAM

This is where things get practical. VRAM requirements are dominated by parameter count in memory, not by compute per token.

The 26B A4B still needs to hold all 26B parameters in memory — or at least the layers that might be needed for any given batch. At bfloat16, that's ~52GB. At 4-bit quantization (Q4_K_M), it's roughly 13-14GB.

Compare to the dense 31B at 4-bit: ~17-18GB.

So you save meaningful VRAM versus the dense 31B, and you get near-identical output quality. The tradeoff compared to a true 4B dense model: you need 3-4x the VRAM, but you get 20-25x better benchmark performance.

Model	Active params	VRAM (bf16)	VRAM (Q4)	AIME 2026
Gemma 4 E4B	4.5B	~9GB	~3GB	—
Gemma 4 26B A4B	4B active	~52GB	~14GB	88.3%
Gemma 4 31B	31B	~62GB	~17GB	89.2%

For the 26B A4B: a 16GB consumer GPU (RTX 4080, 4090) can run it at 4-bit. A Mac with 32GB unified memory runs it comfortably at 8-bit. No multi-GPU setup required.

Running the 26B A4B Locally

Ollama

ollama pull gemma4:26b
ollama run gemma4:26b

Ollama handles quantization automatically. On a 16GB GPU it applies Q4 by default.

llama.cpp

# Download the quantized GGUF
huggingface-cli download unsloth/gemma-4-26b-a4b-it-GGUF \
  --local-dir ./gemma4-26b \
  --include "gemma-4-26b-a4b-it-Q4_K_M.gguf"

# Run
llama-server \
  -m ./gemma4-26b/gemma-4-26b-a4b-it-Q4_K_M.gguf \
  --ctx-size 32768 \
  --n-gpu-layers 40 \
  --host 0.0.0.0 \
  --port 8080

MLX (Apple Silicon)

pip install mlx-lm
mlx_lm.generate \
  --model mlx-community/gemma-4-26b-a4b-it-4bit \
  --prompt "Explain the tradeoffs between B-trees and LSM-trees for write-heavy workloads" \
  --max-tokens 1024

On an M3 Max (128GB), this runs at 30-40 tokens/second. On an M4 Pro (48GB), 20-30 t/s at 4-bit.

How the Router Works

The gating network is a small learned linear layer that maps each token's hidden state to a score over all 128 experts. The top-8 scoring experts are selected, their outputs are weighted by their gate scores, and the weighted sum is the layer output.

# Simplified MoE forward pass (conceptual)
def moe_forward(x, experts, gate):
    # x: [batch, seq_len, hidden_dim]
    scores = gate(x)                          # [batch, seq_len, 128]
    top_k_scores, top_k_idx = scores.topk(8)  # select 8 experts
    top_k_scores = F.softmax(top_k_scores, dim=-1)

    output = torch.zeros_like(x)
    for i, expert_idx in enumerate(top_k_idx.unbind(-1)):
        expert_output = experts[expert_idx](x)
        output += top_k_scores[..., i:i+1] * expert_output

    return output

The interesting part: different experts specialize during training. Some become better at code, some at reasoning, some at factual recall. The router learns to dispatch accordingly — a form of implicit task routing without any explicit labeling.

The Latency Picture: Where MoE Wins and Where It Doesn't

MoE wins on throughput (batch inference): When processing many requests simultaneously, the reduced compute per token means you can serve more requests per second on the same hardware.

MoE is roughly equal on single-token latency: The routing overhead is small, and you're still doing full attention across the sequence.

MoE loses on memory bandwidth: All 26B parameters sit in VRAM. If your GPU's memory bandwidth is the bottleneck (common on consumer GPUs), you pay the full bandwidth cost even though you only activate 4B params per forward pass.

The practical upshot: for a local inference server handling multiple concurrent users, the 26B A4B is a better choice than the 31B dense. For single-user interactive use, they'll feel similar. For production batch jobs, the 26B A4B has a clear advantage.

Comparing 26B A4B vs Dense Alternatives

The natural comparison isn't just "vs Gemma 4 31B" — it's "vs everything else you could run at this VRAM budget."

At ~14GB Q4:

Gemma 4 26B A4B (AIME 88.3%, Codeforces 2100+ ELO)
Llama 3.3 70B Q2 (heavy quantization, quality degrades significantly)
Qwen 2.5 14B (good, but single architecture with ~70B-scale quality gap)
Mistral Small 22B Q4 (~12GB, strong but narrower multimodal support)

For reasoning tasks specifically — math, code, multi-step logic — the 26B A4B has a substantial quality lead over everything else in the 14-16GB VRAM bracket. That's the genuine breakthrough of the MoE architecture here.

When to Use 26B A4B vs 31B Dense

Use 26B A4B when:

You're running on a single 16GB consumer GPU
You need high throughput (multiple concurrent users)
Your tasks are reasoning-heavy (math, code, logic) — this is where MoE specialization shines
You're on Apple Silicon with 32-64GB unified memory

Use 31B dense when:

You have 24GB+ VRAM (or multi-GPU)
You need maximum consistency across diverse task types
You're doing long-context work where 256K context matters and you have the VRAM headroom
You're fine-tuning — dense models are easier to fine-tune than MoE (routing can shift unpredictably during LoRA adaptation)

The Architecture Bet That Paid Off

MoE is not new — it's been in research since the 1990s and production since Switch Transformer. What's new is Google making it work at this quality level in an open-weight model at a size that fits on consumer hardware.

The 26B A4B is the clearest evidence yet that "bigger dense model" is not the only path to capability. For the developer who doesn't have a multi-GPU server, this model is the reason Gemma 4 is meaningfully different from every open-weight release before it.

Gemini Omni's Conversational Video Editing Is a Paradigm Shift — And Nobody's Ready for It

pulkitgovrani — Sun, 24 May 2026 12:40:00 +0000

This is a submission for the Google I/O 2026 Challenge: Explore Google I/O 2026

At Google I/O 2026, Google announced Gemini Omni: a unified multimodal model that generates ~10-second video clips with synchronized audio from text, image, and audio inputs.

Every tech company has a video generation model now. That's not the story.

The story is conversational editing — and if you actually sit with what it means, it changes how you think about the entire creative workflow.

What Conversational Editing Is

Traditional video editing — including AI-assisted editing up to now — works like this: you have a timeline. You make cuts. You apply effects. You render. Every change is a discrete operation on a fixed artifact.

Gemini Omni works differently. You describe what you want changed in natural language, and the model re-renders the scene understanding the change in context.

Not a filter. Not a compositing layer. A re-render.

"Remove the person standing in the background" → the background fills in correctly based on what should physically be there
"Change the lighting to late afternoon golden hour" → the shadows, highlights, and color temperature all shift consistently
"Make this look like it was shot in the 1970s" → the grain, color grading, and aspect ratio update coherently

The model understands physics, geometry, and temporal consistency. It's not image-editing each frame independently. It's reasoning about the scene.

Why This Is a Bigger Deal Than Video Generation

Video generation — "create a 10-second clip of a sunset over mountains" — is impressive and already commoditized. Multiple models do this.

Conversational editing changes the workflow:

With traditional AI video generation:
Generate → unhappy with result → regenerate with different prompt → still not quite right → try a different model → accept an imperfect result

With conversational editing:
Generate → "the lighting is wrong" → Omni fixes the lighting in the same clip → "actually make the mountains more dramatic" → Omni adjusts → done

The difference isn't the quality of any single generation. It's that you can iterate on a video the same way you iterate on text in a document. Ctrl+Z exists for video now. You can have an artistic direction conversation with the model.

The Digital Avatar Feature Is the Genuinely Surprising Part

Buried slightly beneath the conversational editing headline is digital avatar creation in Gemini Omni Flash.

You record yourself — speaking numbers, looking in different directions — and Gemini Omni creates an avatar with:

Consistent identity across scenes (your face, your voice)
Consistent voice preservation
The ability to say things you didn't record

The deepfake-prevention onboarding is real: Google requires the recording step specifically to make it difficult to create avatars of people who haven't consented. It's not perfect protection, but it's more friction than nothing.

What this unlocks for legitimate use: creators who want consistent video output without filming themselves every time. A course creator could record their avatar once and produce lecture videos without appearing on camera for each one. The identity consistency across Gemini Omni Flash's output is the technical property that makes this viable rather than gimmicky.

Who Gets Access and When

All Google AI Plus, Pro, and Ultra subscribers: rolling out via the Gemini app and Google Flow now
YouTube Shorts and YouTube Create App: rolling out free to all users
The free rollout to YouTube is the most significant distribution decision — it puts conversational video editing in front of hundreds of millions of creators, not just paid subscribers

The YouTube integration is where I'd watch carefully. YouTube Shorts is already competing with TikTok and Instagram Reels for creator attention. If conversational editing becomes a native Shorts feature, it changes the production floor for short-form video the same way Instagram's filters changed the production floor for photos in 2012 — anyone can produce something that looks intentional.

The Integration with Google Flow

Google Flow is Google's AI creative studio for video, and it received a major update at I/O 2026 with Gemini Omni and Veo 3.1.

The combination is interesting:

Veo 3.1 handles high-quality, cinematic generation
Gemini Omni handles the conversational editing layer on top
Flow Tools adds custom AI agents that can execute multi-step editing workflows
Available on Android (beta) and iOS for Flow Music

What this looks like in practice: you generate a base scene in Veo 3.1, hand it to Gemini Omni for conversational refinement, and a Flow agent handles the export and packaging. That's a complete short-form video production pipeline in one tool, accessible from a phone.

My Honest Critique

The 10-second clip limit is the real constraint.

Conversational editing on a 10-second clip is a demo. Conversational editing on a 5-minute video is a product. The technical challenges of maintaining temporal and identity consistency across minutes of footage are substantially harder than 10 seconds. Google will get there, but "when" is the question the I/O 2026 announcement didn't answer.

Physics understanding is selective.

The demos showed impressive scene-level reasoning — lighting, background filling, shadow direction. But the model's physics understanding has failure modes. Complex interactions (liquid, cloth physics, realistic human motion in unusual poses) are where current video models still produce uncanny results. Conversational editing fixes obvious changes cleanly; nuanced corrections are still hit-or-miss.

The creative control ceiling is low for professional work.

For social media creators, conversational editing is transformative. For professional film and video production, the control granularity is still far below what editors expect. "Make the lighting more dramatic" is a natural language command — a DaVinci Resolve node graph for lift/gamma/gain with specific RGB values is not. The two audiences are different.

The Takeaway

Gemini Omni's headline is video generation. The real story is that editing — historically the highest-skill, highest-time part of video production — is becoming conversational.

That's not an incremental improvement. It's a workflow change. The question for every creator, developer building on the API, and product team thinking about video features is: what does your product look like when editing is a chat interface rather than a timeline?

The answer to that question is worth thinking about now, before everyone else does.

Links: Introducing Gemini Omni — Google Blog · Google Flow · Google I/O 2026

Gemma 4's Thinking Mode: A Practical Guide to the `<|think|>` Token

pulkitgovrani — Sun, 24 May 2026 12:38:37 +0000

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

Gemma 4 ships with built-in reasoning — a configurable chain-of-thought that runs before the model gives you an answer. It's not a separate model, not a system prompt trick, and not a post-processing layer. It's a control token trained into the model from scratch.

Here's how to actually use it, when it's worth enabling, and how to tune the thinking budget so you're not burning 4,000 tokens on a yes/no question.

What Thinking Mode Is

When reasoning is enabled, Gemma 4 generates an internal chain-of-thought — up to 4,000+ tokens of "working out loud" — before producing its final answer. The reasoning tokens are visible in the output but clearly delimited, so you can surface them to users or strip them silently depending on your use case.

This is the same pattern as DeepSeek-R1 and Claude's extended thinking, now available locally on an Apache 2.0 model.

Enabling It: Two Ways

Method 1 — System prompt token (any backend)

Add <|think|> to your system prompt. Works with any backend that serves Gemma 4 — Ollama, llama.cpp, vLLM, LM Studio:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "google/gemma-4-E4B-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")

messages = [
    {
        "role": "system",
        "content": "<|think|>"  # enables thinking mode
    },
    {
        "role": "user",
        "content": "A train leaves Chicago at 9am going 80mph. Another leaves New York at 10am going 100mph. Chicago to New York is 790 miles. When do they meet?"
    }
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))

Method 2 — `enable_thinking=True` in Transformers

pipe = pipeline(
    "text-generation",
    model="google/gemma-4-E4B-it",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

result = pipe(
    "Prove that the sum of two odd numbers is always even.",
    enable_thinking=True,        # ← the flag
    max_thinking_tokens=2048,    # ← cap the budget
    max_new_tokens=512,
)

thinking = result[0]["thinking"]   # the CoT tokens
answer = result[0]["generated_text"]

Understanding the Output Structure

With thinking mode enabled, the model's output has a clear structure:

<thinking>
Let me work through this step by step.

The train from Chicago travels at 80mph starting at 9am.
The train from New York travels at 100mph starting at 10am.

After 1 hour (10am), the Chicago train has covered 80 miles.
Remaining distance: 790 - 80 = 710 miles.

Both trains are now moving toward each other at a combined speed of 80 + 100 = 180mph.
Time to meet: 710 / 180 ≈ 3.94 hours after 10am.

3.94 hours = 3 hours 56 minutes → they meet at approximately 1:56pm.
</thinking>

The trains meet at approximately **1:56 PM**.

Working: After the first hour (Chicago train covers 80mi), 710 miles remain. 
Combined closing speed: 180mph. Time: 710/180 ≈ 3h56m after 10am.

The <thinking> block is the model's scratch pad. The content after it is the final answer — concise and clean because the model already did the work.

Controlling the Thinking Budget

The thinking budget is the most important knob. More thinking = more tokens = slower + more expensive. Calibrate it to the task.

# Cheap: no thinking — straightforward questions
pipe("What is the capital of France?", enable_thinking=False)

# Light: 512 token budget — moderate reasoning tasks
pipe("Write a regex to match ISO 8601 dates", enable_thinking=True, max_thinking_tokens=512)

# Full: 4096 token budget — complex math, multi-step logic, code architecture
pipe(
    "Design a rate limiting algorithm that handles burst traffic, "
    "distributed deployments, and graceful degradation",
    enable_thinking=True,
    max_thinking_tokens=4096
)

Rule of thumb:
| Task type | Thinking budget |
|-----------|----------------|
| Factual recall, simple Q&A | 0 (disabled) |
| Code generation, regex, structured output | 256–512 |
| Math, logic puzzles, multi-step reasoning | 1024–2048 |
| Architecture decisions, complex proofs, planning | 2048–4096 |

Using Thinking Mode with Ollama

If you're running Gemma 4 locally via Ollama, pass the system prompt manually:

import ollama

response = ollama.chat(
    model="gemma4:4b",
    messages=[
        {"role": "system", "content": "<|think|>"},
        {"role": "user", "content": "What's the most efficient sorting algorithm for nearly-sorted arrays and why?"}
    ]
)

print(response["message"]["content"])

Note: Ollama doesn't yet expose a max_thinking_tokens parameter directly. To cap the budget, add an explicit instruction in your system prompt:

{"role": "system", "content": "<|think|> Keep your reasoning under 300 words."}

Streaming the Thinking Tokens

For user-facing applications, streaming lets you show the thinking as it happens — useful when the reasoning itself is part of the value:

import anthropic  # or any SSE client

async def stream_with_thinking(question: str):
    async for chunk in pipe.stream(
        question,
        enable_thinking=True,
        max_thinking_tokens=2048,
    ):
        if chunk["type"] == "thinking":
            print(f"[thinking] {chunk['text']}", end="", flush=True)
        elif chunk["type"] == "text":
            print(chunk["text"], end="", flush=True)

This is the pattern for building a "show your work" UI — math tutors, debugging assistants, step-by-step explainers.

When Thinking Mode Actually Helps vs When It Doesn't

Helps significantly:

Math and logic problems where the path to the answer matters
Code debugging — the model traces through the logic before proposing a fix
Multi-constraint planning — "schedule these tasks given these dependencies"
Ambiguous questions where the model needs to interpret before answering

Doesn't help much:

Factual retrieval — the answer is in the weights, thinking doesn't change it
Creative writing — longer reasoning doesn't make prose better
Simple classification or entity extraction — deterministic tasks don't benefit from CoT
Low-latency production endpoints where 4K extra tokens is a hard no

Actually hurts:

Tasks with strict output format requirements — the model sometimes "wanders" in the thinking block in ways that bleed into structured output. If you're doing JSON extraction, disable thinking and use a format constraint instead.

The Practical Upside

Thinking mode is the feature that makes Gemma 4 genuinely competitive on reasoning benchmarks — AIME 2026 at 89.2%, GPQA Diamond at 84.3%, Codeforces ELO at 2150. Those numbers aren't from the base model answering cold. They're from a model that has space to work through the problem before committing.

The fact that this runs locally, on your hardware, with Apache 2.0 licensing, is the part that changes what's practical to build. Reasoning-capable AI in your pipeline without an API call, without data leaving your infrastructure, without a per-token bill for 4,000 reasoning tokens.

That's the actual story of the <|think|> token.

WebMCP Is the Most Underrated Announcement from Google I/O 2026

pulkitgovrani — Sun, 24 May 2026 12:30:00 +0000

This is a submission for the Google I/O 2026 Challenge: Explore Google I/O 2026

Everyone's talking about Jules, Gemini Omni, and the $100 AI Ultra price cut. Nobody's talking about WebMCP.

That's a mistake. WebMCP might be the most consequential thing Google announced at I/O 2026 — not because of what it does today, but because of what it standardizes for the next decade of web development.

What WebMCP Is

WebMCP is a proposed open web standard for browser-based AI agents. The short version: it lets developers expose structured tools — JavaScript functions, HTML forms, page actions — so that AI agents can interact with websites reliably and programmatically, not by scraping the DOM and hoping for the best.

An experimental origin trial launches in Chrome 149, with Gemini in Chrome support following shortly after.

The name is a nod to MCP (Model Context Protocol, the standard Anthropic introduced for tool-use with AI). WebMCP extends that concept to the browser, making the web itself the context layer for agents.

The Problem It Solves

Right now, browser AI agents — including Google's own Project Mariner — work by watching what's on the screen and clicking things. This is impressive engineering, but it's fragile in a predictable way:

The agent sees a button that looks like "Book Flight" — but is it the right one?
A UI redesign breaks the agent's understanding of the page
Dynamic content loaded after interaction confuses the agent's mental model
Multi-step workflows with modal dialogs and redirects cause agents to lose state

This is the difference between a screen-reader compatibility layer and a proper API. Both can "read the page." Only one is reliable at scale.

WebMCP is the proper API layer.

How It Works

Developers annotate their page with structured tool definitions — essentially declaring: "here are the actions this page can perform, and here's the shape of their inputs and outputs."

// WebMCP tool definition (simplified from the origin trial spec)
navigator.webmcp.registerTool({
  name: "book_flight",
  description: "Book a flight given origin, destination, and date",
  parameters: {
    origin: { type: "string", description: "IATA airport code" },
    destination: { type: "string", description: "IATA airport code" },
    date: { type: "string", format: "YYYY-MM-DD" },
  },
  handler: async ({ origin, destination, date }) => {
    // Your existing booking logic
    return await bookingApi.reserve({ origin, destination, date });
  }
});

An AI agent using WebMCP doesn't need to parse the DOM or simulate clicks. It calls book_flight with the right parameters and gets a typed response. The difference in reliability is the difference between regex-scraping HTML and calling a REST endpoint.

Why This Is Bigger Than It Looks

1. It decouples agent reliability from UI stability.

Today, every time a developer redesigns a checkout flow or moves a button, any agent that relied on the old UI layout breaks. With WebMCP, the UI can change freely — the tool interface is what the agent uses, and that's maintained separately from the visual layer. Agents become UI-independent.

2. It creates a new surface for developers to target.

Right now, developers think about two audiences: humans using the UI, and other services using the API. WebMCP adds a third: AI agents acting on behalf of humans. Once the standard matures, you'll see developer platforms adding "WebMCP support" as a feature the same way they added mobile responsiveness and REST APIs.

3. It's an open standard, not a Google proprietary thing.

The origin trial is in Chrome 149, but Google has submitted WebMCP to the W3C process. This is the play for cross-browser standardization, similar to how Service Workers or the Fetch API landed. If it gets traction, Firefox and Safari will implement it. A web with a universal agent interaction layer is a qualitatively different web.

4. It's the infrastructure layer for agentic apps.

Every company building "AI agents that browse the web" is currently solving the same DOM-parsing problem independently. WebMCP gives everyone a shared substrate. The comparison I keep coming back to: it's like when REST became the norm and everyone stopped building bespoke XML-RPC protocols. The tooling compounds when there's a standard.

The Honest Critique

WebMCP solves the right problem but it requires developer buy-in to work. An AI agent landing on a site with no WebMCP annotations still falls back to DOM parsing — the standard only helps where developers have opted in.

The history of web standards is littered with great ideas that died because adoption was too slow and the fallback behavior meant developers could ignore it. Schema.org structured data is technically everywhere; practically, most sites half-implement it and nobody maintains it.

WebMCP will face the same adoption problem. The question is whether Google ships enough agent-visible traffic to make WebMCP support table stakes for developers who want their products to be agent-reachable. If "show up in AI agent search results" becomes a revenue driver the way "show up in Google Search" is, adoption follows automatically.

What Developers Should Do Right Now

The origin trial means you can start experimenting in Chrome 149 today. I'd recommend:

Follow the W3C proposal — it's public and the discussion is active. Standards shaped in the early stages are standards you influence.
Read the Project Mariner demos — they show the practical ceiling of DOM-based agents, which clarifies exactly why the structured layer matters.
Think about your product's "agent surface" — which actions in your product would you want an AI agent to be able to invoke on a user's behalf? That list is your WebMCP tool catalog.

The web has had two major eras: static pages, then APIs. WebMCP is the infrastructure for the third — a web that AI agents can use as confidently as humans do. Get ahead of it.

Links: Chrome at I/O 2026 · WebMCP Origin Trial — Chrome 149 · Google I/O 2026 Dev Keynote