DEV Community: ww-w.ai

Google I/O Review (5/5) — What Disappointed and What Surprised

ww-w.ai — Wed, 20 May 2026 20:11:16 +0000

The Other Side of Google I/O 2026 — What Disappointed and What Surprised

Parts 1 through 4 covered what Google got right: Flash economics, serverless agents, Gemini Omni, and the Gemini CLI shutdown. This final piece covers the other side — three things that disappointed and one thing that surprised everyone, including me.

Disappointment 1: Antigravity 2.0 — Great Vision, Brutal Execution

The demo was spectacular. A demo showing an OS built by 93 agents over 12 hours, plus a playable Doom clone, all on stage. Agents as first-class deployment targets — versioning, rollback, observability baked in. The vision is sound.

Then the forced update shipped.

No migration path. No opt-out. Developers who were mid-sprint woke up to broken builds. Existing projects that worked on Monday stopped compiling on Tuesday.

The Google AI Forum filled up fast. Thread titles include "Incompetence 10.0" and "The WORST IDE for development." Source control and the local terminal — tools most developers consider non-negotiable — were missing from the default installation. Users had to manually hunt through settings to re-enable them.

This is the part that stings. Source control is not a power-user feature. A local terminal is not optional. When these are absent from a default install, it reads like the product was built by people who do not ship code themselves. Or worse — it reads like a team optimized for the demo reel and forgot the people who would use the product on Wednesday.

When your platform update breaks the projects people built on your previous platform update, the 12-hour demo stops mattering.

Disappointment 2: Gemini Spark — A Gmail Bot, Not an Agent

Google pitched Gemini Spark as a 24/7 autonomous agent. What shipped: Gmail, Google Docs, Sheets, Slides, Google Calendar. Plus a handful of consumer integrations — Canva, OpenTable, Instacart.

No Slack. No GitHub. No Linear. No Jira. For any developer whose workflow lives outside Google Workspace, Spark has zero surface area.

Here is the irony. The same company, on the same day, shipped the Managed Agents API with dozens of external integrations — GitHub, Jira, Stripe, Linear, Notion, MongoDB — plus native MCP support and Claude model compatibility. The API side built a genuinely open platform. The consumer product stayed inside Google's walled garden.

Same company. Same keynote. Opposite directions.

If the Managed Agents API team and the Spark team compared notes, you would not guess they work in the same building.

Now compare Spark to what already exists. Claude Code and OpenAI Codex both operate inside your actual development environment — your files, your terminal, your version control. Spark operates inside Google's apps. The conceptual overlap is real, but the practical utility gap is wide. Spark does not meet developers where they already work. It asks them to move into Google's house first.

Disappointment 3: Android 17 — The Invisible OS

Android was barely mentioned during the two-hour keynote. At Google I/O — the event that used to be about Android.

The major Android announcements? They happened the week before, off-site, at a pre-recorded event called "The Android Show." The features that did reach the I/O stage — smarter notifications, on-device summarization — are incremental. No new design language. No major API surface. No surprise.

The entire two-hour keynote was Gemini, AI, agents. Android felt like an extra in someone else's movie.

The read is straightforward: Google is betting its future on AI services, not on the operating system that carries them. Whether that is the right call depends on your vantage point. But doing it at I/O — in front of the Android developer community, the people who build the apps that make Android worth using — felt like Google forgot to impress the people in the room.

The Bright Spot Nobody Expected: Project Aura

In the middle of a keynote dominated by software, Google showed hardware. And it might have been the best thing on stage.

Project Aura is a collaboration between Google and XREAL — Android XR glasses that split the compute off the face. The glasses weigh roughly 80-90g. The heavy lifting happens in a tethered puck running a Snapdragon XR2+ Gen 2 chip. The result: an OLED display with a 70-degree field of view, wide enough to show three apps side by side, in a frame light enough to wear for hours.

Where Aura Sits

	Meta Ray-Ban	Project Aura	Apple Vision Pro
Weight	52g	80-90g (glasses only)	750g
Display	None	OLED, 70-degree FOV	Micro-OLED, 90-degree FOV
Apps side by side	N/A	3	Unlimited (spatial)
Compute	On-frame (Snapdragon AR1)	Tethered puck (XR2+ Gen 2)	On-device (M2 + R1)
Text readability	N/A	Sharp (hands-on reports)	Excellent
Price	$299	TBA (expected well below $3,499)	$3,499
Primary use	Camera + audio	AR workspace + media	Spatial computing

Meta Ray-Ban is light but has no display — it is a camera and speaker on your face. Apple Vision Pro has an extraordinary display but weighs 750g and costs as much as a laptop. Aura lands in the gap: actual visual output, actual wearable weight, at a price that is expected to undercut Vision Pro by a significant margin.

Hands-on reviewers at I/O reported text was sharp and pixels were not visible — a meaningful upgrade from the CES 2026 prototype shown earlier this year. But the demo that drew the most attention was not the specs. It was this: connect the glasses to a laptop via DisplayPort, and they become a virtual large monitor. No physical screen. No desk. Multiple reviewers called it the most practical demo of the entire event.

Google also announced a Developer Catalyst Program giving developers early access to devkits. AR/XR glasses live or die on the app ecosystem. Hardware without software is a paperweight. Getting devkits into hands early is the right move.

The broader signal is what makes Aura interesting beyond the product itself. AI has been a software story — models, APIs, tokens, agents. Aura is AI expanding into the physical interface layer. If a developer can carry a full workspace in a glasses case — no monitor, no desk, no office — the implications for remote work and mobile development go beyond what a new model release can offer.

Global launch is targeted for 2026. No price yet. That is the one caveat worth watching — if the price lands above $1,000, the sweet-spot argument weakens considerably.

The Series in Five Lines

Part 1: Flash 3.5 vs Pro — Pro Performance, Flash Branding — The "cheap model" costs 15x what Flash cost two generations ago.
Part 2: Managed Agents API — Serverless Agents Are Here — Deploy, scale, monitor. One CLI command.
Part 3: Gemini Omni and the Gemini CLI Shutdown — The best demo and the worst goodbye, on the same day.
Part 4: What the Numbers Actually Say — Pricing deep-dive and the open-source burial.
Part 5: What Disappointed and What Surprised — You are here.

Four wins. Four misses. One hardware surprise. All announced in the same 48 hours.

This wraps the Google I/O 2026 series. If you sat through the keynote live or tested any of these products hands-on — what surprised you? Drop a comment or find me on GitHub.

Google I/O Review (4/5) — Google Quietly Killed Gemini CLI

ww-w.ai — Wed, 20 May 2026 19:09:56 +0000

Google Quietly Killed Gemini CLI While Everyone Was Celebrating I/O

Part 4 of 5 in the Google I/O 2026 Review series.

There is a term in media strategy called "bad news burial." You wait for a high-traffic news cycle — a holiday, a natural disaster, an election night — and drop the announcement you don't want people to read. The hope is that the noise drowns it out.

On May 19, during Google I/O Day 1, while developers were still digesting Flash 3.5 benchmarks and the Managed Agents API, Google published a blog post announcing that Gemini CLI will be discontinued on June 18, 2026.

Not on stage. Not in the keynote. A blog post and a GitHub Discussion, timed to land under the loudest news cycle of the developer year.

Gemini CLI gave you 1,000 agent requests per day. Antigravity CLI gives you 20.

The Timing Was a Choice

Every major announcement at I/O got a keynote slot. Flash 3.5 beating Pro on benchmarks — keynote. Managed Agents API with 30+ integrations — keynote. Even Project Aura's 80-gram XR glasses got stage time.

The discontinuation of an Apache 2.0 open-source CLI used by thousands of developers? Blog post. Buried in the I/O news flood.

This matters because the announcement was not a minor change. It was a license model reversal, a free tier reduction of 98%, and a 30-day shutdown notice — all rolled into one. Any one of those would deserve its own conversation. Together, they constitute one of the sharpest reversals in developer tooling this year.

What Changed — The Numbers

Here is what developers are losing and what they are getting, based on what Google has disclosed:

	Gemini CLI	Antigravity CLI
License	Apache 2.0 (open source)	Closed source ("possibility" of open-sourcing mentioned, no commitment)
Free tier	1,000 requests/day + 60 RPM	20 requests/day (free individual plan, $0/mo)
Reduction	—	98% fewer free requests
Agent Client Protocol (ACP)	Supported	Community reports suggest not yet available
Project memory	Supported	Community reports suggest not yet supported for markdown files
Ctrl+C behavior	Normal exit	Some users report unreliable exit (Discussion #27274)
Documentation	Community-maintained, extensive	Sparse at launch
Shutdown notice	—	~30 days (May 19 → June 18)
Enterprise	Supported	Maintained

Free tier numbers from Gemini CLI GitHub README and Antigravity pricing.

The enterprise tier is maintained. Individual developers and small teams — the people who built their daily workflows around 1,000 free requests — get 20. That is not a "reduced free tier." That is a rounding error.

The Open Source Question

Gemini CLI was not just open source by license. It was open source by practice. Thousands of pull requests and issues from external contributors. Bug fixes, extensions, documentation improvements — the community was building the product alongside Google.

That is the implicit contract of open source: you contribute labor under an open license, and the project stays open and accessible. Apache 2.0 does not legally require this. But the social contract does, and breaking it has consequences that Apache 2.0 cannot measure.

The code those contributors wrote under Apache 2.0 is now feeding a closed-source product that the same contributors can barely use — 20 requests a day does not support any real development workflow.

Legally, nothing was stolen. Socially, something was taken.

The Pattern

This is not a new playbook. The sequence:

Launch open source with a generous free tier. Attract developers. Build community.
Accumulate contributions. External developers improve the product at zero cost under an open license.
Transition to closed source. The community-built product becomes proprietary. Free access drops to nominal levels.
Monetize through enterprise. Meaningful access requires paid licenses.

The "Google Graveyard" meme resurfaced immediately in the Hacker News thread. But this is different from shutting down a consumer app. Consumer apps have users. Open-source projects have contributors — people who invested engineering time into something they were told would remain open.

What Developers Are Doing Right Now

The HN and GitHub threads paint a clear migration picture:

Claude Code is the most frequently mentioned alternative. Developers cite the plugin/skills ecosystem and extensibility.
OpenAI CLI (Codex) gets mentions from developers who want to stay with a major provider.
Local/self-hosted alternatives — Ollama-based setups, open-weight model wrappers — are attracting developers who now distrust cloud-dependent CLIs entirely.
Gemini CLI forks from the last Apache 2.0 commit exist, though a fork without Google's model access has uncertain long-term viability.

The irony: Google's move may have driven more adoption to competitors than any competing marketing campaign could have achieved.

What This Does Not Solve

The 60 RPM question. Gemini CLI offered 60 requests per minute. Google has not clearly disclosed whether Antigravity maintains this rate limit for paid tiers. If you are evaluating a switch, verify this before committing.

The fork path. Gemini CLI's Apache 2.0 code is still available. A community fork is technically possible. Practically, a fork without Gemini model access has limited utility — someone would need to wire it to alternative providers, and that is a significant effort with no clear owner.

Whether Antigravity improves. Google mentioned the "possibility" of open-sourcing Antigravity in the future. Missing features might ship quickly. The free tier might expand. But "possible" is not a commitment, and developers building workflows need commitments.

Google's internal reasoning. Serving 1,000 free requests per day per user at scale is not cheap. The economics may have forced this decision. But the execution — 30-day notice, missing features in the replacement, a free tier reduced to near-irrelevance, and the timing — turned a business decision into a trust problem.

The Lesson

If your workflow depends on a vendor-controlled tool, you have two options:

Accept the dependency and price in the risk that the vendor changes terms. Budget migration time from day one.
Build on open standards and self-hostable tools where the switching cost stays low.

Neither option is wrong. But pretending the risk does not exist — that is the mistake Google just made visible.

The developers who contributed to Gemini CLI under Apache 2.0 did nothing wrong. They participated in open source the way it is supposed to work. What failed was not the license. It was the assumption that a trillion-dollar company's incentives would stay aligned with theirs.

Remember this the next time a major provider launches something generous and open. The question is not whether it is good today. The question is: what happens when your workflow depends on it, and the economics change?

Part 5 will cover the overall I/O scorecard — what the four wins and four misses tell us about where Google is heading.

If you migrated off Gemini CLI already, I'd like to hear what you moved to and what the transition cost was. Drop a comment or find me on GitHub.

Google I/O Review (3/5) — Gemini Omni Is a Learned Physics Engine

ww-w.ai — Wed, 20 May 2026 19:08:30 +0000

Gemini Omni Is a Learned Physics Engine — Like Unity, But the Rules Aren't Coded

Google I/O 2026 Review — Part 3 of 5

Most video generation models fake physics. They learn what gravity looks like — a ball falls, a cloth drapes — and reproduce the visual pattern. Push the scene past what the training data covered and things break. A marble doesn't bounce right. Shadows point the wrong way after a lighting edit. Swap a background and the character morphs into someone else.

Gemini Omni does something different. It maintains physics and identity across frames — not because someone coded gravity = 9.8 into the system, but because the model built an internal representation of how the physical world works.

That distinction matters more than the demo reel suggests.

The Demos That Stopped the Room

Three demos at I/O 2026 showed what Omni can do.

Hand-drawn character to animation. Someone sketched a character on paper, uploaded it, and Omni turned it into a 10-second animated story. Not a static image with parallax — an actual animation with movement, expression changes, and a coherent scene.

Marble physics. A marble bouncing down a chain-reaction track. Gravity pulled it at the right rate. Bounce trajectories matched the angle of impact. Each bounce produced a distinct sound, including a bell ring at the end. The physics weren't approximate. They looked simulated.

Claymation protein folding. A single prompt generated an educational video showing protein folding in claymation style. The clay texture stayed consistent across the sequence. The folding motion followed biologically plausible mechanics. One prompt. No keyframes. No rigging.

One reviewer at ChatPRD called it "the most impressive demo of the day." Having watched the full keynote and the hands-on sessions, I think that's fair.

What Makes This Different from Sora

Every video generation model can produce impressive isolated clips. The test is what happens when you edit.

Change the background in a Sora-generated scene, and the character often drifts — subtle changes to face shape, clothing color, body proportions. The model doesn't know the character is supposed to stay the same. It's generating each frame based on visual similarity to the previous frame, not based on an understanding that this is the same entity.

Omni maintains identity after edits. Swap the background from a forest to a kitchen. Change the lighting from warm to cold. Replace a prop. The character stays the same — same face, same proportions, same clothing. Google's claim is that the model maintains a persistent representation of objects and their properties, independent of the scene context.

This is the hardest problem in video generation and the reason most generated videos feel uncanny. They look right for 3 seconds. Then something shifts.

The Unity Analogy — And Why It Matters

Here is the mental model I keep coming back to.

In Unity or Unreal, physics works because engineers wrote the rules. Rigidbody.AddForce() applies Newtonian mechanics. Collision detection uses mathematical bounding volumes. Gravity is a constant. The engine simulates a world by executing code.

Omni does something conceptually similar — it maintains physics across frames — but through a different mechanism. The rules aren't coded. They're learned. The model internalized how gravity, light, momentum, and material properties behave by processing enormous amounts of video data. It built what researchers call a world model: an internal representation of physical laws that it applies when generating new frames.

Think of it this way:

	Game engine (Unity)	Learned physics (Omni)
Physics rules	Explicitly coded (`F = ma`)	Implicitly learned from data
Object identity	Tracked via object IDs	Maintained via internal representation
Edit behavior	Deterministic — same input, same output	Probabilistic — but consistent within a generation
Novel scenarios	Only what the code handles	Generalizes from training data patterns
Failure mode	Crashes or glitches visibly	Degrades subtly (uncanny valley)

The game engine approach has known limits and known strengths. You can trust the physics because you wrote the physics. The learned approach trades that certainty for generality — it can handle scenarios nobody anticipated, because it doesn't need someone to write the collision handler first.

The phrase I wrote in my full I/O review keeps sticking: "Like Unity, but the rules aren't coded. They're understood."

Practical Impact: Who Cares Beyond the Demo Reel

Three concrete use cases where this changes cost structures.

YouTube thumbnails and short-form video. A solo creator who currently pays $200-500 for a 30-second product animation can describe the scene in a prompt. If Omni delivers even 70% of the quality at near-zero marginal cost, the economics of content production shift for every small creator and indie team.

Product walkthrough videos. SaaS companies spend $5,000-15,000 per explainer video (script, motion graphics, voiceover, revisions). A world model that understands object permanence means you can generate a walkthrough, swap the UI screenshots for the next version, and the video stays coherent. The revision cycle collapses.

Educational content. The claymation protein-folding demo is not a party trick. If a biology teacher can prompt "show me mitosis in stop-motion clay style, 30 seconds" and get something accurate enough for a classroom, that's a production studio in a text box.

The common thread: Omni reduces the cost of visual storytelling from "hire a team" to "write a paragraph." Not for Hollywood. Not for AAA games. For the long tail of content that nobody could afford to produce before.

What It Can't Do Yet

This section matters more than the demo reel.

It's still in preview. Google showed curated demos on stage. We have not seen the failure cases — the weird hand, the physics glitch, the moment where identity drifts on frame 87. Every generative model looks incredible in a keynote. The question is what happens on the 50th generation you run on your own.

Long-form is unproven. The demos were 10 seconds. What happens at one minute? Two minutes? Five? World models degrade over time — small errors in frame N compound by frame N+100. Whether Omni maintains coherence over longer durations is an open question. Omni Flash clips are capped at 10 seconds; Sora supports up to 60.

Production-grade quality is not validated. "Impressive demo" and "I can ship this to customers" are different bars. Color accuracy, resolution consistency, artifact rates under varied prompts — none of these have been tested at scale by external users.

The pricing is unknown. A world model that generates physically consistent video is computationally expensive. If Omni pricing follows the Flash trajectory — where prices have climbed steeply across Flash generations — the cost math could limit adoption to enterprises.

Where This Fits in the Bigger Picture

Omni is not a video editor. It's not a motion graphics tool. It's a world simulator that outputs video. That framing changes what you compare it to.

Sora and Runway are video generators — they turn text into pixels. Omni is closer to a physics engine that happens to render its output as video frames. The difference is whether the system understands the scene or merely paints it.

If that understanding holds up outside curated demos — and that's a genuine if — the implications go beyond content creation. Robotics simulation, architectural visualization, scientific modeling, game prototyping. Any field that needs "show me what would happen if..." becomes a potential use case.

For now, it's a preview. An impressive one. But a preview.

What I'm watching for next: Public API access, pricing, and the first independent benchmarks on identity persistence across 60+ second clips. The demo set a bar. The product needs to clear it.

If you're tracking Gemini Omni or have tested other world-model approaches, I'd like to hear what you've seen. Comments or GitHub.

Sources:

Google I/O Review (2/5) — Google Just Made Serverless Agents Real

ww-w.ai — Wed, 20 May 2026 19:07:32 +0000

Google Just Made Serverless Agents Real

Part 2 of 5 — Google I/O 2026 Review

Every developer who has shipped an agent demo knows the feeling. The prototype works. The Loom video gets likes. Then someone asks: "Cool — how do I use this with 500 real users?"

That question kills most agent projects.

The gap between demo and production is not about prompts or tool definitions. It is about infrastructure — container orchestration, autoscaling policies, health checks, token budget enforcement, multi-turn state management, and log aggregation. The same gap that existed between "I wrote a web app" and "this web app handles 10,000 concurrent users" before EC2, Cloud Run, and Lambda showed up.

At I/O 2026, Google shipped the answer. The Managed Agents API does for agents what Cloud Functions did for serverless computing. Deploy, scale, monitor, pay per execution. No cluster. No YAML. One CLI command.

I called it the most consequential announcement from I/O in my Part 1 review. This post explains why.

The Demo-to-Production Gap

Building an agent is easy now. LangChain, CrewAI, AutoGen, Claude Code — pick a framework, define tools, write a system prompt, and you have a working prototype in an afternoon.

Running that agent for real users is a different discipline entirely. Here is what production demands that demos do not:

Concern	Demo	Production
Scaling	Your laptop	1 to 10,000 concurrent sessions
State	In-memory dict	Persistent multi-turn across sessions
Monitoring	Print statements	Token consumption, latency p95, error rates, cost attribution
Rollback	Ctrl+Z	Version pinning, canary deploys, instant rollback
Tool auth	Hardcoded API keys	Scoped service accounts, secret rotation
Cost control	"I'll watch it"	Per-agent token budgets, kill switches

Most indie developers and small teams get stuck somewhere in this table. The agent works. The infrastructure to run it does not exist yet. So the project stays a demo.

Cloud Functions, But for Agents

Google's move is to compress that entire table into a managed runtime. The mental model is straightforward: if you have used Cloud Functions or Cloud Run, you already understand the deployment pattern. The difference is that the runtime is agent-aware — it understands tool call chains, token budgets, and conversation state natively.

Here is what a deploy looks like with the actual Agents CLI:

# Install Agents CLI
uvx google-agents-cli

# Scaffold for Cloud Run deployment
agents-cli scaffold enhance -d cloud_run

# Provision infrastructure
agents-cli infra single-project

# Deploy
agents-cli deploy

That replaces a Kubernetes cluster, an autoscaler config, a Prometheus stack, and a custom token-tracking pipeline. For a solo builder, this is the difference between "I need a DevOps hire" and "I need a terminal."

30+ Integrations Out of the Box

The tool registry ships with pre-built connectors. Not "we plan to support" — shipping in preview:

Category	Integrations
Dev tools	GitHub, GitLab, Jira, Linear
Productivity	Notion, Google Workspace, Slack, Asana
Data	MongoDB, BigQuery
Payments & CRM	Stripe, Salesforce (via StackOne), HubSpot (via StackOne)
Cloud	GCP services (Cloud Storage, Pub/Sub, Cloud SQL)
Communication	Twilio (via StackOne)

This matters because the hardest part of building a useful agent is not the LLM call — it is connecting the agent to the systems where work actually happens. A customer support agent that cannot read your ticket system or update your CRM is a chatbot, not an agent.

MCP Native — The Interoperability Play

The platform speaks Model Context Protocol natively. Two things follow from this:

First, existing REST APIs can be wrapped as MCP tools through Apigee without rewriting. If you have an internal API, you do not need to build a custom connector. Apigee generates the MCP schema, and the agent can call it like any other tool.

Second, tool definitions are portable to any MCP-compatible client. The platform is not model-locked at the protocol level. Your agent's tool definitions and conversation flows work with any client that speaks MCP.

This is a deliberate architectural choice. Google controls the runtime, the governance, and the registry. But the tool layer speaks an open protocol. That is a more nuanced lock-in story than "everything is proprietary" or "everything is open."

The Lock-In, Honestly

I want to be specific about what ties you to GCP and what does not.

Locked to GCP:

The agent runtime itself — execution, scaling, health checks
The governance layer — who can deploy, what tools an agent can access, audit logs
The tool registry format — how connectors are packaged and versioned

Portable:

Your prompts and system instructions
Tool definitions (if you use MCP, they work elsewhere)
Conversation flow logic
The LLM choice (through MCP interoperability)

The pattern is familiar from Cloud Functions: your function code is portable, but the trigger bindings, IAM policies, and monitoring integrations are not. You can move your logic. You cannot move your operational wrapper.

Worth pricing in before going all-in. Especially if you are an indie developer building on a platform that could change pricing or terms — which, as Google demonstrated the same day with Gemini CLI, is not a theoretical concern.

Agent-First: What Antigravity 2.0 Signals

The Managed Agents API did not arrive in isolation. Antigravity 2.0 — Google's next-generation development platform — explicitly treats agents as first-class deployment targets with versioning, rollback, and observability. A demo showed an OS built by 93 agents over 12 hours, plus a playable Doom clone, with agent-driven development.

The execution had problems (forced updates broke existing projects — I covered this in Part 1). But the directional signal is clear: Google sees agents not as a feature of its cloud, but as a deployment primitive alongside containers and functions.

That is new. AWS has SageMaker endpoints and Bedrock agents, but neither ships a dedicated agent CLI. Azure has AI Studio, but it lives in a separate portal. Google is among the first major clouds to ship a purpose-built agents-cli that takes an agent from scaffold to production in four commands.

What This Means for Indie Developers

Here is the before-and-after for a solo builder who wants to ship a production agent:

Before Managed Agents API:

Write agent logic
Containerize (Dockerfile, multi-stage builds)
Set up Kubernetes or Cloud Run
Configure autoscaling policies
Build token tracking and cost monitoring
Implement health checks
Set up log aggregation (ELK, Datadog, etc.)
Handle multi-turn state persistence
Manage secret rotation for tool credentials
Build a deployment pipeline

After:

Write agent logic
agents-cli deploy

Steps 2-10 are not eliminated — they are absorbed by the platform. The same compression Cloud Functions brought to backend workloads now applies to agents.

One caveat: the API is in preview. Pricing is not finalized. Production SLAs are not published. I would not migrate a revenue-critical agent today. But for new projects, the build-vs-buy calculation just changed fundamentally.

The Bigger Picture

Google I/O 2026 had a clear thesis: agents are infrastructure now, not experiments. The Managed Agents API, Antigravity 2.0's agent-first deployment, and the 30+ pre-built integrations all point the same direction — the "cool demo" era of AI agents is ending. The "runs in production at scale" era is starting.

For indie developers, the barrier just dropped from "hire a DevOps team" to "learn one CLI command." That is not hype. That is Cloud Functions, 2016, happening again.

Part 3 of this series covers Gemini Omni — the learned physics engine for video that stopped the room at I/O. Follow me on dev.to to catch it when it drops.

If you are building agents and evaluating managed platforms — or if you have tried the preview — I would like to hear your experience. Comments or GitHub.

Sources:

Google I/O Review (1/5) — Gemini 3.5 'Flash' Costs 15x More Than Flash 2.0. It's Pro in Disguise

ww-w.ai — Wed, 20 May 2026 19:06:38 +0000

Gemini 3.5 "Flash" Costs 15x More Than Flash 2.0 — It's Pro in Disguise

Google I/O 2026 Review — Part 1 of 5

The keynote crowd cheered. Sundar Pichai announced that Gemini 3.5 Flash outperforms Gemini 3.1 Pro on multiple benchmarks. The narrative was clean: the lightweight, cheap model just beat the flagship. The start of "the agentic Gemini era."

Then I opened the pricing page.

Flash and Pro Are Neighbors Now

Model	Input (per 1M tokens)	Output (per 1M tokens)
Gemini 3.5 Flash	$1.50	$9.00
Gemini 3.1 Pro	$2.00	$12.00

Source: Google AI pricing, accessed 2026-05-19.

Flash at $1.50/$9.00. Pro at $2.00/$12.00. That is a 25% gap on input, 25% on output. These are not different tiers. They are neighbors. Two years ago, Flash cost a fraction of Pro. Now they share the same block.

If someone showed you these two price points without labels, you would guess they are variants of the same model class. You would be right.

How Flash Got Here: Three Generations of Price Creep

Model	Input (per 1M tokens)	Output (per 1M tokens)	vs 2.0 Flash (Input)	vs 2.0 Flash (Output)
1.5 Flash	$0.075	$0.30	0.75x	0.75x
2.0 Flash	$0.10	$0.40	1x (baseline)	1x (baseline)
2.5 Flash	$0.30	$2.50	3x	6.25x
3.0 Flash	$0.50	$3.00	5x	7.5x
3.5 Flash	$1.50	$9.00	15x	22.5x

Source: Google AI pricing. All prices are standard (non-batch) per 1M tokens.

From 2.0 Flash to 3.5 Flash: input price rose 15x ($0.10 to $1.50). Output price rose 22.5x ($0.40 to $9.00). A model called "Flash" now costs fifteen times what Flash cost three generations ago.

The trajectory is clear. Flash did not stay in the lightweight lane. It grew into the price range that Pro used to occupy.

The Name Didn't Change. The Economics Did.

Here is what I think actually happened: Google shipped Pro-level performance and put the Flash label on it.

The benchmarks are real. Flash 3.5 does outperform Pro 3.1 on the metrics Google showed. But outperforming Pro while costing nearly the same as Pro is not "the cheap model won." It is "the expensive model got a new name."

Think about it from Google's side. If they had called it Pro 3.5 at $1.50/$9.00, the story would be: "Google cut Pro pricing by 25%." Accurate, useful, but not a keynote moment. By calling it Flash, the story becomes: "Flash beat Pro!" That is a keynote moment. Same product economics, different narrative.

Pichai himself leaned into the framing. He used the word "tokenmaxxing" during the keynote — more tokens, more context, more throughput. Some out there might call this tokenmaxxing, he said. The naming is part of that narrative. Flash sounds lightweight and affordable. The pricing page tells a different story.

So Is This Bad? Not Exactly.

I want to be fair. The absolute price matters more than the brand name.

Pro-level performance at $1.50/$9.00 is genuinely useful. Consider an agent workload — a customer support bot handling 50,000 conversations per day. At legacy Pro pricing ($2.00/$12.00), the daily output token cost for, say, 500 tokens per response is:

50,000 conversations x 500 output tokens = 25M output tokens/day
At Pro 3.1: 25 x $12.00 = $300/day
At Flash 3.5: 25 x $9.00 = $225/day

That is $75/day saved, or roughly $2,250/month — with the same or better benchmark performance. For agent-heavy workloads running at scale, this price point opens real economic headroom.

The win is not that "Flash beat Pro." The win is that Pro-grade inference got 25% cheaper. That is a quieter story, but a more honest one.

Benchmarks vs. Production: The Usual Caveat

One thing the keynote did not cover: benchmark performance and production performance are different conversations. Benchmarks test isolated capabilities — reasoning, coding, knowledge retrieval — under controlled conditions. Production workloads add latency variance, context window pressure, tool-call chains, and failure modes that benchmarks do not measure.

I have not tested Flash 3.5 in production yet. Nobody outside Google has had enough time to. If you are making infrastructure decisions based on the keynote benchmarks alone, you are making them on incomplete data. Wait for the community benchmarks. Wait for your own evals.

Gemma 4: A Quick Note from Local Testing

On a related note — I have been running Gemma 4 (2.3B) locally for on-device-llm-wiki, a zero-cost, fully offline knowledge engine. In our internal reasoning benchmark across on-device and cloud models, Gemma 4 scored 66/85 — outperforming Granite 3.4B (52), Qwen3 4B (28), and SmolLM2 1.7B (35). For reference, Claude Haiku 4.5 scored 76. A free, local 2B model reaching 87% of a commercial cloud model's reasoning score — while beating a 4B competitor by more than 2x — is not incremental. It is a generational leap.

If Flash 3.5 carries the same generational improvement at cloud scale, the performance claims are plausible. Gemma is the open-weight sibling of the Gemini family, and quality gains in one tend to reflect in the other. But plausible is not confirmed — that requires production testing, not keynote slides.

What I Think You Should Do

Read the pricing page, not the keynote. The pricing page is the source of truth. Marketing narratives are not.
Run your own evals. If you are considering Flash 3.5 for production, test it on your workloads. Benchmark suites test what benchmark suites test.
Compare to the actual competition. Flash 3.5 at $1.50/$9.00 competes with Claude Sonnet 4 ($3/$15), GPT-4.1 ($2/$8), and other mid-to-high tier models. Compare apples to apples at the price point, not at the brand name.
Track the trajectory. Flash went from $0.10/$0.40 to $1.50/$9.00 in three generations. If the pattern holds, Flash 4.0 will cost what Pro costs today. Plan accordingly.

The Bottom Line

Google told a story about the cheap model beating the expensive one. The pricing page tells a story about the expensive model getting a cheaper name. Both stories have truth in them. The benchmarks are real. The price convergence is real. Which story matters more depends on what you are building.

For me, the useful takeaway is simpler: Pro-level performance is now available at $1.50/$9.00. That is good for anyone running agents at scale. Just do not call it cheap — it is 15x more expensive than the Flash you remember.

This is Part 1 of a 5-part Google I/O 2026 review series. Next up: Managed Agents API — serverless agents arrive, but so does GCP lock-in.

If you have tested Flash 3.5 against Pro on your own workloads, I would like to hear the numbers. Drop a comment or find me on GitHub.

Sources:

AI Agents Are About to Need Government-Issued IDs

ww-w.ai — Tue, 12 May 2026 08:26:01 +0000

AI Agents Are Getting Government IDs — Courtesy of the World's Most Powerful Spy Alliance

In the first week of May, the most powerful intelligence alliance on the planet told the tech industry: your AI agents need passports.

Between May 1 and May 3, the Five Eyes nations — the United States, the United Kingdom, Australia, Canada, and New Zealand — published joint guidelines titled "Careful Adoption of Agentic AI Services."

If the name doesn't ring a bell: Five Eyes is the world's most powerful espionage alliance, founded in 1946 under the UKUSA Agreement. These five nations share intercepted communications intelligence — this is the same network behind the NSA global surveillance programs revealed by Edward Snowden.

The authoring bodies include CISA (the US Cybersecurity and Infrastructure Security Agency), the NSA, and the UK's National Cyber Security Centre (NCSC), along with partner agencies from each member country.

This is the first time these governments have taken a coordinated, public stance on how AI agents should be governed in production environments.

Let me say upfront: I agree with the direction. The engineering recommendations in this document are solid, and they would have prevented real disasters — like the Cursor agent that wiped a production database in 9 seconds last month. But when you stop and ask why a spy alliance published AI agent guidelines, not a tech standards body like IEEE or NIST — that is where the story gets uncomfortable.

Let me walk you through both sides.

What the Guidelines Actually Say

The document is surprisingly concrete for a government publication. It does not deal in vague platitudes about "responsible AI." Instead it lays out specific operational requirements:

Agent identity provisioning. Every agent must have a unique, verifiable identity. No more anonymous processes hiding behind a shared API key.
Audit logging. Every action an agent takes must be recorded in a tamper-evident log. If an agent deletes a database table, there needs to be a trail that says which agent, when, under whose authority.
Delegation chains. When Agent A instructs Agent B to perform a task, the chain of authority must be traceable end-to-end. Think of it like a digital chain of custody.
Human checkpoints. System designs must include points where a human can intervene, review, or override an agent's planned action before it executes.

If you have been building agentic systems, none of these ideas are radical. Most experienced teams already implement some version of these patterns. What is new is that a coalition of five national governments is now saying: this is the baseline.

So far, so reasonable. Now let's talk about who is behind that baseline.

Wait — These Are the Snowden Guys?

Before we go further, it is worth pausing on who published this.

In 2013, Edward Snowden — a contractor working for the NSA — leaked thousands of classified documents revealing that Five Eyes agencies had been secretly collecting phone records, emails, and internet activity of ordinary citizens on a massive scale. The NSA's PRISM program was pulling data directly from the servers of Google, Facebook, Apple, and Microsoft. Britain's GCHQ was tapping undersea fiber optic cables to intercept global internet traffic. The Five Eyes nations were also spying on each other's citizens as a workaround — if US law prohibited the NSA from surveilling Americans, they could ask Britain's GCHQ to do it instead and share the results.

The public reaction was enormous. Governments were embarrassed. Tech companies scrambled to encrypt everything. Congress held hearings. The EU threatened to suspend data-sharing agreements. Snowden fled to Russia.

That was 13 years ago. The same agencies are now telling you how your AI agents should behave.

Why a Spy Alliance — Not a Tech Standards Body

So here is the question worth asking: why did these agencies publish AI agent guidelines — and not IEEE, NIST, or the ISO?

These agencies exist to do one thing: monitor communications and figure out who did what. Every phone call, email, and data packet that crosses a border — they want to be able to intercept it, read it, and trace it back to a person. They have spent 80 years and billions of dollars building the infrastructure to do exactly that.

Now imagine a world where millions of AI agents are autonomously making API calls, sending messages, executing code, and moving data across borders — all hiding behind a single shared API key. No name. No identity. No trail. From the perspective of an intelligence agency, that is a nightmare. It is like trying to wiretap a phone call when you do not even know who is on the line.

That is what this guideline is really about.

The identity provisioning requirement means every AI agent gets a name that intelligence agencies can track — just like every phone gets a number.
The audit logging requirement means every action an agent takes is recorded — just like every phone call generates a metadata record.
The delegation chain requirement means you can trace who told the agent to act — just like tracing who ordered a wire transfer.

None of this makes the guidelines wrong. The engineering recommendations are genuinely sound. But here is my interpretation:

These guidelines do make AI agents safer — but could they also be the first step in extending the same surveillance infrastructure that already covers human communications to cover AI agent communications too?

The same agencies that were caught monitoring your emails now want to make sure your AI agents are not invisible to them. Whether you see that as responsible governance or surveillance overreach probably depends on how you felt about the Snowden revelations.

A Practical Guide — Courtesy of Spies

Regardless of where it comes from, the engineering itself is worth learning from. If you are building agents, these are points worth considering:

Per-agent identity. A unique credential per agent instance instead of a shared API key means you can pinpoint which agent acted when something goes wrong.
Tamper-proof logging. Recording every action and decision — not just errors — and making logs auditable by a third party increases transparency.
Delegation chain tracking. Mapping the authority path from Agent A → B → C means you can answer "who authorized this?"
Human checkpoints. A review step before high-impact actions (database writes, external APIs, financial transactions) could have prevented incidents like the Cursor wipe.

These principles make your system more robust regardless of regulation. Just remember where they came from.

To wrap up... Reminds me of Q handing James Bond his gadgets. Turns out, when it comes to cutting-edge agent technology, the spy agencies are still leading the way.

What's your take? New perspectives after reading this, security issues you've hit while building agents, or just your reaction to spy agencies publishing AI guidelines — drop anything in the comments.

Sources:

Lorem Ipsum Makes LLMs Smarter. No, Seriously.

ww-w.ai — Mon, 11 May 2026 17:32:06 +0000

You know Lorem Ipsum. The placeholder text designers have been slapping into mockups since the 1960s. Turns out, it might be one of the most effective tools for making language models better at math.

A paper dropped last week — "Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration" (Huang et al., May 2026) — and the core finding is wild: prepending random Lorem Ipsum text before math problems during reinforcement learning training produces models that solve problems they otherwise never could.

Let me walk through why this works, because it is genuinely clever once you see the mechanism.

The Problem: When Every Answer Is Wrong, Nobody Learns

Modern LLM training uses reinforcement learning after the initial pretraining phase. One popular method is GRPO (Group Relative Policy Optimization), where you sample multiple candidate answers for a question, then reward the good ones and penalize the bad ones.

Here is the catch. For hard questions, all sampled answers might be wrong. When that happens, every candidate gets the same score. The relative advantage between them collapses to zero. No gradient. No learning signal. The model just shrugs and moves on.

This is called the zero-advantage problem, and it hits hardest on the exact questions you want the model to learn most — the difficult ones sitting at the frontier of its capability.

Previous fixes tried resampling (just roll the dice again) or adjusting reward scaling. They help a little, but fundamentally you are still asking the same question the same way, hoping for a different result.

The Fix: Just Jam Some Latin In There

LoPE — Lorem Perturbation for Exploration — does something that sounds like a prank. When the model fails on a hard question, LoPE prepends a randomly assembled chunk of Lorem Ipsum text before the prompt and resamples.

So instead of:

Solve: What is the integral of x^2 from 0 to 3?

The model sees:

Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Solve: What is the integral of x^2 from 0 to 3?

And somehow, this works. The nonsense prefix perturbs the model's internal state just enough to push it down different reasoning paths. Think of it like giving a stuck hiker a gentle shove in a random direction — sometimes that is all you need to find a trail you could not see before.

Why Latin and Not Just Random Characters?

The authors tested this systematically. Not all perturbations are equal. What works:

Latin-based vocabulary (Lorem Ipsum words)
Low perplexity (around 25) — the text needs to "look like language" to the model, even if it is meaningless

What does not work well:

Random character strings (too alien, the model just ignores or breaks)
High-perplexity gibberish
Perturbations in the model's primary training language (too much semantic interference)

Lorem Ipsum hits a sweet spot: familiar enough that the model processes it normally, foreign enough that it does not contaminate the actual reasoning task. It nudges the hidden states without hijacking them.

The Numbers

Tested on Qwen3-4B-Base across standard math benchmarks:

Benchmark	Standard GRPO	LoPE	Change
MATH-500	77.80	82.60	+4.80
AMC	47.76	58.21	+22% relative
AIME 2024	16.41	19.90	+3.49
Overall avg	49.37	53.99	+4.62 pts

On the 7B model, the gap widens further: +6.20 points over standard GRPO.

But the most interesting result is qualitative. On a set of 352 hard questions, LoPE uniquely solved 50 questions that no other method could crack. These were not marginal improvements on borderline problems. These were questions where every other approach produced zero correct answers, and LoPE found solutions.

The mechanism shows up clearly in the advantage signal. For those rare successful trajectories on hard problems, LoPE amplifies the advantage by 2.1x to 5.0x compared to standard resampling. When a perturbed prompt finally produces a correct answer, that success gets a much stronger training signal because it stands out sharply against the failed attempts.

Why This Matters for Practitioners

Three takeaways if you work with LLMs:

1. Exploration is still an unsolved problem. We talk a lot about scaling data and compute, but how models explore the solution space during RL training is arguably more important and much less understood. LoPE is evidence that we are leaving performance on the table.

2. Prompt sensitivity is a feature, not a bug. The fact that meaningless prefix text can unlock entirely different reasoning chains tells us something deep about how these models navigate their latent space. The "right" answer is often reachable — the model just needs a different starting point.

3. Simple methods can beat complex ones. LoPE is almost embarrassingly simple to implement. No architecture changes. No reward model modifications. Just prepend some Lorem Ipsum during resampling. If you are doing RL fine-tuning, this is a near-zero-cost experiment to try.

The broader lesson: sometimes the best interventions do not add information. They add noise in exactly the right way.

Paper Link

Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration
Huang, Huang, Li, Cai, Yang, Huang (Washington University in St. Louis) — May 7, 2026

Note: This is an arXiv preprint — not yet peer-reviewed. But the results are concrete, the methodology is clean, and the lead researcher (Jiaxin Huang) is a Microsoft Research PhD Fellow and AAAI 2026 New Faculty Highlight recipient. Worth watching.

Image Source: Huang et al., "Nonsense Helps" (arXiv:2605.05566), CC BY-NC-SA 4.0

Delete the Vercel Claude Code Plugin. Here's Why I Did.

ww-w.ai — Mon, 11 May 2026 13:46:49 +0000

TL;DR

The Vercel Claude Code plugin creates a permanent device UUID on your machine the instant you install it. No notification. No expiry. No rotation.
Session starts, tool calls, skill matches — all sent to telemetry.vercel.com. Default ON, no consent prompt. Prompt metadata (matched skill + score) included.
What's worse: they built a consent dialog for prompt text collection. But clicking "No thanks" only stops prompt text. All other telemetry keeps running. Most users will think they opted out of everything.
The documentation exists — buried eight directories deep inside ~/.claude/plugins/cache/. Nobody reads it. Documented ≠ Informed.

What I Found

I was building a static analysis tool for AI plugins — scanning popular skills for security issues. Regex pattern matching plus dual-LLM cross-verification.

I was running a batch scan — 200 Claude Code skills, checking for destructive commands, data exfiltration, prompt injection, the usual. On skill #147, the scanner flagged something in ~/.claude/. Not in some random GitHub repo. On my own machine.

I didn't suspect Vercel for a second. I assumed the flag was a false positive in my own skill. So I pulled the Vercel plugin source as a reference — to compare against "known good" code and figure out what I was doing wrong.

Then I read the Vercel source. Here's what I found.

The Evidence

All file paths and line numbers reference vercel-plugin v0.32.7, located at ~/.claude/plugins/cache/vercel/vercel-plugin/0.32.7/.

Every session start sends this:

// session-start-profiler.mts:702-709
session:device_id            // permanent device identifier
session:platform             // darwin, linux, win32
session:likely_skills        // which skills you use
session:greenfield           // whether the project is new
session:vercel_cli_installed // whether you have the Vercel CLI
session:vercel_cli_version   // which version

Every tool call you make — any tool, not just Vercel's:

// pretooluse-skill-inject.mts:969-971
tool_call:tool_name          // which tool you just called

Every time a skill matches your prompt:

// pretooluse-skill-inject.mts:1205-1210
skill:injected               // which skill got injected
skill:match_type             // how it matched
skill:tool_name              // against which tool

Every prompt you submit:

// user-prompt-submit-skill-inject.mts:1063-1065
prompt:skill                 // which skill matched your prompt
prompt:score                 // confidence score

All of it flows to a single endpoint:

https://telemetry.vercel.com/api/vercel-plugin/v1/events

None of it asked for your permission.

The permanent device ID

This is the part that should make you check your machine right now. Run this:

cat ~/.claude/vercel-plugin-device-id

You'll see something like:

473d7060-5a37-4ebb-9082-b09a983c****

A UUID. Created the instant you installed the plugin. Silently. No notification. It never expires. It never rotates. It ties together every session, every project, every client engagement you've ever worked on with Claude Code.

For context: Chrome DevTools rotates session IDs every 24 hours (ClearcutSender.ts:35,68-70). Vercel's device ID never expires. Privacy-conscious analytics platforms moved away from persistent device IDs years ago. This one lasts forever.

Dozens of telemetry events per coding session. All tied to a permanent fingerprint. All default-on.

"But It's in the README"

Technically, yes. The plugin's README.md has a ## Telemetry section. It explains what's collected and how to disable it.

But does anyone seriously think that counts as consent?

Walk through what actually happens:

You install the plugin.
It prints a success message.
You start coding.

At no point does any text appear on your screen about telemetry. No prompt. No checkbox. No banner. Nothing. Meanwhile, in the background: ~/.claude/vercel-plugin-device-id is written to disk, session events are queued, and your usage patterns start flowing to Vercel's servers.

The README is sitting in ~/.claude/plugins/cache/vercel/vercel-plugin/0.32.7/. Eight directories deep inside a hidden folder. Nobody browses there.

GDPR defines valid consent as "freely given, specific, informed, and unambiguous." Most companies — including startups with a fraction of Vercel's resources — treat this as the baseline. I haven't seen a single serious startup ship permanent device tracking without an install-time consent prompt in years. It's just not done anymore.

Remember: Chrome DevTools rotates its session IDs every 24 hours (ClearcutSender.ts:35,68-70). That's the standard. Vercel's device ID never rotates. Never expires. Created once, lives forever.

This is not a gray area. This is not "technically compliant." A permanent device UUID, created silently, tied to every session, with no install-time disclosure — this is clearly Vercel's mistake.

I used this plugin daily for months. I had no idea. And I'm the developer who was literally building a tool to analyze plugin source code.

The Part That's Even More Absurd — I Never Consented

Here's what makes this worse. The plugin actually has a consent dialog — for prompt text collection:

// user-prompt-submit-telemetry.mts:58-61
prompt:text  // full prompt content, up to 100KB — OPT-IN ONLY

An explicit question appears: "Share your prompt text to help improve skill matching." You can say yes or no. Your choice is saved.

So they know how to build consent flows. They built the infrastructure. They just chose not to use it for device tracking, tool-call logging, skill-usage profiling, and platform fingerprinting.

And here's the trap: if you click "No thanks," you think you've opted out. You haven't. Base telemetry — everything in the previous section — keeps running. The README even says so: "base telemetry remains on by default."

But you already clicked "No thanks." In your mind, the matter is settled. That's not a documentation gap. That's a dark pattern.

How to Protect Yourself

Do this now. It takes 60 seconds.

1. Check if you're affected

ls ~/.claude/vercel-plugin-device-id

If the file exists, you have a permanent tracking UUID on your machine.

2. Disable telemetry

Add this to your shell profile (.zshrc, .bashrc, etc.):

export VERCEL_PLUGIN_TELEMETRY=off

Then reload:

source ~/.zshrc

3. Or just uninstall the plugin entirely

If you don't need it, remove it. One fewer thing sending data you didn't agree to.

What Should Change

Two proposals. Design standards, not policy demands.

1. Surface telemetry at install time. One prompt. Plain language. "This plugin collects [X] and sends it to [Y]. OK?" The user sees it. The user decides. This is four lines of install-time code. Vercel already has the consent infrastructure. They use it for prompt text. Extend it to everything else.

2. Treat data flows as API surface. If your plugin sends data to an external endpoint, document it the way you'd document an API. What data. Where it goes. How often. How to stop it. Put this in the install output, not in a README eight directories deep.

These aren't radical ideas. Homebrew notifies you on first run. VS Code notifies you on first launch. It's already the industry standard. The Vercel plugin just doesn't.

Check your ~/.claude/ directory right now. What did you find? Drop it in the comments.

We Need a CatRun for the AI Era

ww-w.ai — Tue, 05 May 2026 15:36:00 +0000

A 16-pixel hero in your macOS menu bar. Watches LLM traffic. That's it.

RunCat told us the CPU was busy. Nothing tells us the agent is.

You remember RunCat — the kitten in your menu bar that runs faster when your CPU is busy. Almost a decade old. Adorable. Useful. Asks nothing of you.

AI-native development needs the same thing for a different signal. Not CPU. Agent traffic. Is there a live LLM request flowing right now, or is everything quiet?

That's why I built AgentRunner.

We need a CatRun for the AI era. So I made one.

A 16-pixel hero in your macOS menu bar. Runs when your agent's actually working. Idle when it isn't. That's the whole UI.

Seven things it's built around

1. The menu bar is where you already glance. Same place as the clock. No extra tab, no extra window, no "I'll open the dashboard later."

2. Below the noise floor. <1% CPU, ~20MB RAM. Native SwiftUI. A monitor that becomes its own monitoring problem is a joke.

3. Flashy "live agent dashboards" don't last. Animated traffic, live token deltas, color-coded latency heatmaps — fun for a week, closed and forgotten by the next sprint. CatRun ran for a decade because it asked you nothing. Same spirit here.

4. Detailed analysis belongs in a different tool. Token spend, cache misses, run history — that needs report depth. That's what cc-token-saver is for, and it gets its own post next. AgentRunner = glance. cc-token-saver = report. Don't make one app try to be both.

5. Vendor-neutral by design. It watches LLM traffic, not Claude traffic. Claude Code, Codex, Cursor, Windsurf, local LLaMA via Ollama, any agent loop hitting a model endpoint over HTTPS. No API key, no per-vendor SDK.

6. Local-only. Zero telemetry. Detection happens on your machine. The app does not phone home. No analytics SDK, no event ping. An agent monitor that ships your data anywhere doesn't deserve trust.

7. Idle vs Active. Binary. That's the entire UI. CatRun gave us a kitten that ran when CPU spiked. AgentRunner gives you a 16-pixel hero that runs when LLM traffic flows. Same spirit. Useful. Small. Invisible until you glance at it.

Get it

Repo: https://github.com/ww-w-ai/AgentRunner
License: Apache-2.0
Requires: macOS 13+

cc-token-saver post: coming next.