DEV Community: frog404

Playwright can test Chrome extensions. So why does my AI agent still need my help?

frog404 — Mon, 20 Jul 2026 07:44:45 +0000

What happens when you let an AI write a browser extension

I've been building Chrome extensions with AI coding agents — Claude Code, Cursor, whatever. Writing the code is the fast part. The problem is everything after.

"Build it, load it into the browser, open the popup, click the button, check the console."

That part? All me. Every time. The agent can write the code, but it can't see what the code looks like on screen. It can't click anything. It can't read the logs. So the human becomes the agent's eyes and hands.

At first I accepted this as the cost of doing business. Then I hit the UI-polishing phase and it turned into hell. Every "move that button 20px right" or "make the font 14px" meant: build → load → screenshot → paste it into chat → "how's this?" → agent edits code → build → load → screenshot. Forever.

And this isn't just a Chrome extension problem. Electron apps, VSCode extensions — anything with a UI that doesn't open in a normal browser tab has the same failure mode.

"Just use Playwright" — okay, let's talk about that

For a normal web app, sure. Playwright opens the page, screenshots it, clicks things. Tell your agent "verify it with Playwright" and it does.

Playwright supports Chrome extensions too — there are official docs, and they work. But the setup is nothing like a regular page. You need launchPersistentContext to load the extension, extract the extension ID dynamically from the Service Worker URL, then navigate to chrome-extension://{id}/popup.html. Popups have a different lifecycle from normal pages, and MV3 Service Workers get suspended when idle — Playwright handles most restarts transparently these days, but it's still one more extension-specific concern the agent has to account for.

As test code written by a human, this is all fine. Follow the docs and it works.

The problem is asking an agent to do it. All I want to say is "build and check it." Instead, the agent has to write Playwright setup code, understand extension-specific context management, write a test, run it, and parse the output. That's a lot of machinery for "does the button look right?"

Electron is similar. Playwright has (experimental) Electron support — electron.launch gets you basic UI automation. But native OS dialogs like dialog.showOpenDialog() live outside the DOM, so you need to inject mocks. Want to observe IPC between main and renderer? Build your own instrumentation. And Electron bugs love to hide on the far side of IPC, so seeing only the window is half a verification.

VSCode extensions have a different official testing path entirely. The standard setup uses @vscode/test-electron to launch an Extension Development Host — VSCode itself is an Electron app, so you can drive the window through Playwright's Electron support, but commands, notifications, output channels, and the Problems panel all need VSCode-specific integration on top.

So: every platform has good primitives, and agent-facing tools like Playwright MCP already cover driving regular browser pages. What I couldn't find was a single agent-facing interface that covered Chrome extension lifecycle management (build, reload, Service Workers, content scripts), Electron IPC, and VSCode-specific operations under one operation model.

The primitives already existed. What was missing was a unified development loop an agent could call without rebuilding the platform-specific setup every time.

So I built it.

KamoX: hide the platform mess behind a local HTTP API

KamoX is a local HTTP API server. It's not a Playwright replacement — think of it as a layer that wraps Playwright and each platform's native tooling behind endpoints an agent can hit with curl.

POST /rebuild — build and reload
POST /check-ui — screenshot + DOM info + console errors, in one call
POST /playwright/element — click buttons, fill forms
GET /logs — retrieve logs

/check-ui returns the screenshot path plus the page title, body text, HTML, console logs, and page errors as text. So even an agent that can't look at images can reason about UI state from the text:

{
  "success": true,
  "data": {
    "loaded": true,
    "screenshot": "/project/.kamox/screenshots/popup_1234567890.png",
    "dom": { "title": "Popup Title", "bodyText": "...", "html": "<html>...</html>" },
    "logs": [],
    "errors": [],
    "performance": { "loadTime": 240 }
  }
}

A typical agent loop looks like this:

# after editing code, rebuild
curl -X POST http://localhost:3000/rebuild

# check the UI (keepOpen leaves the popup up)
curl -X POST http://localhost:3000/check-ui \
  -H "Content-Type: application/json" \
  -d '{"keepOpen": true}'

# click a button in the open popup
curl -X POST http://localhost:3000/playwright/element \
  -H "Content-Type: application/json" \
  -d '{"pageType": "popup", "selector": "#save-btn", "action": "click"}'

# read the logs
curl http://localhost:3000/logs

Why HTTP instead of MCP?

A deliberate choice. I picked HTTP as the core interface because it's easy to inspect, easy to script, and callable from any agent that can run shell commands and reach localhost — plus the same API works from plain scripts and CI, not just agents. Requests and responses are trivially debuggable by a human with curl.

An MCP adapter could be layered on top later, but the underlying platform API doesn't need to depend on any particular client protocol. For a local dev server, plain HTTP is the simplest possible integration surface.

Why a plugin architecture?

Chrome extensions, Electron apps, and VSCode extensions launch differently and debug differently. But what the agent wants is identical: build, look at the screen, press a button, read the logs.

So the core owns the common API, and platform-specific dirt lives in plugins: waking idle Service Workers and reading their logs via CDP (Chrome), IPC monitoring and native-dialog mocks (Electron), command execution and Problems-panel access (VSCode).

kamox chrome --auto-build
kamox electron --entryPoint main.js
kamox vscode --project-path .

One command starts the server; the agent takes it from there.

What actually changed

The biggest shift: I can now say "hit localhost:3000 and check your work."

Before, every CSS tweak meant me building, opening the browser, screenshotting, and pasting. Now the agent runs /rebuild, pulls a screenshot and DOM info from /check-ui, clicks through with /playwright/element if something looks off, grabs errors from /logs, and fixes its own code. Just eliminating the screenshot-and-paste ritual changed how the whole loop feels.

It's not perfect, and I'd rather be upfront about that:

KamoX opens popup.html as a regular page. DOM and visuals are verifiable, but real-toolbar-popup behaviors like focus handling and auto-close aren't reproduced.
"Is this design actually good?" is still a human call.
Extension permissions and platform-specific sharp edges still exist.
KamoX is a local development tool — don't expose its HTTP port to untrusted networks.

But I no longer spend hours acting as my agent's eyes and hands, and on UI-heavy work the difference is obvious.

There's also kamox guide, which prints an LLM-optimized API reference you can point your CLAUDE.md or .cursorrules at, and scenario files for defining preconditions. Honestly though, 90% of usage is /rebuild + /check-ui + /logs.

Try it

npm install -g kamox
kamox chrome --auto-build

Chrome extension, Electron, and VSCode modes are implemented (VSCode mode is currently tested mainly on Windows). MIT licensed.

GitHub: https://github.com/iwabuchi404/kamox

I built an AI-friendly knowledge base so coding agents stop forgetting project context

frog404 — Thu, 02 Jul 2026 13:49:58 +0000

I have 20+ personal projects. Every time I start a new AI session, I have to re-explain the project structure, what was decided last time, and what to do next. Switching between Claude Code, Codex, and other tools makes it worse — the context doesn't follow.

I built ContextMixer to fix this: a knowledge base where AI agents search, read, and update project documents through MCP or REST API, and humans review them in a web UI.

Why not Notion / Obsidian / local Markdown?

Local Markdown + Git worked for CLI agents but couldn't be accessed from chat-based tools like Claude.ai.

Notion had MCP support, but fetching pages was slow and token-heavy. Its rich block structure is great for humans but inefficient for AI access.

Obsidian is local-first, which rules out chat-based access.

I needed something accessible from both chat and CLI, lightweight for AI, and editable by humans. Nothing fit, so I built my own. It runs on Cloudflare Workers + D1 + R2, entirely within the free tier.

Progressive retrieval

The core design idea: agents choose how much to read.

View	Returns	Tokens
`meta`	title + summary	tens
`outline`	heading structure	hundreds
`section`	one specific section	hundreds–thousands
`full`	entire document	all

A typical flow for "check the auth design":

1. list_docs(collection_id)        → document list with metadata
2. get_doc(doc_id, view="outline") → heading structure
3. get_doc(doc_id, view="section", section="Auth") → just what's needed

Three API calls, but hundreds of tokens instead of thousands. Most queries resolve at meta → section without ever reading the full document.

AI Cortex: structuring project memory

ContextMixer is a generic container. The structure is up to you.

I keep four documents per project:

context — current phase, recent work, unresolved issues, handoff notes for the next session
spec — confirmed goals, requirements, tech stack (nothing tentative)
decisions — design choices with reasoning, including rejected alternatives
notes — library pitfalls, bug workarounds, implementation tips

I call this pattern AI Cortex. The agent reads context at session start and updates it at the end. The next session picks up where the last one left off.

The biggest win: I no longer re-explain previous sessions. The agent reads context and resumes. decisions also helps — "why did we drop Vue?" is answered by the document, not by me repeating myself.

But letting AI write to a knowledge base has risks. Without rules, agents write tentative ideas into spec, create unconfirmed decisions, or duplicate documents. Writing permissions aren't enough — you need writing rules. In my setup, CLAUDE.md specifies things like "don't write to decisions without user confirmation" and "search before creating new documents." Still a work in progress.

Karpathy's LLM Wiki

After I started building ContextMixer, I came across Karpathy's LLM Wiki pattern — where an LLM builds and maintains a structured wiki from raw sources instead of doing RAG lookups every time.

It resonated, but the use cases differ. LLM Wiki is a research library: accumulate and organize what you've read. ContextMixer is a project whiteboard: manage ongoing work across sessions, tools, and access points.

Links

📦 GitHub
🔗 Demo
📄 Project page (detailed design notes)
📝 Japanese article on Zenn

AI agents don't need to remember everything. They need a good place to read from.