אחיה כהן

Posted on Apr 1

7 Things I Learned Building a Safari Browser Automation Tool That Chrome Can't Do

#ai #webdev #javascript #discuss

Every browser automation tool assumes you're using Chrome.

Playwright? Chrome. Puppeteer? Chrome. Selenium? Technically supports others, but let's be real -- Chrome. Even the new wave of AI-powered browser tools (Chrome DevTools MCP, Browserbase) are all Chromium under the hood.

I use Safari as my daily browser. I have 47 tabs open right now with active sessions -- Gmail, GitHub, Ahrefs, my hosting dashboards. When I started building AI agents that needed to interact with web pages, every tool told me the same thing: "Just use Chrome."

So I spent the last two weeks building Safari MCP -- a native Safari automation server with 80 tools, running entirely through AppleScript and JavaScript injection. No Chrome. No Puppeteer. No headless browser.

Here are 7 things I learned that surprised me.

1. WebKit on macOS Is Not What You Think It Is

When Playwright says it supports WebKit, it's running a custom build of WebKit in a separate process. It's WebKit the engine, not Safari the application.

The real Safari on macOS runs inside the operating system's rendering pipeline. It shares resources with the window server, uses the system's DNS resolver, benefits from Apple's Intelligent Tracking Prevention, and -- this is the part that matters for automation -- it has access to your actual cookies, sessions, and logins.

The practical difference: when my AI agent needs to check Google Search Console, it just... opens it. No login flow. No stored credentials. No OAuth dance. Safari already has my Google session from this morning.

This is what "native" actually means. Not "runs on macOS" -- but "runs as macOS."

2. The CPU Difference Is Real: ~60% Less Than Chrome

I wasn't expecting this to be significant. I was wrong.

Running Chrome DevTools Protocol (via Chrome DevTools MCP) on my M2 MacBook Pro, Activity Monitor showed Chrome Helper processes eating 30-45% CPU during automation tasks. The fans spun up. My laptop got hot.

Safari MCP doing the same tasks: 10-15% CPU. No fan noise. The reason isn't that Safari is "more efficient" in some abstract sense -- it's that Safari's rendering is baked into macOS's WindowServer process, which is already running. There's no separate browser process to spin up, no V8 isolate to warm, no DevTools protocol overhead.

For AI agents that run for hours -- scraping data, filling forms, monitoring dashboards -- this isn't a nice-to-have. It's the difference between a usable laptop and a space heater.

3. Apple's Private Entitlement Wall (The Thing That Almost Killed the Project)

My first approach was obvious: use Safari's Web Inspector Protocol. Safari has a remote debugging protocol -- you can see it in the Develop menu. It's how Safari's built-in DevTools work.

I spent days trying to connect to it programmatically. Here's what I found:

Safari's Web Inspector uses an XPC service (com.apple.WebKit.WebContent) that requires a private Apple entitlement to connect. This entitlement is only granted to Apple-signed binaries -- Safari itself and Xcode's instruments.

You cannot get this entitlement as a third-party developer. There's no API to request it. No workaround. Apple has deliberately locked down programmatic access to Safari's debugging protocol.

This is the wall that stops every "Safari automation" attempt. It's why Selenium's Safari driver is perpetually limited. It's why no one has built a Puppeteer-for-Safari.

I had to find another way in.

4. AppleScript + A Swift Daemon Gets You ~5ms Per Command

The "other way in" turned out to be hiding in plain sight: AppleScript's do JavaScript command.

tell application "Safari" to do JavaScript "document.title" in tab 1 of window 1

This runs arbitrary JavaScript in any Safari tab. It's been in macOS for over a decade. The catch: spawning osascript as a subprocess takes ~80ms per call. For a single command that's fine. For an AI agent issuing hundreds of commands to fill a form, it's painfully slow.

The solution: a persistent Swift daemon (safari-helper.swift -- 301 lines) that keeps an NSAppleScript instance alive in-process. The AI agent sends JSON lines over stdin, the daemon executes them and returns results.

Result: ~5ms per command instead of ~80ms. A 16x speedup from a 301-line Swift file.

The entire codebase is ~6,000 lines across 4 files. Two production dependencies (@modelcontextprotocol/sdk for the MCP protocol, ws for the optional Extension WebSocket). That's it.

5. The Tab Ownership Problem (AI Agents Will Destroy Your Work)

This one cost me a full day of lost work before I solved it.

Here's the scenario: you're writing an email in Safari tab 3. Your AI agent is automating something in tab 5. The agent needs to navigate somewhere -- and it navigates in your tab 3 instead. Your half-written email is gone. The form state is destroyed. There's no undo.

This happens because tab indices shift. You close a tab, every index after it changes. The agent cached "my tab is index 5" but now it's index 4, and your email tab is 5.

My solution: tab tracking by URL. Every command resolves the target tab by its URL, not its index. If the URL doesn't match, the command fails safely instead of hitting the wrong tab. The agent maintains a list of tabs it opened (via safari_new_tab) and refuses to touch any tab it didn't create.

Before EVERY command:
1. Resolve tab by URL, not cached index
2. Verify this is a tab the agent opened
3. If mismatch -> fail safely, never navigate in user's tab

I added this as a hard rule in my MCP configuration: the AI agent must call safari_list_tabs at the start of every session, track which tabs it opens, and verify ownership before every interaction.

It sounds paranoid. It is paranoid. And it's the only way to safely share a browser between a human and an AI agent.

6. Shadow DOM, React State, and CSP -- Why I Had to Build a Safari Extension

AppleScript's do JavaScript is powerful, but it runs in the page's JavaScript context. Three things break it:

Closed Shadow DOM. Reddit, many web components, and design systems use mode: 'closed' shadow roots. JavaScript running in the page context literally cannot see inside them -- element.shadowRoot returns null. The only way in is through a browser extension's content script, which has access to the internal shadow tree.

React's internal value tracker. If you set input.value = 'hello' on a React-controlled input, React ignores it. React tracks the "last known value" via an internal _valueTracker property on the DOM element. You have to reset this tracker before dispatching the input event, or React's synthetic event system thinks nothing changed. I learned this the hard way on LinkedIn, where every form is React-controlled.

// The hack that makes React forms work:
const tracker = input._valueTracker;
if (tracker) tracker.setValue('');
input.value = 'hello';
input.dispatchEvent(new Event('input', { bubbles: true }));

Content Security Policy. Strict CSP headers block dynamic code execution and inline scripts. Some sites (banking, enterprise tools) restrict even more aggressively. The Extension runs in the MAIN world with elevated privileges, bypassing CSP restrictions that would block AppleScript-injected JavaScript.

This led to the dual-engine architecture: the Safari Extension handles modern SPAs and CSP-strict sites (5-20ms per command via HTTP polling), while AppleScript handles everything else (~5ms via the Swift daemon). The system automatically falls back between them.

7. CGEvent Window Targeting -- Clicking Without Stealing Focus

The final boss: some sites (Airtable, complex React apps) don't respond to synthetic JavaScript clicks. They check event.isTrusted -- a read-only property that's true only for events generated by the OS, not by JavaScript.

The obvious solution -- simulate a real mouse click via macOS accessibility APIs -- has a nasty side effect: it moves your physical cursor and brings Safari to the foreground. If you're typing in VS Code while your agent works, suddenly your cursor jumps and Safari appears on top.

The fix lives in safari-helper.swift and uses a largely undocumented CGEvent feature:

// CGEventField 91 = kCGMouseEventWindowUnderMousePointer
// (not in Apple's public headers)
let kWindowField = CGEventField(rawValue: 91)!
event.setIntegerValueField(kWindowField, value: windowId)

By setting the window ID on the CGEvent, the click is delivered directly to Safari's window -- without moving the mouse cursor, without activating the window, without stealing focus. The event registers as isTrusted: true in the browser.

This field isn't in Apple's public CGEvent documentation. I found it by reading Chromium's source code (they use the same trick for their own window targeting) and then confirmed the raw field numbers work on macOS Sequoia.

The Result

Safari MCP is open source (MIT), installs via npm install -g safari-mcp or Homebrew, and works with Claude Code, Claude Desktop, Cursor, Windsurf, and any MCP-compatible client.

By the numbers:

80 tools (navigation, forms, screenshots, network mocking, cookies, accessibility, and more)
~5ms per command via the persistent Swift daemon
~6,000 lines of code across 4 files
2 production dependencies
~60% less CPU than Chrome-based alternatives
MIT license

I use it daily at Achiya Automation for everything from filling client forms to monitoring dashboards to running SEO audits -- all without Chrome ever touching my system.

What I'd Do Differently

If I started over, I'd skip the XPC/Web Inspector rabbit hole entirely and go straight to AppleScript + Extension. I lost three days on the private entitlement wall before accepting it wasn't going to work.

I'd also build tab tracking from day one. The "navigate in the wrong tab" disaster happened because I treated it as an edge case. It's not an edge case -- it's the default failure mode when an AI agent shares a browser with a human.

The Question I Keep Coming Back To

Every MCP server I've seen for browser automation is Chrome-first. Playwright MCP, Chrome DevTools MCP, Browserbase -- all Chromium.

But most Mac developers I know use Safari as their daily browser. And the AI agent use case is fundamentally different from testing: you're not running in CI, you're running on your machine, with your sessions, while you're actively working.

For those of you building AI agents that interact with browsers: what's the biggest pain point you've hit with focus stealing, session management, or CPU overhead -- and did you solve it, or just live with Chrome eating your battery?

I'm genuinely curious whether the "just use headless Chrome" consensus holds when the agent runs on your personal laptop for 8 hours a day.

DEV Community