אחיה כהן

Posted on Jun 29

My MCP server had 32 green tests. Not one of them had ever called a tool.

#testing #opensource #mcp #devtools

For weeks my CI badge was green and I believed it.

Safari MCP is an open-source tool that lets an AI coding agent drive a real, logged-in Safari — click, type, read the page, switch tabs. It registers 96 tools. The test suite ran on every push across three Node versions and came back 32 passed, 0 failed. Green is green. I shipped on it.

Then I went to extract a chunk of index.js into its own module, and while staring at the diff I asked a question I should have asked months earlier:

Which of these 32 tests would fail if I broke the security boundary?

The answer was none of them.

What the tests actually tested

I read the suite line by line. Two of the tests carried almost all the weight:

server starts and lists all registered tools — boots the server, asserts the tool count.
valid schemas + unique names — every tool has a schema, no duplicate names.

The rest were string-escaping and JS-injection helpers. All useful. All real. And all of them answered the same kind of question: does the thing exist and is it shaped correctly?

Not one of them answered: does calling it do the right thing?

It's the test-suite equivalent of checking into a hotel by confirming the building has 96 doors with correct room numbers — and never once trying a key in a lock.

The boundary nobody was watching

The most security-critical code in Safari MCP is tab ownership. The rule is simple to say and easy to get subtly wrong: the agent may only touch tabs it opened. It must never navigate, click, or read a tab the human opened — that's someone's half-written email, their banking session, their unsaved work.

That logic lived in a tangle of module-local state: a map of owned tabs, a TTL so stale entries expire, a blank-URL sentinel for tabs mid-load, a matcher that decides whether https://app.example.com/org is "the same" tab as https://app.example.com/org-evil.

Read that last one again. /org vs /org-evil. If the matcher is even slightly too loose — a startsWith where it needed a path-boundary check — the agent could decide it "owns" a look-alike tab and start typing into it.

There was not a single test exercising that comparison. The suite was 100% green the whole time the security boundary had zero behavioral coverage. A regression there wouldn't have turned CI red. It would have turned CI green and wrong — the worst color a test suite can be.

Why "green and wrong" is worse than red

A red build is honest. It stops you. The failure is the feature.

A green build that proves nothing gives you the feeling of safety without the substance — and you make decisions on that feeling. You refactor confidently. You merge contributor PRs confidently. You tell users the boundary holds. Every one of those is a small bet placed on a test that was never actually watching the thing you care about.

This is the same failure mode I keep running into in this project, wearing a different costume each time:

A macOS API that accepted my click and silently delivered it nowhere — "success" that did nothing.
A README that claimed 80 tools while the code had 96 — a fact nobody had pinned down.
And now a test suite that reported confidence it hadn't earned.

The pattern is always the same: the system doesn't fail loudly. It quietly does less, and the signal you're trusting keeps saying "fine."

The fix: make the boundary fail loudly

Before extracting anything, I wrote the tests that should have existed from day one — behavioral tests that call the ownership logic and assert on its decisions:

/org does not own /org-evil (the path-boundary case).
An entry past its TTL is no longer owned.
The blank-URL sentinel is treated as in-flight, not as a match.
Ownership survives the kind of state round-trip the refactor was about to perform.

Nine of them. The suite went from 32 to 41 (it's 46 today, after a later round for macOS compatibility). More importantly: now if I loosen that matcher by one character, a test goes red and names the boundary in the failure message. The security rule finally has a tripwire.

Only then did I do the refactor — extract the state layer into its own module — and the new tests held identity across the move, which is exactly the confidence I'd been pretending to have.

The rule I'd give past-me

Counting your tools is not testing your tools. Schema validation is not behavior. A green suite tells you what it checks — and stays silent about everything it doesn't, in the most reassuring tone possible.

So the question to ask of any test suite, especially one you've been trusting:

What is the single worst thing that could break in this codebase — and would a test go red if it did?

If the honest answer is "no," your CI badge isn't lying. You just never asked it the right question.

Safari MCP is open source — the ownership tests are in test/ownership-state.test.mjs if you want to see what "test the boundary" looks like in practice. More on what I'm building at achiya-automation.com.

What's the security boundary in your project that your test suite has never once exercised?

Top comments (2)

Adam Lewis • Jul 1

Worth asking why the suite drifted to 32 shape-checks before you caught it. When an agent writes the tests, that's the default it produces, it can see the code's structure but not the rule that it may only touch tabs it opened. That rule was in your head, so nothing generated a test for it. The ownership tests you added fix it, but the durable version is writing the boundary down where the agent reads it, as a check it has to satisfy, so the behavioural test isn't the one thing you have to remember to write by hand.

אחיה כהן • Jul 2

You've named the mechanism better than I did. The agent generated tests for everything that was legible from the code — schemas, registration, names — and the ownership rule wasn't legible anywhere. It lived as a convention smeared across a handful of handlers, so nothing (human or agent) had an anchor to generate a test from.

What stuck after the incident: the boundary got extracted into its own module with behavioral tests asserting the rule itself (one client can't touch another's tabs, TTL expiry, the blank-origin sentinel), and the repo's agent instructions now state the invariant explicitly, so anything generating code or tests reads it before touching that layer.

One nuance I'd add: writing it down where the agent reads it only works if the written form can fail. A prose rule gets paraphrased away; a named invariant with a test that goes red survives. Same lesson as the tool-count drift from the earlier post — the durable facts are the ones something re-derives.