For weeks my CI badge was green and I believed it.
Safari MCP is an open-source tool that lets an AI coding agent drive a real, logged-in Safari — click, type, read the page, switch tabs. It registers 96 tools. The test suite ran on every push across three Node versions and came back 32 passed, 0 failed. Green is green. I shipped on it.
Then I went to extract a chunk of index.js into its own module, and while staring at the diff I asked a question I should have asked months earlier:
Which of these 32 tests would fail if I broke the security boundary?
The answer was none of them.
What the tests actually tested
I read the suite line by line. Two of the tests carried almost all the weight:
-
server starts and lists all registered tools— boots the server, asserts the tool count. -
valid schemas + unique names— every tool has a schema, no duplicate names.
The rest were string-escaping and JS-injection helpers. All useful. All real. And all of them answered the same kind of question: does the thing exist and is it shaped correctly?
Not one of them answered: does calling it do the right thing?
It's the test-suite equivalent of checking into a hotel by confirming the building has 96 doors with correct room numbers — and never once trying a key in a lock.
The boundary nobody was watching
The most security-critical code in Safari MCP is tab ownership. The rule is simple to say and easy to get subtly wrong: the agent may only touch tabs it opened. It must never navigate, click, or read a tab the human opened — that's someone's half-written email, their banking session, their unsaved work.
That logic lived in a tangle of module-local state: a map of owned tabs, a TTL so stale entries expire, a blank-URL sentinel for tabs mid-load, a matcher that decides whether https://app.example.com/org is "the same" tab as https://app.example.com/org-evil.
Read that last one again. /org vs /org-evil. If the matcher is even slightly too loose — a startsWith where it needed a path-boundary check — the agent could decide it "owns" a look-alike tab and start typing into it.
There was not a single test exercising that comparison. The suite was 100% green the whole time the security boundary had zero behavioral coverage. A regression there wouldn't have turned CI red. It would have turned CI green and wrong — the worst color a test suite can be.
Why "green and wrong" is worse than red
A red build is honest. It stops you. The failure is the feature.
A green build that proves nothing gives you the feeling of safety without the substance — and you make decisions on that feeling. You refactor confidently. You merge contributor PRs confidently. You tell users the boundary holds. Every one of those is a small bet placed on a test that was never actually watching the thing you care about.
This is the same failure mode I keep running into in this project, wearing a different costume each time:
- A macOS API that accepted my click and silently delivered it nowhere — "success" that did nothing.
- A README that claimed 80 tools while the code had 96 — a fact nobody had pinned down.
- And now a test suite that reported confidence it hadn't earned.
The pattern is always the same: the system doesn't fail loudly. It quietly does less, and the signal you're trusting keeps saying "fine."
The fix: make the boundary fail loudly
Before extracting anything, I wrote the tests that should have existed from day one — behavioral tests that call the ownership logic and assert on its decisions:
-
/orgdoes not own/org-evil(the path-boundary case). - An entry past its TTL is no longer owned.
- The blank-URL sentinel is treated as in-flight, not as a match.
- Ownership survives the kind of state round-trip the refactor was about to perform.
Nine of them. The suite went from 32 to 41 (it's 46 today, after a later round for macOS compatibility). More importantly: now if I loosen that matcher by one character, a test goes red and names the boundary in the failure message. The security rule finally has a tripwire.
Only then did I do the refactor — extract the state layer into its own module — and the new tests held identity across the move, which is exactly the confidence I'd been pretending to have.
The rule I'd give past-me
Counting your tools is not testing your tools. Schema validation is not behavior. A green suite tells you what it checks — and stays silent about everything it doesn't, in the most reassuring tone possible.
So the question to ask of any test suite, especially one you've been trusting:
What is the single worst thing that could break in this codebase — and would a test go red if it did?
If the honest answer is "no," your CI badge isn't lying. You just never asked it the right question.
Safari MCP is open source — the ownership tests are in test/ownership-state.test.mjs if you want to see what "test the boundary" looks like in practice. More on what I'm building at achiya-automation.com.
What's the security boundary in your project that your test suite has never once exercised?
Top comments (0)