DEV Community

Cover image for My MCP server had 32 green tests. Not one of them had ever called a tool.
אחיה כהן
אחיה כהן

Posted on

My MCP server had 32 green tests. Not one of them had ever called a tool.

For weeks my CI badge was green and I believed it.

Safari MCP is an open-source tool that lets an AI coding agent drive a real, logged-in Safari — click, type, read the page, switch tabs. It registers 96 tools. The test suite ran on every push across three Node versions and came back 32 passed, 0 failed. Green is green. I shipped on it.

Then I went to extract a chunk of index.js into its own module, and while staring at the diff I asked a question I should have asked months earlier:

Which of these 32 tests would fail if I broke the security boundary?

The answer was none of them.

What the tests actually tested

I read the suite line by line. Two of the tests carried almost all the weight:

  • server starts and lists all registered tools — boots the server, asserts the tool count.
  • valid schemas + unique names — every tool has a schema, no duplicate names.

The rest were string-escaping and JS-injection helpers. All useful. All real. And all of them answered the same kind of question: does the thing exist and is it shaped correctly?

Not one of them answered: does calling it do the right thing?

It's the test-suite equivalent of checking into a hotel by confirming the building has 96 doors with correct room numbers — and never once trying a key in a lock.

The boundary nobody was watching

The most security-critical code in Safari MCP is tab ownership. The rule is simple to say and easy to get subtly wrong: the agent may only touch tabs it opened. It must never navigate, click, or read a tab the human opened — that's someone's half-written email, their banking session, their unsaved work.

That logic lived in a tangle of module-local state: a map of owned tabs, a TTL so stale entries expire, a blank-URL sentinel for tabs mid-load, a matcher that decides whether https://app.example.com/org is "the same" tab as https://app.example.com/org-evil.

Read that last one again. /org vs /org-evil. If the matcher is even slightly too loose — a startsWith where it needed a path-boundary check — the agent could decide it "owns" a look-alike tab and start typing into it.

There was not a single test exercising that comparison. The suite was 100% green the whole time the security boundary had zero behavioral coverage. A regression there wouldn't have turned CI red. It would have turned CI green and wrong — the worst color a test suite can be.

Why "green and wrong" is worse than red

A red build is honest. It stops you. The failure is the feature.

A green build that proves nothing gives you the feeling of safety without the substance — and you make decisions on that feeling. You refactor confidently. You merge contributor PRs confidently. You tell users the boundary holds. Every one of those is a small bet placed on a test that was never actually watching the thing you care about.

This is the same failure mode I keep running into in this project, wearing a different costume each time:

The pattern is always the same: the system doesn't fail loudly. It quietly does less, and the signal you're trusting keeps saying "fine."

The fix: make the boundary fail loudly

Before extracting anything, I wrote the tests that should have existed from day one — behavioral tests that call the ownership logic and assert on its decisions:

  • /org does not own /org-evil (the path-boundary case).
  • An entry past its TTL is no longer owned.
  • The blank-URL sentinel is treated as in-flight, not as a match.
  • Ownership survives the kind of state round-trip the refactor was about to perform.

Nine of them. The suite went from 32 to 41 (it's 46 today, after a later round for macOS compatibility). More importantly: now if I loosen that matcher by one character, a test goes red and names the boundary in the failure message. The security rule finally has a tripwire.

Only then did I do the refactor — extract the state layer into its own module — and the new tests held identity across the move, which is exactly the confidence I'd been pretending to have.

The rule I'd give past-me

Counting your tools is not testing your tools. Schema validation is not behavior. A green suite tells you what it checks — and stays silent about everything it doesn't, in the most reassuring tone possible.

So the question to ask of any test suite, especially one you've been trusting:

What is the single worst thing that could break in this codebase — and would a test go red if it did?

If the honest answer is "no," your CI badge isn't lying. You just never asked it the right question.


Safari MCP is open source — the ownership tests are in test/ownership-state.test.mjs if you want to see what "test the boundary" looks like in practice. More on what I'm building at achiya-automation.com.

What's the security boundary in your project that your test suite has never once exercised?

Top comments (0)