Three weeks ago I published safari-mcp — a macOS-native Safari automation server that speaks the Model Context Protocol. 84 tools, AppleScript + optional extension for speed, keeps Safari logins, zero Chrome overhead. Today it's in the VS Code and Cursor marketplaces.
Then I saw HKUDS/CLI-Anything — a 29k-star project that auto-wraps open-source software as agent-ready CLIs. Their pitch: "Make ALL software agent-native." Their main example is DOMShell wrapped as cli-anything-browser — a shell-pipeable interface for Chrome automation.
I wanted to know: is wrapping safari-mcp as a CLI actually worth it? Or is it pure theater — re-exposing a working MCP server as a strictly worse interface?
So I built the harness (PR #212) and benchmarked it live against the direct MCP path. Real Safari, real macOS, measured on 2026-04-10.
Here's what I found.
TL;DR
| MCP (direct stdio) | CLI (subprocess per call) | Winner | |
|---|---|---|---|
| Per-call latency | 119ms | 3,023ms | MCP, 25× |
| 5-op workflow | 2.7s | 15.2s | MCP, 5.6× |
| Tokens per API call (tool defs) | 7,986 | 95 | CLI, 84× |
| Output accuracy | identical | identical | tie |
- If your agent speaks MCP (Claude Code, Cursor, Cline, Windsurf, Continue, OpenClaw, any MCP-aware client) — use the MCP directly. The CLI is strictly slower.
- If you need to drive it from bash, CI, cron, or an agent that doesn't speak MCP — use the CLI. The token savings compound; at Claude Opus pricing, a 100-turn session saves ~$12 in tool-definition overhead alone.
That's the whole story. If you only wanted the numbers, you can stop here. If you want the methodology, the edge cases, and the bugs I hit along the way, read on.
What I actually built
The harness (safari/agent-harness/) is a schema-driven CLI generator:
Offline Zod parser (
scripts/extract_tools.py) reads safari-mcp's source and emitsresources/tools.json— the full schema for all 84 tools. Depth-aware, handles nestedz.array(z.object({...})).describe("outer")correctly.Runtime Click generator (
safari_cli.py) loads the registry at import time and builds one Click subcommand per MCP tool. Argument names, types, enum choices, required flags, and descriptions are all pulled from the schema. Zero manual mapping.Parity test suite (
test_parity.py) iterates the registry and verifies every tool is reachable, every param is wired correctly, every enum matches. If the registry and the CLI ever drift, the tests scream.
The CLI surface ends up looking like this:
$ cli-anything-safari tools count
84
$ cli-anything-safari tools describe safari_click
Name: safari_click
CLI command: tool click
Description: Click element. Use ref (from snapshot), selector, text, or x/y...
Parameters:
--ref (string, optional)
--selector (string, optional)
--text (string, optional)
--x (number, optional)
--y (number, optional)
$ cli-anything-safari --json tool snapshot
"ref=0_0 body\nref=0_1 div\nref=0_2 navigation \"Sidebar\"\n..."
Same interface as the upstream MCP, just behind click.command(...) calls.
The benchmark setup
Both paths hit the same safari-mcp server in the end. The difference is the connection model:
MCP direct: Python → stdio (persistent) → safari-mcp → Safari
CLI: Python → subprocess → npx → Node → safari-mcp → Safari
For MCP I used mcp.ClientSession with a persistent stdio connection, measuring only the call_tool() round-trip (initialization amortized). For CLI I measured subprocess.run([...]) wall time. Both had one warmup call that I discarded.
The benchmark script is at /tmp/benchmark_cli_vs_mcp.py (not committed because it's scratch); the key loop is:
# MCP: persistent session, N calls
async with stdio_client(params) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
await session.call_tool(tool_name, args) # warmup
for _ in range(n):
t0 = time.perf_counter()
await session.call_tool(tool_name, args)
times.append((time.perf_counter() - t0) * 1000)
# CLI: spawn per call
for _ in range(n):
t0 = time.perf_counter()
subprocess.run(CLI + ["tool", tool_short_name] + args_list,
capture_output=True, text=True)
times.append((time.perf_counter() - t0) * 1000)
Latency — MCP wins by 25×
Ten calls of safari_list_tabs (warm cache, same Safari state):
MCP (ms) CLI (ms) ratio
min 113.3 2970.2 26.2×
median 119.5 3026.1 25.3×
mean 119.3 3022.7 25.3×
max 123.7 3097.2 25.0×
CLI calls land at ~3 seconds every single time, with almost no variance. That consistency is the giveaway: the bottleneck is not safari_list_tabs itself — it's the ~2.9 seconds that go into npx resolution, Node.js startup, safari-mcp initialization, and MCP handshake for every fresh subprocess.
MCP amortizes all of that across a single persistent session. Once the session is up, each additional tool call is just the ~100ms AppleScript operation.
For interactive reactive workflows — agents that take each result and decide the next step — MCP is the obvious choice. Every round-trip matters.
Workflow — MCP still wins on reactive sequences
I ran a 5-op workflow (snapshot → read_page → list_tabs → snapshot → read_page) three ways:
MCP (persistent, 5 ops) 2,714 ms
CLI (5 sequential spawns) 15,285 ms
CLI (1 shell pipeline, 5 ops) 15,153 ms
Shell pipelining — cli-anything-safari tool X && cli-anything-safari tool Y — does not help. Every && still spawns a fresh npx subprocess. The overhead per step is unchanged.
The only way to amortize the cost is to drive the Python API directly (from cli_anything.safari.utils.safari_backend import call). If you do that, you're back to roughly MCP-class numbers because you're just using the MCP Python SDK under a different name.
The CLI's per-call cost is structural. You cannot pipeline your way out of it.
Tokens — CLI wins by 84×
This is where the picture inverts. When an LLM uses MCP tools, every API call includes the full tool definitions in the request. For safari-mcp that's 84 tools × ~95 tokens each = ~7,986 tokens on every turn.
I measured this with the real tools.json and the cl100k_base tokenizer (tiktoken):
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
mcp_response = {"tools": [
{"name": t["name"], "description": t["description"], "inputSchema": t["inputSchema"]}
for t in tools
]}
tokens = len(enc.encode(json.dumps(mcp_response)))
# 7,986 tokens for 84 tools
The CLI path sends ~95 tokens — just the bash tool definition. The agent learns the CLI surface by running cli-anything-safari tools list --json once (5,236 tokens, one-time) and the info sits in the conversation context.
At Claude Opus pricing ($15/MTok input, no caching):
| Session length | MCP overhead | CLI overhead | Savings |
|---|---|---|---|
| 10 turns | $1.20 | $0.09 | $1.11 |
| 100 turns | $11.98 | $0.22 | $11.76 |
| 1,000 turns | $119.79 | $1.60 | $118.19 |
Prompt caching narrows this considerably — Anthropic lets you cache tool definitions at $3.75/MTok on first write and $1.50/MTok on reads, roughly a 10× discount. With caching the MCP cost drops from ~$12 to ~$1.50 per 100-turn session. Still more expensive than CLI, but the gap is smaller.
The takeaway: for short, reactive sessions (where you care about UX and per-call latency), MCP wins hands-down. For long, scripted sessions at scale (where tool-definition overhead becomes a real line item), the CLI's token efficiency is genuine and measurable.
Accuracy — tie
Both paths call the same safari-mcp server. Both go through the same AppleScript → Safari chain. The CLI is a thin subprocess wrapper that serializes the MCP CallToolResult.content into stdout via a small _unwrap() helper:
def _unwrap(result):
parts = []
for item in result.content:
text = getattr(item, "text", None)
if text is not None:
try:
parts.append(json.loads(text))
except (json.JSONDecodeError, ValueError):
parts.append(text)
continue
# ImageContent: returned by screenshot tools
data = getattr(item, "data", None)
if data is not None:
parts.append({"type": "image", "data": data,
"mimeType": getattr(item, "mimeType", "application/octet-stream")})
continue
parts.append(item)
return parts[0] if len(parts) == 1 else parts
Byte-identical output verified live: the Unicode tab titles returned by cli-anything-safari --json tool list-tabs match the direct MCP output character-for-character, including right-to-left Hebrew.
The bugs that took 5 review rounds to find
I'm not going to pretend the first draft was clean. The schema-driven generator had real bugs that five passes of review (two my own, three by an adversarial code-reviewer agent) surfaced one by one:
Nested
.describe()leaked. Forz.array(z.object({selector: z.string().describe("CSS selector")})).describe("Array of {selector, value} pairs"), the naive regex picked the inner"CSS selector"as the outer field's description. Four tools had wrong help text. Fixed by walking modifier chains at depth 0 only.Nested
.optional()leaked. Same root cause, different effect —safari_mock_route.responseandsafari_run_script.stepswere marked optional because an inner field had.optional(). The actual MCP schema marks them required. This one silently produced wrong JSON schemas; the fix was depth-aware modifier detection everywhere._unwrap()silently dropped screenshot output. It only handledTextContent, notImageContent. For two tools (safari_screenshot,safari_screenshot_element), the CLI returnednullwith exit code 0 instead of the base64 JPEG. Caught on the fourth review round after I'd already declared "100% compliance."safari_evaluateparameter name isscript, notcode. The tool description said "JavaScript code to execute" so I wrote every documentation example as--code "document.title". The parser auto-generated the CLI correctly from the schema (--script), so the CLI worked, but every doc example in SKILL.md, README.md, and my test file was wrong. Caught on the fourth review round when the reviewer cross-referenced docs against the schema.doubleClick: z.boolean().default(false)serialized the default as the string"false". Not broken at runtime (Click ignores it) but wrong in the bundled JSON schema. Fixed by adding a_coerce_default()step that parses JS barewords into their Python equivalents.
Every bug except #5 had a corresponding regression test added to test_parity.py after the fix. The file is now 24 tests, including explicit assertions like:
def test_evaluate_param_is_script_not_code(self):
"""Regression: prior versions used 'code' by mistake."""
tool = self.registry.get("safari_evaluate")
assert "script" in {p.name for p in tool.params}
assert "code" not in {p.name for p in tool.params}
The lesson I kept re-learning: if you wrote the code, you can't review it yourself. You read your own docs through your own mental model of what the code does. You need an adversary — either a human with fresh eyes or an agent with no context — to catch the bugs your mental model papers over.
When to use which
Decision tree (read left to right):
Does your agent speak MCP natively?
├── Yes → Use safari-mcp directly. 25× faster, better UX.
└── No
├── Is this a one-off / interactive script?
│ └── Yes → Use cli-anything-safari. jq-pipeable.
├── Long-running automation, cost matters?
│ └── Yes → cli-anything-safari. Token savings compound.
├── CI / cron / non-interactive automation?
│ └── Yes → cli-anything-safari. Subprocess-friendly.
└── Everything else → try MCP first, fall back to CLI if needed.
For Claude Code, Cursor, Cline, Windsurf, Continue, OpenClaw, VS Code MCP — all MCP-native. Use safari-mcp directly:
npm install -g safari-mcp
# Then add to your MCP client config and restart it.
For Codex CLI, GitHub Copilot CLI, older agent frameworks, shell scripts, cron jobs:
# After the CLI-Anything PR merges
pip install cli-anything-safari
cli-anything-safari tools list
cli-anything-safari tool navigate --url https://example.com
What I'd do differently next time
Write the benchmark first. I built the CLI, shipped it, and then benchmarked it. If I'd measured first, I would have avoided ~3 review rounds of "is this even useful?" angst. The answer is nuanced (MCP for latency, CLI for tokens and reach), but I couldn't see that without the numbers.
Schema-driven from day one. The original plan was to hand-wrap 20 curated tools with a
rawescape hatch for the rest. That would have been ~1,500 lines of code I'd be maintaining forever. The schema-driven approach is ~300 lines and maintains itself.Spin up an adversarial reviewer earlier. I used an independent code-reviewer agent on review rounds 2–4. It caught bugs I'd read past a dozen times. Should have used it on round 1.
Token cost is a first-class metric for MCP design. I was thinking about MCP vs CLI in pure latency terms. The token-cost-at-scale axis is genuinely the more important one for long agent sessions, and I should have been measuring it from the start.
Links
- safari-mcp repo: https://github.com/achiya-automation/safari-mcp
- CLI-Anything PR #212: https://github.com/HKUDS/CLI-Anything/pull/212
- Direct harness path (after merge): https://github.com/HKUDS/CLI-Anything/tree/main/safari/agent-harness
If you run into edge cases — or have a better benchmark setup I should run — open an issue on the repo or reply here.
This post is part of my safari-mcp series. Previous posts: Why I built an MCP server for Safari, Chrome DevTools MCP vs Safari, 7 things I learned building Safari automation.
Top comments (0)