Maxim Saplin

Posted on Jun 10

CLI over MCP: a small Chrome DevTools experiment in Copilot CLI

#programming #mcp #agents #ai

Avoids 5k tokens of upfront MCP schema bloat

I ran the same browser smoke task through two paths: direct Chrome DevTools MCP and a custom CLI skill around mcp2cli. In GitHub Copilot CLI with gpt-5.3-codex-medium, direct Chrome DevTools MCP added about 5k tokens of upfront context before the agent did any work. The runtime table is too small and too noisy to rank the tools. The useful question is where the agent pays to discover the browser-control surface.

mcp2cli README says it can “Save 96-99% of the tokens wasted on tool schemas every turn.” That is a strong claim and frankly I didn't no expect that sort of numbers... It's just the CLI part resonates with me - (a) there's no system prompt pollution with CLI, (b) if you choose between gh CLI and GitHub MCP the former would be better due to the fact that model already knows the tool and there's less tokens wasted on JSON schemas and tool calls.

I used Chrome DevTools MCP a lot and I have chosen this MCP as a test bed to try mcp2cli. This came handy cause I started my experiments with the minimal pi coding agent and it doesn't bundle any MCP integration, just the basic bash tool, I was very much happy not to bloat my instal with a dedicated MCP plugin. Although in this cases I cmpared MCP vs CLI using a fully fledged GitHub CLI.

Tool discovery is part of the experiment. Native MCP gives the agent a tool surface by loading schemas into context. A CLI wrapper makes the agent discover the surface the way it discovers any other command-line tool: list, search, ask for help, run a small probe, write down what worked. That changes where the discovery cost lands.

The Setup

I ran this in GitHub Copilot CLI using gpt-5.3-codex-medium:

Copilot stock MCP servers were disabled.
The app under test was a private Pythobn/Streamlit codebase.
The browser task was the same 9-step smoke test in both variants.
One variant used direct Chrome DevTools MCP.
Another variant used a custom skill that wraps Chrome DevTools MCP via mcp2cli.
The custom skill itself started as an ad-hoc agent task: I pointed pi with gpt-5.4-mini at the Chrome DevTools MCP and mcp2cli repos, asked it to prepare a skill wrapping the MCP, then validated and later polished it with gpt-5.3-codex-high in GitHub CLI.

Copilot CLI is not a tiny harness. A blank run was already around 19k tokens before the agent touched the app. By contrast, pi starts close to zero in a fresh dialog. So a 5k tool-schema tax looks different depending on where you are standing.

  ○ ○ ◌ ◌ ● · · · · ·   gpt-5.3-codex · 19k/400k tokens (5%)
  · · · · · · · · · ·   ○ System Prompt           8.7k   (2%)
  · · · · · · · · · ·   ○ Custom Instructions     1.3k  (<1%)
  · · · · · · · · · ·   ◌ System Tools            8.8k   (2%)
  · · · · · · · · · ·   ● MCP Tools                155  (<1%)
  · · ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   ◉ Messages                   0   (0%)
  ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   · Free Space            239.5k  (60%)
  ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   ◎ Buffer                141.6k  (35%)

The first version of the skill was written by agent with a prompt sharing the 2 GH Links (Chome Dev Tool and mcp2clii) - it was bootstrapped from public docs plus runtime checks through the CLI. For this MCP server, that was enough because the workflow I needed was narrow: start a session, navigate, inspect page state, interact, clean up. A more complex MCP server would probably need the server running side by side while the skill is being built, so the agent can discover actual runtime behavior instead of trusting docs and schemas.

Context Bloat

Mode	Blank total	MCP tools line	Difference vs CLI path
CLI skill path	19k	155	baseline
Direct Chrome DevTools MCP	24k	4.9k	+5k

Direct Chrome DevTools MCP added about 5k upfront context in this Copilot CLI setup. If you enabled two more MCP servers of similar size, you would expect roughly another 10k of context before the user prompt and before any useful work.

The Runs

I had 3 runs per each set-up using exactly the same prompt and expecting the agent do drive Google Chrome and look into each page:

Mode	Attempt	Total context	Messages	Runtime	Outcome	Notes
CLI skill	#1	39k	20.5k	not recorded	not summarized	context stats only
CLI skill	#2	37k	18.1k	259s	9/9 pass	checkbox flake, recovered via retry and fill-form
CLI skill	#3	38k	18.9k	141s	9/9 pass	checkbox UID failed, label click worked
Direct MCP	#1	40k	16.1k	not recorded	not summarized	context stats only
Direct MCP	#2	62k	38.7k	~101s	9/9 pass	fastest recorded completed run
Direct MCP	#3	79k	55.9k	241s	9/9 pass	agent used long waits; at least one 120s-scale delay path showed up

Direct MCP produced the fastest recorded completed run. The CLI skill had more stable message growth. MCP attempt #3 wandered into long waits and ended much heavier than its previous run.

I would not rank the tools from this sample. The model’s path through a long browser trace can dominate the interface choice. One stale UID, one wait loop, one unnecessary reload, one over-eager snapshot, and your neat comparison starts to rot. Context engineering can be local patching while the agent’s random walk being the key factor into how long and how costly the session would be.

Smoke Test Prompt

Middle part cut due to private nature of the repo:

Run a browser smoke test of the local app and provide a concise execution report suitable for comparing token usage across different browser-driving approaches.

Goal:
- Verify the app can be launched and basic navigation/interactions work.
- Keep actions read-only where possible.
- If a step fails, continue with the next step and report failure details.

Setup:
1. Start the app from workspace root:
   [private repo command omitted]
2. Use the local URL shown by the app.

...

Evidence and reporting format:
- For each step, output: PASS/FAIL, short reason, and one concrete UI evidence string.
- Include a final summary with:
  - total steps, passed, failed
  - elapsed runtime
  - estimated tokens consumed if available from your runtime, otherwise "not available"
  - any flaky points encountered

Constraints:
- Do not modify application data unless a step explicitly requires a harmless UI toggle.
- Do not use screenshots unless needed for a failed-step diagnosis.
- Prefer structured text evidence from page state over visual descriptions.
- Clean up any browser/session resources and stop the app process when done.

Where Anthropic’s MCP article fits

At the end of 2025 Anthropic’spublioshed a post, Code execution with MCP: Building more efficient agents, describing two token problems with direct MCP usage:

Tool definitions overload the context window.
Intermediate tool results get passed through the model.

Their preferred answer is code execution: let the agent write code, load only the tool interfaces it needs, filter data outside the model, and return small results.

mcp2cli is not exactly that architecture. But it rhymes with the same idea. It keeps the full MCP tool surface outside the model by default and gives the agent a shell interface it can inspect and call as needed. I expected the tools to also do some optimization of tool results, after all JSON is quite heavy, I didn't observer any token savings here.

Tool Discovery

Direct MCP and a CLI wrapper differ in execution and discovery.

With native MCP, the client usually hands the model a set of tool definitions. That is convenient. The agent can see what exists. It can call the browser tool directly. In Copilot CLI, that convenience showed up as about 5k tokens of additional upfront context for Chrome DevTools MCP.

With the CLI path, the agent has to explore. It can list available commands, search by keyword, inspect command help, run a tiny call, and keep only the working pattern in its notes or skill file. That is more work, but it is also progressive disclosure. The model does not need the whole browser automation surface in context if the task only needs navigation, snapshots, form fills, and cleanup.

Speaking of wrapping MCPs in CLIs... There're 2 options I can see. My approach where I targeted an agent at mcp2cli and target MCP docs and cooked an ad-hoc wrapper skill. Or you can use a dedicated generic mcp2cli.

For more complicated MCP servers, I would not rely on docs alone. I would want the target MCP server available during skill creation, and I would want the agent to test the wrapper against real commands before calling the skill redistributable. The moment auth, pagination, binary outputs, huge payloads, mutation safety, or weird error messages enter the picture, the skill needs runtime scars.

Btw, Claude Code now bundles CLI_EXPERIMENTAL_MODE toggle solving bloated system prompt due to use of many MCPs.

Conclusion

I would not claim that this experiment proves mcp2cli saves 96-99% in real browser work. I would claim this:

mcp2cli works, I like the fact there's a tool that alloaws to easily shim MCP into CLI
The CLI skill path is leaner at startup.
The CLI skill avoided that tool-surface load.
Native MCP pays more of the discovery cost upfront; the CLI skill pushes discovery into command inspection and tested workflow notes.
Long agent traces are noisy enough that path variance can swamp interface choice.

For deep debugging, I still want direct Chrome DevTools MCP available. It exposes a serious browser surface: navigation, input automation, snapshots, screenshots, console, network, performance, memory tooling, and more.

For repeatable smoke tests in a shell-first agent, I like the CLI wrapper.

Raw context windows

CLI / Blank

● Blank

  ○ ○ ◌ ◌ ● · · · · ·   gpt-5.3-codex · 19k/400k tokens (5%)
  · · · · · · · · · ·   ○ System Prompt           8.7k   (2%)
  · · · · · · · · · ·   ○ Custom Instructions     1.3k  (<1%)
  · · · · · · · · · ·   ◌ System Tools            8.8k   (2%)
  · · · · · · · · · ·   ● MCP Tools                155  (<1%)
  · · ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   ◉ Messages                   0   (0%)
  ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   · Free Space            239.5k  (60%)
  ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   ◎ Buffer                141.6k  (35%)

CLI / Attempt #1

● Attmept #1

  ○ ○ ◌ ◌ ● ◉ ◉ ◉ ◉ ·   gpt-5.3-codex · 39k/400k tokens (10%)
  · · · · · · · · · ·   ○ System Prompt          10.0k   (2%)
  · · · · · · · · · ·   ◌ System Tools            8.8k   (2%)
  · · · · · · · · · ·   ● MCP Tools                155  (<1%)
  · · · · · ◎ ◎ ◎ ◎ ◎   ◉ Messages               20.5k   (5%)
  ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   · Free Space            219.0k  (55%)
  ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   ◎ Buffer                141.6k  (35%)

CLI / Attempt #2

● Attmept #2

○ ○ ◌ ◌ ● ◉ ◉ ◉ · ·   gpt-5.3-codex · 37k/400k tokens (9%)
· · · · · · · · · ·   ○ System Prompt          10.0k   (2%)
· · · · · · · · · ·   ◌ System Tools            8.8k   (2%)
· · · · · · · · · ·   ● MCP Tools                155  (<1%)
· · · · · ◎ ◎ ◎ ◎ ◎   ◉ Messages               18.1k   (5%)
◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   · Free Space            221.4k  (55%)
◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   ◎ Buffer                141.6k  (35%)

Summary: total steps 9, passed 9, failed 0.
Elapsed runtime: 259s (~4m19s).
Flaky points: intermittent Timeline checkbox interaction timeouts (element did not become interactive within timeout); recovered via retry and fill-form. Initial root snapshot also needed explicit wait before full UI became visible.

CLI / Attempt #3

● Attempt #3

○ ○ ◌ ◌ ● ◉ ◉ ◉ · ·   gpt-5.3-codex · 38k/400k tokens (9%)
· · · · · · · · · ·   ○ System Prompt          10.0k   (2%)
· · · · · · · · · ·   ◌ System Tools            8.8k   (2%)
· · · · · · · · · ·   ● MCP Tools                155  (<1%)
· · · · · ◎ ◎ ◎ ◎ ◎   ◉ Messages               18.9k   (5%)
◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   · Free Space            220.6k  (55%)
◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   ◎ Buffer                141.6k  (35%)

Summary: total steps 9, passed 9, failed 0; elapsed runtime 141s (~2m21s); estimated tokens consumed not available; flaky points: one checkbox interaction timeout when clicking checkbox uid directly (uid 3_37), resolved by clicking its label uid (uid 3_38) and proceeding.

MCP / Blank

● Blank

  ○ ○ ◌ ◌ ● · · · · ·   gpt-5.3-codex · 24k/400k tokens (6%)
  · · · · · · · · · ·   ○ System Prompt           8.7k   (2%)
  · · · · · · · · · ·   ○ Custom Instructions     1.3k  (<1%)
  · · · · · · · · · ·   ◌ System Tools            8.7k   (2%)
  · · · · · · · · · ·   ● MCP Tools               4.9k   (1%)
  · · ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   ◉ Messages                   0   (0%)
  ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   · Free Space            234.9k  (59%)
  ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   ◎ Buffer                141.6k  (35%)

MCP / Attempt #1

● Attempt 1

○ ○ ◌ ◌ ● ◉ ◉ ◉ · ·   gpt-5.3-codex · 40k/400k tokens (10%)
· · · · · · · · · ·   ○ System Prompt          10.0k   (2%)
· · · · · · · · · ·   ◌ System Tools            8.7k   (2%)
· · · · · · · · · ·   ● MCP Tools               4.9k   (1%)
· · · · · ◎ ◎ ◎ ◎ ◎   ◉ Messages               16.1k   (4%)
◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   · Free Space            218.7k  (55%)
◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   ◎ Buffer                141.6k  (35%)

MCP / Attempt #2

● Attempt #2

○ ○ ◌ ◌ ● ◉ ◉ ◉ ◉ ◉   gpt-5.3-codex · 62k/400k tokens (16%)
◉ ◉ · · · · · · · ·   ○ System Prompt          10.0k   (2%)
· · · · · · · · · ·   ◌ System Tools            8.7k   (2%)
· · · · · · · · · ·   ● MCP Tools               4.9k   (1%)
· · · · · ◎ ◎ ◎ ◎ ◎   ◉ Messages               38.7k  (10%)
◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   · Free Space            196.1k  (49%)
◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   ◎ Buffer                141.6k  (35%)

Summary: total steps 9, passed 9, failed 0.
Elapsed runtime: ~101s.
Flaky points: Timeline checkbox ("Только для проверки") did not respond to direct click twice (interaction timeout); state change required fallback interaction and page reload to restore normal entry list rendering.

MCP / Attempt #3

● Attempt #3

○ ○ ◌ ◌ ● ◉ ◉ ◉ ◉ ◉   gpt-5.3-codex · 79k/400k tokens (20%)
◉ ◉ ◉ ◉ ◉ · · · · ·   ○ System Prompt          10.0k   (2%)
· · · · · · · · · ·   ◌ System Tools            8.7k   (2%)
· · · · · · · · · ·   ● MCP Tools               4.9k   (1%)
· · · · · ◎ ◎ ◎ ◎ ◎   ◉ Messages               55.9k  (14%)
◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   · Free Space            179.0k  (45%)
◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   ◎ Buffer                141.6k  (35%)

Summary

 - Total steps: 9  
 - Passed: 9  
 - Failed: 0  
 - Elapsed runtime: 241s (4m 1s)  
 - Estimated tokens consumed: not available  
 - Flaky points observed:  
  1. Checkbox click initially failed due non-interactive/stale UID; succeeded after fresh snapshot + label click.  
 2. Final wait_for on Главная timed out once; page was already navigated and confirmed by subsequent snapshot.

App process and browser page were cleaned up at the end.

Top comments (9)

Theo Valmis • Jun 15

The MCP-versus-CLI framing has a hidden variable you point at but don't fully name: whether the tool is in the model's training distribution. gh and curl win as CLIs because discovery already happened at training time, the model has seen them ten thousand times, so the schema tax buys nothing. But that advantage is specific to popular tools. For a bespoke internal CLI the model has never seen, "discover by list, search, probe" can cost more turns than MCP's flat schema, because the model is trial-and-erroring a surface nothing described. So the real axis isn't MCP vs CLI, it's familiar surface vs novel surface. CLI for tools the model already knows, a schema for the ones it doesn't. The worst quadrant is a custom CLI with no schema and no training prior: you pay discovery cost with no help, which is exactly where the 96-99% savings would invert.

Hussein Mahdi • Jun 17

Honest writeup, nice that you didn't fake a winner 😅 when the runs were so noisy. The schema-tax savings are the real takeaway: the CLI path wins at startup because the model already knows shells. Did the tool results get smaller too, or just the upfront schemas? That's usually the bigger cost in a long browser run.

Maxim Saplin • Jun 18

Nope, no tool output compression, hast passing through jsons from mcp

Mykola Kondratiuk • Jun 15

5k tokens of upfront schema discovery before the agent does anything is the cold-start cost i keep seeing with MCP setups. does it reset per call or is it amortized across a session? that changes the tradeoff.

Maxim Saplin • Jun 16

It's part of system prompt hence attached to every LLM query

Mykola Kondratiuk • Jun 16

so not cold-start cost, fixed overhead per query. 5k tokens every call is actually worse than a one-time discovery - it just gets baked into the baseline token burn silently.

mote • Jun 13

MCP as a protocol for agent-to-tool communication is interesting, but the DevTools angle is where it gets weird for me.

DevTools is designed for human debugging — you open it, you click around, you read state. Agents don't need that. They need structured, typed interfaces with clear input/output contracts. MCP's JSON-RPC model is closer to what an agent actually wants than a DevTools CDP hook.

The experiment is smart, but I wonder if you're essentially building a human-readable interface for an agent rather than giving the agent what it actually needs — a schema, not a screenshot.

Curious what the Copilot CLI actually does with this — does it parse DOM state or just log it?

Alex Shev • Jun 11

This is a useful distinction. MCP gives tools a common interface, but the terminal is still where many dev workflows become inspectable and repeatable.

The strongest pattern is probably not CLI versus MCP, but a CLI workflow that can expose enough structure for agents while still leaving developers with logs, commands, and artifacts they can verify.

Aliaksei Zelianouski • Jun 10

The token tax is real, but what actually sold me is simpler: the model already knows CLIs. gh, curl, ls - zero schema, nothing to teach it. An MCP server makes it relearn a tool surface every session. I built my agent's browser control as a plain Python CLI over CDP instead of a Chrome MCP for that reason - no schema in the context, every command readable and swappable. Flip it for small or local models though. Discovering a CLI surface is an agentic task they fall down on, and a strict tool schema is the scaffolding that keeps their tool calls well-formed.