DEV Community

Maxim Saplin
Maxim Saplin

Posted on

CLI over MCP: a small Chrome DevTools experiment in Copilot CLI

I ran the same browser smoke task through two paths: direct Chrome DevTools MCP and a custom CLI skill around mcp2cli. In GitHub Copilot CLI with gpt-5.3-codex-medium, direct Chrome DevTools MCP added about 5k tokens of upfront context before the agent did any work. The runtime table is too small and too noisy to rank the tools. The useful question is where the agent pays to discover the browser-control surface.

mcp2cli README says it can “Save 96-99% of the tokens wasted on tool schemas every turn.” That is a strong claim and frankly I didn't no expect that sort of numbers... It's just the CLI part resonates with me - (a) there's no system prompt pollution with CLI, (b) if you choose between gh CLI and GitHub MCP the former would be better due to the fact that model already knows the tool and there's less tokens wasted on JSON schemas and tool calls.

I used Chrome DevTools MCP a lot and I have chosen this MCP as a test bed to try mcp2cli. This came handy cause I started my experiments with the minimal pi coding agent and it doesn't bundle any MCP integration, just the basic bash tool, I was very much happy not to bloat my instal with a dedicated MCP plugin. Although in this cases I cmpared MCP vs CLI using a fully fledged GitHub CLI.

Tool discovery is part of the experiment. Native MCP gives the agent a tool surface by loading schemas into context. A CLI wrapper makes the agent discover the surface the way it discovers any other command-line tool: list, search, ask for help, run a small probe, write down what worked. That changes where the discovery cost lands.

The Setup

I ran this in GitHub Copilot CLI using gpt-5.3-codex-medium:

  • Copilot stock MCP servers were disabled.
  • The app under test was a private Pythobn/Streamlit codebase.
  • The browser task was the same 9-step smoke test in both variants.
  • One variant used direct Chrome DevTools MCP.
  • Another variant used a custom skill that wraps Chrome DevTools MCP via mcp2cli.
  • The custom skill itself started as an ad-hoc agent task: I pointed pi with gpt-5.4-mini at the Chrome DevTools MCP and mcp2cli repos, asked it to prepare a skill wrapping the MCP, then validated and later polished it with gpt-5.3-codex-high in GitHub CLI.

Copilot CLI is not a tiny harness. A blank run was already around 19k tokens before the agent touched the app. By contrast, pi starts close to zero in a fresh dialog. So a 5k tool-schema tax looks different depending on where you are standing.

  ○ ○ ◌ ◌ ● · · · · ·   gpt-5.3-codex · 19k/400k tokens (5%)
  · · · · · · · · · ·   ○ System Prompt           8.7k   (2%)
  · · · · · · · · · ·   ○ Custom Instructions     1.3k  (<1%)
  · · · · · · · · · ·   ◌ System Tools            8.8k   (2%)
  · · · · · · · · · ·   ● MCP Tools                155  (<1%)
  · · ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   ◉ Messages                   0   (0%)
  ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   · Free Space            239.5k  (60%)
  ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   ◎ Buffer                141.6k  (35%)
Enter fullscreen mode Exit fullscreen mode

The first version of the skill was written by agent with a prompt sharing the 2 GH Links (Chome Dev Tool and mcp2clii) - it was bootstrapped from public docs plus runtime checks through the CLI. For this MCP server, that was enough because the workflow I needed was narrow: start a session, navigate, inspect page state, interact, clean up. A more complex MCP server would probably need the server running side by side while the skill is being built, so the agent can discover actual runtime behavior instead of trusting docs and schemas.

Context Bloat

Mode Blank total MCP tools line Difference vs CLI path
CLI skill path 19k 155 baseline
Direct Chrome DevTools MCP 24k 4.9k +5k

Direct Chrome DevTools MCP added about 5k upfront context in this Copilot CLI setup. If you enabled two more MCP servers of similar size, you would expect roughly another 10k of context before the user prompt and before any useful work.

The Runs

I had 3 runs per each set-up using exactly the same prompt and expecting the agent do drive Google Chrome and look into each page:

Mode Attempt Total context Messages Runtime Outcome Notes
CLI skill #1 39k 20.5k not recorded not summarized context stats only
CLI skill #2 37k 18.1k 259s 9/9 pass checkbox flake, recovered via retry and fill-form
CLI skill #3 38k 18.9k 141s 9/9 pass checkbox UID failed, label click worked
Direct MCP #1 40k 16.1k not recorded not summarized context stats only
Direct MCP #2 62k 38.7k ~101s 9/9 pass fastest recorded completed run
Direct MCP #3 79k 55.9k 241s 9/9 pass agent used long waits; at least one 120s-scale delay path showed up

Direct MCP produced the fastest recorded completed run. The CLI skill had more stable message growth. MCP attempt #3 wandered into long waits and ended much heavier than its previous run.

I would not rank the tools from this sample. The model’s path through a long browser trace can dominate the interface choice. One stale UID, one wait loop, one unnecessary reload, one over-eager snapshot, and your neat comparison starts to rot. Context engineering can be local patching while the agent’s random walk being the key factor into how long and how costly the session would be.

Smoke Test Prompt

Middle part cut due to private nature of the repo:

Run a browser smoke test of the local app and provide a concise execution report suitable for comparing token usage across different browser-driving approaches.

Goal:
- Verify the app can be launched and basic navigation/interactions work.
- Keep actions read-only where possible.
- If a step fails, continue with the next step and report failure details.

Setup:
1. Start the app from workspace root:
   [private repo command omitted]
2. Use the local URL shown by the app.

...

Evidence and reporting format:
- For each step, output: PASS/FAIL, short reason, and one concrete UI evidence string.
- Include a final summary with:
  - total steps, passed, failed
  - elapsed runtime
  - estimated tokens consumed if available from your runtime, otherwise "not available"
  - any flaky points encountered

Constraints:
- Do not modify application data unless a step explicitly requires a harmless UI toggle.
- Do not use screenshots unless needed for a failed-step diagnosis.
- Prefer structured text evidence from page state over visual descriptions.
- Clean up any browser/session resources and stop the app process when done.
Enter fullscreen mode Exit fullscreen mode

Where Anthropic’s MCP article fits

At the end of 2025 Anthropic’spublioshed a post, Code execution with MCP: Building more efficient agents, describing two token problems with direct MCP usage:

  1. Tool definitions overload the context window.
  2. Intermediate tool results get passed through the model.

Their preferred answer is code execution: let the agent write code, load only the tool interfaces it needs, filter data outside the model, and return small results.

mcp2cli is not exactly that architecture. But it rhymes with the same idea. It keeps the full MCP tool surface outside the model by default and gives the agent a shell interface it can inspect and call as needed. I expected the tools to also do some optimization of tool results, after all JSON is quite heavy, I didn't observer any token savings here.

Tool Discovery

Direct MCP and a CLI wrapper differ in execution and discovery.

With native MCP, the client usually hands the model a set of tool definitions. That is convenient. The agent can see what exists. It can call the browser tool directly. In Copilot CLI, that convenience showed up as about 5k tokens of additional upfront context for Chrome DevTools MCP.

With the CLI path, the agent has to explore. It can list available commands, search by keyword, inspect command help, run a tiny call, and keep only the working pattern in its notes or skill file. That is more work, but it is also progressive disclosure. The model does not need the whole browser automation surface in context if the task only needs navigation, snapshots, form fills, and cleanup.

Speaking of wrapping MCPs in CLIs... There're 2 options I can see. My approach where I targeted an agent at mcp2cli and target MCP docs and cooked an ad-hoc wrapper skill. Or you can use a dedicated generic mcp2cli.

For more complicated MCP servers, I would not rely on docs alone. I would want the target MCP server available during skill creation, and I would want the agent to test the wrapper against real commands before calling the skill redistributable. The moment auth, pagination, binary outputs, huge payloads, mutation safety, or weird error messages enter the picture, the skill needs runtime scars.

Btw, Claude Code now bundles CLI_EXPERIMENTAL_MODE toggle solving bloated system prompt due to use of many MCPs.

Conclusion

I would not claim that this experiment proves mcp2cli saves 96-99% in real browser work. I would claim this:

  • mcp2cli works, I like the fact there's a tool that alloaws to easily shim MCP into CLI
  • The CLI skill path is leaner at startup.
  • The CLI skill avoided that tool-surface load.
  • Native MCP pays more of the discovery cost upfront; the CLI skill pushes discovery into command inspection and tested workflow notes.
  • Long agent traces are noisy enough that path variance can swamp interface choice.

For deep debugging, I still want direct Chrome DevTools MCP available. It exposes a serious browser surface: navigation, input automation, snapshots, screenshots, console, network, performance, memory tooling, and more.

For repeatable smoke tests in a shell-first agent, I like the CLI wrapper.

Raw context windows

CLI / Blank

● Blank

  ○ ○ ◌ ◌ ● · · · · ·   gpt-5.3-codex · 19k/400k tokens (5%)
  · · · · · · · · · ·   ○ System Prompt           8.7k   (2%)
  · · · · · · · · · ·   ○ Custom Instructions     1.3k  (<1%)
  · · · · · · · · · ·   ◌ System Tools            8.8k   (2%)
  · · · · · · · · · ·   ● MCP Tools                155  (<1%)
  · · ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   ◉ Messages                   0   (0%)
  ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   · Free Space            239.5k  (60%)
  ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   ◎ Buffer                141.6k  (35%)
Enter fullscreen mode Exit fullscreen mode

CLI / Attempt #1

● Attmept #1

  ○ ○ ◌ ◌ ● ◉ ◉ ◉ ◉ ·   gpt-5.3-codex · 39k/400k tokens (10%)
  · · · · · · · · · ·   ○ System Prompt          10.0k   (2%)
  · · · · · · · · · ·   ◌ System Tools            8.8k   (2%)
  · · · · · · · · · ·   ● MCP Tools                155  (<1%)
  · · · · · ◎ ◎ ◎ ◎ ◎   ◉ Messages               20.5k   (5%)
  ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   · Free Space            219.0k  (55%)
  ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   ◎ Buffer                141.6k  (35%)
Enter fullscreen mode Exit fullscreen mode

CLI / Attempt #2

● Attmept #2

○ ○ ◌ ◌ ● ◉ ◉ ◉ · ·   gpt-5.3-codex · 37k/400k tokens (9%)
· · · · · · · · · ·   ○ System Prompt          10.0k   (2%)
· · · · · · · · · ·   ◌ System Tools            8.8k   (2%)
· · · · · · · · · ·   ● MCP Tools                155  (<1%)
· · · · · ◎ ◎ ◎ ◎ ◎   ◉ Messages               18.1k   (5%)
◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   · Free Space            221.4k  (55%)
◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   ◎ Buffer                141.6k  (35%)

Summary: total steps 9, passed 9, failed 0.
Elapsed runtime: 259s (~4m19s).
Flaky points: intermittent Timeline checkbox interaction timeouts (element did not become interactive within timeout); recovered via retry and fill-form. Initial root snapshot also needed explicit wait before full UI became visible.
Enter fullscreen mode Exit fullscreen mode

CLI / Attempt #3

● Attempt #3

○ ○ ◌ ◌ ● ◉ ◉ ◉ · ·   gpt-5.3-codex · 38k/400k tokens (9%)
· · · · · · · · · ·   ○ System Prompt          10.0k   (2%)
· · · · · · · · · ·   ◌ System Tools            8.8k   (2%)
· · · · · · · · · ·   ● MCP Tools                155  (<1%)
· · · · · ◎ ◎ ◎ ◎ ◎   ◉ Messages               18.9k   (5%)
◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   · Free Space            220.6k  (55%)
◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   ◎ Buffer                141.6k  (35%)

Summary: total steps 9, passed 9, failed 0; elapsed runtime 141s (~2m21s); estimated tokens consumed not available; flaky points: one checkbox interaction timeout when clicking checkbox uid directly (uid 3_37), resolved by clicking its label uid (uid 3_38) and proceeding.
Enter fullscreen mode Exit fullscreen mode

MCP / Blank

● Blank

  ○ ○ ◌ ◌ ● · · · · ·   gpt-5.3-codex · 24k/400k tokens (6%)
  · · · · · · · · · ·   ○ System Prompt           8.7k   (2%)
  · · · · · · · · · ·   ○ Custom Instructions     1.3k  (<1%)
  · · · · · · · · · ·   ◌ System Tools            8.7k   (2%)
  · · · · · · · · · ·   ● MCP Tools               4.9k   (1%)
  · · ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   ◉ Messages                   0   (0%)
  ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   · Free Space            234.9k  (59%)
  ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   ◎ Buffer                141.6k  (35%)
Enter fullscreen mode Exit fullscreen mode

MCP / Attempt #1

● Attempt 1

○ ○ ◌ ◌ ● ◉ ◉ ◉ · ·   gpt-5.3-codex · 40k/400k tokens (10%)
· · · · · · · · · ·   ○ System Prompt          10.0k   (2%)
· · · · · · · · · ·   ◌ System Tools            8.7k   (2%)
· · · · · · · · · ·   ● MCP Tools               4.9k   (1%)
· · · · · ◎ ◎ ◎ ◎ ◎   ◉ Messages               16.1k   (4%)
◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   · Free Space            218.7k  (55%)
◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   ◎ Buffer                141.6k  (35%)
Enter fullscreen mode Exit fullscreen mode

MCP / Attempt #2

● Attempt #2

○ ○ ◌ ◌ ● ◉ ◉ ◉ ◉ ◉   gpt-5.3-codex · 62k/400k tokens (16%)
◉ ◉ · · · · · · · ·   ○ System Prompt          10.0k   (2%)
· · · · · · · · · ·   ◌ System Tools            8.7k   (2%)
· · · · · · · · · ·   ● MCP Tools               4.9k   (1%)
· · · · · ◎ ◎ ◎ ◎ ◎   ◉ Messages               38.7k  (10%)
◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   · Free Space            196.1k  (49%)
◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   ◎ Buffer                141.6k  (35%)

Summary: total steps 9, passed 9, failed 0.
Elapsed runtime: ~101s.
Flaky points: Timeline checkbox ("Только для проверки") did not respond to direct click twice (interaction timeout); state change required fallback interaction and page reload to restore normal entry list rendering.
Enter fullscreen mode Exit fullscreen mode

MCP / Attempt #3

● Attempt #3

○ ○ ◌ ◌ ● ◉ ◉ ◉ ◉ ◉   gpt-5.3-codex · 79k/400k tokens (20%)
◉ ◉ ◉ ◉ ◉ · · · · ·   ○ System Prompt          10.0k   (2%)
· · · · · · · · · ·   ◌ System Tools            8.7k   (2%)
· · · · · · · · · ·   ● MCP Tools               4.9k   (1%)
· · · · · ◎ ◎ ◎ ◎ ◎   ◉ Messages               55.9k  (14%)
◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   · Free Space            179.0k  (45%)
◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎ ◎   ◎ Buffer                141.6k  (35%)

Summary

 - Total steps: 9  
 - Passed: 9  
 - Failed: 0  
 - Elapsed runtime: 241s (4m 1s)  
 - Estimated tokens consumed: not available  
 - Flaky points observed:  
  1. Checkbox click initially failed due non-interactive/stale UID; succeeded after fresh snapshot + label click.  
 2. Final wait_for on Главная timed out once; page was already navigated and confirmed by subsequent snapshot.

App process and browser page were cleaned up at the end.
Enter fullscreen mode Exit fullscreen mode

Top comments (0)