DEV Community

Wes
Wes

Posted on

My MCP Server Was 2x Larger Than Playwright. Now It's 136x Smaller.

I built Charlotte, an open source browser MCP server that gives AI agents structured access to web pages. Navigation, observation, interaction, auditing. 40 tools across 6 categories, powered by headless Chromium.

When I first benchmarked it against Playwright MCP (Microsoft's browser MCP server), I expected Charlotte to be smaller. Structured representations should beat raw accessibility trees, right?

Charlotte was bigger. Not by a little. By a lot.

The Embarrassing First Benchmark

I ran both servers against the same pages and compared what the agent actually received. Here's what I found:

Page             Charlotte v0.1.3    Playwright MCP    Ratio
─────────────────────────────────────────────────────────────
Wikipedia          1,957,532 ch       1,040,636 ch     1.88x LARGER
GitHub repo          150,541 ch          80,297 ch     1.88x LARGER
Hacker News          156,897 ch          61,230 ch     2.56x LARGER
Enter fullscreen mode Exit fullscreen mode

My structured, "agent-optimized" server was returning almost twice as much data as the tool that just dumps the accessibility tree. On Hacker News it was 2.5x worse.

I had a choice: quietly ignore the numbers, or figure out why.

Where the Bloat Was Hiding

The problem wasn't the architecture. Charlotte's structured observation model (landmarks, headings, interactive elements with stable IDs) was sound. The problem was serialization. I was sending everything, all the time, whether the agent needed it or not.

Three things were killing me:

1. Verbose JSON formatting. Pretty-printed output with indentation and newlines. Nice for human debugging. Terrible for token budgets. The accessibility tree I was building was genuinely more detailed than Playwright's, but all that detail was being formatted with whitespace that cost tokens and conveyed nothing.

2. Empty fields everywhere. Every element had every possible property, even when those properties were null or empty. A simple link with just text and an href was being serialized with empty arrays for children, null values for bounding boxes, empty strings for ARIA attributes. Most of the payload was describing the absence of things.

3. No concept of "how much do you need?" Charlotte returned the full page representation on every call. Navigate to Wikipedia? Here's every heading, every link, every paragraph, every image, every form element, every table cell. The agent just wanted to know what page it was on.

Phase 1: Compact Serialization (38% Reduction)

First pass was mechanical. Switch to compact JSON. Strip null and empty fields. Remove whitespace. A link that was taking 340 characters dropped to under 100 without losing any information.

This alone brought Charlotte from 2x larger than Playwright to roughly the same size. Progress, but not the goal.

Phase 2: Three Detail Levels (Additional 30% Reduction)

This was the architectural insight that changed everything. Agents don't read pages the way humans do. They orient first, then drill down. So Charlotte should support that workflow natively:

  • Minimal returns landmarks, headings, and interactive element counts grouped by landmark. On Hacker News that's 336 characters. The agent sees "main: 47 links, 0 buttons" and knows what it's working with.
  • Summary adds content summaries, form structures, and error states.
  • Full includes all visible text content and complete element data.

Navigate defaults to minimal. The agent orients cheaply, calls find to locate specific elements, and only requests full detail when it actually needs to read content. Most browsing sessions never need full detail on most pages.

The key: the full element data is still there internally. find, click, type, and diff all work against it. The agent just doesn't pay to see it until it asks.

Phase 3: Interactive Summaries (Additional 8% Reduction)

The last piece was replacing element arrays with grouped summaries at minimal detail. Instead of returning 1,847 individual link objects on Wikipedia, minimal shows:

{"main": {"link": 1847, "button": 3}}
Enter fullscreen mode Exit fullscreen mode

The agent knows there are 1,847 links in the main landmark. If it needs them, it calls find({ type: "link" }) and gets exactly the ones it wants, filtered by text, landmark, or both. It never has to receive all 1,847 just to find the one it cares about.

The Result

Page             Charlotte v0.2.0    Playwright MCP     Advantage
──────────────────────────────────────────────────────────────────
Wikipedia             7,667 ch       1,040,636 ch       136x smaller
Hacker News             336 ch          61,230 ch       182x smaller
GitHub repo           3,185 ch          80,297 ch        25x smaller
httpbin form            364 ch           2,255 ch         6x smaller
Enter fullscreen mode Exit fullscreen mode

From 2x larger to 136x smaller. A 76% total reduction from the starting point, achieved across three optimization phases.

The cost difference is tangible. A 100-page browsing session on Claude Opus went from being roughly comparable to Playwright's cost to costing a fraction:

100-page session on Claude Opus:
  Charlotte:   ~$0.09
  Playwright:  ~$15.30
Enter fullscreen mode Exit fullscreen mode

What I Learned

Measure before you believe your own pitch. I was confident Charlotte would be smaller. The architecture was designed for it. But design intent doesn't survive contact with serialization. If I hadn't run the benchmarks honestly, I'd still be shipping a server that was 2x worse than the competition while claiming it was better.

Agents aren't humans. The biggest win wasn't compression. It was recognizing that agents interact with pages in phases: orient, focus, act, verify. Building the tool around that workflow (detail levels, semantic find, structural diffing) reduced tokens not by squeezing the same data smaller, but by never sending data the agent didn't need yet.

Token efficiency is a first-class design concern for MCP servers. Every character your server returns enters the agent's context window and competes with conversation history, reasoning space, and other tools. MCP server developers should be benchmarking their output size the way web developers benchmark page load time.

And Then We Went Further

With v0.4.0 (just shipped), we turned the same lens on tool definitions. Every MCP tool has a schema (name, description, input parameters) that gets injected into the agent's context on every API round-trip. With 40 tools loaded, that's ~7,200 tokens of overhead per call.

Not every session needs 40 tools. An agent browsing a site doesn't need dev_serve or dev_audit. An agent filling a form doesn't need viewport management or network throttling.

v0.4.0 introduces startup profiles:

charlotte --profile browse    # 22 tools (new default)
charlotte --profile core      # 7 tools (lightweight)
charlotte --profile full      # 40 tools (old behavior)
Enter fullscreen mode Exit fullscreen mode

And a runtime meta-tool that lets agents toggle tool groups mid-session:

charlotte:tools enable dev_mode     → activates dev_serve, dev_audit, dev_inject
charlotte:tools disable dev_mode    → deactivates them
Enter fullscreen mode Exit fullscreen mode

No restart needed. The MCP SDK sends notifications/tools/list_changed and Claude Code picks up changes immediately.

Measured results:

  • browse profile: 48% less tool definition overhead vs full
  • core profile: 77% less tool definition overhead vs full

Over a 100-page browsing session, that's roughly 1.4 million fewer definition tokens with browse vs full. In a real end-to-end form interaction benchmark (12 calls), core profile cut total tokens from 91k to 25k. A 72% reduction.

The pattern is the same one that worked for page representations: don't send what the agent doesn't need yet, and make it easy to ask for more when it does.

Try It

Charlotte is MIT licensed, published on npm, and listed in the MCP registry and several third-party directories.

{
  "mcpServers": {
    "charlotte": {
      "command": "npx",
      "args": ["-y", "@ticktockbent/charlotte"]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

If you're building MCP servers, benchmark your output. You might be surprised what you find. I was.

Top comments (0)