Your AI agent just made 47 tool calls. How much did that cost?
If you answered "uh... no idea," you're not alone. Most developers building with the MCP Tools are flying blind when it comes to observability. Your AI client talks to MCP servers, tools get called, tokens get consumed, and your bill quietly climbs, but where is all that spend going?
Enter benchmark-broccoli: a transparent MCP proxy that sits between your AI client and any MCP server, measuring tokens, estimated cost, and latency for every single tool call. Think of it as the missing dashboard for your AI infrastructure, showing you exactly what's happening under the hood, in real time.
What is MCP tool-call monitoring (And Why You Need It)
The Model Context Protocol is Anthropic's standard for connecting AI applications to data sources and tools. Your AI client (Opencode,Claude , Cursor, VS Code) calls MCP servers (filesystem, database, GraphQL, custom tools) to get things done. The catch:out of the box you get almost no signal about those tool round-trips, only the model’s usual chat tokens, not what each tools/call actually cost you in time, tokens, or money.
MCP tool-call monitoring solves this by intercepting the communication between client and server, logging every interaction, and giving you actionable metrics:
-
Token counts — Input/output tokens per call (using tiktoken
o200k_base) - Cost estimates — Per-call and per-session pricing across 12+ AI models
- Latency tracking — How long each tool call actually takes
- Session grouping — Automatic clustering of related calls by time gap
-
Schema overhead — The hidden cost of
listTools()payloads
Without this layer, you're optimizing in the dark. With it, you know exactly which tools are expensive, which prompts are inefficient, and where to focus your optimization efforts.
How benchmark-broccoli Works
benchmark-broccoli is a stdio MCP proxy written in TypeScript. You configure your AI client to run it instead of your real MCP server command, with the real command after --. Tool results are not rewritten—the proxy forwards work to the upstream server and only observes JSON payloads to count tokens and time. Metrics land in an append-only log: calls.jsonl (one JSON object per line).
The Architecture in 30 Seconds
How benchmark-broccoli Works
benchmark-broccoli is a stdio proxy written in TypeScript. You point your AI client at it instead of directly at your MCP server, and it passes everything through while logging metrics to calls.jsonl.
The Architecture in 30 Seconds
Your AI Client (Cursor, Claude Desktop, etc.)
↓
benchmark-broccoli proxy (measures & logs)
↓
Any MCP Server (filesystem, database, custom)
The proxy:
- Receives MCP requests from your client via stdio
- Spawns the real MCP server as a child process
- Forwards requests, intercepts responses
- Counts tokens, estimates cost, records latency
- Appends structured data to
calls.jsonl - Streams updates to the live dashboard via Server-Sent Events
The result? A live dashboard that auto-updates as your AI works, showing you every tool call, every token, and every dollar (or fraction thereof).
Key Features That Actually Matter
1. Works With Any MCP Server (Seriously, Any)
Unlike monitoring solutions tied to specific tools, benchmark-broccoli is MCP-native. If it speaks the Model Context Protocol, you can measure it:
- Official / community servers (e.g. filesystem, Git, fetch) and bridges like
mcp-remotefor URL-based MCP - Domain-specific servers you or others maintain (e.g. Postgres, MongoDB, or GraphQL—whatever your MCP server implements)
- Custom servers you wrote yourself
No SDK changes. No code modifications. Just proxy it.
2. Per-Query Session Tracking
Your AI doesn't make one tool call, it makes bursts of calls per prompt. benchmark-broccoli automatically groups these into sessions using a configurable time-gap heuristic (default: 30 seconds).
Each session shows:
- Total cost across all calls
- Schema overhead (attributed once per proxy restart)
- Call sequence and timing
- User identifier (from
MCP_USERenv var)
This is game-changing for understanding per-prompt efficiency. Instead of seeing "347 calls today," you see "12 sessions, the first one cost $0.47, the second cost $0.02."
3. Multi-Model Cost Comparison
Not sure whether to use Claude Sonnet or GPT-4o? Switch models in the dashboard and see costs recalculate instantly using real token counts.
Built-in pricing for 12 models:
- Claude 4 (Sonnet, Opus)
- Claude 3.5 (Sonnet, Haiku)
- GPT-4o, GPT-4o-mini
- GPT-4.1 (full, mini, nano)
- Gemini 2.5 (Pro, Flash)
Example: A session that cost $0.42 on Claude Sonnet would cost $2.10 on Claude Opus or $0.11 on Haiku. Suddenly, model selection becomes data-driven instead of gut-feel.
4. Live Dashboard (Dark Theme, Obviously)
The dashboard at http://127.0.0.1:3000 shows:
- Session cards — Collapsible, ordered by recency, with aggregated metrics
- Per-tool breakdown — Which tools are getting called, how often, at what cost
- Real-time updates — SSE-powered, no refresh needed
- Export to CSV — Download call history for deeper analysis
It's built with vanilla JS and a brutally clean dark UI. No framework bloat, just fast, functional observability.
5. JSONL Export for Downstream Analysis
Every call gets appended to calls.jsonl in structured format:
{
"timestamp": "2026-04-15T18:32:41.123Z",
"tool": "read_file",
"inputTokens": 1247,
"outputTokens": 523,
"cost": 0.0119,
"latencyMs": 342,
"sessionId": "session_abc123",
"user": "dev-team"
}
Feed this into your data warehouse, plot it in Grafana, or build custom alerts. The data is yours.
Getting Started in 3 Minutes
Step 1: Install
git clone https://github.com/Shriya-Chauhan/benchmark-broccoli.git
cd benchmark-broccoli
npm install
Step 2: Configure Your AI Client
Point your MCP config at the proxy. Example for Cursor (~/.cursor/mcp.json):
{
"mcpServers": {
"my-server": {
"command": "npx",
"args": [
"tsx", "/absolute/path/to/benchmark-broccoli/src/index.ts",
"--",
"npx", "-y", "mcp-remote", "https://my-mcp-server.example.com/mcp"
],
"env": {
"COST_MODEL": "claude-sonnet-4-20250514"
}
}
}
}
Everything after -- is your real server command. The proxy sits in front.
Step 3: Launch the Dashboard
npm start
# → [dashboard] http://127.0.0.1:3000
Open it in your browser. Use your AI client normally. Watch the metrics roll in.
Who Is This For?
You should use benchmark-broccoli if you:
- Build AI agents or assistants with MCP servers
- Want to optimize prompt efficiency based on data
- Need to justify model choices with actual cost numbers
- Are tired of surprise AI bills
You probably don't need it if:
- You make <10 MCP calls per day (not enough signal)
- You don't care about costs (lucky you)
Performance & Overhead
Latency impact: ~2-15ms per tool call (token counting + JSONL append)
Memory footprint: <50MB for the proxy process
Dashboard overhead: Negligible — SSE pushes updates, no polling
For 1,000 calls/day, you're looking at ~10-15 seconds of added latency total. The observability is worth it.
Roadmap & Contributing
benchmark-broccoli is Apache 2.0 licensed and actively maintained by Shriya Chauhan.
Current priorities:
Accounts & identity — Sign-up / sign-in, user profiles, and per-user (or per-team) dashboards instead of only
MCP_USERin.envAlerts — Notify when estimated cost or latency crosses thresholds (per session, per tool, or daily cap)
Comparative session analysis — Side-by-side runs for A/B prompt or workflow testing (same tools, different instructions)
Support for streaming tool responses
Want to contribute?
- Fork the repo
- Create a feature branch (
git checkout -b feature/my-change) - Push and open a PR
- Run
npm testandnpm run typecheck
FAQ
Q: Does it support streaming responses?
A: Currently, it measures full request/response pairs. Streaming support is on the roadmap.
Q: What if my MCP server uses custom authentication?
A: The proxy is transparent — it forwards everything. Pass auth credentials via env vars in your MCP config, and the proxy will pass them through.
Q: Can I track multiple MCP servers at once?
A: Yes! Run one proxy instance per server, each writing to a different JSONL file. Point the dashboard at whichever file you want to visualize (or aggregate them externally).
Q: Is the token count 100% accurate?
A: It's ~95-98% accurate for Claude 3+, GPT-4+, and Gemini models using tiktoken. Edge cases (special tokens, legacy encodings) may vary slightly, but it's accurate enough for cost estimation and optimization.
Conclusion: Stop Flying Blind
If you're building with MCP and you don't have observability, you're leaving money and performance on the table. benchmark-broccoli gives you the metrics you need in under 5 minutes of setup.
Start now:
git clone https://github.com/Shriya-Chauhan/benchmark-broccoli.git
cd benchmark-broccoli
npm install && npm start
Your future self (and your finance team) will thank you.
Written with ❤️ for the MCP community. Star on GitHub · Report Issues
Top comments (0)