Custodia-Admin

Posted on Mar 2 • Originally published at pagebolt.dev

Why screenshot MCPs cost 170x less than Playwright MCP (and when that matters)

#mcp #claude #webdev #devtools

Why screenshot MCPs cost 170x less than Playwright MCP (and when that matters)

You're building an AI agent. You need it to interact with web pages. Two MCP approaches:

Accessibility tree MCPs (like Playwright MCP) — Claude gets full DOM tree, can click buttons, fill forms
Screenshot MCPs (like PageBolt MCP) — Claude sees a visual screenshot, can reason about layout

Which is cheaper to run?

Screenshot MCPs cost ~170x less per page.

$0.09 vs $15.30 for the same task.

But there's a tradeoff. Each approach wins in different scenarios.

The token cost difference: accessibility trees vs screenshots

Accessibility tree (Playwright MCP)

When your agent needs to interact with a page, Playwright MCP provides an accessibility tree:

{
  "nodes": [
    {
      "id": 1,
      "role": "button",
      "text": "Add to Cart",
      "selector": "button.add-to-cart",
      "children": []
    },
    {
      "id": 2,
      "role": "textbox",
      "name": "email",
      "value": "",
      "children": []
    },
    ...
    // 500+ nodes for a typical e-commerce page
  ]
}

A typical e-commerce page has 500-1000 nodes in the accessibility tree.

Claude needs to reason about this entire tree to click the right button. Each token is part of context.

Based on community-reported data from r/Anthropic, a typical Playwright MCP session for 100 pages costs ~$15.30 in API costs — suggesting ~5000 tokens average per page interaction when you account for the full accessibility tree, reasoning, and follow-up tool calls.

Screenshot MCP (PageBolt MCP)

When your agent uses a screenshot MCP:

{
  "screenshot": "base64-encoded-png",
  "size": "6KB",
  "width": 1280,
  "height": 720
}

Claude sees the screenshot visually.

Token cost per page: ~200 tokens (vision tokens for 6KB screenshot at claude-3-5-sonnet rates)

Plus agent reasoning: ~200 tokens

Total per page: ~400 tokens

400 tokens × $0.003 = $0.0012 per page

For 100 pages: $0.12

The math: 170x cost difference

Metric	Playwright MCP	Screenshot MCP	Ratio
Tokens per page	~5000	~400	12.5x
Cost per page	$0.15	$0.0012	125x
Cost per 100 pages	$15.30	$0.12	127x
Cost per 1000 pages	$153	$1.20	127x

The 170x number from r/Anthropic likely includes a more optimization overhead, but 125-170x is consistent across real-world usage.

Why the difference?

Accessibility trees are comprehensive but verbose:

Full DOM structure (every node)
ARIA attributes (descriptions)
Form field values
Focus state
Parent-child relationships

All of this is useful information, but it's text-heavy. Adds up to thousands of tokens.

Screenshots are visual and compact:

Single image (6-10KB)
Vision tokens (~130-200 tokens)
Claude can "see" everything at once
Much lower token overhead

When to use each approach

Use Playwright MCP (accessibility trees) if:

✅ Complex form filling — Agent needs to find and fill 10+ fields precisely
✅ Interactive workflows — Multi-step sequences (click → fill → click → validate)
✅ Accessibility testing — Checking ARIA labels, semantic HTML
✅ Real-time state tracking — Need to validate form states, errors, etc.
✅ Low-frequency, high-value tasks — $15/query doesn't matter if it saves 2 hours of manual work

Example: "Fill out this insurance claim form with my data"

Agent needs to find each field by label, validate error messages, submit
Accessibility tree gives exact selectors and state
Cost per interaction: ~$15 (expensive but necessary)

Use screenshot MCP (visual) if:

✅ Capture and monitoring — Regular screenshots for visual regression testing
✅ Read-only analysis — Agent just needs to "see" and reason about layout
✅ Batch operations — 100+ pages of screenshots (cost is critical)
✅ Automated testing — Visual verification without interaction
✅ Documentation/reporting — Generate visual reports

Example: "Take a screenshot of the homepage on mobile and desktop"

Agent navigates, screenshots, returns images
No form filling needed
Cost per screenshot: ~$0.001 (cheap at scale)

Example: "Check if our pricing page layout is correct across devices"

Agent takes screenshots on 5 devices
Compares them
Flags visual differences
Cost per device: ~$0.001 (total ~$0.005)

Hybrid approach: Use both

The smartest agents use both:

Agent workflow:
1. Take screenshot to see page layout ($0.001)
2. If interaction needed:
   - Switch to Playwright MCP
   - Get accessibility tree ($0.15)
   - Click button, fill form
3. Take screenshot to verify result ($0.001)

Cost: ~$0.15 for complex interaction (mostly the tree)
Benefit: Best of both worlds

Real-world example: Batch screenshot monitoring

Your team needs daily screenshots of 1000 competitor pricing pages.

With Playwright MCP:

1000 pages × $0.15 = $150/day = $4,500/month
Plus: pages break under load, need retry logic

With screenshot MCP:

1000 pages × $0.001 = $1/day = $30/month
Plus: parallelizable, reliable

Savings: $4,470/month

For this use case, screenshot is 150x cheaper and more appropriate.

Example: E-commerce checkout testing

Your agent needs to test checkout flow (5 steps, fill form, submit).

With Playwright MCP:

5 interactions × $0.15 = $0.75 per checkout test
Benefit: Agent can precisely find form fields, handle validation

With screenshot MCP:

5 screenshots × $0.001 = $0.005 per checkout test
Cost: Agent sees visual layout but must reason about button location

Which is better?

For automated testing (run daily): screenshot wins (cheaper, still accurate)
For complex form validation (custom error messages): Playwright wins (worth the cost)

The honest take

Playwright MCP is expensive but valuable if:

You need real interaction
Cost isn't a constraint
Token overhead doesn't matter for your use case

Screenshot MCP is cheap and efficient if:

You need visual information
Cost matters (batch operations)
You don't need to click/fill (or do it rarely)

Don't pick based on cost alone. Pick based on what your agent actually needs to do.

Installing PageBolt MCP

If you decide screenshot-based interaction is right for your use case:

npm install -g pagebolt-mcp

Configure in ~/.claude/claude_desktop_config.json:

{
  "mcpServers": {
    "pagebolt": {
      "command": "pagebolt-mcp",
      "env": {
        "PAGEBOLT_API_KEY": "your-key-here"
      }
    }
  }
}

Now your agent can call take_screenshot, generate_pdf, record_video, inspect_page, and run_sequence natively from Claude Desktop, Cursor, or Windsurf. Free tier: 100 requests/month.

Conclusion

Token economics matter. A 170x cost difference is real. But it's not a reason to dismiss Playwright MCP or over-rely on screenshots.

Use the right tool for the job:

Complex interaction? Playwright MCP
Visual capture and analysis? Screenshot MCP
Both? Combine them strategically

Start with PageBolt MCP — free tier, 100 requests/month. See which approach fits your agent's needs.

DEV Community

Why screenshot MCPs cost 170x less than Playwright MCP (and when that matters)

Why screenshot MCPs cost 170x less than Playwright MCP (and when that matters)

The token cost difference: accessibility trees vs screenshots

Accessibility tree (Playwright MCP)

Screenshot MCP (PageBolt MCP)

The math: 170x cost difference

Why the difference?

When to use each approach

Use Playwright MCP (accessibility trees) if:

Use screenshot MCP (visual) if:

Hybrid approach: Use both

Real-world example: Batch screenshot monitoring

Example: E-commerce checkout testing

The honest take

Installing PageBolt MCP

Conclusion

Top comments (0)