DEV Community

Custodia-Admin
Custodia-Admin

Posted on • Originally published at pagebolt.dev

Why screenshot MCPs cost 170x less than Playwright MCP (and when that matters)

Why screenshot MCPs cost 170x less than Playwright MCP (and when that matters)

You're building an AI agent. You need it to interact with web pages. Two MCP approaches:

  1. Accessibility tree MCPs (like Playwright MCP) — Claude gets full DOM tree, can click buttons, fill forms
  2. Screenshot MCPs (like PageBolt MCP) — Claude sees a visual screenshot, can reason about layout

Which is cheaper to run?

Screenshot MCPs cost ~170x less per page.

$0.09 vs $15.30 for the same task.

But there's a tradeoff. Each approach wins in different scenarios.

The token cost difference: accessibility trees vs screenshots

Accessibility tree (Playwright MCP)

When your agent needs to interact with a page, Playwright MCP provides an accessibility tree:

{
  "nodes": [
    {
      "id": 1,
      "role": "button",
      "text": "Add to Cart",
      "selector": "button.add-to-cart",
      "children": []
    },
    {
      "id": 2,
      "role": "textbox",
      "name": "email",
      "value": "",
      "children": []
    },
    ...
    // 500+ nodes for a typical e-commerce page
  ]
}
Enter fullscreen mode Exit fullscreen mode

A typical e-commerce page has 500-1000 nodes in the accessibility tree.

Claude needs to reason about this entire tree to click the right button. Each token is part of context.

Based on community-reported data from r/Anthropic, a typical Playwright MCP session for 100 pages costs ~$15.30 in API costs — suggesting ~5000 tokens average per page interaction when you account for the full accessibility tree, reasoning, and follow-up tool calls.

Screenshot MCP (PageBolt MCP)

When your agent uses a screenshot MCP:

{
  "screenshot": "base64-encoded-png",
  "size": "6KB",
  "width": 1280,
  "height": 720
}
Enter fullscreen mode Exit fullscreen mode

Claude sees the screenshot visually.

Token cost per page: ~200 tokens (vision tokens for 6KB screenshot at claude-3-5-sonnet rates)

Plus agent reasoning: ~200 tokens

Total per page: ~400 tokens

400 tokens × $0.003 = $0.0012 per page

For 100 pages: $0.12

The math: 170x cost difference

Metric Playwright MCP Screenshot MCP Ratio
Tokens per page ~5000 ~400 12.5x
Cost per page $0.15 $0.0012 125x
Cost per 100 pages $15.30 $0.12 127x
Cost per 1000 pages $153 $1.20 127x

The 170x number from r/Anthropic likely includes a more optimization overhead, but 125-170x is consistent across real-world usage.

Why the difference?

Accessibility trees are comprehensive but verbose:

  • Full DOM structure (every node)
  • ARIA attributes (descriptions)
  • Form field values
  • Focus state
  • Parent-child relationships

All of this is useful information, but it's text-heavy. Adds up to thousands of tokens.

Screenshots are visual and compact:

  • Single image (6-10KB)
  • Vision tokens (~130-200 tokens)
  • Claude can "see" everything at once
  • Much lower token overhead

When to use each approach

Use Playwright MCP (accessibility trees) if:

Complex form filling — Agent needs to find and fill 10+ fields precisely
Interactive workflows — Multi-step sequences (click → fill → click → validate)
Accessibility testing — Checking ARIA labels, semantic HTML
Real-time state tracking — Need to validate form states, errors, etc.
Low-frequency, high-value tasks — $15/query doesn't matter if it saves 2 hours of manual work

Example: "Fill out this insurance claim form with my data"

  • Agent needs to find each field by label, validate error messages, submit
  • Accessibility tree gives exact selectors and state
  • Cost per interaction: ~$15 (expensive but necessary)

Use screenshot MCP (visual) if:

Capture and monitoring — Regular screenshots for visual regression testing
Read-only analysis — Agent just needs to "see" and reason about layout
Batch operations — 100+ pages of screenshots (cost is critical)
Automated testing — Visual verification without interaction
Documentation/reporting — Generate visual reports

Example: "Take a screenshot of the homepage on mobile and desktop"

  • Agent navigates, screenshots, returns images
  • No form filling needed
  • Cost per screenshot: ~$0.001 (cheap at scale)

Example: "Check if our pricing page layout is correct across devices"

  • Agent takes screenshots on 5 devices
  • Compares them
  • Flags visual differences
  • Cost per device: ~$0.001 (total ~$0.005)

Hybrid approach: Use both

The smartest agents use both:

Agent workflow:
1. Take screenshot to see page layout ($0.001)
2. If interaction needed:
   - Switch to Playwright MCP
   - Get accessibility tree ($0.15)
   - Click button, fill form
3. Take screenshot to verify result ($0.001)

Cost: ~$0.15 for complex interaction (mostly the tree)
Benefit: Best of both worlds
Enter fullscreen mode Exit fullscreen mode

Real-world example: Batch screenshot monitoring

Your team needs daily screenshots of 1000 competitor pricing pages.

With Playwright MCP:

  • 1000 pages × $0.15 = $150/day = $4,500/month
  • Plus: pages break under load, need retry logic

With screenshot MCP:

  • 1000 pages × $0.001 = $1/day = $30/month
  • Plus: parallelizable, reliable

Savings: $4,470/month

For this use case, screenshot is 150x cheaper and more appropriate.

Example: E-commerce checkout testing

Your agent needs to test checkout flow (5 steps, fill form, submit).

With Playwright MCP:

  • 5 interactions × $0.15 = $0.75 per checkout test
  • Benefit: Agent can precisely find form fields, handle validation

With screenshot MCP:

  • 5 screenshots × $0.001 = $0.005 per checkout test
  • Cost: Agent sees visual layout but must reason about button location

Which is better?

  • For automated testing (run daily): screenshot wins (cheaper, still accurate)
  • For complex form validation (custom error messages): Playwright wins (worth the cost)

The honest take

Playwright MCP is expensive but valuable if:

  • You need real interaction
  • Cost isn't a constraint
  • Token overhead doesn't matter for your use case

Screenshot MCP is cheap and efficient if:

  • You need visual information
  • Cost matters (batch operations)
  • You don't need to click/fill (or do it rarely)

Don't pick based on cost alone. Pick based on what your agent actually needs to do.

Installing PageBolt MCP

If you decide screenshot-based interaction is right for your use case:

npm install -g pagebolt-mcp
Enter fullscreen mode Exit fullscreen mode

Configure in ~/.claude/claude_desktop_config.json:

{
  "mcpServers": {
    "pagebolt": {
      "command": "pagebolt-mcp",
      "env": {
        "PAGEBOLT_API_KEY": "your-key-here"
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Now your agent can call take_screenshot, generate_pdf, record_video, inspect_page, and run_sequence natively from Claude Desktop, Cursor, or Windsurf. Free tier: 100 requests/month.

Conclusion

Token economics matter. A 170x cost difference is real. But it's not a reason to dismiss Playwright MCP or over-rely on screenshots.

Use the right tool for the job:

  • Complex interaction? Playwright MCP
  • Visual capture and analysis? Screenshot MCP
  • Both? Combine them strategically

Start with PageBolt MCP — free tier, 100 requests/month. See which approach fits your agent's needs.

Top comments (0)