Custodia-Admin

Posted on Mar 22 • Edited on Mar 25 • Originally published at pagebolt.dev

How I Built a Claude Tool That Screenshots Every Page It Visits

#claude #mcp #aiagents #webdev

How I Built a Claude Tool That Screenshots Every Page It Visits

Claude is incredibly powerful. I use it in Claude Desktop to browse websites, fill forms, extract data, monitor pages. But Claude is text-only.

When Claude clicks a button, it can't see the result. When Claude submits a form, it can't see the confirmation page. It's working blind.

So I built an MCP tool that changes that. After every browser action, it screenshots the page and shows Claude what it's looking at.

Now Claude can actually see.

The Problem: Claude's Tool Calls Are Invisible

Here's what using Claude with browser tools feels like:

I ask Claude: "Check if that button is clickable"
Claude calls my inspect_page tool
Tool returns: { button_found: true, clickable: true }
Claude continues
But Claude has no idea what the button looks like

Claude is working from text descriptions. If my tool makes a mistake, Claude doesn't know. If the page renders differently than I described, Claude goes off the rails.

The Solution: Visual Feedback Loop

I added a screenshot tool. After Claude takes any action (click, fill, navigate), the tool screenshots the page and returns the image to Claude.

Now Claude sees the page, not just a text description.

const Claude can see it's looking at a form.
"Screenshot shows form with fields: Name, Email, Phone"
(Not: "Tool returned: { fields_found: 3 }")

This changes everything. Claude can verify its own actions.

Building the MCP Tool

Here's the MCP tool I built:

const { Server } = require('@modelcontextprotocol/sdk/server/index.js');
const { StdioServerTransport } = require('@modelcontextprotocol/sdk/server/stdio.js');
const {
  ListToolsRequestSchema,
  CallToolRequestSchema,
} = require('@modelcontextprotocol/sdk/types.js');
const fetch = require('node-fetch');
const fs = require('fs');

const PAGEBOLT_API_KEY = process.env.PAGEBOLT_API_KEY;

const server = new Server({
  name: 'claude-screenshot-tool',
  version: '1.0.0',
});

// List available tools
server.setRequestHandler(ListToolsRequestSchema, async () => {
  return {
    tools: [
      {
        name: 'screenshot_page',
        description: 'Take a screenshot of the current page to show Claude what it looks like',
        inputSchema: {
          type: 'object',
          properties: {
            url: {
              type: 'string',
              description: 'URL of the page to screenshot',
            },
            fullPage: {
              type: 'boolean',
              description: 'Capture full page (including lazy-loaded content)',
              default: false,
            },
          },
          required: ['url'],
        },
      },
    ],
  };
});

// Handle tool calls
server.setRequestHandler(CallToolRequestSchema, async (request) => {
  if (request.params.name === 'screenshot_page') {
    const { url, fullPage } = request.params.arguments;

    try {
      const response = await fetch('https://pagebolt.dev/api/v1/screenshot', {
        method: 'POST',
        headers: {
          'x-api-key': `${PAGEBOLT_API_KEY}`,
          'Content-Type': 'application/json',
        },
        body: JSON.stringify({
          url: url,
          format: 'png',
          width: 1280,
          height: 720,
          fullPage: fullPage || false,
          blockBanners: true,
        }),
      });

      if (!response.ok) {
        return {
          content: [
            {
              type: 'text',
              text: `Screenshot failed: ${response.status}`,
            },
          ],
        };
      }

      const buffer = await response.arrayBuffer();
      const base64 = Buffer.from(buffer).toString('base64');

      return {
        content: [
          {
            type: 'image',
            source: {
              type: 'base64',
              media_type: 'image/png',
              data: base64,
            },
          },
          {
            type: 'text',
            text: `Screenshot of ${url} captured. Claude can now see the page.`,
          },
        ],
      };
    } catch (error) {
      return {
        content: [
          {
            type: 'text',
            text: `Error: ${error.message}`,
          },
        ],
      };
    }
  }

  return {
    content: [{ type: 'text', text: 'Tool not found' }],
  };
});

const transport = new StdioServerTransport();
server.connect(transport);

The key parts:

ListToolsRequestSchema — Tells Claude what tools are available
CallToolRequestSchema — Handles tool calls from Claude
Return image as base64 — Claude can display the screenshot inline
blockBanners: true — Hide cookie popups so Claude sees clean pages

How Claude Uses It

Once I install this MCP tool in Claude Desktop, I can ask Claude to use it:

Me: "Navigate to example.com/form and take a screenshot"

Claude: navigates and calls the tool

Tool: returns screenshot as image

Claude: (seeing the image) "I can see a form with three fields: Name, Email, and Phone. Let me fill them in."

Claude can now see what it's doing. It can verify its actions. It can handle unexpected page layouts.

Real Example: Form Filling With Visual Verification

Without visual feedback, Claude's form-filling workflow looks like:

Navigate to form page
Fill "name" field with "John"
Fill "email" field with "john@example.com"
Fill "phone" field with "555-1234"
Click submit button
Hope the form submitted

With visual feedback:

Navigate to form page
Take screenshot → Claude sees the form
Fill "name" field → Claude can read the placeholder text
Take screenshot → Claude sees "Name" field is filled
Fill "email" field → Claude verifies it's the email field
Take screenshot → Claude sees both fields filled
Fill "phone" field → Claude knows where to click
Take screenshot → Claude sees confirmation message
Claude confirms: "Form successfully submitted. I can see the thank you page."

No guessing. No blind steps. Claude sees what's happening.

Why This Matters

Claude is an agent now. It can browse, fill forms, extract data, monitor pages. But agents need feedback loops. Screenshots are that feedback.

When Claude can see:

✅ It makes better decisions (error detection)
✅ It recovers from mistakes (can see what went wrong)
✅ It adapts to page changes (sees dynamic content)
✅ It gains confidence in its own actions (sees results)

Installation

Save the MCP tool code to claude-screenshot-tool.js
Set API key:

   export PAGEBOLT_API_KEY=your_key_here

Add to Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json):

   {
     "mcpServers": {
       "claude-screenshot": {
         "command": "node",
         "args": ["/path/to/claude-screenshot-tool.js"]
       }
     }
   }

Restart Claude Desktop
Claude can now call screenshot_page tool

Real-World Use Cases

Use Case 1: Website Monitoring

Claude navigates a monitoring dashboard
Takes screenshots at each step
Detects visual changes (red alerts, status changes)
Reports findings with visual evidence

Use Case 2: Competitor Analysis

Claude visits competitor websites
Screenshots pricing pages, feature lists
Compares visually across competitors
Summarizes findings with screenshot evidence

Use Case 3: Form Automation

Claude fills complex forms
Verifies each field visually
Detects validation errors (sees red text, error messages)
Retries intelligently based on visual feedback

Use Case 4: Content Extraction

Claude navigates a site
Screenshots key pages
Extracts content with visual context
Higher accuracy because it can see layout

Pricing

Plan	Requests/Month	Cost	Best For
Free	100	$0	Testing Claude tools
Starter	5,000	$29	Small projects
Growth	25,000	$79	Production Claude agents
Scale	100,000	$199	Enterprise AI workflows