Custodia-Admin

Posted on Mar 26 • Originally published at pagebolt.dev

How to Give Your AI Agent Eyes: Screenshot and Visual Verification via API

#aiagents #llm #verification #screenshotapi

How to Give Your AI Agent Eyes: Screenshot and Visual Verification via API

Your AI agent is blind.

You tell it: "Click the submit button and confirm the form was accepted."

The agent makes the API call. Gets back status 200. Continues to the next task.

But you have no idea what actually happened. Did the form submit? Is the page still showing an error? Did the agent hallucinate the success?

AI agents need eyes. They need to see what happened after every action.

Without visual verification, you're building on faith. With it, you have proof.

The Problem: Text-Only Agents Can't Verify Actions

Here's what typical agent workflows look like:

# Agent calls browser tool
response = tool.click_button(selector="#submit")
# Response: { "success": true, "status": 200 }

# Agent has NO IDEA:
# - Did the page actually change?
# - Is there an error message visible?
# - Did the form data actually save?
# - Is the agent looking at the right page?

# Agent just assumes success and moves on
agent.next_task()

Result: Agents make wrong decisions based on incomplete information. They hallucinate success. They miss errors that humans would catch instantly.

You need visual verification loops — after every action, the agent sees what happened.

The Solution: Three Visual Verification Patterns

Pattern 1: Post-Action Verification Screenshot

After your agent clicks a button, takes a screenshot. Extracts key information from the image. Decides what to do next.

from anthropic import Anthropic
import requests
import base64

client = Anthropic()

PAGEBOLT_API_KEY = "YOUR_API_KEY"
PAGEBOLT_BASE_URL = "https://pagebolt.dev/api/v1"

def agent_interact_with_verification(url, action_description, initial_prompt):
    """
    AI agent interacts with a website and uses visual verification
    to understand what happened.
    """

    # Step 1: Take initial screenshot
    response = requests.post(
        f"{PAGEBOLT_BASE_URL}/screenshot",
        json={"url": url},
        headers={"x-api-key": PAGEBOLT_API_KEY},
        timeout=30
    )

    initial_screenshot = base64.standard_b64encode(response.content).decode()

    # Step 2: Ask Claude what it sees
    message = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": "image/png",
                            "data": initial_screenshot,
                        },
                    },
                    {
                        "type": "text",
                        "text": initial_prompt
                    }
                ],
            }
        ],
    )

    agent_decision = message.content[0].text
    print(f"Agent sees: {agent_decision}")

    # Step 3: Execute action (simulate browser interaction)
    # In production, you'd call your actual browser automation tool
    print(f"Executing: {action_description}")

    # Step 4: Verify the action with another screenshot
    response = requests.post(
        f"{PAGEBOLT_BASE_URL}/screenshot",
        json={"url": url},  # URL after action in real scenario
        headers={"x-api-key": PAGEBOLT_API_KEY},
        timeout=30
    )

    verification_screenshot = base64.standard_b64encode(response.content).decode()

    # Step 5: Ask Claude: Did the action work?
    verification = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=512,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": "image/png",
                            "data": verification_screenshot,
                        },
                    },
                    {
                        "type": "text",
                        "text": f"I just performed this action: {action_description}\n\nDid it succeed? What changed? Any errors?"
                    }
                ],
            }
        ],
    )

    result = verification.content[0].text
    print(f"Verification result: {result}")

    return {
        "initial_assessment": agent_decision,
        "action": action_description,
        "verification": result,
        "success": "success" in result.lower() or "worked" in result.lower()
    }

# Usage
result = agent_interact_with_verification(
    url="https://example.com/checkout",
    action_description="Clicked the 'Complete Purchase' button",
    initial_prompt="What do you see on this page? Is the checkout form ready to submit?"
)

print(f"\nFinal result: {result}")

Pattern 2: CSS Selector Discovery with /inspect

Don't guess selectors. Use /inspect to reliably find page elements.

import requests
import json

def find_elements_reliably(url):
    """
    Use PageBolt /inspect endpoint to discover page structure
    and get reliable CSS selectors without guessing.
    """

    response = requests.post(
        "https://pagebolt.dev/api/v1/inspect",
        json={"url": url},
        headers={"x-api-key": PAGEBOLT_API_KEY}
    )

    page_structure = response.json()

    # page_structure contains:
    # {
    #   "buttons": [
    #     { "text": "Submit", "selector": "#submit-btn-123", "visible": true },
    #     { "text": "Cancel", "selector": ".btn-cancel", "visible": true }
    #   ],
    #   "forms": [...],
    #   "inputs": [...],
    #   "headings": [...]
    # }

    return page_structure

def agent_action_with_selector_discovery(url, action_goal):
    """
    Agent asks: "What elements exist on this page?"
    Uses /inspect to get all clickable elements with their selectors.
    Chooses the right one based on the goal.
    """

    # Get page structure
    page_map = find_elements_reliably(url)

    # Convert to natural language for Claude
    page_description = json.dumps(page_map, indent=2)

    decision = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=256,
        messages=[
            {
                "role": "user",
                "content": f"""
                Here's the page structure:
                {page_description}

                Goal: {action_goal}

                Which element should I interact with? Respond with ONLY the CSS selector.
                """
            }
        ]
    )

    selector = decision.content[0].text.strip()
    print(f"Agent chose selector: {selector}")

    # Now the agent can click with CONFIDENCE — the selector is proven to exist
    return selector

# Usage
selector = agent_action_with_selector_discovery(
    url="https://example.com",
    action_goal="Click the button that says 'Add to Cart'"
)

Pattern 3: Video Recording as Compliance Audit Trail

For regulated industries (finance, healthcare, legal), record agent sessions as proof of what happened.

import requests

def record_agent_session_as_video(actions):
    """
    Record a full agent session (multiple clicks, form fills, etc.)
    as a video. Perfect for compliance audits: prove to regulators
    exactly what your agent did.
    """

    # Build step sequence
    steps = [
        {"action": "navigate", "url": "https://example.com/account"},
        {"action": "wait", "ms": 2000},  # Wait for page load
        {"action": "click", "selector": "#view-history"},
        {"action": "wait", "ms": 1000},
        {"action": "screenshot", "name": "history-page"},
        {"action": "click", "selector": "#export-btn"},
        {"action": "wait", "ms": 2000},
        {"action": "screenshot", "name": "export-complete"}
    ]

    # Record to video
    response = requests.post(
        "https://pagebolt.dev/api/v1/record",
        json={
            "steps": steps,
            "format": "mp4",
            "frame": {"enabled": True, "style": "macos"},
            "audioGuide": {
                "enabled": True,
                "script": "Agent navigating to account history. {{1}} Clicking export button. {{2}} Export complete."
            }
        },
        headers={"x-api-key": PAGEBOLT_API_KEY}
    )

    video_url = response.json().get("video_url")
    return video_url

# Usage
audit_video = record_agent_session_as_video([
    {"url": "account-page", "action": "view-history"},
    {"button": "export", "action": "click"}
])

print(f"Compliance video: {audit_video}")

Real-World: AI Agent Verification Loop

Here's what a production agent verification loop looks like:

class VerifiedAgent:
    def __init__(self, api_key):
        self.api_key = api_key
        self.pagebolt_api = "https://pagebolt.dev/api/v1"
        self.client = Anthropic()

    def verify_action(self, url, action):
        """Take screenshot, show to Claude, verify success."""
        response = requests.post(
            f"{self.pagebolt_api}/screenshot",
            json={"url": url},
            headers={"x-api-key": self.api_key}
        )
        screenshot = base64.standard_b64encode(response.content).decode()

        verification = self.client.messages.create(
            model="claude-opus-4-6",
            max_tokens=256,
            messages=[{
                "role": "user",
                "content": [
                    {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": screenshot}},
                    {"type": "text", "text": f"Action: {action}. Did it succeed? Report status."}
                ]
            }]
        )

        return verification.content[0].text

    def discover_elements(self, url):
        """Use /inspect to find clickable elements."""
        response = requests.post(
            f"{self.pagebolt_api}/inspect",
            json={"url": url},
            headers={"x-api-key": self.api_key}
        )
        return response.json()

    def execute_with_verification(self, url, goal, max_steps=5):
        """Execute a goal with verification at each step."""
        current_url = url
        step = 0

        while step < max_steps:
            # Discover what's on the page
            elements = self.discover_elements(current_url)

            # Ask Claude what to do next
            plan = self.client.messages.create(
                model="claude-opus-4-6",
                max_tokens=256,
                messages=[{
                    "role": "user",
                    "content": f"Goal: {goal}\n\nAvailable elements: {json.dumps(elements)}\n\nWhat's the next action?"
                }]
            )

            action = plan.content[0].text
            print(f"Step {step}: {action}")

            # Verify the action worked
            result = self.verify_action(current_url, action)
            print(f"Result: {result}")

            if "success" in result.lower() or "complete" in result.lower():
                print("Goal achieved!")
                return True

            step += 1

        return False

# Usage
agent = VerifiedAgent(api_key="YOUR_API_KEY")
success = agent.execute_with_verification(
    url="https://example.com",
    goal="Fill out the feedback form and submit it"
)

Why This Matters

Scenario	Without Verification	With Verification
Agent clicks button	Assumes it worked	Takes screenshot, confirms
Form submission	Hopes page updated	Sees success message or error
Hallucination detection	Agent continues blindly	Agent recognizes mistake, retries
Audit trail	"The agent ran" (unproven)	Video proof of every action
Debugging failures	Guess what went wrong	Watch the video, see exact issue

Getting Started

Sign up: pagebolt.dev — 100 free requests/month
Get API key: Copy from dashboard
Choose pattern: Post-action verification, /inspect, or video recording
Integrate: Copy the code example above
Deploy: Your agents now have visual awareness

Your AI agents will go from "hoping for the best" to "seeing what actually happened."

Try it free — 100 requests/month, no credit card. Start now.

DEV Community

How to Give Your AI Agent Eyes: Screenshot and Visual Verification via API

How to Give Your AI Agent Eyes: Screenshot and Visual Verification via API

The Problem: Text-Only Agents Can't Verify Actions

The Solution: Three Visual Verification Patterns

Pattern 1: Post-Action Verification Screenshot

Pattern 2: CSS Selector Discovery with /inspect

Pattern 3: Video Recording as Compliance Audit Trail

Real-World: AI Agent Verification Loop

Why This Matters

Getting Started

Top comments (0)