DEV Community

Custodia-Admin
Custodia-Admin

Posted on • Originally published at pagebolt.dev

Visual verification for AI agents: how to confirm web actions actually worked

Visual Verification for AI Agents: How to Confirm Web Actions Actually Worked

AI agents that click buttons and fill forms have a blind spot: they can't tell if the action worked unless they check. A form submission might return a 200 but show an error message. A login might redirect to an unexpected page. A delete might require a confirmation step the agent didn't anticipate.

The fix is visual verification: after each action, screenshot the current state and let the model evaluate whether the expected outcome happened.

The problem with optimistic agents

// This agent is optimistic — it assumes actions work
const result = await agent.invoke("Login to https://example.com with email test@example.com password abc123");

// The agent says "Done" but has no way to confirm login succeeded
// The actual page might show "Invalid credentials" or require 2FA
Enter fullscreen mode Exit fullscreen mode

Add a verification step

import fetch from "node-fetch";

const PAGEBOLT_KEY = process.env.PAGEBOLT_API_KEY;

async function screenshotPage(url) {
  const res = await fetch("https://pagebolt.dev/api/v1/screenshot", {
    method: "POST",
    headers: { "x-api-key": PAGEBOLT_KEY, "Content-Type": "application/json" },
    body: JSON.stringify({ url, blockBanners: true }),
  });
  if (!res.ok) throw new Error(`Screenshot failed: ${res.status}`);
  return Buffer.from(await res.arrayBuffer());
}

async function inspectPage(url) {
  const res = await fetch("https://pagebolt.dev/api/v1/inspect", {
    method: "POST",
    headers: { "x-api-key": PAGEBOLT_KEY, "Content-Type": "application/json" },
    body: JSON.stringify({ url }),
  });
  if (!res.ok) throw new Error(`Inspect failed: ${res.status}`);
  return res.json();
}

// Verification tool — screenshot the result and ask the model to evaluate it
async function verifyAction(url, expectedOutcome) {
  const screenshot = await screenshotPage(url);
  const inspection = await inspectPage(url);

  // Pass screenshot + inspection to the model with a specific question
  return {
    screenshot,          // raw PNG bytes
    elements: inspection.elements,
    question: `After the action, is the current page consistent with: "${expectedOutcome}"?
    Look for: success messages, error messages, redirects, modal dialogs, or missing expected content.
    Answer: Yes/No and explain what you see.`,
  };
}
Enter fullscreen mode Exit fullscreen mode

Full loop: act → verify → retry

import Anthropic from "@anthropic-ai/sdk";
import fetch from "node-fetch";

const claude = new Anthropic();
const PAGEBOLT_KEY = process.env.PAGEBOLT_API_KEY;

// Tool definitions
const tools = [
  {
    name: "navigate_and_screenshot",
    description: "Navigate to a URL and take a screenshot of the result",
    input_schema: {
      type: "object",
      properties: {
        url: { type: "string", description: "URL to navigate to" },
        expected: { type: "string", description: "What you expect to see (used to verify success)" },
      },
      required: ["url"],
    },
  },
  {
    name: "inspect_page",
    description: "Get all interactive elements and CSS selectors on the current page. Use before filling forms or clicking buttons.",
    input_schema: {
      type: "object",
      properties: {
        url: { type: "string" },
      },
      required: ["url"],
    },
  },
  {
    name: "run_sequence_and_verify",
    description: "Run a multi-step browser sequence and screenshot the final state for verification",
    input_schema: {
      type: "object",
      properties: {
        url: { type: "string", description: "Starting URL" },
        steps: {
          type: "array",
          description: "Browser steps to execute",
          items: {
            type: "object",
            properties: {
              action: { type: "string", enum: ["click", "fill", "navigate", "wait", "screenshot"] },
              selector: { type: "string" },
              value: { type: "string" },
              url: { type: "string" },
              ms: { type: "number" },
            },
            required: ["action"],
          },
        },
        expected: { type: "string", description: "Expected state after all steps complete" },
      },
      required: ["url", "steps"],
    },
  },
];

async function executeTool(name, input) {
  const headers = { "x-api-key": PAGEBOLT_KEY, "Content-Type": "application/json" };

  if (name === "navigate_and_screenshot") {
    const res = await fetch("https://pagebolt.dev/api/v1/screenshot", {
      method: "POST",
      headers,
      body: JSON.stringify({ url: input.url, blockBanners: true }),
    });
    if (!res.ok) return { error: `Screenshot failed: ${res.status}` };
    const bytes = Buffer.from(await res.arrayBuffer());
    return {
      screenshot_b64: bytes.toString("base64"),
      url: input.url,
      expected: input.expected,
      note: "Use the screenshot to verify whether the expected outcome was achieved.",
    };
  }

  if (name === "inspect_page") {
    const res = await fetch("https://pagebolt.dev/api/v1/inspect", {
      method: "POST",
      headers,
      body: JSON.stringify({ url: input.url }),
    });
    if (!res.ok) return { error: `Inspect failed: ${res.status}` };
    const data = await res.json();
    return {
      elementCount: data.elements?.length,
      elements: (data.elements || []).slice(0, 30).map((el) => ({
        tag: el.tag,
        role: el.role,
        text: (el.text || "").slice(0, 80),
        selector: el.selector,
      })),
    };
  }

  if (name === "run_sequence_and_verify") {
    // Build sequence steps + screenshot at end
    const sequenceSteps = [
      { action: "navigate", url: input.url },
      ...input.steps,
      { action: "screenshot" },
    ];

    const res = await fetch("https://pagebolt.dev/api/v1/sequence", {
      method: "POST",
      headers,
      body: JSON.stringify({ steps: sequenceSteps }),
    });
    if (!res.ok) return { error: `Sequence failed: ${res.status} ${await res.text()}` };
    const data = await res.json();
    const lastScreenshot = data.outputs?.find((o) => o.type === "screenshot");
    return {
      outputs: data.outputs?.length,
      screenshot_b64: lastScreenshot?.data,
      expected: input.expected,
      note: "Check the screenshot to verify whether the expected outcome was achieved.",
    };
  }

  return { error: `Unknown tool: ${name}` };
}

async function runVerifyingAgent(task) {
  console.log(`\nTask: ${task}\n`);

  const messages = [{
    role: "user",
    content: `${task}

IMPORTANT: After every action that changes page state, use navigate_and_screenshot to capture the result and verify it looks correct. Do not assume actions succeeded without visual confirmation.`,
  }];

  let iterations = 0;
  const MAX_ITERATIONS = 10;

  while (iterations < MAX_ITERATIONS) {
    iterations++;

    const response = await claude.messages.create({
      model: "claude-sonnet-4-6",
      max_tokens: 4096,
      tools,
      messages,
    });

    messages.push({ role: "assistant", content: response.content });

    // If done, return final response
    if (response.stop_reason === "end_turn") {
      const textContent = response.content.find((c) => c.type === "text");
      return textContent?.text || "Task complete";
    }

    // Execute tool calls
    const toolUses = response.content.filter((c) => c.type === "tool_use");
    if (toolUses.length === 0) break;

    const toolResults = await Promise.all(
      toolUses.map(async (tu) => {
        console.log(`  → ${tu.name}(${JSON.stringify(tu.input).slice(0, 80)}...)`);
        const result = await executeTool(tu.name, tu.input);

        // If the tool returned a screenshot, include it as an image for the model to see
        if (result.screenshot_b64) {
          return {
            type: "tool_result",
            tool_use_id: tu.id,
            content: [
              {
                type: "image",
                source: { type: "base64", media_type: "image/png", data: result.screenshot_b64 },
              },
              {
                type: "text",
                text: `Screenshot captured. Expected: "${result.expected || 'not specified'}". Evaluate whether this matches expectations.`,
              },
            ],
          };
        }

        return {
          type: "tool_result",
          tool_use_id: tu.id,
          content: JSON.stringify(result),
        };
      })
    );

    messages.push({ role: "user", content: toolResults });
  }

  return "Max iterations reached";
}

// Run it
const outcome = await runVerifyingAgent(
  "Go to https://example.com, inspect the page to find all links, " +
  "then screenshot the page and tell me if there are any broken layout issues."
);
console.log("\nResult:", outcome);
Enter fullscreen mode Exit fullscreen mode

Verification patterns

Pattern 1: check for expected text

// After form submission, verify the success message appeared
const verification = await runVerifyingAgent(
  "Submit the contact form at https://example.com/contact with " +
  "name: 'Test User', email: 'test@test.com', message: 'Hello'. " +
  "After submitting, screenshot the page and confirm a success message appeared."
);
Enter fullscreen mode Exit fullscreen mode

Pattern 2: check page title or URL changed

// After login, verify redirect happened
const verification = await runVerifyingAgent(
  "Log in to https://example.com/login. After submitting credentials, " +
  "screenshot the result and verify the URL changed to /dashboard (not still on /login)."
);
Enter fullscreen mode Exit fullscreen mode

Pattern 3: element presence check

// Inspect after action to verify element appeared/disappeared
const verification = await runVerifyingAgent(
  "Click the 'Delete account' button at https://example.com/settings. " +
  "After clicking, inspect the page and verify a confirmation dialog appeared " +
  "(look for elements with 'confirm' or 'are you sure' text)."
);
Enter fullscreen mode Exit fullscreen mode

Why this matters for production agents

Unverified agents produce silent failures — they report success but leave the system in an unknown state. Visual verification closes the loop:

  1. Act → run the browser action
  2. Capture → screenshot the result
  3. Evaluate → model assesses whether expected outcome occurred
  4. Retry or escalate → if verification fails, retry with a different approach or surface the failure

This is the difference between an agent that says "done" and one that's actually done.


Try it free — 100 requests/month, no credit card. → Get started in 2 minutes

Top comments (0)