DEV Community

Custodia-Admin
Custodia-Admin

Posted on • Originally published at pagebolt.dev

What Your AI Agent Actually Sees vs What You Think It Sees

What Your AI Agent Actually Sees vs What You Think It Sees

You ask Claude to navigate a checkout page and verify the price is displayed.

Claude reports: "The price is visible on the page. I can see $99 next to the product name."

You check the page yourself. The price is there. Claude was right.

But here's what actually happened: Claude never saw the price. It only received HTML text. The HTML contained <span>$99</span>. Claude parsed that and reported it as "visible".

But what if CSS hid it? What if JavaScript hadn't loaded yet? What if the price was in the HTML but rendered off-screen or behind a modal?

Claude would still report: "I see $99." Even though the user looking at the screen sees nothing.

This is the blind spot. AI agents operate on text, not visuals. They hallucinate about what they "see".


The Agent Vision Problem

When you say "Look at this webpage", an AI agent:

  1. Gets the HTML markup (text)
  2. Parses it (looking for keywords, patterns)
  3. Reasons about what "should" be there
  4. Confidently reports what it "sees"

It never actually sees anything.

CSS might hide elements: display: none hides the markup. Agent still sees the HTML. User sees blank space.

JavaScript might load data dynamically. Agent sees the initial HTML. User sees loaded content seconds later.

Modals might overlay content. Agent sees HTML for the covered element. User sees modal blocking it.

Result: agent confidence in what it "sees" is completely disconnected from visual reality.


Real Example: The Invisible Form Field

You ask an agent to validate a form:

"Check if the email field is interactive."
Enter fullscreen mode Exit fullscreen mode

The agent receives HTML:

<input type="email" id="email" disabled style="display: none;">
Enter fullscreen mode Exit fullscreen mode

The agent parses this and thinks:

  • Field exists ✓
  • Type is email ✓
  • Attributes are correct ✓
  • Conclusion: "Email field is present and properly configured."

The agent reports: "Email field is ready."

But you're looking at the page. There's no email field visible. It's hidden by display: none and disabled anyway.

The agent hallucinated about what it "saw". The HTML was there, but the visual reality was different.


The Solution: Screenshots as Ground Truth

Stop asking agents to reason about HTML. Show them what actually rendered.

import anthropic
import json
import urllib.request

client = anthropic.Anthropic()

def get_visual_proof(url):
    """Capture what the page actually looks like"""
    api_key = "YOUR_API_KEY"  # pagebolt.dev

    payload = json.dumps({"url": url}).encode()
    req = urllib.request.Request(
        'https://pagebolt.dev/api/v1/screenshot',
        data=payload,
        headers={'x-api-key': api_key, 'Content-Type': 'application/json'},
        method='POST'
    )

    with urllib.request.urlopen(req) as resp:
        return json.loads(resp.read())

def verify_form_visibility(url):
    """Agent validates form — with visual proof instead of HTML guessing"""

    # Get visual evidence first
    screenshot = get_visual_proof(url)

    # Now ask agent to look at what actually rendered
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=512,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Look at this screenshot of a form. Tell me: Is the email field visible and interactive?"
                    },
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": "image/png",
                            "data": screenshot["image"]
                        }
                    }
                ]
            }
        ]
    )

    return {
        "url": url,
        "screenshot": screenshot["image"],
        "agent_analysis": response.content[0].text,
        "confidence": "HIGH (based on visual proof, not HTML guessing)"
    }

# Run the validation
result = verify_form_visibility("https://example.com/checkout")
print("Validation Result:")
print(json.dumps({
    "url": result["url"],
    "visual_analysis": result["agent_analysis"],
    "confidence": result["confidence"]
}, indent=2))
Enter fullscreen mode Exit fullscreen mode

What changed:

  • Agent no longer guesses based on HTML
  • Agent analyzes actual visual rendering
  • If field is hidden by CSS, agent sees nothing (correct)
  • If field is disabled, agent sees disabled state (correct)
  • No more hallucination about what it "sees"

Why This Matters at Scale

Single agents with hallucinations are one problem. But multi-agent systems amplify the issue.

Agent A hallucinates about what it "saw". Agent B hallucinates about something different. Agent C reports contradictory findings. Your workflow fails because no agent actually saw anything.

Screenshots create ground truth. All agents reference the same visual reality. No more hallucination.


Try It Now

  1. Get API key at pagebolt.dev (free: 100 requests/month)
  2. Add screenshots to your agent verification tasks
  3. Show agents visual proof instead of HTML
  4. Watch hallucination disappear

Your agents will actually know what they're looking at.

Stop guessing. Start seeing.

Top comments (0)