What Your AI Agent Actually Sees vs What You Think It Sees
You ask Claude to navigate a checkout page and verify the price is displayed.
Claude reports: "The price is visible on the page. I can see $99 next to the product name."
You check the page yourself. The price is there. Claude was right.
But here's what actually happened: Claude never saw the price. It only received HTML text. The HTML contained <span>$99</span>. Claude parsed that and reported it as "visible".
But what if CSS hid it? What if JavaScript hadn't loaded yet? What if the price was in the HTML but rendered off-screen or behind a modal?
Claude would still report: "I see $99." Even though the user looking at the screen sees nothing.
This is the blind spot. AI agents operate on text, not visuals. They hallucinate about what they "see".
The Agent Vision Problem
When you say "Look at this webpage", an AI agent:
- Gets the HTML markup (text)
- Parses it (looking for keywords, patterns)
- Reasons about what "should" be there
- Confidently reports what it "sees"
It never actually sees anything.
CSS might hide elements: display: none hides the markup. Agent still sees the HTML. User sees blank space.
JavaScript might load data dynamically. Agent sees the initial HTML. User sees loaded content seconds later.
Modals might overlay content. Agent sees HTML for the covered element. User sees modal blocking it.
Result: agent confidence in what it "sees" is completely disconnected from visual reality.
Real Example: The Invisible Form Field
You ask an agent to validate a form:
"Check if the email field is interactive."
The agent receives HTML:
<input type="email" id="email" disabled style="display: none;">
The agent parses this and thinks:
- Field exists ✓
- Type is email ✓
- Attributes are correct ✓
- Conclusion: "Email field is present and properly configured."
The agent reports: "Email field is ready."
But you're looking at the page. There's no email field visible. It's hidden by display: none and disabled anyway.
The agent hallucinated about what it "saw". The HTML was there, but the visual reality was different.
The Solution: Screenshots as Ground Truth
Stop asking agents to reason about HTML. Show them what actually rendered.
import anthropic
import json
import urllib.request
client = anthropic.Anthropic()
def get_visual_proof(url):
"""Capture what the page actually looks like"""
api_key = "YOUR_API_KEY" # pagebolt.dev
payload = json.dumps({"url": url}).encode()
req = urllib.request.Request(
'https://pagebolt.dev/api/v1/screenshot',
data=payload,
headers={'x-api-key': api_key, 'Content-Type': 'application/json'},
method='POST'
)
with urllib.request.urlopen(req) as resp:
return json.loads(resp.read())
def verify_form_visibility(url):
"""Agent validates form — with visual proof instead of HTML guessing"""
# Get visual evidence first
screenshot = get_visual_proof(url)
# Now ask agent to look at what actually rendered
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=512,
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Look at this screenshot of a form. Tell me: Is the email field visible and interactive?"
},
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": screenshot["image"]
}
}
]
}
]
)
return {
"url": url,
"screenshot": screenshot["image"],
"agent_analysis": response.content[0].text,
"confidence": "HIGH (based on visual proof, not HTML guessing)"
}
# Run the validation
result = verify_form_visibility("https://example.com/checkout")
print("Validation Result:")
print(json.dumps({
"url": result["url"],
"visual_analysis": result["agent_analysis"],
"confidence": result["confidence"]
}, indent=2))
What changed:
- Agent no longer guesses based on HTML
- Agent analyzes actual visual rendering
- If field is hidden by CSS, agent sees nothing (correct)
- If field is disabled, agent sees disabled state (correct)
- No more hallucination about what it "sees"
Why This Matters at Scale
Single agents with hallucinations are one problem. But multi-agent systems amplify the issue.
Agent A hallucinates about what it "saw". Agent B hallucinates about something different. Agent C reports contradictory findings. Your workflow fails because no agent actually saw anything.
Screenshots create ground truth. All agents reference the same visual reality. No more hallucination.
Try It Now
- Get API key at pagebolt.dev (free: 100 requests/month)
- Add screenshots to your agent verification tasks
- Show agents visual proof instead of HTML
- Watch hallucination disappear
Your agents will actually know what they're looking at.
Stop guessing. Start seeing.
Top comments (0)