Part 2: Moving from toy scripts to enterprise architecture using Qwen2-VL, Set-of-Mark, and Playwright.
Series: Building AI Web Agents
In Part 1: I Tried to Teach AI to Click Buttons, and It Missed by 500 Pixels, I shared the painful reality of building my first web agent. I fed a screenshot to a standard multimodal model, asked for coordinates, and watched it hallucinate a click into the white void of a webpage.
It turns out, guessing pixel coordinates is a fragile game.
If you are just playing around, a 70% success rate is cool. If you are building an enterprise agent to automate 1,000 tasks, a 70% success rate is a disaster.
In this post, I'm breaking down the architecture that actually works in production: The Generator-Executor Pattern, powered by Qwen2-VL and Set-of-Mark (SoM).
The Problem: The "Blind" Brain vs. The "Blurry" Eye
Why did my first agent fail? It faced two massive walls that every developer in this space hits eventually:
1. Pure LLMs are Blind
A text-only model (like GPT-4 or Llama 3) reading HTML code fails on modern web apps. It can't see inside <canvas> elements, deep Shadow DOMs, or visual trickery (like a popup advertisement physically covering the "Login" button).
2. Standard VLMs are Blurry
Most older vision models squash your beautiful 4K screenshot into a tiny square. A "Submit" button becomes a smudge. If the model can't distinctively see the button, it definitely can't give you accurate coordinates for it.
The Solution: The Hybrid "Neuro-Symbolic" Stack
To build a production-grade agent, we stop asking the AI to act. Instead, we ask the AI to plan, and let dumb code do the acting.
We need three components:
- The Eye (Qwen2-VL): Specifically this model because it uses Naive Dynamic Resolution (it doesn't squash images) and M-RoPE (it understands 2D position natively).
- The Map (Set-of-Mark): We don't ask for pixels; we ask for labels.
- The Hand (Playwright): Deterministic execution code.
Here is the Generator-Executor workflow that moves us from "Toy" to "Tool":
┌─────────────────────────────────────────────────────────────┐
│ GENERATOR-EXECUTOR FLOW │
└─────────────────────────────────────────────────────────────┘
🌐 Web Page Loaded
│
▼
📍 STEP 1: SET-OF-MARK INJECTION
│
├─► Inject JavaScript into page
├─► Add red numbered badges to all interactive elements
└─► Build selector map: {id: selector}
│
▼
📸 Take Screenshot (with numbered labels)
│
▼
🧠 STEP 2: GENERATOR (Qwen2-VL)
│
├─► AI analyzes screenshot
├─► Identifies target element
└─► Returns: "Target ID = 42"
│
▼
🎯 STEP 3: EXECUTOR (Playwright)
│
├─► Retrieve selector from window.som_map[42]
├─► Snapshot BEFORE state (URL, DOM)
└─► Execute: page.click(selector)
│
▼
✅ STEP 4: VERIFICATION LOOP
│
├─► Did URL change? ──────────────► ✓ SUCCESS
│
├─► Did DOM change significantly? ─► ✓ SUCCESS
│
└─► No change detected? ──────────► ⚠️ SILENT FAILURE
│
▼
🤖 AI Visual Judge
│
├─► Compare before/after screenshots
├─► Analyze what happened
│
▼
Decision Point:
│
├─► Retry ────────────► (loop back to Step 2)
│
└─► Fail ─────────────► ❌ Report Error & Stop
✓ SUCCESS ──► Continue to next task
Step 1: Visual Grounding (The Setup)
First, we solve the coordinate problem. Instead of asking the AI to guess pixels, we inject JavaScript to label every interactive element with a big red number.
This is called Set-of-Mark (SoM) prompting.
The Injection Script (JavaScript)
// inject_som.js
// This runs inside the browser via Playwright
function markElements() {
let id = 0;
// Select everything clickable
const elements = document.querySelectorAll('button, a, input, [role="button"]');
elements.forEach(el => {
id++;
const rect = el.getBoundingClientRect();
if (rect.width === 0 || rect.height === 0) return; // Skip invisible stuff
// Create the visual badge
const badge = document.createElement('div');
badge.style.position = 'absolute';
badge.style.left = rect.left + 'px';
badge.style.top = rect.top + 'px';
badge.style.background = 'red';
badge.style.color = 'white';
badge.style.fontSize = '12px';
badge.style.zIndex = '10000';
badge.textContent = id;
document.body.appendChild(badge);
// CRITICAL: Map ID back to a unique selector for code usage
if (!window.som_map) window.som_map = {};
window.som_map[id] = getUniqueSelector(el);
});
}
Step 2: The Generator (Qwen2-VL)
Now, we take a screenshot of those red numbers. We ask Qwen2-VL a multiple-choice question: "User wants to Login. Which Number is the button?"
This changes the task from Regression (hard math) to Classification (easy reading).
# The Python Manager
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
# Load the specialist model
model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", device_map="auto")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
def get_target_id(screenshot_path, user_goal):
prompt = f"User Goal: {user_goal}. Look at the screenshot with red numbered boxes. Return ONLY the number of the element needed."
# ... standard Qwen2-VL inference code ...
return predicted_number # e.g., 42
Step 3: The Executor & The "Silent Failure" Check
This is where production agents die. You click a button... and nothing happens. Did it fail? Or was the site just slow? Or was the button a dud?
We can't just click(). We need a Predict-Verify Loop.
- Predict: Ask the AI before clicking: "If I click 'Save', what should happen?" (Expectation: Network Request).
- Verify: Check if that actually happened.
def robust_click(page, target_id, qwen_model):
# 1. Retrieve the selector from our JS Map (100% precision)
selector = page.evaluate(f"window.som_map[{target_id}]")
# 2. Snapshot state BEFORE action
url_before = page.url
# 3. EXECUTE
try:
page.click(selector)
page.wait_for_load_state("networkidle", timeout=3000)
except:
return "CRITICAL FAIL: Element not clickable"
# 4. VERIFY (The Judge)
# Did the URL change?
if page.url != url_before:
return "SUCCESS: Navigation detected"
# Did the DOM change significantly?
# If not, we trigger the AI Judge to compare Before/After screenshots
return "WARNING: Silent Failure - Needs AI Visual Inspection"
The Verdict: Do We Need Reinforcement Learning?
I initially thought I needed Reinforcement Learning (RL) to train a "Super Agent." I was wrong.
For 95% of use cases, RL is a trap. It's complex, expensive, and hard to debug (if the agent makes a typo, do you punish it?).
The "State-of-the-Art" right now isn't a smarter brain; it's a better system.
- Set-of-Mark fixes the vision.
- Qwen2-VL fixes the reasoning.
- Verification Loops fix the reliability.
By moving to this Generator-Executor pattern, my agent stopped missing by 500 pixels. It now hits the target every single time—because it's not guessing pixels anymore. It's reading map coordinates.
This is Part 2 of my journey into Multimodal AI. In Part 3, I'll be deploying this onto a live server to see how much it costs to run 10,000 steps.
Top comments (0)