DEV Community

Max Luong
Max Luong

Posted on

Why Do AI Agents Fail in Production? (And How to Fix the "Silent Click")

Part 2: Moving from toy scripts to enterprise architecture using Qwen2-VL, Set-of-Mark, and Playwright.

Series: Building AI Web Agents

In Part 1: I Tried to Teach AI to Click Buttons, and It Missed by 500 Pixels, I shared the painful reality of building my first web agent. I fed a screenshot to a standard multimodal model, asked for coordinates, and watched it hallucinate a click into the white void of a webpage.

It turns out, guessing X,YX, Y pixel coordinates is a fragile game.

If you are just playing around, a 70% success rate is cool. If you are building an enterprise agent to automate 1,000 tasks, a 70% success rate is a disaster.

In this post, I'm breaking down the architecture that actually works in production: The Generator-Executor Pattern, powered by Qwen2-VL and Set-of-Mark (SoM).

The Problem: The "Blind" Brain vs. The "Blurry" Eye

Why did my first agent fail? It faced two massive walls that every developer in this space hits eventually:

1. Pure LLMs are Blind

A text-only model (like GPT-4 or Llama 3) reading HTML code fails on modern web apps. It can't see inside <canvas> elements, deep Shadow DOMs, or visual trickery (like a popup advertisement physically covering the "Login" button).

2. Standard VLMs are Blurry

Most older vision models squash your beautiful 4K screenshot into a tiny 336×336336 \times 336 square. A "Submit" button becomes a smudge. If the model can't distinctively see the button, it definitely can't give you accurate coordinates for it.

The Solution: The Hybrid "Neuro-Symbolic" Stack

To build a production-grade agent, we stop asking the AI to act. Instead, we ask the AI to plan, and let dumb code do the acting.

We need three components:

  1. The Eye (Qwen2-VL): Specifically this model because it uses Naive Dynamic Resolution (it doesn't squash images) and M-RoPE (it understands 2D position natively).
  2. The Map (Set-of-Mark): We don't ask for pixels; we ask for labels.
  3. The Hand (Playwright): Deterministic execution code.

Here is the Generator-Executor workflow that moves us from "Toy" to "Tool":

┌─────────────────────────────────────────────────────────────┐
│                    GENERATOR-EXECUTOR FLOW                   │
└─────────────────────────────────────────────────────────────┘

    🌐 Web Page Loaded
         │
         ▼
    📍 STEP 1: SET-OF-MARK INJECTION
         │
         ├─► Inject JavaScript into page
         ├─► Add red numbered badges to all interactive elements
         └─► Build selector map: {id: selector}
         │
         ▼
    📸 Take Screenshot (with numbered labels)
         │
         ▼
    🧠 STEP 2: GENERATOR (Qwen2-VL)
         │
         ├─► AI analyzes screenshot
         ├─► Identifies target element
         └─► Returns: "Target ID = 42"
         │
         ▼
    🎯 STEP 3: EXECUTOR (Playwright)
         │
         ├─► Retrieve selector from window.som_map[42]
         ├─► Snapshot BEFORE state (URL, DOM)
         └─► Execute: page.click(selector)
         │
         ▼
    ✅ STEP 4: VERIFICATION LOOP
         │
         ├─► Did URL change? ──────────────► ✓ SUCCESS
         │
         ├─► Did DOM change significantly? ─► ✓ SUCCESS
         │
         └─► No change detected? ──────────► ⚠️  SILENT FAILURE
                  │
                  ▼
            🤖 AI Visual Judge
                  │
                  ├─► Compare before/after screenshots
                  ├─► Analyze what happened
                  │
                  ▼
            Decision Point:
                  │
                  ├─► Retry ────────────► (loop back to Step 2)
                  │
                  └─► Fail ─────────────► ❌ Report Error & Stop

    ✓ SUCCESS ──► Continue to next task
Enter fullscreen mode Exit fullscreen mode

Step 1: Visual Grounding (The Setup)

First, we solve the coordinate problem. Instead of asking the AI to guess pixels, we inject JavaScript to label every interactive element with a big red number.

This is called Set-of-Mark (SoM) prompting.

The Injection Script (JavaScript)

// inject_som.js
// This runs inside the browser via Playwright
function markElements() {
  let id = 0;
  // Select everything clickable
  const elements = document.querySelectorAll('button, a, input, [role="button"]');

  elements.forEach(el => {
    id++;
    const rect = el.getBoundingClientRect();
    if (rect.width === 0 || rect.height === 0) return; // Skip invisible stuff

    // Create the visual badge
    const badge = document.createElement('div');
    badge.style.position = 'absolute';
    badge.style.left = rect.left + 'px';
    badge.style.top = rect.top + 'px';
    badge.style.background = 'red';
    badge.style.color = 'white';
    badge.style.fontSize = '12px';
    badge.style.zIndex = '10000';
    badge.textContent = id;
    document.body.appendChild(badge);

    // CRITICAL: Map ID back to a unique selector for code usage
    if (!window.som_map) window.som_map = {};
    window.som_map[id] = getUniqueSelector(el);
  });
}
Enter fullscreen mode Exit fullscreen mode

Step 2: The Generator (Qwen2-VL)

Now, we take a screenshot of those red numbers. We ask Qwen2-VL a multiple-choice question: "User wants to Login. Which Number is the button?"

This changes the task from Regression (hard math) to Classification (easy reading).

# The Python Manager
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor

# Load the specialist model
model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", device_map="auto")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

def get_target_id(screenshot_path, user_goal):
    prompt = f"User Goal: {user_goal}. Look at the screenshot with red numbered boxes. Return ONLY the number of the element needed."

    # ... standard Qwen2-VL inference code ...

    return predicted_number # e.g., 42
Enter fullscreen mode Exit fullscreen mode

Step 3: The Executor & The "Silent Failure" Check

This is where production agents die. You click a button... and nothing happens. Did it fail? Or was the site just slow? Or was the button a dud?

We can't just click(). We need a Predict-Verify Loop.

  • Predict: Ask the AI before clicking: "If I click 'Save', what should happen?" (Expectation: Network Request).
  • Verify: Check if that actually happened.
def robust_click(page, target_id, qwen_model):
    # 1. Retrieve the selector from our JS Map (100% precision)
    selector = page.evaluate(f"window.som_map[{target_id}]")

    # 2. Snapshot state BEFORE action
    url_before = page.url

    # 3. EXECUTE
    try:
        page.click(selector)
        page.wait_for_load_state("networkidle", timeout=3000)
    except:
        return "CRITICAL FAIL: Element not clickable"

    # 4. VERIFY (The Judge)
    # Did the URL change?
    if page.url != url_before:
        return "SUCCESS: Navigation detected"

    # Did the DOM change significantly?
    # If not, we trigger the AI Judge to compare Before/After screenshots
    return "WARNING: Silent Failure - Needs AI Visual Inspection"
Enter fullscreen mode Exit fullscreen mode

The Verdict: Do We Need Reinforcement Learning?

I initially thought I needed Reinforcement Learning (RL) to train a "Super Agent." I was wrong.

For 95% of use cases, RL is a trap. It's complex, expensive, and hard to debug (if the agent makes a typo, do you punish it?).

The "State-of-the-Art" right now isn't a smarter brain; it's a better system.

  • Set-of-Mark fixes the vision.
  • Qwen2-VL fixes the reasoning.
  • Verification Loops fix the reliability.

By moving to this Generator-Executor pattern, my agent stopped missing by 500 pixels. It now hits the target every single time—because it's not guessing pixels anymore. It's reading map coordinates.


This is Part 2 of my journey into Multimodal AI. In Part 3, I'll be deploying this onto a live server to see how much it costs to run 10,000 steps.

Top comments (0)