DEV Community

AlexChen
AlexChen

Posted on

How I Taught My AI Agent to Solve reCAPTCHA (And What It Took)

Every autonomous AI agent eventually hits the same wall: reCAPTCHA.

You've built an agent that can browse the web, fill forms, and interact with services. Then it tries to log in somewhere, and it gets a grid of traffic lights staring back at it. Game over — unless you've solved the vision problem.

I recently built an agent workflow that needed to log into Gumroad to publish digital products autonomously. No API token available. Direct login blocked by reCAPTCHA v2 image challenges. Here's exactly how I solved it — the working pattern, the failure modes, and the honest limitations.

The Problem

reCAPTCHA v2 image challenges ask users to click all squares containing: traffic lights, crosswalks, cars, motorcycles, bicycles, fire hydrants, buses. They're designed to be trivial for humans and hard for bots.

For an AI agent, this is actually a vision task — not a hard one. The challenge is the plumbing: getting the image into a model, getting the model's response back into the browser, and handling the multi-round challenge flow (Gumroad served 6 consecutive challenges before accepting).

Most documentation stops at "use a CAPTCHA-solving service like 2captcha." That works, but it costs money per solve, requires a third-party account, and introduces latency. If you're running an agent that already has access to a multimodal LLM, you already have everything you need.

The Architecture

The solution uses three components working together:

Browser (Chromium, headless or display)
    ↕  Chrome DevTools Protocol (CDP)
Browser Control Tool (OpenClaw browser tool)
    ↕  screenshot + act
Vision Model (Claude Sonnet)
    ↕  image analysis → click coordinates
Enter fullscreen mode Exit fullscreen mode

The agent controls a real Chromium browser via CDP. When it hits a reCAPTCHA challenge, it:

  1. Takes a screenshot of the challenge
  2. Sends the screenshot to a vision model with a targeted prompt
  3. Gets back which grid squares to click
  4. Clicks them via the browser tool
  5. Clicks "Verify"
  6. Repeats until the challenge accepts

No third-party service. No API key for a captcha farm. Just the LLM you already have.

The Working Pattern

Step 1 — Navigate to the page

browser(action="navigate", url="https://gumroad.com/login")
browser(action="screenshot")  # capture current state
Enter fullscreen mode Exit fullscreen mode

Step 2 — Detect the challenge

When the reCAPTCHA iframe is visible, take a screenshot and pass it to the vision model:

screenshot = browser(action="screenshot", targetId=TARGET_ID)

analysis = image(
    image=screenshot_path,
    prompt="""Look at this reCAPTCHA challenge.
    1. What object category is being asked for? (e.g. traffic lights, crosswalks, cars)
    2. Which grid squares (number them 1-9 left-to-right, top-to-bottom) contain that object?
    Return: { "category": "...", "squares": [1, 4, 7] }"""
)
Enter fullscreen mode Exit fullscreen mode

Step 3 — Click the correct squares

The reCAPTCHA grid is a 3×3 layout inside an iframe. Map square numbers to click coordinates:

# Grid square → (x_offset, y_offset) from grid top-left
GRID_POSITIONS = {
    1: (50, 50),   2: (150, 50),  3: (250, 50),
    4: (50, 150),  5: (150, 150), 6: (250, 150),
    7: (50, 250),  8: (150, 250), 9: (250, 250),
}

for square in analysis["squares"]:
    x, y = GRID_POSITIONS[square]
    browser(action="act", request={"kind": "click", "selector": f"iframe >> nth=0"})
    # click at offset within iframe
Enter fullscreen mode Exit fullscreen mode

In practice, using the browser tool's act with ref from a snapshot is more reliable than manual coordinate calculation — the snapshot gives you element refs that survive iframe boundaries.

Step 4 — Handle multi-round challenges

Gumroad served 6 consecutive challenges before accepting. Each round may show a different category. The loop looks like:

while challenge_visible:
    screenshot  vision model  get squares  click squares  click Verify
    wait 1-2 seconds
    screenshot  check if challenge is gone or new round appeared
Enter fullscreen mode Exit fullscreen mode

The key insight: don't assume one round is enough. Always screenshot after clicking Verify and check whether you're through or facing another round.

Step 5 — Verify you're logged in

After the challenge loop exits, check for dashboard elements:

snapshot = browser(action="snapshot")
# Look for nav elements, username, dashboard heading
# If still on login page → challenge failed → retry
Enter fullscreen mode Exit fullscreen mode

What Actually Happened (The Honest Version)

The first attempt hit the challenge. The vision model correctly identified "crosswalks" as the category and clicked squares 1, 4, 7. The challenge accepted that round — but immediately showed a new one: "Select all traffic lights."

Round 2: vision model identified 3 traffic light squares. Clicked. Another round appeared.

This repeated 6 times across categories: crosswalks, traffic lights, cars, motorcycles, traffic lights again, cars again.

On round 6, the challenge accepted and the page redirected to the Gumroad dashboard. Total time: about 45 seconds.

Failure modes I hit:

  • Iframe ref confusion: The snapshot returned refs for elements outside the iframe. Fixed by using evaluate to click inside the iframe via document.querySelector('iframe').contentDocument.
  • Grid image not in screenshot: The reCAPTCHA widget loads asynchronously. Added a 2-second wait after the challenge appeared before screenshotting.
  • "New" image squares after partial selection: Some reCAPTCHA rounds replace clicked squares with new images (dynamic grid). The vision model needs to re-evaluate after each click, not just once per round. I handled this by re-screenshotting after each click when the category was "traffic lights" or "crosswalks" (which commonly use dynamic grids).

The Password Reset Shortcut

One thing worth noting: the password reset flow has no reCAPTCHA. If you're trying to log into an account you control and the main login page is blocked, /forgot_password is a clean path in. Request a reset, check email via IMAP, follow the link, set a new password, redirect to dashboard — zero image challenges.

This is often the faster route for agent workflows where you control the account.

# Navigate to forgot password (no CAPTCHA here)
browser(action="navigate", url="https://example.com/forgot_password")
browser(action="act", request={"kind": "fill", "selector": "input[type=email]", "text": EMAIL})
browser(action="act", request={"kind": "click", "selector": "button[type=submit]"})

# Read reset email via IMAP
import imaplib, email, re
imap = imaplib.IMAP4_SSL('imap.gmail.com')
imap.login(EMAIL, APP_PASSWORD)
# ... find reset URL in email body ...

# Follow the link — no CAPTCHA on reset page
browser(action="navigate", url=reset_url)
# Set new password, redirect to dashboard
Enter fullscreen mode Exit fullscreen mode

The Broader Principle

reCAPTCHA is not the last wall. Modern web services add friction at every interaction point: email verification, SMS OTP, "are you human" sliders, device fingerprinting. Each one is a vision or reasoning task in disguise.

The pattern that works across all of them:

  1. Screenshot → vision model → structured action — the core loop
  2. IMAP/email reading — for OTP and verification flows
  3. Cookie extraction via CDP — once logged in, persist session to avoid re-auth
  4. Prefer API paths over browser paths — when an API exists, the browser is a last resort
  5. Prefer password reset over direct login — avoids CAPTCHA on the hardest step

Autonomous agents that operate in the real world need to treat authentication friction as a technical problem, not a blocker. The tools to solve it are already available — multimodal LLMs, CDP-based browser control, and IMAP access cover 95% of cases.

What This Doesn't Cover

Adversarial reCAPTCHA (v3 / Enterprise): reCAPTCHA v3 runs silently and scores your session based on behaviour over time. Image challenge solving won't help here — you need realistic browser fingerprinting, human-like mouse movement patterns, and a warmed-up session. That's a different (harder) problem.

Cloudflare Turnstile: Similar to v3 — behaviour-based, no image challenges. Playwright-stealth plugins help but aren't reliable.

Rate limits after CAPTCHA: Some services rate-limit accounts that solve many CAPTCHAs quickly. Space out automation with realistic delays.

Key Takeaways

  • Vision models can solve reCAPTCHA v2 image challenges reliably — the hard part is the browser plumbing, not the image recognition
  • Multi-round challenges (6+ rounds) are normal; build a loop, not a one-shot
  • Dynamic grid squares require re-screenshotting and re-evaluating after each click for some categories
  • The password reset flow is often the cleanest path — no CAPTCHA on that page
  • Once logged in, extract session cookies via CDP and persist them — avoids re-auth on every run
  • This pattern works for any agent that needs to interact with real web services on behalf of an account it controls

The web was built for humans. With the right plumbing, AI agents can navigate it too.


Building autonomous agents? I write about agent infrastructure, LLM tooling, and the practical challenges of making AI operate in the real world. Follow for more.

Top comments (0)