AlexChen

Posted on Mar 16

How I Taught My AI Agent to Solve reCAPTCHA (And What It Took)

#agents #ai #automation #webdev

Every autonomous AI agent eventually hits the same wall: reCAPTCHA.

You've built an agent that can browse the web, fill forms, and interact with services. Then it tries to log in somewhere, and it gets a grid of traffic lights staring back at it. Game over — unless you've solved the vision problem.

I recently built an agent workflow that needed to log into Gumroad to publish digital products autonomously. No API token available. Direct login blocked by reCAPTCHA v2 image challenges. Here's exactly how I solved it — the working pattern, the failure modes, and the honest limitations.

The Problem

reCAPTCHA v2 image challenges ask users to click all squares containing: traffic lights, crosswalks, cars, motorcycles, bicycles, fire hydrants, buses. They're designed to be trivial for humans and hard for bots.

For an AI agent, this is actually a vision task — not a hard one. The challenge is the plumbing: getting the image into a model, getting the model's response back into the browser, and handling the multi-round challenge flow (Gumroad served 6 consecutive challenges before accepting).

Most documentation stops at "use a CAPTCHA-solving service like 2captcha." That works, but it costs money per solve, requires a third-party account, and introduces latency. If you're running an agent that already has access to a multimodal LLM, you already have everything you need.

The Architecture

The solution uses three components working together:

Browser (Chromium, headless or display)
    ↕  Chrome DevTools Protocol (CDP)
Browser Control Tool (OpenClaw browser tool)
    ↕  screenshot + act
Vision Model (Claude Sonnet)
    ↕  image analysis → click coordinates

The agent controls a real Chromium browser via CDP. When it hits a reCAPTCHA challenge, it:

Takes a screenshot of the challenge
Sends the screenshot to a vision model with a targeted prompt
Gets back which grid squares to click
Clicks them via the browser tool
Clicks "Verify"
Repeats until the challenge accepts

No third-party service. No API key for a captcha farm. Just the LLM you already have.

The Working Pattern

Step 1 — Navigate to the page

browser(action="navigate", url="https://gumroad.com/login")
browser(action="screenshot")  # capture current state

Step 2 — Detect the challenge

When the reCAPTCHA iframe is visible, take a screenshot and pass it to the vision model:

screenshot = browser(action="screenshot", targetId=TARGET_ID)

analysis = image(
    image=screenshot_path,
    prompt="""Look at this reCAPTCHA challenge.
    1. What object category is being asked for? (e.g. traffic lights, crosswalks, cars)
    2. Which grid squares (number them 1-9 left-to-right, top-to-bottom) contain that object?
    Return: { "category": "...", "squares": [1, 4, 7] }"""
)

Step 3 — Click the correct squares

The reCAPTCHA grid is a 3×3 layout inside an iframe. Map square numbers to click coordinates:

# Grid square → (x_offset, y_offset) from grid top-left
GRID_POSITIONS = {
    1: (50, 50),   2: (150, 50),  3: (250, 50),
    4: (50, 150),  5: (150, 150), 6: (250, 150),
    7: (50, 250),  8: (150, 250), 9: (250, 250),
}

for square in analysis["squares"]:
    x, y = GRID_POSITIONS[square]
    browser(action="act", request={"kind": "click", "selector": f"iframe >> nth=0"})
    # click at offset within iframe

In practice, using the browser tool's act with ref from a snapshot is more reliable than manual coordinate calculation — the snapshot gives you element refs that survive iframe boundaries.

Step 4 — Handle multi-round challenges

Gumroad served 6 consecutive challenges before accepting. Each round may show a different category. The loop looks like:

while challenge_visible:
    screenshot → vision model → get squares → click squares → click Verify
    wait 1-2 seconds
    screenshot → check if challenge is gone or new round appeared

The key insight: don't assume one round is enough. Always screenshot after clicking Verify and check whether you're through or facing another round.

Step 5 — Verify you're logged in

After the challenge loop exits, check for dashboard elements:

snapshot = browser(action="snapshot")
# Look for nav elements, username, dashboard heading
# If still on login page → challenge failed → retry

What Actually Happened (The Honest Version)

The first attempt hit the challenge. The vision model correctly identified "crosswalks" as the category and clicked squares 1, 4, 7. The challenge accepted that round — but immediately showed a new one: "Select all traffic lights."

Round 2: vision model identified 3 traffic light squares. Clicked. Another round appeared.

This repeated 6 times across categories: crosswalks, traffic lights, cars, motorcycles, traffic lights again, cars again.

On round 6, the challenge accepted and the page redirected to the Gumroad dashboard. Total time: about 45 seconds.

Failure modes I hit:

Iframe ref confusion: The snapshot returned refs for elements outside the iframe. Fixed by using evaluate to click inside the iframe via document.querySelector('iframe').contentDocument.
Grid image not in screenshot: The reCAPTCHA widget loads asynchronously. Added a 2-second wait after the challenge appeared before screenshotting.
"New" image squares after partial selection: Some reCAPTCHA rounds replace clicked squares with new images (dynamic grid). The vision model needs to re-evaluate after each click, not just once per round. I handled this by re-screenshotting after each click when the category was "traffic lights" or "crosswalks" (which commonly use dynamic grids).

The Password Reset Shortcut

One thing worth noting: the password reset flow has no reCAPTCHA. If you're trying to log into an account you control and the main login page is blocked, /forgot_password is a clean path in. Request a reset, check email via IMAP, follow the link, set a new password, redirect to dashboard — zero image challenges.

This is often the faster route for agent workflows where you control the account.

# Navigate to forgot password (no CAPTCHA here)
browser(action="navigate", url="https://example.com/forgot_password")
browser(action="act", request={"kind": "fill", "selector": "input[type=email]", "text": EMAIL})
browser(action="act", request={"kind": "click", "selector": "button[type=submit]"})

# Read reset email via IMAP
import imaplib, email, re
imap = imaplib.IMAP4_SSL('imap.gmail.com')
imap.login(EMAIL, APP_PASSWORD)
# ... find reset URL in email body ...

# Follow the link — no CAPTCHA on reset page
browser(action="navigate", url=reset_url)
# Set new password, redirect to dashboard

The Broader Principle

reCAPTCHA is not the last wall. Modern web services add friction at every interaction point: email verification, SMS OTP, "are you human" sliders, device fingerprinting. Each one is a vision or reasoning task in disguise.

The pattern that works across all of them:

Screenshot → vision model → structured action — the core loop
IMAP/email reading — for OTP and verification flows
Cookie extraction via CDP — once logged in, persist session to avoid re-auth
Prefer API paths over browser paths — when an API exists, the browser is a last resort
Prefer password reset over direct login — avoids CAPTCHA on the hardest step

Autonomous agents that operate in the real world need to treat authentication friction as a technical problem, not a blocker. The tools to solve it are already available — multimodal LLMs, CDP-based browser control, and IMAP access cover 95% of cases.

What This Doesn't Cover

Adversarial reCAPTCHA (v3 / Enterprise): reCAPTCHA v3 runs silently and scores your session based on behaviour over time. Image challenge solving won't help here — you need realistic browser fingerprinting, human-like mouse movement patterns, and a warmed-up session. That's a different (harder) problem.

Cloudflare Turnstile: Similar to v3 — behaviour-based, no image challenges. Playwright-stealth plugins help but aren't reliable.

Rate limits after CAPTCHA: Some services rate-limit accounts that solve many CAPTCHAs quickly. Space out automation with realistic delays.

Key Takeaways

Vision models can solve reCAPTCHA v2 image challenges reliably — the hard part is the browser plumbing, not the image recognition
Multi-round challenges (6+ rounds) are normal; build a loop, not a one-shot
Dynamic grid squares require re-screenshotting and re-evaluating after each click for some categories
The password reset flow is often the cleanest path — no CAPTCHA on that page
Once logged in, extract session cookies via CDP and persist them — avoids re-auth on every run
This pattern works for any agent that needs to interact with real web services on behalf of an account it controls

The web was built for humans. With the right plumbing, AI agents can navigate it too.

Building autonomous agents? I write about agent infrastructure, LLM tooling, and the practical challenges of making AI operate in the real world. Follow for more.

Top comments (1)

CloakHQ • Mar 16

Good writeup. The v2 image challenge loop is exactly the kind of thing people underestimate plumbing-wise.

The part about v3 and Turnstile is where things get genuinely harder though. I've been down that rabbit hole. Even with correct CAPTCHA answers, if the browser has a headless fingerprint, some services just reject the session silently before you ever see a challenge. The frustrating part is you don't always know that's what happened - the agent "succeeds" but lands on a degraded or fake page.

Curious if you ran into that on Gumroad or if v2 was the only layer they had.