DEV Community

Daniel Castillo
Daniel Castillo

Posted on

I built a skill that solves reCAPTCHA with an LLM — here's how it actually works

How it started

I was testing Claude Code with the Chrome DevTools MCP server for browser automation. A reCAPTCHA popped up mid-flow. I asked Claude to solve it.

It sort of worked — but it was slow, unreliable, and frequently timed out. So I did what any engineer would do: I turned it into a proper skill and let the agent iterate on it until it actually worked consistently.

Why the naive approach fails

The obvious way to automate a CAPTCHA with a browser agent is: take a snapshot of the accessibility tree → get the element UID → click(uid).

This fails for three structural reasons:

1. iframes kill the accessibility tree. reCAPTCHA renders everything inside cross-origin iframes. Elements inside these iframes show up as "ignored" in the accessibility tree with no assignable UIDs. The standard click(uid) approach simply can't see them.

2. The timer is brutal. reCAPTCHA gives you 2 minutes. Each tool call — screenshot, script evaluation, LLM analysis — takes 1-10 seconds. An unoptimized flow (e.g., taking individual screenshots of each of the 9 tiles) can exhaust the timer before you even click verify.

3. The tiles are too small. At native size, tiles in a 3×3 grid are ~100px. At that resolution, the vision model confuses visually similar objects — buses vs. cars vs. motorcycles — often enough to fail the first attempt and force a restart.

How the skill was built: iterative learning

This wasn't designed upfront. The skill evolved through trial and error — each failed attempt taught the agent something new, and the skill got updated with that knowledge.

The first version tried the accessibility tree approach and hit the iframe wall. The second version switched to evaluate_script for direct DOM access but kept timing out on 3×3 grids because it was screenshotting tiles one by one. The third version introduced the zoom trick after the agent noticed tiles were too small to classify reliably. Each iteration refined the approach until the flow was fast and reliable enough to consistently beat the 2-minute timer.

The result is a Claude Code skill — a SKILL.md with structured instructions and a set of helper scripts — that the agent loads and follows when it encounters a reCAPTCHA. In the video, CLAUDE.md references the skill so it loads automatically, making it look like the agent just figures it out on the fly. But behind the scenes, there's a well-tested playbook.

The solution: a 5-round flow

The entire flow is designed around one constraint: minimize total tool calls.

Round 1 — Click the checkbox

evaluate_script accesses iframe[0].contentDocument directly and calls .click() on #recaptcha-anchor. This bypasses the accessibility tree entirely — it's the only reliable way to interact with cross-origin iframe content.

Round 2 — Detect the challenge

Another evaluate_script, this time on the challenge iframe (usually iframe[2]), reads .rc-imageselect-desc to get the challenge text ("Select all images with traffic lights") and counts td[role="button"] to determine grid size: 9 tiles = 3×3, 16 tiles = 4×4. If it returns state: 'loading', it waits 1 second and retries.

Round 3 — Analyze the images (the core of the skill)

This is where the strategy diverges based on grid size:

For 3×3 grids: Apply iframe.style.transform = 'scale(2)' via evaluate_script, doubling tile size from ~100px to ~200px. Take a single fullPage=true screenshot. Launch one sub-agent that receives the image + challenge text and returns matching indices (e.g., MATCHES: 3, 5, 8).

Why not fetch individual tile images? The 9 tiles are actually a single CSS sprite with different background-position values. Fetching the image URL gives you the complete sprite, not individual tiles. Individual screenshots by UID would cost 9 tool calls. One zoomed full-page screenshot solves both problems. This was one of those things the agent discovered through iteration — the first attempt tried fetching tile URLs and got the same image 9 times.

For 4×4 grids: Zooming doesn't help enough — 16 tiles are still too dense in a single screenshot. Instead: take_snapshot(verbose=true) to get UIDs for all 16 tiles, then launch 4 sub-agents in parallel (one per row). Each agent screenshots its 4 tiles individually by UID and reports matches. Four rows analyzed simultaneously instead of 16 sequential tool calls. Estimated time: 30-45 seconds.

Round 4 — Select and verify

A single evaluate_script that does two things: clicks all matching tiles (tiles[i].click() for each index) and then clicks #recaptcha-verify-button. Before clicking, it resets the zoom — otherwise click coordinates don't map correctly to the unscaled element positions.

Combining tile selection and verify into one script call saves an entire round. Another lesson from iteration — early versions did these as separate calls and kept hitting the timer.

Round 5 — Detect the result

evaluate_script interrogates the DOM to determine the outcome. This is a state machine with 6 possible states:

State DOM Signal Action
SUCCESS #recaptcha-anchor[aria-checked="true"] Done — submit the form
NEW_IMAGES .rc-imageselect-error-dynamic-more visible Analyze only the replaced tiles
WRONG_ANSWER .rc-imageselect-incorrect-response visible Back to Round 2
SELECT_MORE .rc-imageselect-error-select-more visible Analyze unselected tiles
EXPIRED .rc-anchor-error-msg contains "expired" Back to Round 1
ERROR .rc-anchor-error-msg contains "error" Reload page

The trickiest state is NEW_IMAGES: reCAPTCHA sometimes replaces only the selected tiles with new images and asks you to evaluate those too. The skill detects this, snapshots only the changed tiles, and runs Round 3 specifically for those — without restarting the entire challenge.

Visibility detection uses offsetParent !== null as a proxy, since Google keeps all error elements in the DOM but hides inactive ones with display: none.

The numbers

Happy path for a 3×3 grid: 5 tool calls + 1 sub-agent analysis.

checkbox → detect → zoom+screenshot+agent → select+verify → detect_result
Enter fullscreen mode Exit fullscreen mode

Total estimated time: 20-30 seconds.

For a 4×4 grid with parallel agents: 30-45 seconds.

Both well within the 2-minute timer.

The stack

  • Claude Code as the main agent (claude-sonnet-4-6)
  • chrome-devtools-mcp (official MCP server) — exposes the browser as a tool
  • Tools used:
    • evaluate_script — iframe interaction (the only reliable method)
    • take_screenshot with fullPage: true — zoomed 3×3 grid capture
    • take_screenshot with uid — individual tile capture for 4×4
    • take_snapshot with verbose: true — UID discovery for 4×4
    • navigate_page with type: reload — error recovery
    • new_page with isolatedContext — cookie-free sessions for testing
  • Sub-agents launched via the Agent tool (claude-sonnet-4-6) — parallel visual analysis

What this means for security

This isn't a theoretical attack. It's a working implementation using publicly available tools. The entire approach costs fractions of a cent per solve.

A quick timeline for context:

  • 2009: Google acquires reCAPTCHA. CAPTCHAs train OCR and digitize books.
  • 2014: reCAPTCHA v2 switches to image challenges, feeding Google's computer vision pipeline.
  • 2018: reCAPTCHA v3 drops visual challenges entirely, shifts to behavioral scoring.
  • 2025: An LLM solves the visual challenge end-to-end in under 30 seconds.

If CAPTCHA is still your primary bot mitigation layer, it's time to revisit your threat model. Rate limiting, device fingerprinting, anomaly detection, and proof-of-work challenges don't depend on the assumption that machines can't solve perceptual puzzles.

The bots won. Update your threat model.


The skill was built iteratively — each failure taught the agent something new about iframes, timing, tile resolution, and reCAPTCHA's state machine. No CAPTCHA-solving APIs, no external dependencies. Just an LLM with a browser, a set of scripts, and enough failed attempts to figure it out.

Have you run into similar findings with LLMs and browser automation? I'd love to hear about it in the comments.

Top comments (0)