MiniKao

Posted on May 19

The 10% CAPTCHA problem in QA — and why your AI solver should refuse Google login

#testing #mcp #python #automation

The 10% that ruins QA day

You've automated the login flow. Your Playwright suite hums along. Then a CAPTCHA shows up and the whole thing collapses.

The honest answer from any QA engineer who's done this for more than six months is: stop trying to solve the CAPTCHA. Configure the test environment so it never appears. Test mode keys. Backend bypass tokens. Feature flags. IP allowlist on staging. The list of "right ways" is long and almost all of them are boring.

That works for ninety percent of testing. Then there's the remaining ten percent:

A B2B integration test where the third party owns the CAPTCHA and won't change their config for you
A client engagement with written authorization to test the production system, but no access to the backend
A staging environment that intentionally mirrors prod CAPTCHA behavior to catch UX regressions
A mobile webview test where IP allowlist doesn't reach
An accessibility audit that needs to actually see the challenge to test screen-reader behavior

For those, every shortcut violates someone's terms of service or your engagement contract. So we built mk-qa-master v0.7.0: a pair of MCP tools that let an AI client read a reCAPTCHA v2 image grid and click the right tiles — but only after a consent gate, never against third-party login portals, and never retaining the screenshot beyond the active cycle.

This post is about why the safety design matters more than the AI magic.

The three-tier CAPTCHA strategy

The strategy lives in the built-in QA knowledge layer (get_qa_context(section="CAPTCHA")) so every test the AI generates respects the same hierarchy:

Tier	Approach	When to use
1 — bypass	reCAPTCHA test keys, feature flags, IP allowlist, test-mode headers	Default. Covers ~90% of cases.
2 — degrade	Mark as `external_dependency`, skip downstream assertions	When you can't change the backend but the test isn't about the CAPTCHA itself.
3 — AI visual judgment	This feature.	Only when 1 + 2 don't fit.

Tier 1 is the "boring" answer and it's right almost every time. Google publishes test keys that always return success. Cloudflare Turnstile does the same. hCaptcha does the same. Your staging env can use them in seconds.

Tier 2 is for when the CAPTCHA is on the way to what you're really testing — say, you want to verify the post-login dashboard, not the auth flow. Mark the auth step as external_dependency, prove independently that the dashboard renders correctly with a seeded session, and you've decoupled the concern.

Tier 3 is what this release is about. It's the last resort, and we designed it like one.

What v0.7.0 actually ships

Two atomic MCP tools:

inspect_visual_challenge(confirm: bool = False)
  # Returns: screenshot of the challenge frame (base64),
  # challenge text, 3x3 or 4x4 tile grid metadata.
  # Refuses on forbidden domains.
  # Requires QA_VISUAL_CHALLENGE_CONSENT=true.

solve_visual_challenge(
    tile_indices: list[int],   # AI client's tile selection
    confirm: bool = False
)
  # Executes the click chain for the chosen tiles + Verify.
  # Returns: status (passed/failed), token, hint.
  # Same gates as inspect.

The AI client (Claude, Gemini, GPT-4V, whichever) is the actual solver. mk-qa-master is just eyes and hands: it screenshots, it accepts a list of indices, it clicks. The intelligence about which tiles contain a bicycle lives in the multimodal model.

That separation matters: it means the QA tool doesn't ship a CAPTCHA-solving ML model, doesn't compete with services like 2Captcha, doesn't accumulate know-how about how to beat specific challenge types. It just enables an AI client that already has vision to do its job inside a Playwright session.

The safety design

When you read the implementation, ~40% of the code is feature logic. The other 60% is restraint.

Consent gate. Default off. Nothing happens until you set:

QA_VISUAL_CHALLENGE_CONSENT=true

And every tool call requires confirm=true on top of that. Two locks, deliberately.

Per-call disclaimer. The first call surfaces the acceptable-use text in the error message:

ACCEPTABLE USE
This tool is intended for QA testing on:
- Sites you own
- Client sites where you have explicit written authorization
- Test environments where Tier 1 bypass is unavailable

DO NOT USE THIS TOOL ON:
- Third-party sites you do not own
- Production sites without explicit authorization
- Sites where automated access violates TOS or local law

If you're the kind of engineer who'd skip a disclaimer, you'll see it three times before you can call this thing for real.

Hard-stop domains. Some places are refused regardless of consent flag:

_FORBIDDEN_DOMAINS = frozenset({
    "accounts.google.com",
    "login.microsoftonline.com",
    "id.apple.com",
    "appleid.apple.com",
    "facebook.com",
    "login.live.com",
    "login.yahoo.com",
    "twitter.com/login",
    "x.com/login",
})

Third-party identity portals. There is no legitimate QA reason to script a CAPTCHA solver against someone else's login portal. The match is suffix-based on host, so accounts.google.com.evil does not accidentally pass.

Optional authorized-domains allowlist. For added discipline:

QA_VISUAL_CHALLENGE_AUTHORIZED_DOMAINS=client-staging.example.com,internal-app.example.com

When set, the tool refuses on any host that isn't on this list. Recommended for client engagements where you want a hard contract trail.

Privacy. Screenshots live only during the active inspect → solve cycle. Telemetry logs the boolean outcome — never the screenshot, never the challenge text, never the tile selection. You don't accumulate a corpus of solved CAPTCHAs.

What a real session looks like

Inside any MCP-compatible client (Claude Desktop, Cursor, Codex CLI, Gemini CLI, Cline...):

You: "Run the checkout suite. If a CAPTCHA blocks the test, resolve it
      so the rest of the flow can continue."

Claude:
  → run_tests()
  → ✗ failed at step 'click Checkout' — CAPTCHA modal detected

  → inspect_visual_challenge(confirm=true)
  → returns: screenshot + grid metadata + challenge text
            ("Select all images with traffic lights")

  → [Claude looks at the image, identifies tiles 0, 2, 5]

  → solve_visual_challenge(tile_indices=[0, 2, 5], confirm=true)
  → returns: { status: "passed", token: "...", hint: "CAPTCHA verified.
              Resume your test." }

  → run_failed()   # retry the failed step
  → ✓ checkout completes

The shape mirrors how MCP composes other capabilities — analyze, generate, run, advise. The AI orchestrates; the server just runs each step.

Scope, on purpose

v0.7.0 covers reCAPTCHA v2 image-grid only. That's deliberate:

reCAPTCHA v3 has no visible challenge — it's a behavioral risk score. There's nothing to inspect.
Cloudflare Turnstile mostly runs invisibly. Same story.
hCaptcha lands in v0.7.1 once the same safety machinery is fully ported over to its tile layout.
Behavioral CAPTCHA (mouse pattern, keystroke timing) is permanently out of scope. That's an anti-bot arms race we have no interest in feeding.

This is a feature designed to retire as the web does. When test keys become universally available and behavioral risk scoring takes over, this entire module should become unnecessary. We're fine with that.

Why MCP

A few people have asked why we packaged this as an MCP tool instead of a pytest fixture or a Playwright plugin. Two reasons:

The intelligence lives in the AI client, not in the server. MCP is the only protocol that makes that clean — the server exposes capabilities, the client (which already has vision and reasoning) decides how to use them. A pytest fixture would have to choose a vision provider, manage credentials, run inference. None of that is the test runner's job.
Composition with the rest of the QA loop. mk-qa-master already exposes analyze_url, generate_test, run_tests, get_optimization_plan. Putting the visual solver on the same MCP surface means the AI can chain it naturally: detect failure → inspect → solve → re-run. No glue code.

If you want the longer pitch on why MCP is the right shape for QA tooling, the README walks through it. Short version: AI clients should orchestrate testing the way a senior engineer would, and MCP is the cleanest way to give them the building blocks.

Try it

pip install mk-qa-master  # or: uvx mk-qa-master

In your MCP client config:

{
  "mcpServers": {
    "mk-qa-master": {
      "command": "uvx",
      "args": ["mk-qa-master"],
      "env": {
        "QA_RUNNER": "pytest",
        "QA_PROJECT_ROOT": "/path/to/your/tests",
        "QA_VISUAL_CHALLENGE_CONSENT": "true"
      }
    }
  }
}

The repo includes examples/sample_captcha_fixture/ — a local HTML page wired up with Google's public reCAPTCHA test keys so you can verify the end-to-end inspect/solve loop without ever touching a real production CAPTCHA.

What's next

v0.7.1 — hCaptcha support, same safety machinery
v0.8.0 — get_optimization_plan gains a "CAPTCHA pressure" metric that tells you when your suite is leaning too hard on Tier 3 and should be moved back to Tier 1
Always — no telemetry export of challenge content; no centralized solver model; no third-party identity portal support

If you find a domain that should be on the hard-stop list and isn't, open an issue. If you find a use case where Tier 1 / Tier 2 should work but the docs don't make it obvious, that's the higher-impact bug — the goal is for this feature to be used less, not more.

If mk-qa-master saved your QA flow, a coffee keeps the late-night CAPTCHA debugging going. Star the repo, file an issue, send a Maestro flow that broke — they're all the same to me.

Repo: kao273183/mcp-test-runner
PyPI: mk-qa-master
Glama: glama.ai/mcp/servers/kao273183/mcp-test-runner
Landing page: mcp.chenjundigital.com