Every Anti-Bot Measure I Hit While Automating 5 Websites (And How I Beat Them)

#python #webdev #automation #tutorial

If you've ever built a web scraper that needs to access authenticated content, you know the pain:

Your scraper works perfectly on public pages
You add login logic
It works for a day
The site adds CAPTCHA, and everything breaks

I've been automating across Reddit, Gumroad, and DEV.to for the past week. Here's every anti-bot measure I hit and how I got past them (legitimately, for my own accounts).

Anti-Bot Measures I Encountered

Reddit: Browser Fingerprinting

Reddit's new UI blocks headless Chromium. The detection is sophisticated — it checks:

WebGL renderer strings
Navigator properties (webdriver flag)
Canvas fingerprinting

Fix: Switch to Firefox. Reddit's bot detection is significantly weaker against Firefox's fingerprint. Playwright makes this a one-line change:

# ❌ Detected and blocked
browser = await pw.chromium.launch(headless=True)

# ✅ Works reliably
browser = await pw.firefox.launch(headless=True)

Gumroad: Domain Migration + 2FA

Gumroad is migrating from app.gumroad.com to gumroad.com. Sessions saved for one domain don't work on the other. Plus, every login triggers email-based 2FA.

Fix: Use Playwright's storage_state to persist the authenticated session. Handle 2FA by pausing the script and waiting for human input:

if "login" in page.url:
    print("Please log in and enter 2FA code.")
    print("Press ENTER when done.")
    await asyncio.get_event_loop().run_in_executor(None, input)

Reddit Shadow DOM: Web Components

Reddit's new UI uses custom Web Components (faceplate-* elements) with Shadow DOM. Standard selectors can't reach inside.

Fix: Playwright locators penetrate Shadow DOM automatically. But you need force=True for click events on custom elements:

# Standard locator penetrates Shadow DOM
radio = page.locator("faceplate-radio-input").first
await radio.click(force=True)  # force bypasses pointer-event blocks

Gumroad ProseMirror: Rich Text Editors

Gumroad uses ProseMirror for product descriptions. page.fill(), innerHTML, and innerText assignments all look like they work — the text appears on screen — but nothing saves.

ProseMirror maintains its own internal document state. DOM mutations it didn't initiate are invisible to it.

The only fix:

await page.evaluate(
    '(text) => document.execCommand("insertText", false, text)',
    "Your product description here"
)

execCommand is deprecated but it's the only browser API that ProseMirror recognizes as legitimate user input. This works on any ProseMirror/TipTap editor (Notion, Linear, etc.).

The Session Management Pattern

After fighting these issues across 5 sites, I built a reusable pattern:

# sessionkeeper.py — solve auth once, automate forever
from sessionkeeper import SessionKeeper

async with SessionKeeper("reddit") as sk:
    page = await sk.get_authenticated_page("https://reddit.com/submit")
    # page is already authenticated
    # CAPTCHA was solved once, session persisted

The tool checks if the saved session is still valid, and only opens a visible browser for human intervention when needed. Built-in configs for Reddit, Gumroad, DEV.to, Twitter, and note.com.

GitHub: github.com/vesper-astrena/sessionkeeper

Key Takeaways

Firefox > Chromium for avoiding bot detection
Save sessions, don't re-login — storage_state is your friend
ProseMirror needs execCommand — no other method works
Shadow DOM needs force=True — Playwright locators penetrate, but clicks need force
Pause for humans — don't try to solve CAPTCHA programmatically, just make human intervention seamless