DEV Community

Nico Reyes
Nico Reyes

Posted on

Got CAPTCHA'd on page 47. Every single time.

I thought I was being clever with my scraping setup. 10 requests per minute, rotating user agents, residential proxies. Page 47, CAPTCHA. Page 48, CAPTCHA. Page 49, CAPTCHA. Fun times.

The site decided I was a bot once I hit some threshold. Didn't matter that I was going slow. Didn't matter that I looked like a real browser. CAPTCHA wall, every time, starting at page 47.

First thing I tried: more delays. 30 seconds between requests. Still got CAPTCHA'd on page 47. Interesting.

Then I tried different proxy providers. Three of them. Same result, same page number. At this point I was convinced the threshold was tied to my IP somehow, but no, same thing happened with fresh IPs.

Turns out the site just really, really doesn't like automated scraping. And I was too stubborn to give up.

Ended up biting the bullet and paying for 2captcha. If you're not familiar, they solve CAPTCHAs using actual humans (or very good models, unclear). The integration looked like this:

import time
from twocaptcha import TwoCaptcha

solver = TwoCaptcha('YOUR_API_KEY')

def scrape_with_captcha_handling(url):
    response = requests.get(url, headers=headers)

    if 'captcha' in response.text:
        # Find the CAPTCHA image/sitekey
        captcha_element = response.text.find('data-sitekey')
        sitekey = extract_between(response.text, 'data-sitekey="', '"')

        # Send to 2captcha
        result = solver.turnstile(sitekey=sitekey, url=url)

        # Retry with token
        response = requests.post(
            url + '/verify',
            data={'captcha_token': result['code']}
        )

    return parse_page(response)
Enter fullscreen mode Exit fullscreen mode

Not exactly elegant. But it worked.

The key insight that nobody wants to hear: if a site throws CAPTCHAs at you, maybe just don't scrape it. I know, I know, not helpful. But honestly? If a site is that aggressive about blocking bots, there's usually a reason. Either they're protecting something valuable, or they're worried about liability, or both.

That said, if you really need the data and the site doesn't offer an API, 2captcha is there as a last resort. Costs money, adds latency, but you get your data.

Still worth it in my case tho

Top comments (0)