DEV Community

agenthustler
agenthustler

Posted on

How to Handle Cloudflare Protection in Web Scraping

The Cloudflare Challenge

Cloudflare protects over 20% of all websites. If you have ever seen a "Checking your browser" page or a CAPTCHA challenge while scraping, you have encountered Cloudflare's bot detection. Let's understand how it works and how to get past it.

How Cloudflare Detects Bots

Cloudflare uses multiple layers of detection:

  1. JavaScript challenges — forces browsers to execute JS and prove they are real
  2. TLS fingerprinting — checks if the TLS handshake matches a real browser
  3. Browser fingerprinting — canvas, WebGL, fonts, plugins
  4. Behavioral analysis — mouse movements, click patterns, timing
  5. IP reputation — datacenter IPs are flagged immediately

Method 1: Undetected ChromeDriver

The undetected-chromedriver library patches Selenium to avoid detection:

import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

def scrape_cloudflare_site(url):
    options = uc.ChromeOptions()
    options.add_argument("--headless=new")
    options.add_argument("--no-sandbox")

    driver = uc.Chrome(options=options)

    try:
        driver.get(url)

        # Wait for Cloudflare challenge to resolve
        WebDriverWait(driver, 30).until(
            EC.presence_of_element_located((By.TAG_NAME, "body"))
        )

        # Additional wait for JS rendering
        time.sleep(5)

        # Check if we passed the challenge
        if "Just a moment" in driver.title:
            print("Still blocked by Cloudflare")
            return None

        return driver.page_source
    finally:
        driver.quit()

html = scrape_cloudflare_site("https://example-cf-protected.com")
if html:
    print(f"Got {len(html)} bytes of content")
Enter fullscreen mode Exit fullscreen mode

Method 2: Playwright with Stealth

Playwright with stealth plugins is more reliable than Selenium:

import asyncio
from playwright.async_api import async_playwright

async def bypass_cloudflare(url):
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            args=[
                "--disable-blink-features=AutomationControlled",
                "--no-sandbox",
            ]
        )

        context = await browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
            locale="en-US",
            timezone_id="America/New_York",
        )

        # Remove webdriver flag
        await context.add_init_script("""
            Object.defineProperty(navigator, 'webdriver', {
                get: () => undefined
            });
        """)

        page = await context.new_page()

        response = await page.goto(url, wait_until="networkidle")

        # Wait for challenge to complete
        for _ in range(10):
            title = await page.title()
            if "Just a moment" not in title:
                break
            await asyncio.sleep(3)

        content = await page.content()
        cookies = await context.cookies()

        await browser.close()

        # Save cf_clearance cookie for future requests
        cf_cookie = next(
            (c for c in cookies if c["name"] == "cf_clearance"), None
        )

        return content, cf_cookie

html, cookie = asyncio.run(bypass_cloudflare("https://example.com"))
Enter fullscreen mode Exit fullscreen mode

Method 3: Using a Scraping API

The most reliable approach for production is a dedicated API that handles Cloudflare automatically:

import requests

def scrape_with_api(url):
    """Use ScraperAPI to bypass Cloudflare automatically."""
    resp = requests.get(
        "https://api.scraperapi.com",
        params={
            "api_key": "YOUR_KEY",
            "url": url,
            "render": "true",
            "country_code": "us"
        }
    )
    return resp.text

# Works on most Cloudflare-protected sites
html = scrape_with_api("https://cloudflare-protected-site.com")
Enter fullscreen mode Exit fullscreen mode

ScraperAPI maintains a pool of browser instances and residential IPs that can bypass most Cloudflare configurations.

Method 4: TLS Fingerprint Matching

Cloudflare fingerprints TLS connections. Python's requests library has a distinctive fingerprint. Use curl_cffi to mimic real browsers:

from curl_cffi import requests as cf_requests

def fetch_with_browser_tls(url):
    """Use curl_cffi to impersonate Chrome's TLS fingerprint."""
    resp = cf_requests.get(
        url,
        impersonate="chrome120",
        headers={
            "Accept": "text/html,application/xhtml+xml",
            "Accept-Language": "en-US,en;q=0.9",
        }
    )
    return resp.text

html = fetch_with_browser_tls("https://cf-protected-site.com")
Enter fullscreen mode Exit fullscreen mode

Method 5: Residential Proxies

Datacenter IPs are instantly flagged. Use residential proxies from ThorData to appear as a real home user:

import requests

proxies = {
    "http": "http://user:pass@residential.thordata.com:9000",
    "https": "http://user:pass@residential.thordata.com:9000"
}

resp = requests.get(
    "https://cf-protected-site.com",
    proxies=proxies,
    headers={"User-Agent": "Mozilla/5.0 ..."}
)
Enter fullscreen mode Exit fullscreen mode

Combining Methods for Maximum Success

class CloudflareBypass:
    def __init__(self, scraper_api_key=None):
        self.api_key = scraper_api_key

    def fetch(self, url):
        # Try methods in order of speed/cost
        for method in [self._try_curl_cffi, self._try_scraper_api]:
            result = method(url)
            if result and "Just a moment" not in result:
                return result
        return None

    def _try_curl_cffi(self, url):
        try:
            from curl_cffi import requests as cf
            resp = cf.get(url, impersonate="chrome120")
            return resp.text if resp.status_code == 200 else None
        except Exception:
            return None

    def _try_scraper_api(self, url):
        if not self.api_key:
            return None
        resp = requests.get("https://api.scraperapi.com", params={
            "api_key": self.api_key, "url": url, "render": "true"
        })
        return resp.text if resp.status_code == 200 else None
Enter fullscreen mode Exit fullscreen mode

Monitoring Success Rates

Track which methods work for which sites with ScrapeOps. Cloudflare regularly updates their detection, so what works today may not work tomorrow.

Key Takeaways

  • Start with curl_cffi for TLS fingerprint matching — it is free and fast
  • Use residential proxies for IP reputation issues
  • Fall back to browser automation for JavaScript challenges
  • Use a scraping API for production reliability
  • Always monitor your success rates and adapt

Cloudflare is an arms race. The most reliable long-term strategy is using a managed service that keeps up with Cloudflare's changes so you do not have to.

Top comments (0)