The Cloudflare Challenge
Cloudflare protects over 20% of all websites. If you have ever seen a "Checking your browser" page or a CAPTCHA challenge while scraping, you have encountered Cloudflare's bot detection. Let's understand how it works and how to get past it.
How Cloudflare Detects Bots
Cloudflare uses multiple layers of detection:
- JavaScript challenges — forces browsers to execute JS and prove they are real
- TLS fingerprinting — checks if the TLS handshake matches a real browser
- Browser fingerprinting — canvas, WebGL, fonts, plugins
- Behavioral analysis — mouse movements, click patterns, timing
- IP reputation — datacenter IPs are flagged immediately
Method 1: Undetected ChromeDriver
The undetected-chromedriver library patches Selenium to avoid detection:
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
def scrape_cloudflare_site(url):
options = uc.ChromeOptions()
options.add_argument("--headless=new")
options.add_argument("--no-sandbox")
driver = uc.Chrome(options=options)
try:
driver.get(url)
# Wait for Cloudflare challenge to resolve
WebDriverWait(driver, 30).until(
EC.presence_of_element_located((By.TAG_NAME, "body"))
)
# Additional wait for JS rendering
time.sleep(5)
# Check if we passed the challenge
if "Just a moment" in driver.title:
print("Still blocked by Cloudflare")
return None
return driver.page_source
finally:
driver.quit()
html = scrape_cloudflare_site("https://example-cf-protected.com")
if html:
print(f"Got {len(html)} bytes of content")
Method 2: Playwright with Stealth
Playwright with stealth plugins is more reliable than Selenium:
import asyncio
from playwright.async_api import async_playwright
async def bypass_cloudflare(url):
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
args=[
"--disable-blink-features=AutomationControlled",
"--no-sandbox",
]
)
context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
locale="en-US",
timezone_id="America/New_York",
)
# Remove webdriver flag
await context.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
});
""")
page = await context.new_page()
response = await page.goto(url, wait_until="networkidle")
# Wait for challenge to complete
for _ in range(10):
title = await page.title()
if "Just a moment" not in title:
break
await asyncio.sleep(3)
content = await page.content()
cookies = await context.cookies()
await browser.close()
# Save cf_clearance cookie for future requests
cf_cookie = next(
(c for c in cookies if c["name"] == "cf_clearance"), None
)
return content, cf_cookie
html, cookie = asyncio.run(bypass_cloudflare("https://example.com"))
Method 3: Using a Scraping API
The most reliable approach for production is a dedicated API that handles Cloudflare automatically:
import requests
def scrape_with_api(url):
"""Use ScraperAPI to bypass Cloudflare automatically."""
resp = requests.get(
"https://api.scraperapi.com",
params={
"api_key": "YOUR_KEY",
"url": url,
"render": "true",
"country_code": "us"
}
)
return resp.text
# Works on most Cloudflare-protected sites
html = scrape_with_api("https://cloudflare-protected-site.com")
ScraperAPI maintains a pool of browser instances and residential IPs that can bypass most Cloudflare configurations.
Method 4: TLS Fingerprint Matching
Cloudflare fingerprints TLS connections. Python's requests library has a distinctive fingerprint. Use curl_cffi to mimic real browsers:
from curl_cffi import requests as cf_requests
def fetch_with_browser_tls(url):
"""Use curl_cffi to impersonate Chrome's TLS fingerprint."""
resp = cf_requests.get(
url,
impersonate="chrome120",
headers={
"Accept": "text/html,application/xhtml+xml",
"Accept-Language": "en-US,en;q=0.9",
}
)
return resp.text
html = fetch_with_browser_tls("https://cf-protected-site.com")
Method 5: Residential Proxies
Datacenter IPs are instantly flagged. Use residential proxies from ThorData to appear as a real home user:
import requests
proxies = {
"http": "http://user:pass@residential.thordata.com:9000",
"https": "http://user:pass@residential.thordata.com:9000"
}
resp = requests.get(
"https://cf-protected-site.com",
proxies=proxies,
headers={"User-Agent": "Mozilla/5.0 ..."}
)
Combining Methods for Maximum Success
class CloudflareBypass:
def __init__(self, scraper_api_key=None):
self.api_key = scraper_api_key
def fetch(self, url):
# Try methods in order of speed/cost
for method in [self._try_curl_cffi, self._try_scraper_api]:
result = method(url)
if result and "Just a moment" not in result:
return result
return None
def _try_curl_cffi(self, url):
try:
from curl_cffi import requests as cf
resp = cf.get(url, impersonate="chrome120")
return resp.text if resp.status_code == 200 else None
except Exception:
return None
def _try_scraper_api(self, url):
if not self.api_key:
return None
resp = requests.get("https://api.scraperapi.com", params={
"api_key": self.api_key, "url": url, "render": "true"
})
return resp.text if resp.status_code == 200 else None
Monitoring Success Rates
Track which methods work for which sites with ScrapeOps. Cloudflare regularly updates their detection, so what works today may not work tomorrow.
Key Takeaways
- Start with
curl_cffifor TLS fingerprint matching — it is free and fast - Use residential proxies for IP reputation issues
- Fall back to browser automation for JavaScript challenges
- Use a scraping API for production reliability
- Always monitor your success rates and adapt
Cloudflare is an arms race. The most reliable long-term strategy is using a managed service that keeps up with Cloudflare's changes so you do not have to.
Top comments (0)