Web Scraping Without Getting Banned in 2026: The Complete Anti-Bot Bypass Guide
Getting blocked is the #1 frustration for web scrapers. You write the code, it works for 10 minutes, then you're staring at a 403 or a Cloudflare challenge page. This guide covers every technique that actually works in 2026 — from basic rate limiting to defeating Turnstile — so you can scrape 100 to 50,000 records without getting banned.
Why You're Getting Blocked: The Real Reasons
Before fixes, you need to understand what's detecting you. Modern anti-bot systems check multiple signals simultaneously:
TLS Fingerprint — Your Python requests library sends a TLS handshake that looks nothing like a browser. Sites like Cloudflare identify it in milliseconds.
HTTP/2 Fingerprint — Browsers use HTTP/2 with specific frame ordering. requests uses HTTP/1.1 by default, which is an instant giveaway.
Browser fingerprint — Headless Chrome has detectable properties: navigator.webdriver=true, missing plugins, wrong screen dimensions, no GPU renderer.
Behavioral signals — Too fast, too regular, no mouse movement, no scroll events, straight-line navigation patterns.
IP reputation — Datacenter IPs (AWS, GCP, Azure, Hetzner) are pre-blocked on most serious sites. Even residential IPs get flagged if they hit too fast.
Knowing which layer is detecting you tells you what to fix.
Layer 1: TLS and HTTP Fingerprinting (Fix This First)
The single biggest win for most scrapers: stop using plain requests.
Use curl-cffi to impersonate real browsers
from curl_cffi import requests
session = requests.Session()
# Impersonate Chrome 120 — matches real browser TLS fingerprint
response = session.get(
"https://target-site.com/data",
impersonate="chrome120"
)
# Other options: chrome110, safari17_0, edge101
print(response.status_code) # 200 instead of 403
Install: pip install curl-cffi
curl-cffi uses libcurl under the hood and replicates the exact TLS cipher suite order, extension list, and HTTP/2 SETTINGS frame that Chrome sends. Many sites that block requests outright will pass curl-cffi straight through.
httpx with HTTP/2
For sites that check HTTP version but not TLS fingerprint deeply:
import httpx
with httpx.Client(http2=True) as client:
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
}
response = client.get("https://target-site.com", headers=headers)
Layer 2: Rotating Proxies the Right Way
Most scrapers use proxies wrong. Here's what actually works:
Proxy type matters
| Proxy Type | Detection Risk | Cost | Good For |
|---|---|---|---|
| Datacenter | Very High | Low | Public data, no protection |
| Residential | Low | Medium | Most protected sites |
| Mobile (4G) | Very Low | High | Strictest anti-bot |
| ISP (static residential) | Low | Medium-High | Consistent sessions |
For 100-5000 records on a protected site, residential proxies are the sweet spot.
Python proxy rotation with backoff
import requests
import time
import random
from itertools import cycle
PROXIES = [
"http://user:pass@proxy1:8080",
"http://user:pass@proxy2:8080",
"http://user:pass@proxy3:8080",
]
proxy_pool = cycle(PROXIES)
def scrape_with_retry(url, max_retries=3):
for attempt in range(max_retries):
proxy = next(proxy_pool)
try:
response = requests.get(
url,
proxies={"http": proxy, "https": proxy},
timeout=15,
headers={"User-Agent": get_random_ua()}
)
if response.status_code == 200:
return response
elif response.status_code == 429:
# Rate limited — wait and try different proxy
time.sleep(2 ** attempt + random.uniform(0, 1))
except requests.RequestException:
time.sleep(1)
return None
def get_random_ua():
agents = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
]
return random.choice(agents)
Request timing — the key that most guides skip
import time
import random
def human_delay(min_sec=1.5, max_sec=4.0):
"""Simulate human browsing pace"""
time.sleep(random.uniform(min_sec, max_sec))
# Between pages
human_delay(2, 5)
# Between sites (session warm-up)
human_delay(5, 10)
At 100 records: comfortable at 1 request/2 seconds
At 1000 records: use 3-5 second delays + proxy rotation
At 5000+ records: rotate proxies every 50-100 requests, add session reuse
Layer 3: Playwright for JavaScript-Heavy Sites
When the target site requires JavaScript execution (React, Vue, Angular), you need a real browser:
Stealth Playwright setup
from playwright.sync_api import sync_playwright
import time, random
def create_stealth_browser():
p = sync_playwright().start()
browser = p.chromium.launch(
headless=True,
args=[
"--disable-blink-features=AutomationControlled",
"--no-sandbox",
"--disable-dev-shm-usage",
]
)
context = browser.new_context(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
viewport={"width": 1280, "height": 800},
locale="en-US",
timezone_id="America/New_York",
# Add proxy here if needed:
# proxy={"server": "http://proxy:8080", "username": "u", "password": "p"}
)
# Patch webdriver detection
context.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
Object.defineProperty(navigator, 'plugins', {get: () => [1, 2, 3]});
window.chrome = {runtime: {}};
""")
return p, browser, context
def scrape_page(url):
p, browser, context = create_stealth_browser()
page = context.new_page()
# Human-like: navigate, wait, scroll
page.goto(url, wait_until="networkidle")
time.sleep(random.uniform(1.5, 3))
# Scroll to simulate reading
page.evaluate("window.scrollBy(0, 300)")
time.sleep(random.uniform(0.5, 1.5))
content = page.content()
browser.close()
p.stop()
return content
Install: pip install playwright && playwright install chromium
Layer 4: Defeating Cloudflare Turnstile
Cloudflare Turnstile (the non-interactive "I'm not a robot" check) is the hardest challenge in 2026. It runs JavaScript fingerprinting, behavior analysis, and sometimes visual challenges. There are three viable approaches:
Option A: Avoid it entirely (fastest, free)
Many sites have unprotected API endpoints even when the HTML is protected:
import requests, json
# Instead of scraping the HTML page:
# https://shop.example.com/products
# Try the API directly:
response = requests.get(
"https://shop.example.com/api/products",
headers={"Accept": "application/json"}
)
# Or the GraphQL endpoint:
response = requests.post(
"https://shop.example.com/graphql",
json={"query": "{ products { id name price } }"}
)
Open your browser's Network tab, filter for XHR/Fetch requests, and look for JSON responses. About 60% of protected sites expose clean APIs this way.
Option B: Turnstile solving services ($0.001-$0.002 per solve)
When you must solve the challenge:
import requests
import time
SOLVER_API_KEY = "your_2captcha_or_anticaptcha_key"
def solve_turnstile(page_url, sitekey):
# Submit task
r = requests.post("https://api.2captcha.com/createTask", json={
"clientKey": SOLVER_API_KEY,
"task": {
"type": "TurnstileTaskProxyless",
"websiteURL": page_url,
"websiteKey": sitekey
}
})
task_id = r.json()["taskId"]
# Poll for result
for _ in range(30):
time.sleep(5)
result = requests.post("https://api.2captcha.com/getTaskResult", json={
"clientKey": SOLVER_API_KEY,
"taskId": task_id
}).json()
if result["status"] == "ready":
return result["solution"]["token"]
raise Exception("Solving timeout")
# Use the token in your request
token = solve_turnstile("https://target.com", "0x4AAAAAAABxxxxxxx")
response = requests.post(
"https://target.com/submit",
data={"cf-turnstile-response": token, "other_field": "value"}
)
Cost at scale: 1000 solves = ~$1-2. Services: 2captcha, Anti-Captcha, CapSolver.
Option C: Headless browser with stealth
For low-volume scraping where you need to interact with the full page:
from playwright.sync_api import sync_playwright
import time
def bypass_turnstile_page(url):
with sync_playwright() as p:
browser = p.chromium.launch(headless=False) # False helps pass more checks
context = browser.new_context(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...",
)
context.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
""")
page = context.new_page()
page.goto(url)
# Wait for Turnstile to complete (it auto-solves for real browsers)
page.wait_for_selector(".cf-turnstile[data-cf-token]", timeout=30000)
# Now extract whatever you need
data = page.evaluate("() => document.querySelector('#data').innerText")
browser.close()
return data
Layer 5: Session Management and Cookies
The pattern that breaks most scrapers: treating every request as stateless.
import requests
from http.cookiejar import LWPCookieJar
import json
class PersistentSession:
def __init__(self, proxy=None):
self.session = requests.Session()
if proxy:
self.session.proxies = {"http": proxy, "https": proxy}
self.session.headers.update({
"User-Agent": "Mozilla/5.0 (Macintosh...) Chrome/120.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
})
def warm_up(self, base_url):
"""Visit homepage first to build cookie session"""
self.session.get(base_url)
import time, random
time.sleep(random.uniform(2, 4))
def get(self, url):
return self.session.get(url)
def save_cookies(self, path):
with open(path, 'w') as f:
json.dump(list(self.session.cookies), f)
def load_cookies(self, path):
with open(path) as f:
for cookie in json.load(f):
self.session.cookies.set(**cookie)
# Usage
scraper = PersistentSession(proxy="http://user:pass@residential-proxy:8080")
scraper.warm_up("https://target-site.com") # Build session
data = scraper.get("https://target-site.com/data/page/1")
Key points:
- Always visit the homepage before the target page
- Reuse the same session for a site (keep cookies)
- Add
Refererheaders that match the actual navigation path
Practical Rate Limits by Site Protection Level
| Protection Level | Examples | Safe Request Rate |
|---|---|---|
| None | Most blogs, news | 1 req/sec |
| Basic (rate limit only) | Small e-commerce | 1 req/3-5s |
| Moderate (Cloudflare Basic) | Mid-size retail | 1 req/5-10s + proxy rotation |
| Heavy (Turnstile + JS checks) | LinkedIn, Amazon | Solving service + 10-30s delays |
| Maximum (behavioral AI) | Ticketmaster, airlines | Mobile proxies + full browser |
Quick Decision Tree: Which Approach to Use
Is the data in a JSON API?
YES → Use requests + proxy, skip browser entirely
NO ↓
Does the page require JavaScript to render content?
NO → Use curl-cffi (impersonate Chrome)
YES ↓
Is there a Cloudflare Turnstile challenge?
NO → Use Playwright with stealth patches
YES ↓
Volume > 1000 requests?
YES → Use solving service (2captcha/CapSolver)
NO → Use headless=False Playwright (often auto-solves)
Tools Reference
| Tool | Use Case | Install |
|---|---|---|
curl-cffi |
TLS fingerprint bypass | pip install curl-cffi |
httpx[http2] |
HTTP/2 support | pip install httpx[http2] |
playwright |
JavaScript rendering | pip install playwright |
undetected-chromedriver |
Selenium alternative | pip install undetected-chromedriver |
scrapy-rotating-proxies |
Scrapy proxy rotation | pip install scrapy-rotating-proxies |
When to Use a Managed Scraping Service
Writing all this yourself makes sense for 1-3 targets you know well. For production scraping across many sites, maintaining anti-bot bypass code becomes its own full-time job — Cloudflare updates every few weeks, browser fingerprints shift, proxy IPs get burned.
A managed approach lets you focus on the data pipeline while the infrastructure handles detection. The tradeoff is cost vs. maintenance time.
Summary
The 80/20 of not getting banned:
-
Use
curl-cffi— fixes TLS fingerprinting immediately, handles 60% of blocks - Add residential proxies — fixes IP reputation, 20% more coverage
- Slow down — 2-5 second delays eliminate most rate-limit blocks
- Warm up sessions — visit homepage first, reuse cookies, add Referer
- Check for hidden APIs — often cleaner than scraping HTML at all
If you're hitting Cloudflare Turnstile specifically, solving services cost ~$1 per 1000 solves and integrate in under 20 lines of Python.
The hardest targets (airlines, ticketing, financial data) need mobile proxies and full behavioral simulation. For most business use cases — competitor data, lead generation, market research — the techniques above are more than enough.
Take the next step
Skip the setup. Production-ready tools for scraping without bans:
Apify Scrapers Bundle — $29 one-time
Instant download. Documented. Ready to deploy.
Top comments (0)