Web Scraping Without Getting Banned in 2026: The Complete Anti-Bot Bypass Guide
Getting blocked is the #1 frustration for web scrapers. You write the code, it works for 10 minutes, then you're staring at a 403 or a Cloudflare challenge page. This guide covers every technique that actually works in 2026 — from basic rate limiting to defeating Turnstile — so you can scrape 100 to 50,000 records without getting banned.
Why You're Getting Blocked: The Real Reasons
Before fixes, you need to understand what's detecting you. Modern anti-bot systems check multiple signals simultaneously:
TLS Fingerprint — Your Python requests library sends a TLS handshake that looks nothing like a browser. Sites like Cloudflare identify it in milliseconds.
HTTP/2 Fingerprint — Browsers use HTTP/2 with specific frame ordering. requests uses HTTP/1.1 by default, which is an instant giveaway.
Browser fingerprint — Headless Chrome has detectable properties: navigator.webdriver=true, missing plugins, wrong screen dimensions, no GPU renderer.
Behavioral signals — Too fast, too regular, no mouse movement, no scroll events, straight-line navigation patterns.
IP reputation — Datacenter IPs (AWS, GCP, Azure, Hetzner) are pre-blocked on most serious sites. Even residential IPs get flagged if they hit too fast.
Knowing which layer is detecting you tells you what to fix.
Layer 1: TLS and HTTP Fingerprinting (Fix This First)
The single biggest win for most scrapers: stop using plain requests.
Use curl-cffi to impersonate real browsers
from curl_cffi import requests
session = requests.Session()
# Impersonate Chrome 120 — matches real browser TLS fingerprint
response = session.get(
"https://target-site.com/data",
impersonate="chrome120"
)
# Other options: chrome110, safari17_0, edge101
print(response.status_code) # 200 instead of 403
Install: pip install curl-cffi
curl-cffi uses libcurl under the hood and replicates the exact TLS cipher suite order, extension list, and HTTP/2 SETTINGS frame that Chrome sends. Many sites that block requests outright will pass curl-cffi straight through.
httpx with HTTP/2
For sites that check HTTP version but not TLS fingerprint deeply:
import httpx
with httpx.Client(http2=True) as client:
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
}
response = client.get("https://target-site.com", headers=headers)
Layer 2: Rotating Proxies the Right Way
Most scrapers use proxies wrong. Here's what actually works:
Proxy type matters
| Proxy Type | Detection Risk | Cost | Good For |
|---|---|---|---|
| Datacenter | Very High | Low | Public data, no protection |
| Residential | Low | Medium | Most protected sites |
| Mobile (4G) | Very Low | High | Strictest anti-bot |
| ISP (static residential) | Low | Medium-High | Consistent sessions |
For 100-5000 records on a protected site, residential proxies are the sweet spot.
Python proxy rotation with backoff
import requests
import time
import random
from itertools import cycle
PROXIES = [
"http://user:pass@proxy1:8080",
"http://user:pass@proxy2:8080",
"http://user:pass@proxy3:8080",
]
proxy_pool = cycle(PROXIES)
def scrape_with_retry(url, max_retries=3):
for attempt in range(max_retries):
proxy = next(proxy_pool)
try:
response = requests.get(
url,
proxies={"http": proxy, "https": proxy},
timeout=15,
headers={"User-Agent": get_random_ua()}
)
if response.status_code == 200:
return response
elif response.status_code == 429:
# Rate limited — wait and try different proxy
time.sleep(2 ** attempt + random.uniform(0, 1))
except requests.RequestException:
time.sleep(1)
return None
def get_random_ua():
agents = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
]
return random.choice(agents)
Request timing — the key that most guides skip
import time
import random
def human_delay(min_sec=1.5, max_sec=4.0):
"""Simulate human browsing pace"""
time.sleep(random.uniform(min_sec, max_sec))
# Between pages
human_delay(2, 5)
# Between sites (session warm-up)
human_delay(5, 10)
At 100 records: comfortable at 1 request/2 seconds
At 1000 records: use 3-5 second delays + proxy rotation
At 5000+ records: rotate proxies every 50-100 requests, add session reuse
Layer 3: Playwright for JavaScript-Heavy Sites
When the target site requires JavaScript execution (React, Vue, Angular), you need a real browser:
Stealth Playwright setup
from playwright.sync_api import sync_playwright
import time, random
def create_stealth_browser():
p = sync_playwright().start()
browser = p.chromium.launch(
headless=True,
args=[
"--disable-blink-features=AutomationControlled",
"--no-sandbox",
"--disable-dev-shm-usage",
]
)
context = browser.new_context(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
viewport={"width": 1280, "height": 800},
locale="en-US",
timezone_id="America/New_York",
# Add proxy here if needed:
# proxy={"server": "http://proxy:8080", "username": "u", "password": "p"}
)
# Patch webdriver detection
context.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
Object.defineProperty(navigator, 'plugins', {get: () => [1, 2, 3]});
window.chrome = {runtime: {}};
""")
return p, browser, context
def scrape_page(url):
p, browser, context = create_stealth_browser()
page = context.new_page()
# Human-like: navigate, wait, scroll
page.goto(url, wait_until="networkidle")
time.sleep(random.uniform(1.5, 3))
# Scroll to simulate reading
page.evaluate("window.scrollBy(0, 300)")
time.sleep(random.uniform(0.5, 1.5))
content = page.content()
browser.close()
p.stop()
return content
Install: pip install playwright && playwright install chromium
Layer 4: Defeating Cloudflare Turnstile
Cloudflare Turnstile (the non-interactive "I'm not a robot" check) is the hardest challenge in 2026. It runs JavaScript fingerprinting, behavior analysis, and sometimes visual challenges. There are three viable approaches:
Option A: Avoid it entirely (fastest, free)
Many sites have unprotected API endpoints even when the HTML is protected:
import requests, json
# Instead of scraping the HTML page:
# https://shop.example.com/products
# Try the API directly:
response = requests.get(
"https://shop.example.com/api/products",
headers={"Accept": "application/json"}
)
# Or the GraphQL endpoint:
response = requests.post(
"https://shop.example.com/graphql",
json={"query": "{ products { id name price } }"}
)
Open your browser's Network tab, filter for XHR/Fetch requests, and look for JSON responses. About 60% of protected sites expose clean APIs this way.
Option B: Turnstile solving services ($0.001-$0.002 per solve)
When you must solve the challenge:
import requests
import time
SOLVER_API_KEY = "your_2captcha_or_anticaptcha_key"
def solve_turnstile(page_url, sitekey):
# Submit task
r = requests.post("https://api.2captcha.com/createTask", json={
"clientKey": SOLVER_API_KEY,
"task": {
"type": "TurnstileTaskProxyless",
"websiteURL": page_url,
"websiteKey": sitekey
}
})
task_id = r.json()["taskId"]
# Poll for result
for _ in range(30):
time.sleep(5)
result = requests.post("https://api.2captcha.com/getTaskResult", json={
"clientKey": SOLVER_API_KEY,
"taskId": task_id
}).json()
if result["status"] == "ready":
return result["solution"]["token"]
raise Exception("Solving timeout")
# Use the token in your request
token = solve_turnstile("https://target.com", "0x4AAAAAAABxxxxxxx")
response = requests.post(
"https://target.com/submit",
data={"cf-turnstile-response": token, "other_field": "value"}
)
Cost at scale: 1000 solves = ~$1-2. Services: 2captcha, Anti-Captcha, CapSolver.
Option C: Headless browser with stealth
For low-volume scraping where you need to interact with the full page:
from playwright.sync_api import sync_playwright
import time
def bypass_turnstile_page(url):
with sync_playwright() as p:
browser = p.chromium.launch(headless=False) # False helps pass more checks
context = browser.new_context(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...",
)
context.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
""")
page = context.new_page()
page.goto(url)
# Wait for Turnstile to complete (it auto-solves for real browsers)
page.wait_for_selector(".cf-turnstile[data-cf-token]", timeout=30000)
# Now extract whatever you need
data = page.evaluate("() => document.querySelector('#data').innerText")
browser.close()
return data
Layer 5: Session Management and Cookies
The pattern that breaks most scrapers: treating every request as stateless.
import requests
from http.cookiejar import LWPCookieJar
import json
class PersistentSession:
def __init__(self, proxy=None):
self.session = requests.Session()
if proxy:
self.session.proxies = {"http": proxy, "https": proxy}
self.session.headers.update({
"User-Agent": "Mozilla/5.0 (Macintosh...) Chrome/120.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
})
def warm_up(self, base_url):
"""Visit homepage first to build cookie session"""
self.session.get(base_url)
import time, random
time.sleep(random.uniform(2, 4))
def get(self, url):
return self.session.get(url)
def save_cookies(self, path):
with open(path, 'w') as f:
json.dump(list(self.session.cookies), f)
def load_cookies(self, path):
with open(path) as f:
for cookie in json.load(f):
self.session.cookies.set(**cookie)
# Usage
scraper = PersistentSession(proxy="http://user:pass@residential-proxy:8080")
scraper.warm_up("https://target-site.com") # Build session
data = scraper.get("https://target-site.com/data/page/1")
Key points:
- Always visit the homepage before the target page
- Reuse the same session for a site (keep cookies)
- Add
Refererheaders that match the actual navigation path
Practical Rate Limits by Site Protection Level
| Protection Level | Examples | Safe Request Rate |
|---|---|---|
| None | Most blogs, news | 1 req/sec |
| Basic (rate limit only) | Small e-commerce | 1 req/3-5s |
| Moderate (Cloudflare Basic) | Mid-size retail | 1 req/5-10s + proxy rotation |
| Heavy (Turnstile + JS checks) | LinkedIn, Amazon | Solving service + 10-30s delays |
| Maximum (behavioral AI) | Ticketmaster, airlines | Mobile proxies + full browser |
Quick Decision Tree: Which Approach to Use
Is the data in a JSON API?
YES → Use requests + proxy, skip browser entirely
NO ↓
Does the page require JavaScript to render content?
NO → Use curl-cffi (impersonate Chrome)
YES ↓
Is there a Cloudflare Turnstile challenge?
NO → Use Playwright with stealth patches
YES ↓
Volume > 1000 requests?
YES → Use solving service (2captcha/CapSolver)
NO → Use headless=False Playwright (often auto-solves)
Tools Reference
| Tool | Use Case | Install |
|---|---|---|
curl-cffi |
TLS fingerprint bypass | pip install curl-cffi |
httpx[http2] |
HTTP/2 support | pip install httpx[http2] |
playwright |
JavaScript rendering | pip install playwright |
undetected-chromedriver |
Selenium alternative | pip install undetected-chromedriver |
scrapy-rotating-proxies |
Scrapy proxy rotation | pip install scrapy-rotating-proxies |
When to Use a Managed Scraping Service
Writing all this yourself makes sense for 1-3 targets you know well. For production scraping across many sites, maintaining anti-bot bypass code becomes its own full-time job — Cloudflare updates every few weeks, browser fingerprints shift, proxy IPs get burned.
A managed approach lets you focus on the data pipeline while the infrastructure handles detection. The tradeoff is cost vs. maintenance time.
Summary
The 80/20 of not getting banned:
-
Use
curl-cffi— fixes TLS fingerprinting immediately, handles 60% of blocks - Add residential proxies — fixes IP reputation, 20% more coverage
- Slow down — 2-5 second delays eliminate most rate-limit blocks
- Warm up sessions — visit homepage first, reuse cookies, add Referer
- Check for hidden APIs — often cleaner than scraping HTML at all
If you're hitting Cloudflare Turnstile specifically, solving services cost ~$1 per 1000 solves and integrate in under 20 lines of Python.
The hardest targets (airlines, ticketing, financial data) need mobile proxies and full behavioral simulation. For most business use cases — competitor data, lead generation, market research — the techniques above are more than enough.
Take the next step
Skip the setup. Production-ready tools for scraping without bans:
Apify Scrapers Bundle — $29 one-time
Instant download. Documented. Ready to deploy.
Top comments (2)
Layer 1 + Layer 2 is where I keep bleeding time. The navigator.webdriver + missing plugins detection stack is exactly why I moved off vanilla Playwright - a stealth browser via the browser-act CLI handled those fingerprints plus HTTP/2 SETTINGS by default, and its
--dynamic-proxyflag covers the residential rotation you called the sweet spot. Still reach for curl_cffi below the JS line - your benchmark numbers match what I've been seeing.The biggest frustration isn’t usually the code itself — it’s IP rotation and session continuity. Even perfectly written scripts fail if proxies burn out or sessions aren’t maintained. Using stable residential proxies like Novada can help keep your requests under the radar, maintain sessions across pages, and reduce downtime when sites deploy stricter anti-bot measures.