Getting blocked is the number one frustration in web scraping. You write a perfect parser, test it on 10 pages, deploy it — and within an hour, every request returns a 403 or a CAPTCHA page.
After scraping millions of pages across hundreds of sites, here's everything I've learned about staying unblocked in 2026. These techniques work whether you're using Python, Node.js, or any other language.
Understanding Why You Get Blocked
Before diving into solutions, understand what you're up against. Modern anti-bot systems detect scrapers through:
- IP reputation: Too many requests from one IP
- Browser fingerprinting: Missing or inconsistent browser signatures
- Behavioral analysis: Inhuman request patterns (too fast, too regular)
- TLS fingerprinting: HTTP clients have different TLS signatures than real browsers
- JavaScript challenges: Checking if a real browser engine is executing JS
Each technique below addresses one or more of these detection vectors.
1. Rotate User Agents (and Do It Properly)
The most basic mistake: using the default python-requests/2.31.0 user agent. Every anti-bot system blocks this immediately.
But just setting a Chrome user agent isn't enough either. You need to rotate through realistic, current user agents.
import random
import requests
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:134.0) "
"Gecko/20100101 Firefox/134.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 "
"(KHTML, like Gecko) Version/18.2 Safari/605.1.15",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
]
headers = {
"User-Agent": random.choice(USER_AGENTS),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
}
response = requests.get("https://example.com", headers=headers)
Pro tip: Use ScrapeOps free Fake Browser Headers API to get always-updated, realistic header sets instead of maintaining your own list.
2. Implement Smart Rate Limiting
Hitting a site with 100 requests per second is the fastest way to get banned. Real users don't browse that fast.
import time
import random
def polite_request(url, session, min_delay=1.0, max_delay=3.0):
delay = random.uniform(min_delay, max_delay)
time.sleep(delay)
try:
response = session.get(url, timeout=15)
if response.status_code == 429:
retry_after = int(response.headers.get("Retry-After", 60))
print(f"Rate limited. Waiting {retry_after}s...")
time.sleep(retry_after)
return session.get(url, timeout=15)
return response
except Exception as e:
print(f"Request failed: {e}")
return None
Key rules:
- Randomize delays (don't use fixed
time.sleep(2)) - Respect
Retry-Afterheaders - Back off exponentially on repeated failures
- Scrape during off-peak hours for the target site's timezone
3. Use Residential Proxies
Datacenter IPs are cheap but easily detected. Residential proxies route through real ISP addresses, making your requests look like they come from regular home internet users.
import requests
proxy_url = "http://USER:PASS@proxy.thordata.com:9000"
proxies = {
"http": proxy_url,
"https": proxy_url
}
response = requests.get(
"https://target-site.com/data",
proxies=proxies,
headers=headers,
timeout=30
)
For cost-effective residential proxies, ThorData offers rates starting at $0.60/GB — significantly cheaper than enterprise alternatives while maintaining good IP quality.
If you want managed proxy rotation without configuring it yourself, ScraperAPI handles rotation, retries, and geo-targeting automatically through a simple API call.
4. Set Complete HTTP Headers
A real browser sends 10-15 headers with every request. A scraper using requests.get(url) sends 2-3. Anti-bot systems notice this discrepancy.
import random
from urllib.parse import urlparse
def get_browser_headers(url):
parsed = urlparse(url)
return {
"User-Agent": random.choice(USER_AGENTS),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Cache-Control": "max-age=0",
"Referer": f"{parsed.scheme}://{parsed.netloc}/",
}
The Sec-Fetch-* headers are particularly important in 2026 — many anti-bot systems check for these. Missing them is a dead giveaway.
5. Handle JavaScript Rendering
More than 60% of modern websites require JavaScript to render content. If you're only getting empty pages or "Please enable JavaScript" messages, you need a headless browser.
from playwright.sync_api import sync_playwright
def scrape_js_page(url):
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
args=[
"--disable-blink-features=AutomationControlled",
"--no-sandbox"
]
)
context = browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36"
)
)
page = context.new_page()
page.add_init_script(
"Object.defineProperty(navigator, 'webdriver', "
"{get: () => undefined});"
)
page.goto(url, wait_until="networkidle")
content = page.content()
browser.close()
return content
Important stealth tips:
- Remove the
navigator.webdriverflag - Set a realistic viewport size
- Use
--disable-blink-features=AutomationControlled - Add random mouse movements for heavily protected sites
For sites with aggressive anti-bot (Cloudflare, Akamai), consider using ScraperAPI with render=true — they maintain browser farms optimized for bypassing these protections.
6. Solve CAPTCHAs Gracefully
When you hit a CAPTCHA, you have three options:
- Avoid it entirely — Better proxies and headers often prevent CAPTCHAs from triggering
- Use a CAPTCHA solving service — Services like 2Captcha or Anti-Captcha solve them for $1-3 per 1,000
- Use a managed scraping API — Services like ScraperAPI handle CAPTCHAs automatically
import requests
import time
import random
def scrape_with_captcha_retry(url, max_retries=3):
for attempt in range(max_retries):
response = requests.get(
url,
proxies=get_next_proxy(),
headers=get_browser_headers(url)
)
if "captcha" not in response.text.lower() and response.status_code == 200:
return response
print(f"CAPTCHA detected, rotating proxy (attempt {attempt + 1})")
time.sleep(random.uniform(5, 15))
return None
7. Respect robots.txt (Mostly)
This isn't just about ethics — it's practical. Sites that see you ignoring robots.txt are more likely to deploy aggressive blocking.
from urllib.robotparser import RobotFileParser
from urllib.parse import urlparse
def can_scrape(url, user_agent="*"):
parsed = urlparse(url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
rp = RobotFileParser()
rp.set_url(robots_url)
try:
rp.read()
return rp.can_fetch(user_agent, url)
except Exception:
return True
8. Use Session Persistence
Real browsers maintain cookies and sessions. Scrapers that don't look suspicious.
import requests
import time
import random
session = requests.Session()
# First, visit the homepage to get cookies
session.get(
"https://target-site.com",
headers=get_browser_headers("https://target-site.com")
)
time.sleep(random.uniform(1, 3))
# Then navigate to the page you actually want
response = session.get(
"https://target-site.com/data/page-1",
headers=get_browser_headers("https://target-site.com/data/page-1")
)
This mimics how real users browse: they don't land directly on page 47 of search results — they start from the homepage and navigate.
9. Handle Errors and Adapt
The best scrapers adapt to blocking in real-time:
import requests
import time
import random
def adaptive_scraper(urls, session):
consecutive_failures = 0
base_delay = 1.0
for url in urls:
delay = base_delay * (2 ** min(consecutive_failures, 5))
delay += random.uniform(0, delay * 0.5)
time.sleep(delay)
headers = get_browser_headers(url)
response = session.get(url, headers=headers, timeout=15)
if response.status_code == 200:
consecutive_failures = 0
yield url, response
elif response.status_code in (403, 429, 503):
consecutive_failures += 1
print(
f"Blocked ({response.status_code}). "
f"Backing off {delay:.1f}s. "
f"Failures: {consecutive_failures}"
)
if consecutive_failures >= 5:
print("Too many failures. Rotating proxy/session...")
session = create_new_session()
consecutive_failures = 0
Putting It All Together
Here's a production-ready scraping template combining all techniques above:
import requests
import random
import time
from dataclasses import dataclass
@dataclass
class ScraperConfig:
min_delay: float = 1.5
max_delay: float = 4.0
max_retries: int = 3
proxy_url: str = None
def create_scraper(config: ScraperConfig):
session = requests.Session()
if config.proxy_url:
session.proxies = {
"http": config.proxy_url,
"https": config.proxy_url,
}
return session
def scrape_url(url, session, config):
for attempt in range(config.max_retries):
delay = random.uniform(config.min_delay, config.max_delay)
time.sleep(delay)
headers = get_browser_headers(url)
try:
response = session.get(url, headers=headers, timeout=15)
if response.status_code == 200:
return response
if response.status_code == 429:
wait = int(response.headers.get("Retry-After", 30))
time.sleep(wait)
except requests.RequestException as e:
print(f"Attempt {attempt + 1} failed: {e}")
return None
# Usage
config = ScraperConfig(
proxy_url="http://USER:PASS@proxy.thordata.com:9000"
)
session = create_scraper(config)
for url in target_urls:
result = scrape_url(url, session, config)
if result:
process(result)
When to Use Managed Scraping Platforms
If you're scraping at scale (thousands of pages daily), managing proxies, headers, and anti-bot evasion yourself becomes a full-time job. That's where managed platforms shine.
Apify provides ready-made scraping actors with built-in proxy rotation, retry logic, and data storage. For common scraping targets, using a pre-built actor is faster and cheaper than building from scratch.
For proxy-specific management, ThorData gives you affordable residential proxies, while ScrapeOps adds monitoring on top so you can see exactly which proxies and techniques are working.
Summary Checklist
Before deploying any scraper, verify you have:
- Rotating, up-to-date user agents
- Complete browser-like headers (including Sec-Fetch headers)
- Randomized delays between requests
- Residential proxy rotation for sensitive targets
- JavaScript rendering capability for SPA sites
- Error handling with exponential backoff
- Session persistence with cookies
- CAPTCHA handling strategy
- robots.txt awareness
Web scraping is an arms race, but these fundamentals haven't changed much over the years. Master them, and you'll successfully scrape 95% of websites without issues.
What's your biggest scraping challenge in 2026? Drop a comment below — happy to help troubleshoot.
Top comments (0)