agenthustler

Posted on Mar 26 • Edited on Apr 19

How to Avoid Getting Blocked While Scraping in 2026 (Complete Guide)

#webscraping #python #proxies #tutorial

Every web scraper eventually hits the wall: your requests start returning 403s, CAPTCHAs appear on every page, or your IP gets blacklisted entirely. In 2026, anti-bot systems are more sophisticated than ever — but they're not unbeatable. This guide covers every layer of bot detection and how to work around each one.

Why Sites Block Scrapers

Modern anti-bot systems don't rely on a single check. They use layered detection that examines multiple signals simultaneously:

IP reputation — Is this IP address from a datacenter? Has it made suspicious request patterns?
TLS fingerprinting — Does the TLS handshake match a real browser, or does it look like a Python script?
JavaScript challenges — Can the client execute JavaScript and return the expected result?
Browser fingerprinting — Do the browser properties (screen size, fonts, WebGL, canvas) look like a real user?
Behavioral analysis — Is the browsing pattern human-like (mouse movements, scroll patterns, timing)?
CAPTCHAs — The last resort when other signals are ambiguous.

Services like Cloudflare, Akamai Bot Manager, and PerimeterX combine these layers. To avoid detection, you need to address multiple layers simultaneously.

Layer 1: IP Rotation and Proxy Types

The most fundamental anti-blocking strategy is rotating your IP address. But not all proxies are equal.

Datacenter Proxies

What: IPs from cloud providers (AWS, GCP, DigitalOcean).
Cost: $1-5 per GB.
Detection rate: High. Most anti-bot systems maintain databases of datacenter IP ranges.
Use when: Scraping sites with minimal protection.

Residential Proxies

What: IPs from real ISPs, assigned to home users.
Cost: $5-15 per GB.
Detection rate: Low. These IPs look identical to regular users.
Use when: Scraping sites with strong anti-bot protection.

ISP Proxies

What: Datacenter-hosted IPs registered to ISPs.
Cost: $3-8 per GB.
Detection rate: Medium. Faster than residential but less detectable than datacenter.
Use when: You need speed and moderate stealth.

A service like ScraperAPI handles proxy rotation automatically — you send requests through their endpoint and they select the right proxy type, rotate IPs, and handle retries. It's the easiest way to start if you don't want to manage proxy infrastructure yourself.

Layer 2: Rate Limiting and Request Patterns

Even with good proxies, predictable request patterns will get you flagged.

Strategies:

Random delays. Never use fixed intervals. Add randomized delays between 2-8 seconds using random.uniform(2, 8).
Request queuing. Use a queue with configurable concurrency instead of firing all requests at once. Libraries like asyncio.Semaphore in Python work well.
Session management. Maintain cookies and sessions across requests. Anti-bot systems flag clients that don't maintain state.
Respect robots.txt. Not just for ethics — sites monitor for bots that ignore it.
Vary request headers. Rotate User-Agent strings and include realistic headers (Accept, Accept-Language, Accept-Encoding).

import random
import asyncio

async def scrape_with_delays(urls, semaphore):
    async with semaphore:
        for url in urls:
            await fetch(url)
            await asyncio.sleep(random.uniform(2, 8))

# Limit to 5 concurrent requests
sem = asyncio.Semaphore(5)

Layer 3: Browser Fingerprinting and Stealth

When sites use JavaScript-based detection, you need a real browser — but a default Playwright or Puppeteer instance leaks signals that identify it as automated.

Common detection vectors:

navigator.webdriver is set to true
Missing browser plugins
Viewport size set to unusual dimensions
Missing or inconsistent WebGL renderer info
Canvas fingerprint doesn't match the claimed browser

Stealth solutions:

playwright-stealth / puppeteer-stealth: Patches that override common detection vectors.
Undetected-chromedriver: Modified ChromeDriver that avoids detection.
Custom browser profiles: Create persistent profiles with realistic browser history, cookies, and settings.

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Key tip: Headless mode is more detectable than headed mode. If you're running on a server, use xvfb (virtual display) with headed mode for better stealth.

Layer 4: CAPTCHA Solving

When all other detection layers are ambiguous, sites deploy CAPTCHAs. Here are your options:

Service	Cost	Speed	Accuracy
2Captcha	$2.99/1000	10-30s	~95%
Anti-Captcha	$2.00/1000	10-25s	~96%
CapSolver	$1.50/1000	5-15s	~94%

Most CAPTCHA solvers work the same way: you send the CAPTCHA image or sitekey, they return the solution. Integration is straightforward:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Better approach: Avoid CAPTCHAs entirely by solving the earlier detection layers. CAPTCHAs are expensive and slow — they should be your last resort, not your primary strategy.

Layer 5: Using Managed Scraping Platforms

If you'd rather not deal with proxies, fingerprinting, and CAPTCHAs yourself, managed platforms handle all of this under the hood.

Apify Actors, for example, come with built-in proxy rotation, browser management, and anti-detection. Actors like the LinkedIn Jobs Scraper and Reddit Scraper handle anti-bot challenges automatically — you just configure the input and get clean data.

The advantage is clear: instead of spending days building and maintaining anti-detection logic, you use a tested solution. The tradeoff is cost per compute unit versus your engineering time.

The Nuclear Option: Residential Proxy + Stealth Browser

For the most heavily protected sites, you need the full stack:

Residential proxy with session persistence (sticky sessions)
Playwright in headed mode with stealth patches
Realistic browsing patterns — visit the homepage first, navigate naturally, scroll
Persistent browser profile with cookies from previous visits
Random delays between all actions
CAPTCHA solver as a fallback

This is expensive ($10-20+ per GB in proxy costs plus compute) and slow, but it defeats virtually all current anti-bot systems. Reserve it for high-value targets where the data justifies the cost.

Recommended Proxy Provider

For reliable residential proxies that won't break the budget, ThorData is worth evaluating. They offer residential proxy bandwidth at competitive per-GB pricing with wide geographic coverage — ideal for scraping targets that block datacenter IPs like Cloudflare-protected sites, Crunchbase, or LinkedIn.

Summary: Anti-Blocking Checklist

[ ] Rotate IPs with residential proxies for protected sites
[ ] Add random delays (2-8s) between requests
[ ] Rotate User-Agent strings and headers
[ ] Use stealth browser plugins when JavaScript rendering is needed
[ ] Maintain sessions and cookies across requests
[ ] Respect rate limits and robots.txt
[ ] Use CAPTCHA solvers only as a last resort
[ ] Consider managed platforms (Apify) to skip the infrastructure work
[ ] Monitor your success rate and adapt when detection changes

The key insight is that anti-blocking is not a single technique — it's a layered approach that matches the layered detection systems you're facing. Start with the simplest solution (proxy rotation + delays) and add complexity only when needed.

Top comments (1)

Blanche • Jun 12

Totally agree—CAPTCHAs should be your last resort. Once a site starts flagging your traffic, the key is maintaining session continuity and using high-quality residential proxies. With sticky sessions and controlled rotation, you can keep requests flowing smoothly while drastically reducing both blocks and captcha triggers, even on sites with heavy anti-bot measures.