DEV Community

agenthustler
agenthustler

Posted on

How to Avoid Getting Blocked While Scraping in 2026 (Complete Guide)

Every web scraper eventually hits the wall: your requests start returning 403s, CAPTCHAs appear on every page, or your IP gets blacklisted entirely. In 2026, anti-bot systems are more sophisticated than ever — but they're not unbeatable. This guide covers every layer of bot detection and how to work around each one.

Why Sites Block Scrapers

Modern anti-bot systems don't rely on a single check. They use layered detection that examines multiple signals simultaneously:

  1. IP reputation — Is this IP address from a datacenter? Has it made suspicious request patterns?
  2. TLS fingerprinting — Does the TLS handshake match a real browser, or does it look like a Python script?
  3. JavaScript challenges — Can the client execute JavaScript and return the expected result?
  4. Browser fingerprinting — Do the browser properties (screen size, fonts, WebGL, canvas) look like a real user?
  5. Behavioral analysis — Is the browsing pattern human-like (mouse movements, scroll patterns, timing)?
  6. CAPTCHAs — The last resort when other signals are ambiguous.

Services like Cloudflare, Akamai Bot Manager, and PerimeterX combine these layers. To avoid detection, you need to address multiple layers simultaneously.

Layer 1: IP Rotation and Proxy Types

The most fundamental anti-blocking strategy is rotating your IP address. But not all proxies are equal.

Datacenter Proxies

  • What: IPs from cloud providers (AWS, GCP, DigitalOcean).
  • Cost: $1-5 per GB.
  • Detection rate: High. Most anti-bot systems maintain databases of datacenter IP ranges.
  • Use when: Scraping sites with minimal protection.

Residential Proxies

  • What: IPs from real ISPs, assigned to home users.
  • Cost: $5-15 per GB.
  • Detection rate: Low. These IPs look identical to regular users.
  • Use when: Scraping sites with strong anti-bot protection.

ISP Proxies

  • What: Datacenter-hosted IPs registered to ISPs.
  • Cost: $3-8 per GB.
  • Detection rate: Medium. Faster than residential but less detectable than datacenter.
  • Use when: You need speed and moderate stealth.

A service like ScraperAPI handles proxy rotation automatically — you send requests through their endpoint and they select the right proxy type, rotate IPs, and handle retries. It's the easiest way to start if you don't want to manage proxy infrastructure yourself.

Layer 2: Rate Limiting and Request Patterns

Even with good proxies, predictable request patterns will get you flagged.

Strategies:

  • Random delays. Never use fixed intervals. Add randomized delays between 2-8 seconds using random.uniform(2, 8).
  • Request queuing. Use a queue with configurable concurrency instead of firing all requests at once. Libraries like asyncio.Semaphore in Python work well.
  • Session management. Maintain cookies and sessions across requests. Anti-bot systems flag clients that don't maintain state.
  • Respect robots.txt. Not just for ethics — sites monitor for bots that ignore it.
  • Vary request headers. Rotate User-Agent strings and include realistic headers (Accept, Accept-Language, Accept-Encoding).
import random
import asyncio

async def scrape_with_delays(urls, semaphore):
    async with semaphore:
        for url in urls:
            await fetch(url)
            await asyncio.sleep(random.uniform(2, 8))

# Limit to 5 concurrent requests
sem = asyncio.Semaphore(5)
Enter fullscreen mode Exit fullscreen mode

Layer 3: Browser Fingerprinting and Stealth

When sites use JavaScript-based detection, you need a real browser — but a default Playwright or Puppeteer instance leaks signals that identify it as automated.

Common detection vectors:

  • navigator.webdriver is set to true
  • Missing browser plugins
  • Viewport size set to unusual dimensions
  • Missing or inconsistent WebGL renderer info
  • Canvas fingerprint doesn't match the claimed browser

Stealth solutions:

  • playwright-stealth / puppeteer-stealth: Patches that override common detection vectors.
  • Undetected-chromedriver: Modified ChromeDriver that avoids detection.
  • Custom browser profiles: Create persistent profiles with realistic browser history, cookies, and settings.
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async

async with async_playwright() as p:
    browser = await p.chromium.launch(headless=True)
    page = await browser.new_page()
    await stealth_async(page)
    await page.goto('https://target-site.com')
Enter fullscreen mode Exit fullscreen mode

Key tip: Headless mode is more detectable than headed mode. If you're running on a server, use xvfb (virtual display) with headed mode for better stealth.

Layer 4: CAPTCHA Solving

When all other detection layers are ambiguous, sites deploy CAPTCHAs. Here are your options:

Service Cost Speed Accuracy
2Captcha $2.99/1000 10-30s ~95%
Anti-Captcha $2.00/1000 10-25s ~96%
CapSolver $1.50/1000 5-15s ~94%

Most CAPTCHA solvers work the same way: you send the CAPTCHA image or sitekey, they return the solution. Integration is straightforward:

import requests

def solve_recaptcha(sitekey, url, api_key):
    # Submit task
    resp = requests.post('https://2captcha.com/in.php', data={
        'key': api_key,
        'method': 'userrecaptcha',
        'googlekey': sitekey,
        'pageurl': url
    })
    task_id = resp.text.split('|')[1]

    # Poll for result
    import time
    for _ in range(30):
        time.sleep(5)
        result = requests.get(f'https://2captcha.com/res.php?key={api_key}&action=get&id={task_id}')
        if 'CAPCHA_NOT_READY' not in result.text:
            return result.text.split('|')[1]
Enter fullscreen mode Exit fullscreen mode

Better approach: Avoid CAPTCHAs entirely by solving the earlier detection layers. CAPTCHAs are expensive and slow — they should be your last resort, not your primary strategy.

Layer 5: Using Managed Scraping Platforms

If you'd rather not deal with proxies, fingerprinting, and CAPTCHAs yourself, managed platforms handle all of this under the hood.

Apify Actors, for example, come with built-in proxy rotation, browser management, and anti-detection. Actors like the LinkedIn Jobs Scraper and Reddit Scraper handle anti-bot challenges automatically — you just configure the input and get clean data.

The advantage is clear: instead of spending days building and maintaining anti-detection logic, you use a tested solution. The tradeoff is cost per compute unit versus your engineering time.

The Nuclear Option: Residential Proxy + Stealth Browser

For the most heavily protected sites, you need the full stack:

  1. Residential proxy with session persistence (sticky sessions)
  2. Playwright in headed mode with stealth patches
  3. Realistic browsing patterns — visit the homepage first, navigate naturally, scroll
  4. Persistent browser profile with cookies from previous visits
  5. Random delays between all actions
  6. CAPTCHA solver as a fallback

This is expensive ($10-20+ per GB in proxy costs plus compute) and slow, but it defeats virtually all current anti-bot systems. Reserve it for high-value targets where the data justifies the cost.

Recommended Proxy Provider

For reliable residential proxies that won't break the budget, ThorData is worth evaluating. They offer residential proxy bandwidth at competitive per-GB pricing with wide geographic coverage — ideal for scraping targets that block datacenter IPs like Cloudflare-protected sites, Crunchbase, or LinkedIn.

Summary: Anti-Blocking Checklist

  • [ ] Rotate IPs with residential proxies for protected sites
  • [ ] Add random delays (2-8s) between requests
  • [ ] Rotate User-Agent strings and headers
  • [ ] Use stealth browser plugins when JavaScript rendering is needed
  • [ ] Maintain sessions and cookies across requests
  • [ ] Respect rate limits and robots.txt
  • [ ] Use CAPTCHA solvers only as a last resort
  • [ ] Consider managed platforms (Apify) to skip the infrastructure work
  • [ ] Monitor your success rate and adapt when detection changes

The key insight is that anti-blocking is not a single technique — it's a layered approach that matches the layered detection systems you're facing. Start with the simplest solution (proxy rotation + delays) and add complexity only when needed.

Top comments (0)