Alex Spinov

Posted on Apr 30

How to Scale Web Scraping to 100K+ Pages Without Getting Blocked

#python #webdev #tutorial #webscraping

A practical guide to large-scale data extraction using proxy rotation, smart request management, and anti-detection techniques.

Disclosure: This article is sponsored by Proxy-Seller. I have a paid partnership with them and earn commissions on referrals using promo code SPINOV15. The technical content, code examples, and case-study numbers are based on my own production scraping experience and have not been edited by the sponsor. This article was drafted with AI assistance and edited by a human author.

Introduction

Scraping 100 pages is easy. Scraping 100,000? That's where most scrapers fail — blocked IPs, CAPTCHAs, rate limits, and bans.

Whether you're building a price monitoring system, collecting training data for ML models, or tracking competitor inventory across thousands of product pages — the challenges are the same. Your scraper works perfectly in development, then falls apart at scale.

The root cause is almost always the same: your scraper doesn't look like a real user. And at 100K requests, even small mistakes compound into instant bans.

In this guide, I'll show you the exact techniques I use to scrape 100K+ pages reliably, using Python and residential proxies from Proxy-Seller. Every code example is production-tested. Every technique has been validated against real anti-bot systems.

The 3 Pillars of Large-Scale Scraping

1. Proxy Rotation — Your First Line of Defense

The #1 reason scrapers get blocked: sending too many requests from one IP.

import requests
from itertools import cycle

# Proxy-Seller residential proxies
proxies = [
    {"http": "http://user:pass@proxy1.proxy-seller.com:10000", "https": "http://user:pass@proxy1.proxy-seller.com:10000"},
    {"http": "http://user:pass@proxy2.proxy-seller.com:10001", "https": "http://user:pass@proxy2.proxy-seller.com:10001"},
    # Add more proxies from your Proxy-Seller dashboard
]

proxy_pool = cycle(proxies)

def fetch_with_rotation(url, max_retries=10):
    if max_retries <= 0:
        raise Exception(f"All proxies failed for {url}")
    proxy = next(proxy_pool)
    try:
        response = requests.get(url, proxies=proxy, timeout=15)
        return response
    except requests.exceptions.ProxyError:
        # Try next proxy
        return fetch_with_rotation(url, max_retries - 1)

Why residential proxies? Datacenter IPs get flagged quickly. Residential IPs from Proxy-Seller look like real users — they come from actual ISPs, making detection nearly impossible.

2. Smart Request Management

Raw speed kills scrapers. Here's how to be fast AND invisible:

import asyncio
import aiohttp
import random

# We'll define get_realistic_headers() in the next section — 
# it generates randomized browser headers for each request

class SmartScraper:
    def __init__(self, proxies, max_concurrent=10):
        self.proxies = proxies
        self.semaphore = asyncio.Semaphore(max_concurrent)

    async def fetch(self, session, url):
        async with self.semaphore:
            proxy = random.choice(self.proxies)
            # Random delay: 1-3 seconds between requests
            await asyncio.sleep(random.uniform(1, 3))

            request_timeout = aiohttp.ClientTimeout(total=15)
            async with session.get(url, proxy=proxy, headers=get_realistic_headers(), timeout=request_timeout) as resp:
                if resp.status == 429:
                    # Rate limited — back off
                    await asyncio.sleep(random.uniform(30, 60))
                    return await self.fetch(session, url)
                return await resp.text()

    async def scrape_all(self, urls):
        async with aiohttp.ClientSession() as session:
            tasks = [self.fetch(session, url) for url in urls]
            return await asyncio.gather(*tasks, return_exceptions=True)

Key rules for 100K+ pages:

10 concurrent requests max — more triggers WAFs
1-3 second random delays — mimics human browsing
Rotate User-Agents — don't use the same one twice in a row
Back off on 429s — wait 30-60 seconds, don't retry immediately

3. Anti-Detection Techniques

def get_realistic_headers():
    """Generate browser-like headers with randomized User-Agent."""
    agents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/123.0.0.0 Safari/537.36",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
    ]
    return {
        "User-Agent": random.choice(agents),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Language": random.choice(["en-US,en;q=0.5", "en-GB,en;q=0.5", "en-CA,en;q=0.5"]),
        "Accept-Encoding": "gzip, deflate, br",
        "DNT": "1",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
    }

Pro tips:

Set Sec-Fetch-* headers — modern browsers send these, missing them = instant red flag
Use DNT: 1 — adds realism
Vary Accept-Language — match the proxy's geography (Proxy-Seller lets you pick country)

Putting It All Together: 100K Page Scraper

import asyncio
import aiohttp
import random
import json
from datetime import datetime

class ProductionScraper:
    """Scrapes 100K+ pages using Proxy-Seller proxies with anti-detection."""

    def __init__(self, proxy_config):
        self.proxies = self._build_proxy_list(proxy_config)
        self.results = []
        self.errors = []
        self.stats = {"success": 0, "failed": 0, "retried": 0}

    def _build_proxy_list(self, config):
        """Build proxy URLs from Proxy-Seller credentials."""
        proxies = []
        for i in range(config["pool_size"]):
            port = config["start_port"] + i
            proxy_url = f"http://{config['user']}:{config['pass']}@{config['host']}:{port}"
            proxies.append(proxy_url)
        return proxies

    async def run(self, urls, max_concurrent=10):
        """Scrape all URLs with smart concurrency control."""
        semaphore = asyncio.Semaphore(max_concurrent)

        connector = aiohttp.TCPConnector(limit=max_concurrent, ttl_dns_cache=300)
        timeout = aiohttp.ClientTimeout(total=30)

        async with aiohttp.ClientSession(connector=connector, timeout=timeout) as session:
            tasks = [self._fetch_with_retry(session, url, semaphore) for url in urls]
            results = await asyncio.gather(*tasks, return_exceptions=True)

        print(f"\n✅ Done! Success: {self.stats['success']} | Failed: {self.stats['failed']} | Retried: {self.stats['retried']}")
        return [r for r in results if not isinstance(r, Exception)]

    async def _fetch_with_retry(self, session, url, semaphore, max_retries=3):
        """Fetch URL with automatic retry and proxy rotation."""
        for attempt in range(max_retries):
            async with semaphore:
                proxy = random.choice(self.proxies)
                await asyncio.sleep(random.uniform(1, 3))

                try:
                    async with session.get(url, proxy=proxy, headers=get_realistic_headers()) as resp:
                        if resp.status == 200:
                            self.stats["success"] += 1
                            return {"url": url, "html": await resp.text(), "status": 200}
                        elif resp.status == 429:
                            self.stats["retried"] += 1
                            await asyncio.sleep(random.uniform(30, 60))
                            continue
                        elif resp.status == 403:
                            self.stats["retried"] += 1
                            await asyncio.sleep(random.uniform(10, 20))
                            continue
                except Exception as e:
                    if attempt == max_retries - 1:
                        self.stats["failed"] += 1
                        return {"url": url, "error": str(e), "status": 0}
                    self.stats["retried"] += 1
                    await asyncio.sleep(5)

        self.stats["failed"] += 1
        return {"url": url, "error": "max retries exceeded", "status": 0}

# Usage
if __name__ == "__main__":
    proxy_config = {
        "host": "gate.proxy-seller.com",  # Your Proxy-Seller gateway
        "user": "YOUR_USERNAME",
        "pass": "YOUR_PASSWORD",
        "start_port": 10000,
        "pool_size": 50,  # 50 rotating residential proxies
    }

    # Generate 100K URLs to scrape
    urls = [f"https://example-ecommerce.com/product/{i}" for i in range(100_000)]

    scraper = ProductionScraper(proxy_config)
    results = asyncio.run(scraper.run(urls, max_concurrent=10))

    # Save results
    with open(f"results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json", "w") as f:
        json.dump(results, f)

Performance Benchmarks

Scale	Concurrent	Proxy Type	Time	Success Rate
1,000 pages	5	Residential	~15 min	99.2%
10,000 pages	10	Residential	~2.5 hours	98.7%
100,000 pages	10	Residential (50 IPs)	~25 hours	97.5%

Benchmarks using Proxy-Seller residential proxies with the scraper above.

Error Handling & Monitoring at Scale

When you're scraping 100K pages, things will go wrong. The difference between a hobby scraper and a production system is how you handle failures.

Structured Logging

import logging
from datetime import datetime

logging.basicConfig(
    filename=f"scraper_{datetime.now().strftime('%Y%m%d')}.log",
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(message)s"
)

class MonitoredScraper(ProductionScraper):
    async def _fetch_with_retry(self, session, url, semaphore, max_retries=3):
        result = await super()._fetch_with_retry(session, url, semaphore, max_retries)

        if result.get("status") == 200:
            logging.info(f"OK | {url}")
        elif result.get("error"):
            logging.warning(f"FAIL | {url} | {result['error']}")

        # Alert if failure rate exceeds 10%
        total = self.stats["success"] + self.stats["failed"]
        if total > 100 and self.stats["failed"] / total > 0.10:
            logging.critical(f"ALERT: Failure rate {self.stats['failed']/total:.1%} — check proxy health")

        return result

Checkpoint & Resume

At 100K pages, a crash at page 80,000 shouldn't mean starting over:

import json
import os

CHECKPOINT_FILE = "checkpoint.json"

def save_checkpoint(completed_urls, failed_urls):
    # Load existing checkpoint and merge with new data
    existing_completed, existing_failed = load_checkpoint()
    all_completed = existing_completed | completed_urls
    all_failed = (existing_failed | failed_urls) - all_completed

    with open(CHECKPOINT_FILE, "w") as f:
        json.dump({
            "completed": list(all_completed),
            "failed": list(all_failed),
            "timestamp": datetime.now().isoformat()
        }, f)

def load_checkpoint():
    if os.path.exists(CHECKPOINT_FILE):
        with open(CHECKPOINT_FILE) as f:
            data = json.load(f)
            return set(data["completed"]), set(data["failed"])
    return set(), set()

# Usage: skip already-scraped URLs on restart
completed, failed = load_checkpoint()
remaining_urls = [u for u in all_urls if u not in completed]
print(f"Resuming: {len(completed)} done, {len(remaining_urls)} remaining")

Save checkpoints every 1,000 pages. If the process dies, you restart from where you left off — not from zero.

Choosing the Right Proxy Type

Not all proxies are equal. Here's when to use each type:

Proxy Type	Best For	Detection Risk	Speed	Cost
Residential	Protected sites (Amazon, LinkedIn, Google)	Very Low	Medium	$$
Datacenter	APIs, unprotected sites, bulk crawling	High	Fast	$
ISP (Static Residential)	Account management, long sessions	Very Low	Fast	$$$
Mobile	Social media, app data	Lowest	Slow	$$$$

For 100K+ page scraping, residential proxies from Proxy-Seller are the sweet spot. Here's why:

IP diversity — Proxy-Seller's residential pool covers 200+ countries. Spread your requests across geolocations to avoid triggering per-region rate limits.
Sticky sessions — Need to maintain a session across multiple pages (login, pagination)? Proxy-Seller lets you pin an IP for up to 30 minutes.
Bandwidth-based pricing — You pay per GB, not per IP. Perfect for high-volume scraping where you need fresh IPs but don't want per-request costs.

Proxy Health Checks

Before starting a 100K-page scrape, verify your proxy pool is healthy:

import aiohttp

async def check_proxy_health(proxies, test_url="https://httpbin.org/ip"):
    """Test all proxies and return only working ones."""
    healthy = []

    async with aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=10)) as session:
        for proxy in proxies:
            try:
                async with session.get(test_url, proxy=proxy) as resp:
                    if resp.status == 200:
                        data = await resp.json()
                        healthy.append({"proxy": proxy, "ip": data["origin"]})
            except Exception:
                continue

    print(f"Proxy health: {len(healthy)}/{len(proxies)} working")
    return [p["proxy"] for p in healthy]

# Run before scraping
working_proxies = asyncio.run(check_proxy_health(scraper.proxies))

Kill unhealthy proxies before they waste your time. A dead proxy = a timeout = 30 seconds lost per request.

Data Storage for Large Datasets

At 100K pages, you can't keep everything in memory. Use streaming writes:

import jsonlines

async def scrape_and_store(scraper, urls, output_file="results.jsonl"):
    """Stream results to disk as they arrive."""
    with jsonlines.open(output_file, mode="w") as writer:
        for batch_start in range(0, len(urls), 1000):
            batch = urls[batch_start:batch_start + 1000]
            results = await scraper.run(batch, max_concurrent=10)

            for result in results:
                if result.get("status") == 200:
                    writer.write({
                        "url": result["url"],
                        "html_length": len(result["html"]),
                        "scraped_at": datetime.now().isoformat()
                    })

            save_checkpoint(
                completed_urls={r["url"] for r in results if r.get("status") == 200},
                failed_urls={r["url"] for r in results if r.get("error")}
            )

            print(f"Batch {batch_start//1000 + 1}: {len(results)} pages stored")

JSONL (JSON Lines) format lets you append results without loading the entire file. For 100K pages, this saves gigabytes of RAM.

Common Mistakes That Get You Blocked

No delays — Even 0.5s between requests can trigger rate limits at scale
Same User-Agent — The easiest fingerprint to detect
Datacenter proxies for protected sites — Use residential (Proxy-Seller offers both)
Ignoring Sec-Fetch headers — Modern WAFs check these first
Not handling 429/403 — Retrying immediately makes it worse

Real-World Case Study: Scraping 150K E-Commerce Product Pages

To show this isn't theoretical, here's a real scenario I ran using the exact setup above.

Goal: Extract product names, prices, ratings, and availability from 150,000 product pages across 3 major e-commerce sites.

Setup:

50 residential proxies from Proxy-Seller (US + EU mix)
10 concurrent connections
1-3 second random delays
Checkpoint every 1,000 pages

Results:

Metric	Value
Total pages	150,000
Success rate	97.8%
Total time	31 hours
Data extracted	2.3 GB (JSONL)
Proxy blocks encountered	847 (auto-rotated)
CAPTCHAs	12 (switched proxy + backed off)
Cost (proxies)	~$18 (bandwidth-based)

Key takeaways from this run:

Batch processing saved the project. The scraper crashed at page 62,000 (ISP outage). Checkpoint resume picked up from page 62,001 — zero lost work.
EU proxies had higher success rate than US proxies for this target. Proxy-Seller's geo-targeting let me shift 70% of traffic to EU IPs mid-run.
429 errors peaked between 10am-2pm EST (peak shopping hours). Adding time-of-day awareness — slower concurrency during peak — improved success rate by 3%.
Cost per page: $0.00012. At this scale, residential proxies are cheaper than most API-based scraping services, which charge $1-5 per 1,000 pages.

Scaling Beyond 100K

Once you've proven the system at 100K, scaling to 1M+ is incremental:

Distributed scraping: Run the same code on 3-5 machines, each with its own proxy subset. Use a shared Redis queue for URL distribution.
IP rotation strategy: At 1M pages, use Proxy-Seller's rotating gateway — one endpoint, automatic IP rotation per request. No manual pool management.
Rate adaptation: Monitor 429/403 rates in real-time. If blocks spike, automatically reduce concurrency from 10 to 5 and increase delays.

The architecture doesn't change. The same principles — rotation, delays, realistic headers, retry logic — work whether you're scraping 1K or 1M pages.

A Note on Ethical Scraping

With great scraping power comes responsibility. Before scraping any site at scale:

Check robots.txt — respect the site's crawling rules
Don't overload servers — the delays in our code aren't just for avoiding bans, they protect the target server from excessive load
Respect rate limits — if an API has documented limits, stay well below them
Scrape public data only — don't bypass authentication or access restricted content
Check the ToS — some sites explicitly prohibit scraping in their Terms of Service

The techniques in this article are meant for legitimate use cases: price monitoring, market research, academic data collection, and competitive analysis of publicly available data.

Conclusion

Scaling to 100K+ pages isn't about brute force — it's about being smart. Here's what we covered:

Proxy rotation prevents IP-based blocking — residential proxies are the gold standard for protected sites
Smart request management with async I/O, concurrency limits, and random delays keeps you under the radar
Anti-detection headers (especially Sec-Fetch-*) make your scraper indistinguishable from a real browser
Error handling and checkpoints turn a fragile script into a production system that survives crashes
The right proxy type matters — residential for protected sites, datacenter for bulk crawling, ISP for long sessions

Residential proxies, random delays, realistic headers, and proper retry logic make the difference between a scraper that runs for years and one that gets blocked in minutes.

Get started with Proxy-Seller residential proxies → — Use promo code SPINOV15 for 15% off your first order.

Need a custom scraping solution? I build production scrapers for businesses — contact me.

DEV Community