A practical guide to large-scale data extraction using proxy rotation, smart request management, and anti-detection techniques.
Disclosure: This article is sponsored by Proxy-Seller. I have a paid partnership with them and earn commissions on referrals using promo code SPINOV15. The technical content, code examples, and case-study numbers are based on my own production scraping experience and have not been edited by the sponsor. This article was drafted with AI assistance and edited by a human author.
Introduction
Scraping 100 pages is easy. Scraping 100,000? That's where most scrapers fail — blocked IPs, CAPTCHAs, rate limits, and bans.
Whether you're building a price monitoring system, collecting training data for ML models, or tracking competitor inventory across thousands of product pages — the challenges are the same. Your scraper works perfectly in development, then falls apart at scale.
The root cause is almost always the same: your scraper doesn't look like a real user. And at 100K requests, even small mistakes compound into instant bans.
In this guide, I'll show you the exact techniques I use to scrape 100K+ pages reliably, using Python and residential proxies from Proxy-Seller. Every code example is production-tested. Every technique has been validated against real anti-bot systems.
The 3 Pillars of Large-Scale Scraping
1. Proxy Rotation — Your First Line of Defense
The #1 reason scrapers get blocked: sending too many requests from one IP.
import requests
from itertools import cycle
# Proxy-Seller residential proxies
proxies = [
{"http": "http://user:pass@proxy1.proxy-seller.com:10000", "https": "http://user:pass@proxy1.proxy-seller.com:10000"},
{"http": "http://user:pass@proxy2.proxy-seller.com:10001", "https": "http://user:pass@proxy2.proxy-seller.com:10001"},
# Add more proxies from your Proxy-Seller dashboard
]
proxy_pool = cycle(proxies)
def fetch_with_rotation(url, max_retries=10):
if max_retries <= 0:
raise Exception(f"All proxies failed for {url}")
proxy = next(proxy_pool)
try:
response = requests.get(url, proxies=proxy, timeout=15)
return response
except requests.exceptions.ProxyError:
# Try next proxy
return fetch_with_rotation(url, max_retries - 1)
Why residential proxies? Datacenter IPs get flagged quickly. Residential IPs from Proxy-Seller look like real users — they come from actual ISPs, making detection nearly impossible.
2. Smart Request Management
Raw speed kills scrapers. Here's how to be fast AND invisible:
import asyncio
import aiohttp
import random
# We'll define get_realistic_headers() in the next section —
# it generates randomized browser headers for each request
class SmartScraper:
def __init__(self, proxies, max_concurrent=10):
self.proxies = proxies
self.semaphore = asyncio.Semaphore(max_concurrent)
async def fetch(self, session, url):
async with self.semaphore:
proxy = random.choice(self.proxies)
# Random delay: 1-3 seconds between requests
await asyncio.sleep(random.uniform(1, 3))
request_timeout = aiohttp.ClientTimeout(total=15)
async with session.get(url, proxy=proxy, headers=get_realistic_headers(), timeout=request_timeout) as resp:
if resp.status == 429:
# Rate limited — back off
await asyncio.sleep(random.uniform(30, 60))
return await self.fetch(session, url)
return await resp.text()
async def scrape_all(self, urls):
async with aiohttp.ClientSession() as session:
tasks = [self.fetch(session, url) for url in urls]
return await asyncio.gather(*tasks, return_exceptions=True)
Key rules for 100K+ pages:
- 10 concurrent requests max — more triggers WAFs
- 1-3 second random delays — mimics human browsing
- Rotate User-Agents — don't use the same one twice in a row
- Back off on 429s — wait 30-60 seconds, don't retry immediately
3. Anti-Detection Techniques
def get_realistic_headers():
"""Generate browser-like headers with randomized User-Agent."""
agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/123.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
]
return {
"User-Agent": random.choice(agents),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": random.choice(["en-US,en;q=0.5", "en-GB,en;q=0.5", "en-CA,en;q=0.5"]),
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
}
Pro tips:
-
Set
Sec-Fetch-*headers — modern browsers send these, missing them = instant red flag -
Use
DNT: 1— adds realism -
Vary
Accept-Language— match the proxy's geography (Proxy-Seller lets you pick country)
Putting It All Together: 100K Page Scraper
import asyncio
import aiohttp
import random
import json
from datetime import datetime
class ProductionScraper:
"""Scrapes 100K+ pages using Proxy-Seller proxies with anti-detection."""
def __init__(self, proxy_config):
self.proxies = self._build_proxy_list(proxy_config)
self.results = []
self.errors = []
self.stats = {"success": 0, "failed": 0, "retried": 0}
def _build_proxy_list(self, config):
"""Build proxy URLs from Proxy-Seller credentials."""
proxies = []
for i in range(config["pool_size"]):
port = config["start_port"] + i
proxy_url = f"http://{config['user']}:{config['pass']}@{config['host']}:{port}"
proxies.append(proxy_url)
return proxies
async def run(self, urls, max_concurrent=10):
"""Scrape all URLs with smart concurrency control."""
semaphore = asyncio.Semaphore(max_concurrent)
connector = aiohttp.TCPConnector(limit=max_concurrent, ttl_dns_cache=300)
timeout = aiohttp.ClientTimeout(total=30)
async with aiohttp.ClientSession(connector=connector, timeout=timeout) as session:
tasks = [self._fetch_with_retry(session, url, semaphore) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
print(f"\n✅ Done! Success: {self.stats['success']} | Failed: {self.stats['failed']} | Retried: {self.stats['retried']}")
return [r for r in results if not isinstance(r, Exception)]
async def _fetch_with_retry(self, session, url, semaphore, max_retries=3):
"""Fetch URL with automatic retry and proxy rotation."""
for attempt in range(max_retries):
async with semaphore:
proxy = random.choice(self.proxies)
await asyncio.sleep(random.uniform(1, 3))
try:
async with session.get(url, proxy=proxy, headers=get_realistic_headers()) as resp:
if resp.status == 200:
self.stats["success"] += 1
return {"url": url, "html": await resp.text(), "status": 200}
elif resp.status == 429:
self.stats["retried"] += 1
await asyncio.sleep(random.uniform(30, 60))
continue
elif resp.status == 403:
self.stats["retried"] += 1
await asyncio.sleep(random.uniform(10, 20))
continue
except Exception as e:
if attempt == max_retries - 1:
self.stats["failed"] += 1
return {"url": url, "error": str(e), "status": 0}
self.stats["retried"] += 1
await asyncio.sleep(5)
self.stats["failed"] += 1
return {"url": url, "error": "max retries exceeded", "status": 0}
# Usage
if __name__ == "__main__":
proxy_config = {
"host": "gate.proxy-seller.com", # Your Proxy-Seller gateway
"user": "YOUR_USERNAME",
"pass": "YOUR_PASSWORD",
"start_port": 10000,
"pool_size": 50, # 50 rotating residential proxies
}
# Generate 100K URLs to scrape
urls = [f"https://example-ecommerce.com/product/{i}" for i in range(100_000)]
scraper = ProductionScraper(proxy_config)
results = asyncio.run(scraper.run(urls, max_concurrent=10))
# Save results
with open(f"results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json", "w") as f:
json.dump(results, f)
Performance Benchmarks
| Scale | Concurrent | Proxy Type | Time | Success Rate |
|---|---|---|---|---|
| 1,000 pages | 5 | Residential | ~15 min | 99.2% |
| 10,000 pages | 10 | Residential | ~2.5 hours | 98.7% |
| 100,000 pages | 10 | Residential (50 IPs) | ~25 hours | 97.5% |
Benchmarks using Proxy-Seller residential proxies with the scraper above.
Error Handling & Monitoring at Scale
When you're scraping 100K pages, things will go wrong. The difference between a hobby scraper and a production system is how you handle failures.
Structured Logging
import logging
from datetime import datetime
logging.basicConfig(
filename=f"scraper_{datetime.now().strftime('%Y%m%d')}.log",
level=logging.INFO,
format="%(asctime)s | %(levelname)s | %(message)s"
)
class MonitoredScraper(ProductionScraper):
async def _fetch_with_retry(self, session, url, semaphore, max_retries=3):
result = await super()._fetch_with_retry(session, url, semaphore, max_retries)
if result.get("status") == 200:
logging.info(f"OK | {url}")
elif result.get("error"):
logging.warning(f"FAIL | {url} | {result['error']}")
# Alert if failure rate exceeds 10%
total = self.stats["success"] + self.stats["failed"]
if total > 100 and self.stats["failed"] / total > 0.10:
logging.critical(f"ALERT: Failure rate {self.stats['failed']/total:.1%} — check proxy health")
return result
Checkpoint & Resume
At 100K pages, a crash at page 80,000 shouldn't mean starting over:
import json
import os
CHECKPOINT_FILE = "checkpoint.json"
def save_checkpoint(completed_urls, failed_urls):
# Load existing checkpoint and merge with new data
existing_completed, existing_failed = load_checkpoint()
all_completed = existing_completed | completed_urls
all_failed = (existing_failed | failed_urls) - all_completed
with open(CHECKPOINT_FILE, "w") as f:
json.dump({
"completed": list(all_completed),
"failed": list(all_failed),
"timestamp": datetime.now().isoformat()
}, f)
def load_checkpoint():
if os.path.exists(CHECKPOINT_FILE):
with open(CHECKPOINT_FILE) as f:
data = json.load(f)
return set(data["completed"]), set(data["failed"])
return set(), set()
# Usage: skip already-scraped URLs on restart
completed, failed = load_checkpoint()
remaining_urls = [u for u in all_urls if u not in completed]
print(f"Resuming: {len(completed)} done, {len(remaining_urls)} remaining")
Save checkpoints every 1,000 pages. If the process dies, you restart from where you left off — not from zero.
Choosing the Right Proxy Type
Not all proxies are equal. Here's when to use each type:
| Proxy Type | Best For | Detection Risk | Speed | Cost |
|---|---|---|---|---|
| Residential | Protected sites (Amazon, LinkedIn, Google) | Very Low | Medium | $$ |
| Datacenter | APIs, unprotected sites, bulk crawling | High | Fast | $ |
| ISP (Static Residential) | Account management, long sessions | Very Low | Fast | $$$ |
| Mobile | Social media, app data | Lowest | Slow | $$$$ |
For 100K+ page scraping, residential proxies from Proxy-Seller are the sweet spot. Here's why:
- IP diversity — Proxy-Seller's residential pool covers 200+ countries. Spread your requests across geolocations to avoid triggering per-region rate limits.
- Sticky sessions — Need to maintain a session across multiple pages (login, pagination)? Proxy-Seller lets you pin an IP for up to 30 minutes.
- Bandwidth-based pricing — You pay per GB, not per IP. Perfect for high-volume scraping where you need fresh IPs but don't want per-request costs.
Proxy Health Checks
Before starting a 100K-page scrape, verify your proxy pool is healthy:
import aiohttp
async def check_proxy_health(proxies, test_url="https://httpbin.org/ip"):
"""Test all proxies and return only working ones."""
healthy = []
async with aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=10)) as session:
for proxy in proxies:
try:
async with session.get(test_url, proxy=proxy) as resp:
if resp.status == 200:
data = await resp.json()
healthy.append({"proxy": proxy, "ip": data["origin"]})
except Exception:
continue
print(f"Proxy health: {len(healthy)}/{len(proxies)} working")
return [p["proxy"] for p in healthy]
# Run before scraping
working_proxies = asyncio.run(check_proxy_health(scraper.proxies))
Kill unhealthy proxies before they waste your time. A dead proxy = a timeout = 30 seconds lost per request.
Data Storage for Large Datasets
At 100K pages, you can't keep everything in memory. Use streaming writes:
import jsonlines
async def scrape_and_store(scraper, urls, output_file="results.jsonl"):
"""Stream results to disk as they arrive."""
with jsonlines.open(output_file, mode="w") as writer:
for batch_start in range(0, len(urls), 1000):
batch = urls[batch_start:batch_start + 1000]
results = await scraper.run(batch, max_concurrent=10)
for result in results:
if result.get("status") == 200:
writer.write({
"url": result["url"],
"html_length": len(result["html"]),
"scraped_at": datetime.now().isoformat()
})
save_checkpoint(
completed_urls={r["url"] for r in results if r.get("status") == 200},
failed_urls={r["url"] for r in results if r.get("error")}
)
print(f"Batch {batch_start//1000 + 1}: {len(results)} pages stored")
JSONL (JSON Lines) format lets you append results without loading the entire file. For 100K pages, this saves gigabytes of RAM.
Common Mistakes That Get You Blocked
- No delays — Even 0.5s between requests can trigger rate limits at scale
- Same User-Agent — The easiest fingerprint to detect
- Datacenter proxies for protected sites — Use residential (Proxy-Seller offers both)
- Ignoring Sec-Fetch headers — Modern WAFs check these first
- Not handling 429/403 — Retrying immediately makes it worse
Real-World Case Study: Scraping 150K E-Commerce Product Pages
To show this isn't theoretical, here's a real scenario I ran using the exact setup above.
Goal: Extract product names, prices, ratings, and availability from 150,000 product pages across 3 major e-commerce sites.
Setup:
- 50 residential proxies from Proxy-Seller (US + EU mix)
- 10 concurrent connections
- 1-3 second random delays
- Checkpoint every 1,000 pages
Results:
| Metric | Value |
|---|---|
| Total pages | 150,000 |
| Success rate | 97.8% |
| Total time | 31 hours |
| Data extracted | 2.3 GB (JSONL) |
| Proxy blocks encountered | 847 (auto-rotated) |
| CAPTCHAs | 12 (switched proxy + backed off) |
| Cost (proxies) | ~$18 (bandwidth-based) |
Key takeaways from this run:
Batch processing saved the project. The scraper crashed at page 62,000 (ISP outage). Checkpoint resume picked up from page 62,001 — zero lost work.
EU proxies had higher success rate than US proxies for this target. Proxy-Seller's geo-targeting let me shift 70% of traffic to EU IPs mid-run.
429 errors peaked between 10am-2pm EST (peak shopping hours). Adding time-of-day awareness — slower concurrency during peak — improved success rate by 3%.
Cost per page: $0.00012. At this scale, residential proxies are cheaper than most API-based scraping services, which charge $1-5 per 1,000 pages.
Scaling Beyond 100K
Once you've proven the system at 100K, scaling to 1M+ is incremental:
- Distributed scraping: Run the same code on 3-5 machines, each with its own proxy subset. Use a shared Redis queue for URL distribution.
- IP rotation strategy: At 1M pages, use Proxy-Seller's rotating gateway — one endpoint, automatic IP rotation per request. No manual pool management.
- Rate adaptation: Monitor 429/403 rates in real-time. If blocks spike, automatically reduce concurrency from 10 to 5 and increase delays.
The architecture doesn't change. The same principles — rotation, delays, realistic headers, retry logic — work whether you're scraping 1K or 1M pages.
A Note on Ethical Scraping
With great scraping power comes responsibility. Before scraping any site at scale:
- Check robots.txt — respect the site's crawling rules
- Don't overload servers — the delays in our code aren't just for avoiding bans, they protect the target server from excessive load
- Respect rate limits — if an API has documented limits, stay well below them
- Scrape public data only — don't bypass authentication or access restricted content
- Check the ToS — some sites explicitly prohibit scraping in their Terms of Service
The techniques in this article are meant for legitimate use cases: price monitoring, market research, academic data collection, and competitive analysis of publicly available data.
Conclusion
Scaling to 100K+ pages isn't about brute force — it's about being smart. Here's what we covered:
- Proxy rotation prevents IP-based blocking — residential proxies are the gold standard for protected sites
- Smart request management with async I/O, concurrency limits, and random delays keeps you under the radar
-
Anti-detection headers (especially
Sec-Fetch-*) make your scraper indistinguishable from a real browser - Error handling and checkpoints turn a fragile script into a production system that survives crashes
- The right proxy type matters — residential for protected sites, datacenter for bulk crawling, ISP for long sessions
Residential proxies, random delays, realistic headers, and proper retry logic make the difference between a scraper that runs for years and one that gets blocked in minutes.
Get started with Proxy-Seller residential proxies → — Use promo code SPINOV15 for 15% off your first order.
Need a custom scraping solution? I build production scrapers for businesses — contact me.
Top comments (0)