Last month, a marketing agency asked me to scrape 50,000 product listings daily. Their budget? Zero for infrastructure.
I thought they were joking. They weren't.
Here's how I built a production pipeline that runs for free, handles failures gracefully, and has been running for 30 days straight without intervention.
The Problem
Most scraping tutorials show you requests + BeautifulSoup on a single page. That's like teaching someone to cook by boiling water.
Real scraping at scale means:
- Rate limiting (or getting IP-banned in 30 seconds)
- Retry logic (sites go down, connections drop)
- Data validation (garbage in = garbage out)
- Storage (50K items/day × 30 days = you need a plan)
- Monitoring (how do you know it's still working at 3 AM?)
The Architecture
[Scheduler] → [Queue] → [Workers] → [Validator] → [Storage]
↓ ↓ ↓ ↓ ↓
Cron Redis/File Async Pool JSON Schema SQLite + S3
↓ ↓ ↓ ↓ ↓
Free Free Free tier Free Free
Layer 1: Smart Scheduling
import asyncio
from datetime import datetime, timedelta
class AdaptiveScheduler:
def __init__(self, base_interval=60):
self.base_interval = base_interval
self.success_streak = 0
self.fail_streak = 0
@property
def interval(self):
# Speed up when things are working
if self.success_streak > 10:
return self.base_interval * 0.5
# Slow down on failures (exponential backoff)
if self.fail_streak > 0:
return min(self.base_interval * (2 ** self.fail_streak), 3600)
return self.base_interval
def record_success(self):
self.success_streak += 1
self.fail_streak = 0
def record_failure(self):
self.fail_streak += 1
self.success_streak = 0
This alone saved the project. Instead of fixed intervals, the scraper adapts. Fast when the site is responsive, slow when it's struggling.
Layer 2: Async Worker Pool
import aiohttp
import asyncio
from collections import deque
class WorkerPool:
def __init__(self, max_concurrent=5, rate_limit=2.0):
self.semaphore = asyncio.Semaphore(max_concurrent)
self.rate_limit = rate_limit
self.results = deque(maxlen=10000)
async def fetch(self, session, url):
async with self.semaphore:
await asyncio.sleep(self.rate_limit) # Be polite
try:
async with session.get(url, timeout=30) as resp:
if resp.status == 200:
return await resp.json()
elif resp.status == 429:
await asyncio.sleep(60) # Rate limited
return await self.fetch(session, url)
except Exception as e:
return {'error': str(e), 'url': url}
async def run(self, urls):
async with aiohttp.ClientSession() as session:
tasks = [self.fetch(session, url) for url in urls]
return await asyncio.gather(*tasks)
Key insight: 5 concurrent workers with 2-second delays beats 100 workers hammering the server. You get the same throughput without getting banned.
Layer 3: Data Validation
This is where most pipelines fail. You scrape 50,000 items and 30% are garbage.
from dataclasses import dataclass
from typing import Optional
@dataclass
class Product:
name: str
price: float
url: str
category: Optional[str] = None
def validate(self) -> bool:
if not self.name or len(self.name) < 2:
return False
if self.price <= 0 or self.price > 1_000_000:
return False
if not self.url.startswith('http'):
return False
return True
def clean_pipeline(raw_items):
valid = []
rejected = []
for item in raw_items:
product = Product(**item)
if product.validate():
valid.append(product)
else:
rejected.append(item)
rejection_rate = len(rejected) / len(raw_items) * 100
if rejection_rate > 20:
print(f'WARNING: {rejection_rate:.1f}% rejection rate — check selectors')
return valid, rejected
Layer 4: Free Storage Strategy
import sqlite3
import json
from pathlib import Path
class StorageManager:
def __init__(self, db_path='scraper.db'):
self.conn = sqlite3.connect(db_path)
self.conn.execute('''
CREATE TABLE IF NOT EXISTS products (
id INTEGER PRIMARY KEY,
name TEXT,
price REAL,
url TEXT UNIQUE,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
''')
def upsert(self, products):
for p in products:
self.conn.execute(
'INSERT OR REPLACE INTO products (name, price, url) VALUES (?, ?, ?)',
(p.name, p.price, p.url)
)
self.conn.commit()
def export_daily(self):
cursor = self.conn.execute(
'SELECT * FROM products WHERE date(scraped_at) = date("now")'
)
rows = cursor.fetchall()
path = Path(f'exports/{datetime.now():%Y-%m-%d}.json')
path.parent.mkdir(exist_ok=True)
path.write_text(json.dumps(rows, indent=2))
return len(rows)
SQLite handles 50K inserts/day easily. Daily JSON exports go to a free cloud storage bucket.
Results After 30 Days
| Metric | Value |
|---|---|
| Total items scraped | 1,247,000 |
| Uptime | 99.7% |
| Infrastructure cost | $0 |
| Average scrape time | 12 min/batch |
| Data quality (valid %) | 94.2% |
| IP bans | 0 |
The zero-ban rate is the achievement I'm most proud of. Polite scraping works.
What I'd Do Differently
- Start with validation — I added it after 3 days of garbage data
- Monitor from day 1 — silent failures are the worst kind
- Use rotating user agents — even polite scrapers need variety
Want This Running in 5 Minutes?
I've packaged similar scrapers as ready-to-use actors on Apify Store. No setup, no infrastructure, just results.
Or grab the full source from my web scraping toolkit on GitHub.
What's your biggest scraping challenge? Rate limiting? Anti-bot detection? Data quality? Drop it in the comments — I've probably hit the same wall. 👇
Building developer tools at spinov001-art.github.io — 190+ open source repos and counting.
Top comments (0)