Part 1: Planning Your Scraping Project
Before you write a single line of code, answer these questions:
1. What exactly are you scraping?
- Specific fields (titles, dates, URLs, content)?
- How many pages/items initially? How many over 3 months?
- Is the data structured (JSON API) or unstructured (HTML)?
2. What's the target site's ToS and robots.txt?
- Some sites explicitly forbid scraping. Respect that.
- Check
yoursite.com/robots.txtand the Terms of Service. - If they offer an API, use it—it's always better than scraping.
3. What are the scale requirements?
- 100 items/week? Use a simple hourly cron job.
- 100,000 items/week? You need rate limiting, rotating proxies, and distributed workers.
- Millions? Consider managed platforms like Apify or hiring a specialist.
4. Where will you store the data?
- JSON files for quick prototypes
- CSV for spreadsheet analysis
- PostgreSQL/MongoDB for queryable datasets
- S3 for long-term archival
5. How will you handle failures?
- What happens if the site changes its HTML structure?
- What if your IP gets blocked?
- What if a scheduled job crashes at 2 AM?
Answer these first. It saves refactoring later.
Part 2: Choosing Your Scraping Tool
Not all scraping tools are created equal. Here's the 2026 landscape:
requests + BeautifulSoup (Simple HTTP pages)
Use this if:
- The site serves static HTML
- No JavaScript rendering needed
- You're scraping < 1,000 items/day
import requests
from bs4 import BeautifulSoup
response = requests.get('https://example.com')
soup = BeautifulSoup(response.content, 'html.parser')
items = soup.find_all('div', class_='item')
Pros: Fast, simple, lightweight
Cons: Useless against JavaScript-heavy sites; no built-in handling for redirects or dynamic content
Selenium or Playwright (JavaScript-rendered pages)
Use this if:
- The site heavily relies on JavaScript
- You need to interact with the page (click, scroll, fill forms)
- You're comfortable trading speed for reliability
from playwright.async_api import async_playwright
async def scrape():
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
await page.goto('https://example.com')
content = await page.content()
# Parse content with BeautifulSoup
await browser.close()
Pros: Handles JavaScript, can interact with pages
Cons: Slow, memory-heavy, overkill for static sites
Official APIs (Best option)
If the site offers an API:
- Use it. Always.
- APIs are faster, more reliable, and respect the site's infrastructure.
- Most modern platforms (Twitter, Bluesky, Substack, HackerNews) have APIs or documented endpoints.
When APIs are unavailable, managed scraping platforms like Apify become valuable. They handle proxies, retries, and JavaScript rendering so you don't have to:
- Bluesky posts: Apify Bluesky Scraper
- Substack newsletters: Apify Substack Scraper
- HackerNews submissions: Apify HackerNews Scraper
Part 3: Building a Robust Scraping Pipeline
Here's a production-grade scraping pipeline that handles errors, rate limits, and storage:
import requests
import json
import time
import logging
from datetime import datetime
from typing import List, Dict
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
# Configure logging
logging.basicConfig(
filename='scraper.log',
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
class ScrapingPipeline:
def __init__(self, output_file: str, max_retries: int = 3, backoff_factor: float = 0.3):
"""Initialize the scraping pipeline."""
self.output_file = output_file
self.max_retries = max_retries
self.backoff_factor = backoff_factor
self.session = self._create_session()
def _create_session(self) -> requests.Session:
"""Create a requests session with retry logic."""
session = requests.Session()
# Configure retry strategy
retry_strategy = Retry(
total=self.max_retries,
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["GET", "POST"],
backoff_factor=self.backoff_factor
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
session.mount("https://", adapter)
session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
})
return session
def fetch_page(self, url: str) -> str:
"""Fetch a page with error handling."""
try:
logging.info(f"Fetching {url}")
response = self.session.get(url, timeout=10)
response.raise_for_status()
return response.text
except requests.exceptions.RequestException as e:
logging.error(f"Failed to fetch {url}: {str(e)}")
raise
def parse_items(self, html: str) -> List[Dict]:
"""Parse HTML and extract items. Override this in subclasses."""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
items = []
# Example: Parse items with class 'article'
for article in soup.find_all('div', class_='article'):
item = {
'title': article.find('h2').text.strip() if article.find('h2') else '',
'url': article.find('a')['href'] if article.find('a') else '',
'timestamp': datetime.now().isoformat(),
}
items.append(item)
return items
def save_items(self, items: List[Dict], format: str = 'json'):
"""Save items to file in specified format."""
if format == 'json':
with open(self.output_file, 'w') as f:
json.dump(items, f, indent=2)
elif format == 'csv':
import csv
if items:
keys = items[0].keys()
with open(self.output_file, 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=keys)
writer.writeheader()
writer.writerows(items)
logging.info(f"Saved {len(items)} items to {self.output_file}")
def load_existing_data(self) -> List[Dict]:
"""Load previously scraped data to avoid duplicates."""
try:
with open(self.output_file, 'r') as f:
return json.load(f)
except FileNotFoundError:
return []
def deduplicate_items(self, new_items: List[Dict], existing_items: List[Dict], key: str = 'url') -> List[Dict]:
"""Remove items that already exist in the dataset."""
existing_keys = {item[key] for item in existing_items if key in item}
return [item for item in new_items if item.get(key) not in existing_keys]
def run(self, urls: List[str], format: str = 'json'):
"""Execute the scraping pipeline."""
all_items = []
existing_items = self.load_existing_data()
for url in urls:
try:
html = self.fetch_page(url)
items = self.parse_items(html)
new_items = self.deduplicate_items(items, existing_items)
all_items.extend(new_items)
# Respect rate limits: wait 2 seconds between requests
time.sleep(2)
except Exception as e:
logging.error(f"Error processing {url}: {str(e)}")
continue
# Combine with existing data and save
combined = existing_items + all_items
self.save_items(combined, format=format)
logging.info(f"Pipeline complete. Total items: {len(combined)}")
return combined
# Usage example
if __name__ == '__main__':
pipeline = ScrapingPipeline(
output_file='scraped_data.json',
max_retries=3,
backoff_factor=0.5
)
urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3',
]
pipeline.run(urls, format='json')
This pipeline includes:
- Automatic retries with exponential backoff for transient failures
- Rate limiting (2-second delays between requests)
- Error logging for debugging
- Deduplication to avoid storing duplicates
- Multiple output formats (JSON, CSV, or database)
- Session management with proper User-Agent headers
Part 4: Handling Rate Limits and Blocks
Every site has limits. Respect them or get blocked.
Common strategies:
- Respect delay requests: If a site returns 429 (Too Many Requests), increase your wait time.
if response.status_code == 429:
retry_after = int(response.headers.get('Retry-After', 60))
logging.warning(f"Rate limited. Waiting {retry_after} seconds")
time.sleep(retry_after)
- Rotate User-Agents: Some sites block specific user agents.
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36',
]
headers = {'User-Agent': random.choice(user_agents)}
- Use rotating proxies: For large-scale scraping, consider proxy services.
proxies = {
'http': 'http://proxy1.com:8080',
'https': 'http://proxy1.com:8080',
}
response = session.get(url, proxies=proxies)
- Space out requests: Never hammer a site. Aim for 1-2 requests per second max.
When blocks happen:
- Check the site's API or terms of service first.
- If they forbid scraping, stop and find an alternative data source.
- If they allow it, slow down further and use proxies.
- If an actor/tool consistently works (like Apify), consider outsourcing the scraping.
Part 5: Scheduling Automatic Runs
Your pipeline shouldn't require manual execution every time. Automate it.
Option 1: System Cron (Linux/Mac)
# Run daily at 2 AM
0 2 * * * /usr/bin/python3 /path/to/scraper.py
# Run every 6 hours
0 */6 * * * /usr/bin/python3 /path/to/scraper.py
Option 2: APScheduler (Python)
For more complex scheduling within Python:
from apscheduler.schedulers.background import BackgroundScheduler
def scheduled_scrape():
pipeline = ScrapingPipeline('output.json')
pipeline.run(urls=['https://example.com/page1'])
scheduler = BackgroundScheduler()
scheduler.add_job(scheduled_scrape, 'interval', hours=6)
scheduler.start()
# Keep the scheduler running
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
scheduler.shutdown()
Option 3: Apify (Managed Scraping)
For production workloads, Apify eliminates the operational overhead:
- Handles retries, proxies, and JavaScript rendering automatically
- Supports scheduling directly from the platform
- Monitors performance and alerts on failures
- No infrastructure to manage on your end
Use Apify for:
- Bluesky posts: Apify Bluesky Scraper
- Substack newsletters: Apify Substack Scraper
- HackerNews submissions: Apify HackerNews Scraper
Part 6: Error Handling and Monitoring
Production scrapers fail. Build resilience in.
Essential error handling:
import traceback
from functools import wraps
def retry_on_exception(max_attempts: int = 3, delay: int = 5):
"""Decorator to retry a function on exception."""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(max_attempts):
try:
return func(*args, **kwargs)
except Exception as e:
if attempt == max_attempts - 1:
raise
logging.warning(f"Attempt {attempt + 1} failed. Retrying in {delay}s: {str(e)}")
time.sleep(delay)
return wrapper
return decorator
@retry_on_exception(max_attempts=3, delay=5)
def fetch_with_retries(url: str):
response = requests.get(url, timeout=10)
response.raise_for_status()
return response.text
Monitor critical metrics:
- Items scraped per run
- Error rate
- Average response time
- IP blocks or CAPTCHA challenges
- Data quality (missing fields, incomplete records)
Store these in a simple metrics file or database. If error rates spike, pause and investigate.
Part 7: Data Storage Strategy
Choose based on your query patterns:
| Format | Best For | Pros | Cons |
|---|---|---|---|
| JSON | Small datasets, quick prototypes | Human-readable, nested structures | Not queryable, slower for large files |
| CSV | Spreadsheet analysis, tabular data | Excel/Sheets compatible | No nested data, flat structure only |
| SQLite | Local development, < 1M rows | No server setup, ACID transactions | Single machine only, slow for concurrent access |
| PostgreSQL | Production, concurrent access, complex queries | Powerful, scalable, backups | Infrastructure overhead |
| MongoDB | Nested/unstructured data, rapid iteration | Flexible schema, scalable | No strong guarantees, higher memory use |
Start with JSON. Graduate to a database when you need to query across millions of records.
Part 8: Putting It All Together
Here's a complete example scraping HackerNews or a similar site:
import requests
from bs4 import BeautifulSoup
import json
from datetime import datetime
import time
class HNScrapingPipeline:
def __init__(self):
self.base_url = 'https://news.ycombinator.com'
self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}
def scrape_top_stories(self, pages: int = 3) -> list:
"""Scrape top stories from HackerNews."""
stories = []
for page in range(1, pages + 1):
try:
url = f'{self.base_url}/?p={page}' if page > 1 else self.base_url
response = requests.get(url, headers=self.headers, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
rows = soup.find_all('tr', class_='athing')
for row in rows:
title_cell = row.find('span', class_='titleline')
if title_cell:
link = title_cell.find('a')
story = {
'title': link.text,
'url': link.get('href', ''),
'source': 'hackernews',
'scraped_at': datetime.now().isoformat(),
}
stories.append(story)
logging.info(f"Scraped page {page}: {len(rows)} stories")
time.sleep(2) # Rate limiting
except Exception as e:
logging.error(f"Error scraping page {page}: {str(e)}")
continue
return stories
def save(self, stories: list, filename: str = 'hn_stories.json'):
"""Save stories to JSON."""
with open(filename, 'w') as f:
json.dump(stories, f, indent=2)
logging.info(f"Saved {len(stories)} stories to {filename}")
# Run it
if __name__ == '__main__':
pipeline = HNScrapingPipeline()
stories = pipeline.scrape_top_stories(pages=3)
pipeline.save(stories)
For HackerNews and other popular platforms, consider using our Apify actors instead: Apify HackerNews Scraper. They handle edge cases and updates automatically.
Key Takeaways
- Plan before coding. Answer the 5 planning questions first.
- Use APIs when available. Scraping is a last resort.
- Build for failure. Retries, logging, and deduplication save you from disasters.
- Respect rate limits. A 2-second delay prevents IP blocks.
- Store strategically. JSON for prototypes, databases for production.
- Automate scheduling. Cron, APScheduler, or Apify—pick one and forget about it.
- Monitor obsessively. If you're not tracking errors, you won't know when it breaks.
For complex scraping at scale—Bluesky, Substack, HackerNews—managed platforms like Apify eliminate months of operational burden:
Related reading:
Stay Updated
Web scraping tools and best practices evolve constantly. Subscribe to The Data Collector to stay ahead of API changes, new scraping techniques, and real-world case studies from the scraping community.
Subscribe to The Data Collector
Get the latest on web data, scraping strategies, and how to build sustainable data pipelines. No fluff, just practical techniques that work in 2026.
Disclosure: This post contains affiliate links. I may earn a commission if you sign up through my links, at no extra cost to you.
Disclosure: This post contains affiliate links. I may earn a commission if you sign up through my links, at no extra cost to you.
Compare web scraping APIs:
- ScraperAPI — 5,000 free credits, 50+ countries, structured data parsing
- Scrape.do — From $29/mo, strong Cloudflare bypass
- ScrapeOps — Proxy comparison + monitoring dashboard
Need custom web scraping? Email hustler@curlship.com — fast turnaround, fair pricing.
Top comments (0)