Introduction
Ever wondered how to reliably scrape dynamic web content without getting blocked?
Most web scrapers fail because they don't account for anti-bot defenses. This article walks through RepoHunter, an open-source Python framework that demonstrates production-grade scraping architecture with anti-detection evasion techniques.
What You'll Learn
- 🛡️ Defensive parsing strategies that survive layout mutations
- 🤖 Headless browser automation with stealth technology
- 🏗️ System design principles for resilient scrapers
- 📊 Real-world example: GitHub trending repository discovery
The Problem: Why Standard Scraping Fails
Typical Web Scraper Architecture (Naive Approach)
# ❌ BAD: Naive scraper (will fail)
import requests
from bs4 import BeautifulSoup
response = requests.get("https://github.com/trending")
soup = BeautifulSoup(response.content, 'html.parser')
# Problem: GitHub detects bot requests immediately
# Result: HTTP 429 (Too Many Requests) or blocked IP
Why This Fails
- User-Agent Detection — GitHub checks if request comes from real browser
- Rate Limiting — Rapid requests trigger IP blocking
-
JavaScript Rendering — GitHub uses dynamic rendering;
requestslibrary gets empty HTML - Layout Mutations — GitHub changes CSS selectors; parser breaks
The Solution: Defensive Parsing Architecture
RepoHunter implements a multi-layered resilience strategy:
Layer 1: Headless Browser with Anti-Fingerprinting
# ✅ GOOD: Use headless browser with stealth
from playwright import sync_playwright
browser = sync_playwright().start()
page = browser.new_page()
# Anti-fingerprinting: Spoof browser signals
page.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined,
});
""")
page.goto("https://github.com/trending")
html = page.content()
Why this works:
- Headless browser renders JavaScript
- Anti-fingerprinting bypasses
navigator.webdriverdetection - Real browser signals = no blocking
Layer 2: Defensive CSS Selector Strategy
# ✅ Sequential fallback selectors (resilient to layout changes)
import re
from bs4 import BeautifulSoup
class DefensiveParser:
def extract_repositories(self, html):
soup = BeautifulSoup(html, 'html.parser')
# Primary selector (current GitHub layout)
repos = soup.select('article h1 a')
if not repos: # Fallback 1: GitHub mutated layout
repos = soup.select('h2 a[href*="/"]')
if not repos: # Fallback 2: Different class names
repos = soup.select('[itemprop="name"] a')
if not repos: # Fallback 3: Regex pattern matching
repos = re.findall(r'href="(/[^/]+/[^"]+)"', html)
return repos
Design Principle: Single Responsibility - parser adapts to layout changes without modifying core logic.
Layer 3: Exception Handling & Graceful Degradation
# ✅ GOOD: Graceful failure handling
import logging
from time import sleep
logger = logging.getLogger(__name__)
def fetch_with_retry(url, max_retries=3):
for attempt in range(max_retries):
try:
response = fetch(url)
if response.status_code == 200:
return response
elif response.status_code == 429:
# Rate limited: exponential backoff
sleep(2 ** attempt)
continue
except Exception as e:
logger.error(f"Attempt {attempt}: {e}")
continue
logger.warning(f"Failed after {max_retries} attempts")
return None
Design Pattern: Retry pattern with exponential backoff (prevents hammering server).
Real-World Implementation: RepoHunter
Architecture Overview
┌─────────────────────────────────────────┐
│ RepoHunter Main Flow │
├─────────────────────────────────────────┤
│ │
│ 1. Headless Browser Launch │
│ └─ Anti-fingerprinting enabled │
│ │
│ 2. GitHub Trending Request │
│ └─ Real browser signals │
│ │
│ 3. Defensive Parser │
│ └─ Sequential CSS fallbacks │
│ │
│ 4. Data Extraction │
│ └─ repo name, stars, language │
│ │
│ 5. JSON/Markdown Output │
│ └─ daily_trends.md report │
│ │
└─────────────────────────────────────────┘
Key Components
Component 1: Multiplatform Binary Validation
# ✅ Deterministic path handling across OS
import os
import platform
from pathlib import Path
class BinaryValidator:
def __init__(self, binary_name):
self.binary_path = Path(f"./obscura-{platform.system().lower()}/{binary_name}")
def is_executable(self):
# Cross-platform: check Unix execute bit
return os.access(self.binary_path, os.X_OK)
def validate(self):
if not self.binary_path.exists():
raise FileNotFoundError(f"Binary not found: {self.binary_path}")
if not self.is_executable():
raise PermissionError(f"Binary not executable: {self.binary_path}")
return True
Design Principle: Separation of Concerns - validation logic isolated from execution.
Component 2: Configurable Scan Policies
# ✅ Strategy pattern: swap algorithms at runtime
from abc import ABC, abstractmethod
from time import sleep
class ScanPolicy(ABC):
@abstractmethod
def get_interval(self):
pass
class FastPolicy(ScanPolicy):
def get_interval(self):
return 600 # 10 minutes
class SlowPolicy(ScanPolicy):
def get_interval(self):
return 43200 # 12 hours
class RepoHunter:
def __init__(self, policy: ScanPolicy):
self.policy = policy
def run(self):
while True:
self.scan()
sleep(self.policy.get_interval())
def scan(self):
pass # Scanning implementation goes here
Design Pattern: Strategy pattern - adds/removes policies without modifying RepoHunter.
Key Learnings: What Works in Production
✅ What Makes RepoHunter Resilient
| Strategy | Benefit |
|---|---|
| Sequential fallback selectors | Survives CSS layout mutations |
| Headless browser + anti-fingerprinting | Bypasses bot detection |
| Exponential backoff retry | Handles rate limiting gracefully |
| Deterministic path handling | Cross-platform binary execution |
| Strategy pattern configuration | Runtime policy switching |
❌ Common Mistakes to Avoid
❌ Single CSS selector (breaks on layout change)
❌ requests library only (can't render JavaScript)
❌ No retry logic (fails on transient errors)
❌ Hardcoded paths (breaks on different OS)
❌ Monolithic scraper (hard to test/maintain)
Installation & Usage
Get Started with RepoHunter
# Clone
git clone https://github.com/AxthonyV/RepoHunter.git
cd RepoHunter
# Install dependencies
pip install -r requirements.txt
# Run
python repo_hunter.py
Select scan policy:
- Fast: 10 minutes
- Normal: 1 hour
- Slow: 12 hours
- Once: Single execution
Example Output
# Trending Repositories (2026-06-07)
| Repository | Stars | Language | Description |
|-----------|-------|----------|-------------|
| openai/gpt-4 | 15,234 | Python | Advanced LLM Framework |
| vercel/next.js | 12,456 | JavaScript | React Framework |
Architecture & Design Principles
RepoHunter demonstrates several SOLID principles:
Single Responsibility Principle (SRP)
-
Parserhandles HTML extraction -
BinaryValidatorhandles binary validation -
RepoHunterorchestrates flow
Open/Closed Principle (OCP)
- Add new
ScanPolicywithout modifyingRepoHunter - Add new parsing logic without breaking existing
Dependency Inversion (DIP)
-
RepoHunterdepends on abstractScanPolicy, not concrete implementations
Educational Value
RepoHunter is perfect for learning:
✅ Web Scraping Architecture — Real-world resilience patterns
✅ Design Patterns — Strategy pattern in action
✅ Systems Programming — Cross-platform binary handling
✅ Error Handling — Graceful failure & retry logic
✅ Clean Code — Well-structured, testable components
Conclusion
Web scraping at scale requires defensive architecture, not just quick scripts. RepoHunter demonstrates production-grade patterns:
- Headless browsers for JavaScript rendering
- Sequential fallbacks for resilience
- Graceful degradation for error handling
- Design patterns for maintainability
Next Steps
- 🔍 Explore RepoHunter on GitHub
- 📚 Read the complete architecture documentation
- 🤝 Star the repo if you find it valuable
- 💬 Leave feedback or contribute improvements
Resources
- BeautifulSoup Documentation
- Playwright Documentation
- Gang of Four - Design Patterns
- Robert Martin - Clean Code
Questions? Feedback? Connect on GitHub 🚀
Top comments (0)