AxthonyV

Posted on Jun 7

Building Anti-Bot Detection Systems: How RepoHunter Scrapes GitHub Trending Repositories

#python #webscraping #automation #architecture

Introduction

Ever wondered how to reliably scrape dynamic web content without getting blocked?

Most web scrapers fail because they don't account for anti-bot defenses. This article walks through RepoHunter, an open-source Python framework that demonstrates production-grade scraping architecture with anti-detection evasion techniques.

What You'll Learn

🛡️ Defensive parsing strategies that survive layout mutations
🤖 Headless browser automation with stealth technology
🏗️ System design principles for resilient scrapers
📊 Real-world example: GitHub trending repository discovery

The Problem: Why Standard Scraping Fails

Typical Web Scraper Architecture (Naive Approach)

# ❌ BAD: Naive scraper (will fail)
import requests
from bs4 import BeautifulSoup

response = requests.get("https://github.com/trending")
soup = BeautifulSoup(response.content, 'html.parser')

# Problem: GitHub detects bot requests immediately
# Result: HTTP 429 (Too Many Requests) or blocked IP

Why This Fails

User-Agent Detection — GitHub checks if request comes from real browser
Rate Limiting — Rapid requests trigger IP blocking
JavaScript Rendering — GitHub uses dynamic rendering; requests library gets empty HTML
Layout Mutations — GitHub changes CSS selectors; parser breaks

The Solution: Defensive Parsing Architecture

RepoHunter implements a multi-layered resilience strategy:

Layer 1: Headless Browser with Anti-Fingerprinting

# ✅ GOOD: Use headless browser with stealth
from playwright import sync_playwright

browser = sync_playwright().start()
page = browser.new_page()

# Anti-fingerprinting: Spoof browser signals
page.add_init_script("""
    Object.defineProperty(navigator, 'webdriver', {
        get: () => undefined,
    });
""")

page.goto("https://github.com/trending")
html = page.content()

Why this works:

Headless browser renders JavaScript
Anti-fingerprinting bypasses navigator.webdriver detection
Real browser signals = no blocking

Layer 2: Defensive CSS Selector Strategy

# ✅ Sequential fallback selectors (resilient to layout changes)
import re
from bs4 import BeautifulSoup

class DefensiveParser:
    def extract_repositories(self, html):
        soup = BeautifulSoup(html, 'html.parser')

        # Primary selector (current GitHub layout)
        repos = soup.select('article h1 a')

        if not repos:  # Fallback 1: GitHub mutated layout
            repos = soup.select('h2 a[href*="/"]')

        if not repos:  # Fallback 2: Different class names
            repos = soup.select('[itemprop="name"] a')

        if not repos:  # Fallback 3: Regex pattern matching
            repos = re.findall(r'href="(/[^/]+/[^"]+)"', html)

        return repos

Design Principle: Single Responsibility - parser adapts to layout changes without modifying core logic.

Layer 3: Exception Handling & Graceful Degradation

# ✅ GOOD: Graceful failure handling
import logging
from time import sleep

logger = logging.getLogger(__name__)

def fetch_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = fetch(url)
            if response.status_code == 200:
                return response
            elif response.status_code == 429:
                # Rate limited: exponential backoff
                sleep(2 ** attempt)
                continue
        except Exception as e:
            logger.error(f"Attempt {attempt}: {e}")
            continue

    logger.warning(f"Failed after {max_retries} attempts")
    return None

Design Pattern: Retry pattern with exponential backoff (prevents hammering server).

Real-World Implementation: RepoHunter

Architecture Overview

┌─────────────────────────────────────────┐
│    RepoHunter Main Flow                 │
├─────────────────────────────────────────┤
│                                         │
│  1. Headless Browser Launch             │
│     └─ Anti-fingerprinting enabled      │
│                                         │
│  2. GitHub Trending Request             │
│     └─ Real browser signals             │
│                                         │
│  3. Defensive Parser                    │
│     └─ Sequential CSS fallbacks         │
│                                         │
│  4. Data Extraction                     │
│     └─ repo name, stars, language       │
│                                         │
│  5. JSON/Markdown Output                │
│     └─ daily_trends.md report           │
│                                         │
└─────────────────────────────────────────┘

Key Components

Component 1: Multiplatform Binary Validation

# ✅ Deterministic path handling across OS
import os
import platform
from pathlib import Path

class BinaryValidator:
    def __init__(self, binary_name):
        self.binary_path = Path(f"./obscura-{platform.system().lower()}/{binary_name}")

    def is_executable(self):
        # Cross-platform: check Unix execute bit
        return os.access(self.binary_path, os.X_OK)

    def validate(self):
        if not self.binary_path.exists():
            raise FileNotFoundError(f"Binary not found: {self.binary_path}")
        if not self.is_executable():
            raise PermissionError(f"Binary not executable: {self.binary_path}")
        return True

Design Principle: Separation of Concerns - validation logic isolated from execution.

Component 2: Configurable Scan Policies

# ✅ Strategy pattern: swap algorithms at runtime
from abc import ABC, abstractmethod
from time import sleep

class ScanPolicy(ABC):
    @abstractmethod
    def get_interval(self):
        pass

class FastPolicy(ScanPolicy):
    def get_interval(self):
        return 600  # 10 minutes

class SlowPolicy(ScanPolicy):
    def get_interval(self):
        return 43200  # 12 hours

class RepoHunter:
    def __init__(self, policy: ScanPolicy):
        self.policy = policy

    def run(self):
        while True:
            self.scan()
            sleep(self.policy.get_interval())

    def scan(self):
        pass  # Scanning implementation goes here

Design Pattern: Strategy pattern - adds/removes policies without modifying RepoHunter.

Key Learnings: What Works in Production

✅ What Makes RepoHunter Resilient

Strategy	Benefit
Sequential fallback selectors	Survives CSS layout mutations
Headless browser + anti-fingerprinting	Bypasses bot detection
Exponential backoff retry	Handles rate limiting gracefully
Deterministic path handling	Cross-platform binary execution
Strategy pattern configuration	Runtime policy switching

❌ Common Mistakes to Avoid

❌ Single CSS selector (breaks on layout change)
❌ requests library only (can't render JavaScript)
❌ No retry logic (fails on transient errors)
❌ Hardcoded paths (breaks on different OS)
❌ Monolithic scraper (hard to test/maintain)

Installation & Usage

Get Started with RepoHunter

# Clone
git clone https://github.com/AxthonyV/RepoHunter.git
cd RepoHunter

# Install dependencies
pip install -r requirements.txt

# Run
python repo_hunter.py

Select scan policy:

Fast: 10 minutes
Normal: 1 hour
Slow: 12 hours
Once: Single execution

Example Output

# Trending Repositories (2026-06-07)

| Repository | Stars | Language | Description |
|-----------|-------|----------|-------------|
| openai/gpt-4 | 15,234 | Python | Advanced LLM Framework |
| vercel/next.js | 12,456 | JavaScript | React Framework |

Architecture & Design Principles

RepoHunter demonstrates several SOLID principles:

Single Responsibility Principle (SRP)

Parser handles HTML extraction
BinaryValidator handles binary validation
RepoHunter orchestrates flow

Open/Closed Principle (OCP)

Add new ScanPolicy without modifying RepoHunter
Add new parsing logic without breaking existing

Dependency Inversion (DIP)

RepoHunter depends on abstract ScanPolicy, not concrete implementations

Educational Value

RepoHunter is perfect for learning:

✅ Web Scraping Architecture — Real-world resilience patterns
✅ Design Patterns — Strategy pattern in action
✅ Systems Programming — Cross-platform binary handling
✅ Error Handling — Graceful failure & retry logic
✅ Clean Code — Well-structured, testable components

Conclusion

Web scraping at scale requires defensive architecture, not just quick scripts. RepoHunter demonstrates production-grade patterns:

Headless browsers for JavaScript rendering
Sequential fallbacks for resilience
Graceful degradation for error handling
Design patterns for maintainability

Next Steps

🔍 Explore RepoHunter on GitHub
📚 Read the complete architecture documentation
🤝 Star the repo if you find it valuable
💬 Leave feedback or contribute improvements

Resources

Questions? Feedback? Connect on GitHub 🚀

DEV Community