DEV Community

AxthonyV
AxthonyV

Posted on

Building Anti-Bot Detection Systems: How RepoHunter Scrapes GitHub Trending Repositories

Introduction

Ever wondered how to reliably scrape dynamic web content without getting blocked?

Most web scrapers fail because they don't account for anti-bot defenses. This article walks through RepoHunter, an open-source Python framework that demonstrates production-grade scraping architecture with anti-detection evasion techniques.

What You'll Learn

  • 🛡️ Defensive parsing strategies that survive layout mutations
  • 🤖 Headless browser automation with stealth technology
  • 🏗️ System design principles for resilient scrapers
  • 📊 Real-world example: GitHub trending repository discovery

The Problem: Why Standard Scraping Fails

Typical Web Scraper Architecture (Naive Approach)

# ❌ BAD: Naive scraper (will fail)
import requests
from bs4 import BeautifulSoup

response = requests.get("https://github.com/trending")
soup = BeautifulSoup(response.content, 'html.parser')

# Problem: GitHub detects bot requests immediately
# Result: HTTP 429 (Too Many Requests) or blocked IP
Enter fullscreen mode Exit fullscreen mode

Why This Fails

  1. User-Agent Detection — GitHub checks if request comes from real browser
  2. Rate Limiting — Rapid requests trigger IP blocking
  3. JavaScript Rendering — GitHub uses dynamic rendering; requests library gets empty HTML
  4. Layout Mutations — GitHub changes CSS selectors; parser breaks

The Solution: Defensive Parsing Architecture

RepoHunter implements a multi-layered resilience strategy:

Layer 1: Headless Browser with Anti-Fingerprinting

# ✅ GOOD: Use headless browser with stealth
from playwright import sync_playwright

browser = sync_playwright().start()
page = browser.new_page()

# Anti-fingerprinting: Spoof browser signals
page.add_init_script("""
    Object.defineProperty(navigator, 'webdriver', {
        get: () => undefined,
    });
""")

page.goto("https://github.com/trending")
html = page.content()
Enter fullscreen mode Exit fullscreen mode

Why this works:

  • Headless browser renders JavaScript
  • Anti-fingerprinting bypasses navigator.webdriver detection
  • Real browser signals = no blocking

Layer 2: Defensive CSS Selector Strategy

# ✅ Sequential fallback selectors (resilient to layout changes)
import re
from bs4 import BeautifulSoup

class DefensiveParser:
    def extract_repositories(self, html):
        soup = BeautifulSoup(html, 'html.parser')

        # Primary selector (current GitHub layout)
        repos = soup.select('article h1 a')

        if not repos:  # Fallback 1: GitHub mutated layout
            repos = soup.select('h2 a[href*="/"]')

        if not repos:  # Fallback 2: Different class names
            repos = soup.select('[itemprop="name"] a')

        if not repos:  # Fallback 3: Regex pattern matching
            repos = re.findall(r'href="(/[^/]+/[^"]+)"', html)

        return repos
Enter fullscreen mode Exit fullscreen mode

Design Principle: Single Responsibility - parser adapts to layout changes without modifying core logic.


Layer 3: Exception Handling & Graceful Degradation

# ✅ GOOD: Graceful failure handling
import logging
from time import sleep

logger = logging.getLogger(__name__)

def fetch_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = fetch(url)
            if response.status_code == 200:
                return response
            elif response.status_code == 429:
                # Rate limited: exponential backoff
                sleep(2 ** attempt)
                continue
        except Exception as e:
            logger.error(f"Attempt {attempt}: {e}")
            continue

    logger.warning(f"Failed after {max_retries} attempts")
    return None
Enter fullscreen mode Exit fullscreen mode

Design Pattern: Retry pattern with exponential backoff (prevents hammering server).


Real-World Implementation: RepoHunter

Architecture Overview

┌─────────────────────────────────────────┐
│    RepoHunter Main Flow                 │
├─────────────────────────────────────────┤
│                                         │
│  1. Headless Browser Launch             │
│     └─ Anti-fingerprinting enabled      │
│                                         │
│  2. GitHub Trending Request             │
│     └─ Real browser signals             │
│                                         │
│  3. Defensive Parser                    │
│     └─ Sequential CSS fallbacks         │
│                                         │
│  4. Data Extraction                     │
│     └─ repo name, stars, language       │
│                                         │
│  5. JSON/Markdown Output                │
│     └─ daily_trends.md report           │
│                                         │
└─────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Key Components

Component 1: Multiplatform Binary Validation

# ✅ Deterministic path handling across OS
import os
import platform
from pathlib import Path

class BinaryValidator:
    def __init__(self, binary_name):
        self.binary_path = Path(f"./obscura-{platform.system().lower()}/{binary_name}")

    def is_executable(self):
        # Cross-platform: check Unix execute bit
        return os.access(self.binary_path, os.X_OK)

    def validate(self):
        if not self.binary_path.exists():
            raise FileNotFoundError(f"Binary not found: {self.binary_path}")
        if not self.is_executable():
            raise PermissionError(f"Binary not executable: {self.binary_path}")
        return True
Enter fullscreen mode Exit fullscreen mode

Design Principle: Separation of Concerns - validation logic isolated from execution.


Component 2: Configurable Scan Policies

# ✅ Strategy pattern: swap algorithms at runtime
from abc import ABC, abstractmethod
from time import sleep

class ScanPolicy(ABC):
    @abstractmethod
    def get_interval(self):
        pass

class FastPolicy(ScanPolicy):
    def get_interval(self):
        return 600  # 10 minutes

class SlowPolicy(ScanPolicy):
    def get_interval(self):
        return 43200  # 12 hours

class RepoHunter:
    def __init__(self, policy: ScanPolicy):
        self.policy = policy

    def run(self):
        while True:
            self.scan()
            sleep(self.policy.get_interval())

    def scan(self):
        pass  # Scanning implementation goes here
Enter fullscreen mode Exit fullscreen mode

Design Pattern: Strategy pattern - adds/removes policies without modifying RepoHunter.


Key Learnings: What Works in Production

✅ What Makes RepoHunter Resilient

Strategy Benefit
Sequential fallback selectors Survives CSS layout mutations
Headless browser + anti-fingerprinting Bypasses bot detection
Exponential backoff retry Handles rate limiting gracefully
Deterministic path handling Cross-platform binary execution
Strategy pattern configuration Runtime policy switching

❌ Common Mistakes to Avoid

❌ Single CSS selector (breaks on layout change)
❌ requests library only (can't render JavaScript)
❌ No retry logic (fails on transient errors)
❌ Hardcoded paths (breaks on different OS)
❌ Monolithic scraper (hard to test/maintain)
Enter fullscreen mode Exit fullscreen mode

Installation & Usage

Get Started with RepoHunter

# Clone
git clone https://github.com/AxthonyV/RepoHunter.git
cd RepoHunter

# Install dependencies
pip install -r requirements.txt

# Run
python repo_hunter.py
Enter fullscreen mode Exit fullscreen mode

Select scan policy:

  • Fast: 10 minutes
  • Normal: 1 hour
  • Slow: 12 hours
  • Once: Single execution

Example Output

# Trending Repositories (2026-06-07)

| Repository | Stars | Language | Description |
|-----------|-------|----------|-------------|
| openai/gpt-4 | 15,234 | Python | Advanced LLM Framework |
| vercel/next.js | 12,456 | JavaScript | React Framework |
Enter fullscreen mode Exit fullscreen mode

Architecture & Design Principles

RepoHunter demonstrates several SOLID principles:

Single Responsibility Principle (SRP)

  • Parser handles HTML extraction
  • BinaryValidator handles binary validation
  • RepoHunter orchestrates flow

Open/Closed Principle (OCP)

  • Add new ScanPolicy without modifying RepoHunter
  • Add new parsing logic without breaking existing

Dependency Inversion (DIP)

  • RepoHunter depends on abstract ScanPolicy, not concrete implementations

Educational Value

RepoHunter is perfect for learning:

Web Scraping Architecture — Real-world resilience patterns
Design Patterns — Strategy pattern in action
Systems Programming — Cross-platform binary handling
Error Handling — Graceful failure & retry logic
Clean Code — Well-structured, testable components


Conclusion

Web scraping at scale requires defensive architecture, not just quick scripts. RepoHunter demonstrates production-grade patterns:

  1. Headless browsers for JavaScript rendering
  2. Sequential fallbacks for resilience
  3. Graceful degradation for error handling
  4. Design patterns for maintainability

Next Steps


Resources


Questions? Feedback? Connect on GitHub 🚀

Top comments (0)