Scrapamoja: A Python Web Scraping Framework

#scraping #python #flashscore #playwright

I've been building Scrapamoja — a Python scraping framework built on one idea: you shouldn't need a different tool for every site you scrape.

🔗 GitHub: https://github.com/TisoneK/scrapamoja

What is Scrapamoja?

Scrapamoja blends the English word Scrape with the Swahili word Pamoja — meaning together. Scrape together. One scraper, many sites. One framework, many contributors.

It's also a quiet nod to Moja, Swahili for one — the idea that you shouldn't need a different tool for every site you want to scrape. One framework should be enough, and it should be good enough that anyone can extend it.

That philosophy shapes everything about how Scrapamoja is built. It's not a scraper — it's the infrastructure that makes scrapers reliable: handling anti-bot measures, selector drift, network failures, and browser resource leaks so you don't have to. New sites can be added by anyone, existing ones improved by the community, and the whole thing grows stronger the more people contribute to it.

Scrape together. Build together.

Core Framework Capabilities

🎯 Intelligent Selector Engine

The selector engine is the heart of Scrapamoja. Instead of brittle single-selector lookups, it uses a multi-strategy approach — CSS, XPath, and text-based selectors can all be defined for the same element. Each strategy is weighted, and the engine picks the best match with a confidence score. When a selector fails, it falls back gracefully rather than crashing. Selectors are defined in YAML, not hardcoded, making them easy to maintain without touching Python.

Site → Sport → Status → Context → Element

🛡️ Resilience System

Built around the assumption that things will go wrong. Automatic retries with exponential backoff, failure classification (network vs. selector vs. parse errors), checkpoint-based recovery so long scrapes can resume, and a coordinator that ensures graceful shutdown even mid-scrape.

🕵️ Stealth & Anti-Detection

A dedicated stealth module handles fingerprint randomization, human-like behavior simulation, consent popup handling, and proxy rotation. Sites that actively fight scrapers are manageable targets.

🔍 Snapshot Debugging

When a scrape fails, Scrapamoja captures a full snapshot: the page HTML, a screenshot, structured logs, and selector resolution traces — all correlated by session ID. Debugging a failure means looking at exactly what the browser saw, not guessing.

📊 Telemetry & Observability

Structured JSON logging with correlation IDs, built-in metrics collection (execution time, success rates, selector confidence distributions), and alerting hooks. Production scrapers need production-grade monitoring.

🌐 Browser Lifecycle Management

Browser and page pooling, session state persistence, tab management, resource monitoring (memory, CPU), and corruption detection. Long-running scrapers won't leak memory or leave orphaned browser processes.

🔀 Hybrid Extraction Modes

Scrapamoja chooses the optimal extraction method based on each target site's architecture:

Mode	Description	Use Case
DOM Mode (default)	Navigate with browser, extract from HTML	Sites requiring full rendering
Direct API Mode	Skip browser, call APIs directly	Open APIs, millisecond latency
Network Interception	Capture API responses during browser session	Sites requiring browser initialization
Hybrid Mode	Browser once to harvest session, then direct HTTP	Sites requiring authenticated sessions