DEV Community

Tisone Kironget
Tisone Kironget

Posted on

Scrapamoja: A Python Web Scraping Framework

I've been building Scrapamoja — a Python scraping framework built on one idea: you shouldn't need a different tool for every site you scrape.

🔗 GitHub: https://github.com/TisoneK/scrapamoja

What is Scrapamoja?

Scrapamoja blends the English word Scrape with the Swahili word Pamoja — meaning together. Scrape together. One scraper, many sites. One framework, many contributors.

It's also a quiet nod to Moja, Swahili for one — the idea that you shouldn't need a different tool for every site you want to scrape. One framework should be enough, and it should be good enough that anyone can extend it.

That philosophy shapes everything about how Scrapamoja is built. It's not a scraper — it's the infrastructure that makes scrapers reliable: handling anti-bot measures, selector drift, network failures, and browser resource leaks so you don't have to. New sites can be added by anyone, existing ones improved by the community, and the whole thing grows stronger the more people contribute to it.

Scrape together. Build together.


Core Framework Capabilities

🎯 Intelligent Selector Engine

The selector engine is the heart of Scrapamoja. Instead of brittle single-selector lookups, it uses a multi-strategy approach — CSS, XPath, and text-based selectors can all be defined for the same element. Each strategy is weighted, and the engine picks the best match with a confidence score. When a selector fails, it falls back gracefully rather than crashing. Selectors are defined in YAML, not hardcoded, making them easy to maintain without touching Python.

Site → Sport → Status → Context → Element
Enter fullscreen mode Exit fullscreen mode

🛡️ Resilience System

Built around the assumption that things will go wrong. Automatic retries with exponential backoff, failure classification (network vs. selector vs. parse errors), checkpoint-based recovery so long scrapes can resume, and a coordinator that ensures graceful shutdown even mid-scrape.

🕵️ Stealth & Anti-Detection

A dedicated stealth module handles fingerprint randomization, human-like behavior simulation, consent popup handling, and proxy rotation. Sites that actively fight scrapers are manageable targets.

🔍 Snapshot Debugging

When a scrape fails, Scrapamoja captures a full snapshot: the page HTML, a screenshot, structured logs, and selector resolution traces — all correlated by session ID. Debugging a failure means looking at exactly what the browser saw, not guessing.

📊 Telemetry & Observability

Structured JSON logging with correlation IDs, built-in metrics collection (execution time, success rates, selector confidence distributions), and alerting hooks. Production scrapers need production-grade monitoring.

🌐 Browser Lifecycle Management

Browser and page pooling, session state persistence, tab management, resource monitoring (memory, CPU), and corruption detection. Long-running scrapers won't leak memory or leave orphaned browser processes.

🔀 Hybrid Extraction Modes

Scrapamoja chooses the optimal extraction method based on each target site's architecture:

Mode Description Use Case
DOM Mode (default) Navigate with browser, extract from HTML Sites requiring full rendering
Direct API Mode Skip browser, call APIs directly Open APIs, millisecond latency
Network Interception Capture API responses during browser session Sites requiring browser initialization
Hybrid Mode Browser once to harvest session, then direct HTTP Sites requiring authenticated sessions

If you're building on top of the web, I'd love to hear what you're scraping.

👉 https://github.com/TisoneK/scrapamoja

Top comments (0)