John Wakaba

Posted on Nov 4

🌍 Automating Africa’s Energy Data Collection Using Python, Playwright(+Why Playwright ?), and MongoDB (2000–2024)

#dataengineering #mongodb #python #automation

⚡ Introduction

In today’s data-driven world, access to reliable and structured energy data is critical for decision-making, research, and policy planning.

However, most open data platforms in Africa — such as the Africa Energy Portal (AEP) — present information in dashboard views, which makes large-scale analysis tedious.

To address this challenge, I built a fully automated ETL (Extract, Transform, Load) pipeline that:

Scrapes energy indicators for all African countries (2000–2024),
Formats and validates the data for consistency,
And stores it in a MongoDB database for easy access and analysis.

This project uses Python, Playwright, and MongoDB, with automation powered by the lightweight dependency manager uv.

🧩 Problem Statement

While the Africa Energy Portal provides valuable country-level datasets, it does not offer a bulk download option.

Researchers, analysts, and energy planners need historical time-series data — such as:

Electricity generation and consumption
Renewable energy contribution
Access to clean cooking
Population electrification (rural vs urban)

Manually downloading data for 50+ African countries and 20+ years would take days — not counting inconsistencies in data formats and missing years.

The solution: automate it end-to-end.

🧠 Project Goals

Extract data for all African countries directly from the AEP website.
Transform it into a structured, tabular format for analysis.
Store it efficiently in MongoDB for scalability and retrieval.
Validate data completeness and consistency across countries and indicators.
Export the final cleaned dataset for analysis and sharing.

⚙️ Tools & Technologies

Purpose	Tool / Library	Role
Web scraping	Playwright	Automates browser-based data capture
Environment & Dependency Management	uv	Manages virtual environment and packages
Data storage	MongoDB	Stores country-wise metrics and year data
Data validation & analysis	`pandas`, `pydantic`	Cleans and structures data
Export	`openpyxl`	Saves Excel files
Scripting	`Python`	Glue for the entire ETL process

🔄 ETL Pipeline Overview

The pipeline consists of four modular stages:

Stage 1 – Data Extraction

Uses Playwright to navigate to each country’s profile page.
Intercepts the /get-country-data XHR response.
Extracts JSON payloads containing all available indicators and yearly values.

Each JSON record includes:

{
  "country": "Kenya",
  "metric": "Population with access to electricity - National",
  "sector": "ELECTRICITY ACCESS",
  "yearly": {
    "2015": 19.65,
    "2016": 25.73,
    "2022": 42.62
  }
}

Stage 2 – Data Formatting

Converts raw JSON into a tabular schema: ["country", "country_serial", "metric", "unit", "sector", "sub_sector", "sub_sub_sector", "source_link", "source", "2000", ..., "2024"]
Ensures each row represents one metric for one country.
Fills missing years with null values to maintain consistency.

Stage 3 – Data Storage

Inserts formatted records into MongoDB using pymongo.
Adds a unique index (country, metric, source) to prevent duplicates.
Upserts records — ensuring updates don’t create duplicates.

Each MongoDB document looks like this:

{
  "country": "Kenya",
  "metric": "Access to Clean Cooking%",
  "source": "Tracking SDG7/WBG",
  "2000": null,
  "2015": 11.9,
  "2020": 23.6,
  "2024": null
}

Stage 4 – Validation

Identifies missing years or inconsistent units.
Detects countries with incomplete datasets.
Exports a detailed validation_report.csv that flags issues automatically.

Sample output:
| issue_type | country | metric | details |
|-------------|----------|--------|----------|
| MISSING_YEARS | Kenya | Access to Clean Cooking% | 2000–2014, 2023–2024 |
| UNIT_INCONSISTENCY | ALL | Electricity Access | %; MW |

🧾 Data Export

Once the ETL pipeline finishes, data is exported to both CSV and Excel formats for analysis.

uv run python export_to_csv.py

Output files:

reports/exports/energy_data.csv
reports/exports/energy_data.xlsx

🎭 Why Playwright Was Essential for This Project

Many would wonder:

“Why not just use requests or BeautifulSoup to get the data?”

The Africa Energy Portal (AEP) website is highly dynamic — it doesn’t serve raw data directly in the page HTML. Instead, when you open a country page like
https://africa-energy-portal.org/country/kenya,
the browser first loads a basic template, and then JavaScript makes a hidden request to an internal endpoint:

POST https://africa-energy-portal.org/get-country-data

This is the actual source of all the energy statistics.

🔹 1. Dynamic JavaScript Rendering

The data (electricity access, renewables, etc.) is fetched asynchronously after the page loads.
Traditional libraries like requests only download the HTML shell, missing all those dynamic values.

Playwright, however, executes JavaScript in a real browser, allowing it to:

Wait until the data request is made,

Intercept the /get-country-data response,

Capture the full JSON payload in real time.

This gives us clean, structured data instead of messy HTML scraping.

🔹 2. Cloudflare and Anti-Bot Protection

The AEP website uses Cloudflare security, which blocks automated clients that don’t behave like browsers.
When we tried using requests, we got frequent 403 (Forbidden) and 500 (Server Error) responses.

Playwright solves this because it:

Runs a full Chromium browser (just like Chrome or Edge).

Sends real headers, cookies, and browsing patterns.

Is indistinguishable from a human visitor.

This made it the only reliable way to consistently access data without breaking terms of service or scraping hidden content.

🔹 3. Network Interception

Playwright allows us to listen for specific network calls using:

page.expect_response(lambda r: "get-country-data" in r.url and r.status == 200)

That means we can:

Capture the JSON the site uses internally,

Save it instantly,

And avoid parsing the visual layout at all.

It’s essentially like “catching the data packet” as it flies through the browser.

🔹 4. Reliability & Control

Because we were scraping 50+ countries, we needed:

Timeout handling (some pages take 30–60s to load),

Retries for network issues,

Throttling to avoid overwhelming the server.

Playwright gives us those controls — ensuring we get consistent, ethical, and stable scraping.

5. Clean ETL Integration

With Playwright, our Stage 1 data extraction directly produces structured JSON, like:

{
  "country": "Kenya",
  "metric": "Access to Clean Cooking%",
  "sector": "Basic data",
  "yearly": {
    "2015": 11.9,
    "2016": 13.8,
    "2022": 30.0
  }
}

That JSON is ready for transformation and MongoDB storage, making the pipeline efficient end-to-end.

In short:
Playwright wasn’t just a choice — it was a necessity.
It allowed this ETL project to move from fragile scraping to robust, browser-level automation, ensuring accurate and repeatable extraction of Africa’s energy data. ⚡

⚠️ Challenges Faced

Challenge	Description
Cloudflare protection	The AEP website blocked simple HTTP requests (403, 500). Solved by using Playwright’s browser simulation to mimic human behavior.
Slow response times	Some pages took >30 seconds to return data. Added retry logic and longer timeouts.
Inconsistent URL naming	Country URLs (like `cote-d’ivoire` vs `cote-divoire`) required slug normalization logic.
Incomplete datasets	Some countries lacked data for certain years, handled via validation.
Browser resource use	Playwright’s real browser automation was resource-heavy; introduced throttling to manage load.

📊 Results

✅ Successfully extracted data for 50 African countries
✅ Collected 500+ indicators covering 2000–2024
✅ All records stored in MongoDB with proper schema
✅ Automated validation caught missing and inconsistent data
✅ Exportable formats ready for visualization and analysis

💡 Key Takeaways

Automating data extraction from protected websites is possible using browser-level automation (Playwright).
Designing modular ETL stages makes maintenance and debugging easier.
Data validation is just as important as extraction — raw data is rarely clean.
Storing data in MongoDB offers flexibility for hierarchical (nested) data structures.

🧠 Future Work

Build an interactive dashboard using Streamlit or Power BI.
Automate periodic updates (monthly/quarterly).
Add country-level time-series visualization modules.

DEV Community