Haji Rufai

Posted on May 23

Building an African Economic Data Pipeline with Python, DuckDB & World Bank API

#dataengineering #python #sql #tutorial

Every data engineer knows the struggle: finding a project that's both technically impressive and genuinely useful. Today I'll walk you through AfriData Pipeline — a production-grade ETL system that extracts economic data for all 54 African countries, loads it into a DuckDB analytical warehouse, and serves an interactive dashboard.

No paid APIs. No cloud services required. Just Python, DuckDB, and free public data.

Why This Project?

Africa's economy is growing fast, but finding clean, consolidated economic data is surprisingly hard. The World Bank has an amazing free API with 16,000+ indicators — but raw API responses need serious engineering to become useful.

This project demonstrates:

ETL pipeline design with proper error handling and retries
Dimensional modeling (star schema) in DuckDB
Data quality engineering — automated checks for completeness, validity, and freshness
Full-stack delivery — from raw API to interactive dashboard

Architecture Overview

World Bank API v2 → Extract (httpx) → Transform (Python) → Load (DuckDB)
                                                               ↓
                                            Export JSON → Static Dashboard (Vercel)

The pipeline processes 13,500 data points (54 countries × 10 indicators × 25 years) in under 50 seconds.

The Data: 10 Key Indicators

I selected indicators that tell a comprehensive economic story:

Indicator	Category	Why It Matters
GDP (US$)	Economy	Total economic output
GDP Growth (%)	Economy	Economic momentum
Population	Demographics	Scale context
Inflation (CPI)	Economy	Cost of living pressure
Unemployment	Labor	Job market health
Life Expectancy	Health	Quality of life proxy
Internet Users (%)	Technology	Digital readiness
Electricity Access (%)	Infrastructure	Development foundation
Literacy Rate (%)	Education	Human capital
FDI Inflows (% GDP)	Investment	External confidence

Building the Extract Layer

The World Bank API v2 is beautifully simple — no auth required, JSON responses, and you can batch multiple countries in one request:

import httpx
import time

WB_BASE = "https://api.worldbank.org/v2"
MAX_RETRIES = 3

def extract_indicator(client: httpx.Client, indicator_code: str, 
                      country_codes: str) -> list[dict]:
    url = (f"{WB_BASE}/country/{country_codes}/indicator/{indicator_code}"
           f"?format=json&date=2000:2024&per_page=10000")

    for attempt in range(MAX_RETRIES):
        try:
            resp = client.get(url, timeout=60)
            resp.raise_for_status()
            data = resp.json()
            # World Bank returns [metadata, records]
            if isinstance(data, list) and len(data) == 2:
                return data[1] or []
        except (httpx.HTTPStatusError, httpx.ReadTimeout) as e:
            delay = 2 * (2 ** attempt)
            time.sleep(delay)
    return []

Key design decisions:

Exponential backoff on failures (2s, 4s, 8s)
Single request per indicator — semicolon-separated country codes let us fetch all 54 countries at once
60-second timeout — some indicators return large payloads
0.5s delay between indicators — respect the free API

The Star Schema

DuckDB is perfect for this: blazing fast analytics, zero configuration, and a single portable file.

dim_country ◄──── fact_indicators ────► dim_indicator
     │                  │
     └────────── dim_date ──────────────┘

import duckdb

def create_schema(conn):
    conn.execute("""
        CREATE TABLE IF NOT EXISTS fact_indicators (
            country_key  INTEGER,
            indicator_key INTEGER,
            date_key     INTEGER,
            value        DOUBLE,
            yoy_change   DOUBLE,
            extracted_at TIMESTAMP DEFAULT current_timestamp,
            PRIMARY KEY (country_key, indicator_key, date_key)
        )
    """)
    # Plus dim_country (54 rows), dim_indicator (10 rows), dim_date (25 rows)

The transform layer also computes year-over-year change for every data point:

def calculate_yoy(current, previous):
    if current is not None and previous is not None and previous != 0:
        return round(((current - previous) / abs(previous)) * 100, 2)
    return None

Data Quality Framework

This is what separates a toy project from a production one. The quality framework scores three dimensions:

1. Completeness — What percentage of expected data points are non-null?

Literacy Rate: only 18% complete (data is sparse)
Population: 100% complete (every country, every year)

2. Validity — Are values within expected ranges?

Life expectancy: 25-95 years ✅
GDP: $1M - $10T ✅
Inflation: -30% to 10,000% (yes, hyperinflation happens) ✅

3. Freshness — How recent is the latest data?

GDP: 2024 ✅
Literacy: 2021 ⚠️ (surveys are infrequent)

The final score: 95.8/100 — with completeness dragging slightly due to sparse literacy data (expected for survey-based indicators).

Interactive Dashboard

The dashboard is a static site (HTML + Tailwind CSS + Chart.js + Leaflet.js) that loads pre-exported JSON files:

Features:

🗺️ Choropleth map — click any African country, toggle between indicators
📈 Country comparison — compare up to 6 countries over 25 years
🏆 Rankings table — sortable by any indicator
🌙 Dark mode — full theme support
📱 Responsive — works on mobile

The dashboard reads four JSON files exported by the pipeline:

country_profiles.json — all data per country (897KB)
rankings.json — pre-sorted rankings per indicator
summary_stats.json — aggregate statistics
quality_report.json — transparency on data quality

Automated Daily Refresh

A GitHub Actions workflow runs the pipeline daily at 6 AM UTC:

name: Daily ETL Pipeline
on:
  schedule:
    - cron: '0 6 * * *'
  workflow_dispatch:

jobs:
  etl:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.12' }
      - run: pip install -r requirements.txt
      - run: python -m pipeline.main all
      - run: |
          git config user.name "github-actions[bot]"
          git add dashboard/data/
          git diff --cached --quiet || git commit -m "chore: update data"
          git push

Fresh data → committed JSON → Vercel auto-deploys. Zero manual intervention.

Key Takeaways

Free APIs are underrated — The World Bank API has incredible depth. No auth, no rate limits worth worrying about, and 25+ years of history.
DuckDB is a game-changer for small-to-medium analytical workloads. Zero setup, single file, and it handles 13K+ rows with analytical queries in milliseconds.
Data quality isn't optional — Even with a trusted source like the World Bank, you'll find missing data, sparse indicators, and surprises. Build quality checks into the pipeline, not as an afterthought.
Static dashboards scale — By pre-computing JSON at ETL time, the dashboard is just a static site. No backend, no database connection, no server costs. Deploy to Vercel for free.
Star schemas still matter — Even in a world of data lakes and denormalized tables, dimensional modeling makes your data queryable and understandable.

Try It Yourself

The entire project is open source:

GitHub: hajirufai/afridata-pipeline
Stack: Python 3.12, httpx, DuckDB, Chart.js, Leaflet.js, Tailwind CSS

git clone https://github.com/hajirufai/afridata-pipeline.git
cd afridata-pipeline
pip install -r requirements.txt
python -m pipeline.main all
cd dashboard && python -m http.server 8080

Data engineering doesn't have to be about massive Spark clusters and cloud bills. Sometimes the best projects start with a free API and a clear question.

What economic indicators would you add? Drop a comment below!

DEV Community