DEV Community

Haji Rufai
Haji Rufai

Posted on

Building an African Economic Data Pipeline with Python, DuckDB & World Bank API

Every data engineer knows the struggle: finding a project that's both technically impressive and genuinely useful. Today I'll walk you through AfriData Pipeline — a production-grade ETL system that extracts economic data for all 54 African countries, loads it into a DuckDB analytical warehouse, and serves an interactive dashboard.

No paid APIs. No cloud services required. Just Python, DuckDB, and free public data.

Why This Project?

Africa's economy is growing fast, but finding clean, consolidated economic data is surprisingly hard. The World Bank has an amazing free API with 16,000+ indicators — but raw API responses need serious engineering to become useful.

This project demonstrates:

  • ETL pipeline design with proper error handling and retries
  • Dimensional modeling (star schema) in DuckDB
  • Data quality engineering — automated checks for completeness, validity, and freshness
  • Full-stack delivery — from raw API to interactive dashboard

Architecture Overview

World Bank API v2 → Extract (httpx) → Transform (Python) → Load (DuckDB)
                                                               ↓
                                            Export JSON → Static Dashboard (Vercel)
Enter fullscreen mode Exit fullscreen mode

The pipeline processes 13,500 data points (54 countries × 10 indicators × 25 years) in under 50 seconds.

The Data: 10 Key Indicators

I selected indicators that tell a comprehensive economic story:

Indicator Category Why It Matters
GDP (US$) Economy Total economic output
GDP Growth (%) Economy Economic momentum
Population Demographics Scale context
Inflation (CPI) Economy Cost of living pressure
Unemployment Labor Job market health
Life Expectancy Health Quality of life proxy
Internet Users (%) Technology Digital readiness
Electricity Access (%) Infrastructure Development foundation
Literacy Rate (%) Education Human capital
FDI Inflows (% GDP) Investment External confidence

Building the Extract Layer

The World Bank API v2 is beautifully simple — no auth required, JSON responses, and you can batch multiple countries in one request:

import httpx
import time

WB_BASE = "https://api.worldbank.org/v2"
MAX_RETRIES = 3

def extract_indicator(client: httpx.Client, indicator_code: str, 
                      country_codes: str) -> list[dict]:
    url = (f"{WB_BASE}/country/{country_codes}/indicator/{indicator_code}"
           f"?format=json&date=2000:2024&per_page=10000")

    for attempt in range(MAX_RETRIES):
        try:
            resp = client.get(url, timeout=60)
            resp.raise_for_status()
            data = resp.json()
            # World Bank returns [metadata, records]
            if isinstance(data, list) and len(data) == 2:
                return data[1] or []
        except (httpx.HTTPStatusError, httpx.ReadTimeout) as e:
            delay = 2 * (2 ** attempt)
            time.sleep(delay)
    return []
Enter fullscreen mode Exit fullscreen mode

Key design decisions:

  • Exponential backoff on failures (2s, 4s, 8s)
  • Single request per indicator — semicolon-separated country codes let us fetch all 54 countries at once
  • 60-second timeout — some indicators return large payloads
  • 0.5s delay between indicators — respect the free API

The Star Schema

DuckDB is perfect for this: blazing fast analytics, zero configuration, and a single portable file.

dim_country ◄──── fact_indicators ────► dim_indicator
     │                  │
     └────────── dim_date ──────────────┘
Enter fullscreen mode Exit fullscreen mode
import duckdb

def create_schema(conn):
    conn.execute("""
        CREATE TABLE IF NOT EXISTS fact_indicators (
            country_key  INTEGER,
            indicator_key INTEGER,
            date_key     INTEGER,
            value        DOUBLE,
            yoy_change   DOUBLE,
            extracted_at TIMESTAMP DEFAULT current_timestamp,
            PRIMARY KEY (country_key, indicator_key, date_key)
        )
    """)
    # Plus dim_country (54 rows), dim_indicator (10 rows), dim_date (25 rows)
Enter fullscreen mode Exit fullscreen mode

The transform layer also computes year-over-year change for every data point:

def calculate_yoy(current, previous):
    if current is not None and previous is not None and previous != 0:
        return round(((current - previous) / abs(previous)) * 100, 2)
    return None
Enter fullscreen mode Exit fullscreen mode

Data Quality Framework

This is what separates a toy project from a production one. The quality framework scores three dimensions:

1. Completeness — What percentage of expected data points are non-null?

Literacy Rate: only 18% complete (data is sparse)
Population: 100% complete (every country, every year)
Enter fullscreen mode Exit fullscreen mode

2. Validity — Are values within expected ranges?

Life expectancy: 25-95 years ✅
GDP: $1M - $10T ✅
Inflation: -30% to 10,000% (yes, hyperinflation happens) ✅
Enter fullscreen mode Exit fullscreen mode

3. Freshness — How recent is the latest data?

GDP: 2024 ✅
Literacy: 2021 ⚠️ (surveys are infrequent)
Enter fullscreen mode Exit fullscreen mode

The final score: 95.8/100 — with completeness dragging slightly due to sparse literacy data (expected for survey-based indicators).

Interactive Dashboard

The dashboard is a static site (HTML + Tailwind CSS + Chart.js + Leaflet.js) that loads pre-exported JSON files:

Features:

  • 🗺️ Choropleth map — click any African country, toggle between indicators
  • 📈 Country comparison — compare up to 6 countries over 25 years
  • 🏆 Rankings table — sortable by any indicator
  • 🌙 Dark mode — full theme support
  • 📱 Responsive — works on mobile

The dashboard reads four JSON files exported by the pipeline:

  • country_profiles.json — all data per country (897KB)
  • rankings.json — pre-sorted rankings per indicator
  • summary_stats.json — aggregate statistics
  • quality_report.json — transparency on data quality

Automated Daily Refresh

A GitHub Actions workflow runs the pipeline daily at 6 AM UTC:

name: Daily ETL Pipeline
on:
  schedule:
    - cron: '0 6 * * *'
  workflow_dispatch:

jobs:
  etl:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.12' }
      - run: pip install -r requirements.txt
      - run: python -m pipeline.main all
      - run: |
          git config user.name "github-actions[bot]"
          git add dashboard/data/
          git diff --cached --quiet || git commit -m "chore: update data"
          git push
Enter fullscreen mode Exit fullscreen mode

Fresh data → committed JSON → Vercel auto-deploys. Zero manual intervention.

Key Takeaways

  1. Free APIs are underrated — The World Bank API has incredible depth. No auth, no rate limits worth worrying about, and 25+ years of history.

  2. DuckDB is a game-changer for small-to-medium analytical workloads. Zero setup, single file, and it handles 13K+ rows with analytical queries in milliseconds.

  3. Data quality isn't optional — Even with a trusted source like the World Bank, you'll find missing data, sparse indicators, and surprises. Build quality checks into the pipeline, not as an afterthought.

  4. Static dashboards scale — By pre-computing JSON at ETL time, the dashboard is just a static site. No backend, no database connection, no server costs. Deploy to Vercel for free.

  5. Star schemas still matter — Even in a world of data lakes and denormalized tables, dimensional modeling makes your data queryable and understandable.

Try It Yourself

The entire project is open source:

git clone https://github.com/hajirufai/afridata-pipeline.git
cd afridata-pipeline
pip install -r requirements.txt
python -m pipeline.main all
cd dashboard && python -m http.server 8080
Enter fullscreen mode Exit fullscreen mode

Data engineering doesn't have to be about massive Spark clusters and cloud bills. Sometimes the best projects start with a free API and a clear question.


What economic indicators would you add? Drop a comment below!

Top comments (0)