Ryan Giggs

Posted on Apr 9

Building the Sovereign Debt Observatory: An End-to-End ELT Pipeline on World Bank Debt Data for Low and Middle-Income Countries

#dataengineering #gcp #dbt #worldbank

Introduction

Global sovereign debt is one of the most consequential datasets in existence. It shapes foreign policy, determines credit ratings, drives IMF bailout decisions, and affects the daily lives of billions of people in developing countries. The World Bank publishes this data openly — 130+ countries, 27 years of history, updated quarterly — yet there is no ready-made analytical layer on top of it.

If you want to answer a question like "which African countries have the highest ratio of private nonguaranteed debt to total external debt, and how has that changed since 2010?", you have to manually download Excel files from multiple World Bank portals, clean inconsistent column names, handle missing values, and stitch everything together in a spreadsheet. Every time the data updates, you do it again.

That is the problem this project solves.

The Sovereign Debt Observatory is an end-to-end ELT pipeline that ingests World Bank external debt data, lands it in a cloud data lake, transforms it in BigQuery using dbt Cloud, orchestrates everything quarterly with Apache Airflow, and surfaces the answers in a Looker Studio dashboard.

This article walks through every layer of the pipeline — the architecture decisions, the technical challenges, and how I solved them.

The Five Questions This Pipeline Answers

Before writing a single line of code, I defined the analytical questions the pipeline needed to answer. This kept every decision grounded in purpose rather than technology for its own sake.

How is gross external debt distributed across public, publicly guaranteed, private nonguaranteed, and multilateral sectors per country?
Which countries carry the highest short-term external debt exposure and how has that changed since 2010?
What share of external debt is foreign-currency denominated and where is that ratio worsening?
How has regional external debt stock evolved from 1998 to 2025 across Africa, Latin America, East Asia, South Asia, Europe and Central Asia, and the Middle East?
Which countries face the heaviest debt service pressure relative to their total debt position?

Architecture Overview

The pipeline follows a modern ELT pattern — extract and load first, transform inside the warehouse.

World Bank API (IDS source 2)
        |
        v
PySpark job (Docker container)
        |
        v
Google Cloud Storage — raw Parquet, partitioned by extracted_date
        |
        v
BigQuery external tables (raw dataset)
        |
        v
dbt Cloud — staging views + mart tables
        |
        v
Looker Studio dashboard
        |
Orchestrated by Apache Airflow on Docker Compose
Infrastructure provisioned by Terraform

Why ELT and not ETL?

In ETL, transformation happens before loading — your Spark job does the cleaning, aggregation, and business logic before writing to the warehouse. In ELT, raw data lands untransformed and the warehouse does all the heavy lifting.

For this project, ELT is the right choice for three reasons. First, raw data is preserved in GCS indefinitely — if analytical requirements change six months from now, I just write a new dbt model without re-running the ingestion layer. Second, BigQuery is optimized for analytical SQL transformations at scale — it is far better at this than PySpark running in a Docker container on a local machine. Third, dbt gives us version-controlled, tested, documented transformations that are readable by anyone with SQL knowledge.

PySpark's job in this pipeline is purely mechanical: hit the API, paginate, write Parquet. No business logic. No aggregations. Pure extract and load.

Data Sources

International Debt Statistics (IDS) — World Bank source 2

The IDS database is the flagship World Bank debt dataset. It covers external debt stocks and flows for low and middle income countries, with annual data going back to 1998. I access it through the wbgapi Python library, which wraps the World Bank Indicators API v2.

I ingest nine series:

Series Code	Description
DT.DOD.DECT.CD	Total external debt stocks
DT.DOD.DLXF.CD	Long-term external debt
DT.DOD.DPNG.CD	Private nonguaranteed debt
DT.DOD.MIBR.CD	PPG IBRD loans
DT.DOD.DPPG.CD	Public and publicly guaranteed debt
DT.DOD.DIMF.CD	IMF credit
DT.DOD.PVLX.CD	Present value of external debt
DT.DOD.MWBG.CD	IBRD loans and IDA credits
DT.DOD.MIDA.CD	PPG IDA loans

One important lesson from building this: the World Bank has multiple API source databases, and not all of them support the standard wbgapi query format. Sources 22 (QEDS SDDS), 23 (QEDS GDDS), and 54 (JEDH) all return JSON decode errors when queried programmatically — they use a separate DataBank backend. Only source 2 (IDS) reliably supports the Indicators API. This cost me several hours of debugging.

Quarterly External Debt Statistics SDDS (QEDS)

QEDS provides quarterly debt payment schedule data broken down by sector and maturity. Unlike IDS, QEDS does not support programmatic API access in the standard format. The World Bank provides bulk Excel downloads for each supplementary table instead.

I download five Excel files directly:

Table 1.5 — Net external debt position by sector
Table 3 — Debt service payment schedule by sector
Table 3.2 — Debt service by sector and instrument
Table 2.1 — Foreign currency and domestic currency debt
Table 1.6 — Reconciliation of positions and flows

Infrastructure as Code with Terraform

Every GCP resource in this project is provisioned by Terraform. The three core resources are a GCS bucket for the data lake and three BigQuery datasets — raw, staging, and mart.

resource "google_storage_bucket" "data_lake" {
  name          = var.gcs_bucket_name
  location      = var.region
  force_destroy = true

  lifecycle_rule {
    action { type = "Delete" }
    condition { age = 90 }
  }

  versioning { enabled = true }
}

resource "google_bigquery_dataset" "raw" {
  dataset_id                 = "raw"
  location                   = var.location
  delete_contents_on_destroy = true
}

The 90-day lifecycle rule on GCS automatically deletes old partitions, keeping storage costs near zero. The entire GCP footprint for this project costs less than $0.05 per month — BigQuery's free tier covers 1 TB of queries and 10 GB of storage, which is far more than this dataset requires.

A setup_gcp.sh script handles the one-time bootstrap — creating the GCP project, enabling APIs, creating the service account, and granting IAM roles. The billing account ID is passed as an environment variable so it never appears in version-controlled files.

BILLING_ACCOUNT=your-billing-id bash scripts/setup_gcp.sh

Ingestion Layer — PySpark on Docker

JEDH / IDS Ingestion

The IDS ingestion script uses wbgapi to fetch all nine series across all available countries from 1998 to 2025. The API returns data in wide format — one row per country per series, with year columns as separate fields. PySpark writes this to GCS as Snappy-compressed Parquet, partitioned by extraction date.

One critical issue I hit: the World Bank API returns year columns as bare integers (1998, 1999, etc.). BigQuery rejects column names that start with numbers. The fix was to prefix all year columns with year_ before writing:

combined.columns = [
    f"year_{col}" if str(col).isdigit() else col
    for col in combined.columns
]

This produces clean column names like year_1998, year_1999 that BigQuery accepts without complaint.

QEDS Ingestion

The QEDS ingestion downloads five Excel files from World Bank DataBank, reads all sheets from each file using pandas.read_excel, and concatenates them into a single DataFrame.

The Excel files have messy column names — spaces, brackets, special characters. A clean_column_name function normalizes everything:

def clean_column_name(col):
    col = str(col)
    col = re.sub(r'\s*\[.*?\]', '', col)  # remove [YR2021Q4] suffixes
    col = col.strip()
    col = re.sub(r'[^a-zA-Z0-9_]', '_', col)
    col = re.sub(r'_+', '_', col)
    col = col.strip('_')
    if col and col[0].isdigit():
        col = 'q_' + col  # prefix quarter columns: q_2021q4
    return col.lower()

I also hit an OOM error trying to write the QEDS data through PySpark — 76,835 rows with 90+ columns across 214 sheets was too much for the JVM heap in the Docker container. The fix was to bypass Spark entirely for QEDS and write directly to GCS using the google-cloud-storage Python client:

buffer = io.BytesIO()
df.to_parquet(buffer, index=False, engine="pyarrow")
buffer.seek(0)

client = storage.Client()
bucket = client.bucket(GCS_BUCKET)
blob = bucket.blob(output_path)
blob.upload_from_file(buffer, content_type="application/octet-stream")

No JVM, no Spark executor, no OOM. The lesson: use the right tool for the job. PySpark is excellent for large distributed datasets. For a 76K-row DataFrame from an Excel file, plain pandas and the GCS Python client is simpler and more reliable.

Docker Image

The ingestion image is built on eclipse-temurin:17-jdk-jammy rather than a plain Python image. This is because PySpark requires Java, and the python:3.11-slim base image uses Debian Trixie which does not carry openjdk-17-jdk in its default repositories. The Temurin image ships Java 17 out of the box, which is exactly what Spark 3.5.1 needs.

The GCS connector JAR is downloaded at build time:

RUN curl --progress-bar -L \
    https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar \
    -o ${SPARK_HOME}/jars/gcs-connector-hadoop3-latest.jar

Credentials are never baked into the image. They are mounted at runtime as a volume:

docker run --rm \
  -v /path/to/key.json:/app/credentials/key.json \
  -e GOOGLE_APPLICATION_CREDENTIALS=/app/credentials/key.json \
  sovereign-debt-ingestion:v1 python3 extract_jedh.py

Orchestration with Apache Airflow

Airflow runs on Docker Compose using the official apache/airflow:2.9.2 image. The stack includes a Celery executor with Redis as the message broker and Postgres as the metadata database.

The DAG runs quarterly — on the 1st of January, April, July, and October at 06:00 UTC:

schedule_interval="0 6 1 1,4,7,10 *"

Task flow:

extract_load_jedh >> extract_load_qeds

Both tasks use the DockerOperator to spin up the ingestion container on the host machine, which means Airflow itself does not need PySpark, Java, or any data dependencies — it just tells Docker to run the job.

The Docker socket problem

The DockerOperator communicates with the host Docker daemon through the Unix socket at /var/run/docker.sock. This requires two things: the socket must be mounted into the Airflow worker container, and the worker must have permission to use it.

The socket mount goes in docker-compose.yml under the common volumes section:

volumes:
  - /var/run/docker.sock:/var/run/docker.sock

The permission fix requires adding the Docker group ID to the Airflow worker. On my machine the Docker group ID is 984:

group_add:
  - "984"

Without the group add, the worker can see the socket but gets PermissionError(13, 'Permission denied'). This is a common Airflow + Docker-in-Docker gotcha that took several debugging sessions to resolve.

Transformation with dbt Cloud

Staging Layer

The staging models are materialized as views — no storage cost, no latency, just a SQL lens over the raw external tables.

stg_jedh does the heavy lifting: it unpivots the wide year-column format into long format using BigQuery's UNPIVOT operator:

unpivot(
    value for year in (
        year_1998, year_1999, year_2000, ..., year_2025
    )
)

Then extracts the year integer from the column name:

cast(replace(year, 'year_', '') as int64) as year

And adds a human-readable series description via a CASE statement, turning DT.DOD.DECT.CD into Total external debt stocks.

The result is a clean long-format table: one row per country per series per year.

stg_qeds is simpler — it selects the clean columns, uses SAFE_CAST to convert quarter values to float64, and filters out null countries and series codes. SAFE_CAST is preferable to CAST here because it returns NULL on failure rather than throwing an error, which is the right behaviour for messy Excel data.

Mart Layer

The mart models are materialized as tables with partitioning and clustering for query efficiency.

mart_debt_stocks is partitioned by year and clustered by country_code and series_code. It enriches the staging data with YoY percentage change and debt as a percentage of total external debt per country per year — both computed using window functions:

lag(b.debt_value_usd) over (
    partition by b.country_code, b.series_code
    order by b.year
) as prev_year_value,

safe_divide(
    b.debt_value_usd - lag(b.debt_value_usd) over (
        partition by b.country_code, b.series_code
        order by b.year
    ),
    lag(b.debt_value_usd) over (
        partition by b.country_code, b.series_code
        order by b.year
    )
) * 100 as yoy_change_pct

mart_regional_debt assigns countries to six World Bank regions using a CASE statement on ISO3 country codes, then aggregates total, average, max, and min debt stocks per region per series per year. This powers the regional trajectory time series on the dashboard.

mart_debt_service computes total annual debt payments and average quarterly payments per country from the QEDS payment schedule data, enabling debt service pressure analysis.

dbt Tests

All models have not_null tests on primary dimension columns. Running dbt test after every model change ensures data quality is enforced at the transformation layer rather than discovered downstream in the dashboard.

The Dashboard

The Looker Studio dashboard has two pages, both connected directly to the BigQuery mart tables.

Page 1 — Global Debt Overview answers Q4 at a glance. A time series chart using mart_regional_debt shows six regional debt trajectories from 1998 to 2024. Africa's trajectory is notably steeper post-2010. A bar chart shows the top 20 countries by total external debt for the selected year. A scorecard shows total global external debt.

Page 2 — Country Deep-Dive answers Q1, Q2, and Q5. A stacked bar chart shows debt composition by sector over time for any selected country. A line chart shows short-term vulnerability trends. A table sorted by total 2021 payments shows which countries face the most acute debt service pressure.

Live dashboard: https://lookerstudio.google.com/reporting/7fc18e9e-a5c6-4616-b920-b5b4bddf2264

Key Technical Lessons

1. Know which World Bank API sources support programmatic access. Only source 2 (IDS) reliably supports the Indicators API v2 via wbgapi. Sources 22, 23, and 54 use a different DataBank backend and return JSON decode errors. For QEDS data, use the bulk Excel downloads instead.

2. BigQuery rejects column names starting with numbers. Prefix them before writing Parquet. A simple list comprehension handles this: f"year_{col}" if str(col).isdigit() else col.

3. Use the right tool for the data size. PySpark is excellent for large datasets but overkill for a 76K-row Excel file. The GCS Python client with pandas and PyArrow is simpler, faster, and doesn't OOM.

4. Docker-in-Docker with Airflow requires explicit socket mounting and group permissions. Mount /var/run/docker.sock and add the Docker group ID to the worker container. Without the group add, you get a silent permission error.

5. ELT separates concerns cleanly. When the analytical questions evolved during development, I only needed to update dbt models — never the ingestion layer. This separation is the most valuable architectural decision in the project.

What I Would Do Differently

If I were building this again, I would use the World Bank DataBank bulk download API to get historical QEDS time series data instead of point-in-time Excel files. The current QEDS data only has a handful of quarters because the Excel files are snapshots of the latest publication. A proper time series would require downloading and archiving each quarterly release.

I would also add a load_to_bigquery step that loads Parquet directly into native BigQuery tables rather than using external tables. External tables work well but they require manually updating the source URI list each time a new partition is added. A native table with partitioning handles this automatically.

Conclusion

The Sovereign Debt Observatory took the World Bank's raw debt data from scattered Excel files and API endpoints to a fully automated, tested, and documented analytical pipeline. Every component is reproducible — Terraform provisions the infrastructure, Docker packages the ingestion environment, Airflow schedules the runs, and dbt documents and tests the transformations.

The full source code is on GitHub: https://github.com/Derrick-Ryan-Giggs/sovereign-debt-observatory

If you have questions about any part of the implementation, drop them in the comments. I am happy to go deeper on any layer.

Ryan Derrick Giggs is a data engineering practitioner and technical writer based in Nairobi, Kenya. He is currently completing the DataTalksClub Data Engineering Zoomcamp 2026.

LinkedIn: https://linkedin.com/in/ryan-giggs-a19330265
GitHub: https://github.com/Derrick-Ryan-Giggs

DEV Community