Ryan Giggs

Posted on Jun 8

I Built a Data Lakehouse to Map Kenya's Healthcare Inequality — Here's What I Learned

#dataengineering #opensource #python #healthtech

A child born in Westlands, Nairobi has access to 7.23 health facilities per 10,000 people.

A child born in Embakasi North, also in Nairobi, has 0.62.

Same city. Same county. One-tenth the access.

That number did not come from a government report. It did not come from a consultant's PowerPoint. It came from a pipeline I built from scratch, joining three public datasets that nobody had systematically connected before — the Ministry of Health facility registry, the KNBS 2019 Census, and HDX county boundary GeoJSON.

This is the story of how I built it, why I made the architecture choices I made, and what I would do differently.

The Problem

Kenya has 47 counties and over 20,000 registered health facilities. The Ministry of Health publishes facility data through the Kenya Master Health Facility Registry (KMHFR). The Kenya National Bureau of Statistics publishes county population data. The Humanitarian Data Exchange publishes county boundary GeoJSON for choropleth mapping.

Three datasets. Three different sources. All public. None of them joined.

Without joining them, you cannot answer the most basic questions:

Which counties are most underserved relative to their population?
Which counties have no maternity facilities? No ART centres? No TB clinics?
How many facilities would Bungoma need to reach the national baseline of 3 per 10,000 people?
Has the facility landscape changed over time — are counties gaining or losing services?

The data to answer all of this exists. It just has not been assembled. That is the problem this pipeline solves.

The Stack

Here is what I built with, and more importantly, why:

Infrastructure as Code: OpenTofu — Provisions two MinIO buckets before the stack starts. OpenTofu is the community-maintained open-source fork of Terraform following HashiCorp's BSL license change.

Orchestration: Apache Airflow (CeleryExecutor + Redis) — Monthly DAG: three ingestion tasks run in parallel, a barrier gate waits for all three to complete, then dbt runs.

Data Lake: MinIO — S3-compatible object storage running in Docker. Raw NDJSON files land here first, partitioned by year/month/day.

Table Format: Apache Iceberg — ACID transactions, schema evolution, and time travel on top of files. The SCD2 snapshot tracks which facilities gained or lost maternity/ART/TB services month over month.

Catalog: Iceberg REST Catalog — SQLite-backed, persists across Docker restarts via a named volume.

Query Engine: Trino 480 — Federated SQL over Iceberg tables stored in MinIO.

Transformations: dbt Core + dbt-trino — Staging models, four mart models, one SCD2 snapshot, 30 passing data quality tests, one seed file.

Dashboards: Apache Superset 5.0.0 — Eight charts on one dashboard.

Everything runs in Docker Compose on my HP EliteBook 840. Zero cloud spend.

The Architecture Tradeoff I Need To Be Honest About

Here is where I am going to say something that might surprise you.

This stack is overkill for 20,000 rows of data.

Trino is a distributed query engine built for petabyte-scale federated queries across dozens of data sources. My biggest table has 20,391 rows. DuckDB — a single embedded analytics database that runs in a Python process — would query that in milliseconds without any infrastructure at all.

CeleryExecutor with Redis is designed for Airflow deployments with many concurrent tasks across multiple workers. I have three DAGs that run once a month. LocalExecutor handles that just fine with zero additional services.

The Iceberg REST catalog backed by SQLite is, genuinely, the worst of both worlds at this scale — you get all of Iceberg's operational complexity without most of its benefits.

So why did I build it this way?

Because this is the architecture I would run at scale — and I needed to prove I can operate it.

If this pipeline were ingesting facility data from all 54 African Union member states, with real-time updates from DHIS2, and serving 200 analysts querying simultaneously — you would need exactly this stack. Trino for federated querying. CeleryExecutor so ingestion tasks do not block dashboard queries. Iceberg for time travel and schema evolution as the data model matures.

The data volume is small. The architecture patterns are production-grade. I built it this way intentionally, because the goal of a portfolio project is not just to answer a data question — it is to demonstrate that you can operate the systems that answer data questions at companies where the stakes are real.

When I sit in a data engineering interview and someone asks "have you worked with Trino and Iceberg?" the answer is yes, in anger, debugging catalog persistence issues and permission conflicts at 11pm. That is worth more than a hypothetical answer.

What I Actually Found

After all of it was running, here is what the data showed:

National level:

Bungoma is Kenya's most underserved county: 1.88 facilities per 10,000 people, serving 1.67 million people
Samburu is the only county with a critical TB gap — just 1 TB facility for 310,327 people
Every county in Kenya has at least one emergency facility — that surprised me
Mandera has the highest raw ratio (6.26 per 10k) — but that is sparse population density, not good healthcare access. Ratios lie without context.

Nairobi sub-county drill-down:

Starehe (CBD): 7.23 facilities per 10,000 people
Embakasi North: 0.62 facilities per 10,000 people
The most underserved sub-county has eleven times fewer facilities per capita than the most served, within the same city

The Dashboard

Eight charts. One story. Here is what each one shows.

Chart 1 — Facilities per 10,000 People by County

The primary national scorecard. Every bar is one of Kenya's 47 counties. The spike you see around Lamu reflects its small population relative to its facility count — a reminder that raw ratios must be read alongside absolute population figures. The counties sitting below 2.0 on the left — Trans Nzoia, Kakamega — are large-population counties being severely underserved in absolute terms. The Y axis is facilities per 10,000 people. The X axis is sorted alphabetically; the ranked view lives in the next chart.

Chart 2 — Underserved Counties Ranking

All 47 counties ranked from most to least underserved. The severity score column is highlighted red for scores of 3 (critical) and pink for scores of 2 — Bungoma and Samburu both carry severity score 3. The facilities_needed_to_baseline column shows how many additional facilities each county would need just to reach a minimum of 3 per 10,000 people. Bungoma needs 21. The tb_gap_flag column confirms Samburu as the only county in Kenya with a critical TB service gap. This table is the most action-oriented output in the entire pipeline — it is a prioritised to-do list for the Ministry of Health.

Chart 3 — Maternity Coverage by County

Maternity facilities per 100,000 people across all 47 counties. The spike at Lamu (~52) is the small-population effect again. More meaningful are the counties between 10 and 17 on the left side — these are where maternal healthcare access is most constrained on a population-adjusted basis. Kenya's maternal mortality remains high in underserved counties, and this chart makes that structural gap visible and quantified for the first time in a reproducible pipeline.

Chart 4 — Service Coverage Rates by County

Four services compared side by side for every county: TB (blue), maternity (dark red), ART/HIV treatment (orange-red), and emergency care (amber). The tall blue TB bars dominate because TB diagnosis points are integrated into a wider range of facility types nationally. The very short amber emergency bars across most counties are a quiet crisis in themselves — emergency care coverage is thin across the country. This is the most information-dense chart in the dashboard and rewards close examination county by county.

Chart 5 — Nairobi Sub-County Facility Density

This is the chart that stops people. Embakasi North: 0.62 facilities per 10,000 people. Starehe: 7.23. Same city. The bar on the far right (Starehe/Westlands cluster) towers over everything else. Embakasi North, the short bar third from the left, has 291,760 people and only 18 health facilities. This chart is the most visceral data visualisation in the project — it shows that healthcare inequality is not just a rural versus urban story, but an intra-city story playing out at sub-county level inside Nairobi.

Chart 6 — Nairobi Sub-County Service Gaps

The full service breakdown for all 17 Nairobi sub-counties, sorted by facilities per 10k ascending — most underserved first. The table shows 2024 projected population (2019 census + 2.3% annual KNBS growth rate) alongside raw counts of maternity, ART, and emergency facilities per sub-county. What stands out is that every sub-county shows no_maternity_flag = false and no_emergency_flag = false. Nairobi's inequality is not a service absence crisis — it is a density crisis. Every sub-county has some maternity and emergency facilities. Not nearly enough of them.

Chart 7 — Total Counties Mapped

A single bold number: 47. All of Kenya's counties accounted for. This is not just cosmetic — a custom dbt test (assert_county_count_equals_47) enforces this in code. If any pipeline run produces fewer than 47 counties due to data quality issues, join failures, or ingestion gaps, the test fails and the pipeline stops before writing bad data to the mart tables.

Chart 8 — Kenya Facility Density Map

A full choropleth of all 47 Kenya counties, colour-coded by facility density. Light yellow is least served (0.9–1.8 per 10k). Dark maroon is most served (4.4–5.3 per 10k). The dense cluster of small coloured counties in the southwest reflects the high-population Rift Valley and central counties. The large pale northern counties — Turkana, Marsabit, Wajir, Mandera — are sparse in both population and facilities. Building this map required significant engineering: Superset's built-in Kenya map only has the pre-2013 provincial boundaries (8 provinces, not 47 counties), so I built a custom virtual dataset wrapping each county's raw GeoJSON geometry as a full Feature object to make deck.gl render it correctly.

The Engineering Problems That Nearly Broke Me

The Superset logout loop

Both Airflow and Superset were logging me out within 20 seconds of login. The root causes turned out to be three separate things simultaneously: multiple Airflow gunicorn workers not sharing session state, a Superset 5.0 regression breaking database session persistence, and Flask generating a random SECRET_KEY on every container restart — invalidating every session cookie on every docker compose restart.

Fix: AIRFLOW__WEBSERVER__WORKERS=1, a fixed SECRET_KEY in .env from openssl rand -hex 32, and a before_request Flask hook forcing session.permanent = True on every request.

The Iceberg catalog disappearing on restart

Every restart wiped all Trino schema and table knowledge. The REST catalog was using in-memory SQLite by default.

Fix: CATALOG_URI=jdbc:sqlite:/catalog/iceberg_catalog.db with a named Docker volume. Two config lines. One hour of debugging.

The dbt target/ permission conflict

The VS Code dbt extension creates dbt/target/ with host user ownership (UID 1000). Airflow runs as UID 50000. Every dbt run failed with Permission denied.

Fix: declare dbt-target as a named Docker volume mounted over the bind-mounted ./dbt directory. Named volumes are owned by the Docker daemon, not the host user.

The KMHFR API redirect

Original pipeline hit kmhfl.health.go.ke. That domain 301-redirects to kmhfr.health.go.ke, which returned just the URL path as plain text — not JSON. Three weeks of failed ingestion before finding the correct public API endpoint: api.kmhfr.health.go.ke/api/public/facilities/.

The Superset Country Map with 8 old provinces

Superset's built-in Kenya GeoJSON has the pre-2013 provincial boundaries — 8 provinces, not 47 counties. No ISO code mapping fixes this. The solution was switching to deck.gl Polygon with a custom SQL virtual dataset wrapping each county's geometry as a GeoJSON Feature object, with explicit CASE mapping for three county name mismatches across datasets.

What I Would Do Differently

Use DuckDB instead of Trino for a project of this scale. Use LocalExecutor for Airflow. Start with the data model before the infrastructure.

The data model is what creates value. The infrastructure is just plumbing. I spent more time debugging the plumbing than thinking about what the data should actually say.

What's Next

The pipeline runs. The dashboards are live. The data tells a real story.

The next step is getting it to the people who can act on it: Ministry of Health Kenya, UNICEF Kenya, Amref Health Africa, the Council of Governors, Code for Kenya. The data is public. The pipeline is open source. Anyone can run it.

If you work at any of those organisations and want to talk about this data — reach out.

Github : github.com/Derrick-Ryan-Giggs/kenya-health-pipeline