Oliver Samuel

Posted on Oct 20

From smog to streams: how data engineering helps us breathe easier.

#architecture #dataengineering #opensource

Building a Real-Time Air Quality Data Pipeline for Mombasa & Nairobi

The Invisible Problem We Breathe

If you’ve ever driven through Nairobi at rush hour or felt the coastal haze in Mombasa, you’ve likely wondered:

“What exactly am I breathing right now?”

Air pollution often hides in plain sight, invisible but deadly. As a data engineer passionate about real-world impact, I decided to build a system that could listen to the air and tell us the truth in real time.

That journey became the Real-Time Air Quality Pipeline:
a streaming data architecture that fetches hourly pollutant readings, processes them instantly, and makes them queryable within seconds — all built with open-source tools.

Project Overview

This pipeline fetches air quality data (PM2.5, PM10, CO, NO₂, SO₂, Ozone, UV Index) from the Open-Meteo API for Nairobi and Mombasa, then streams it through a real-time pipeline using Kafka, MongoDB, Debezium, and Cassandra.

What It Does

Fetches data hourly
Streams via Kafka
Stores raw data in MongoDB
Uses Debezium for CDC (Change Data Capture)
Writes processed data to Cassandra for analytics
Fully containerized using Docker Compose

End Result: Live, queryable data on Kenya’s air quality — updated every hour.

Architecture Overview

(Open-Meteo API)
       ↓
  [Producer - Python]
       ↓
  [Kafka Topic - air_quality_data]
       ↓
  [Consumer - MongoDB Writer]
       ↓
  [MongoDB - Raw Data Storage]
       ↓
  [Debezium CDC Connector]
       ↓
  [Kafka CDC Topic]
       ↓
  [Cassandra Consumer]
       ↓
  [Cassandra - Analytics Storage]

Each block is a service — communicating in real-time via Kafka topics.
Together, they form a streaming ecosystem that can handle continuous data without breaking a sweat.

Setting the Pipeline in Motion

1. Start the System

docker-compose up -d

After a few minutes, you’ll see all 9 containers running:

mongo, zookeeper, kafka, kafka-ui, mongo-connector, producer, consumer, cassandra, and cassandra-consumer.

2. Initialize Databases

MongoDB Replica Set Setup

bash storage/init-replica-set.sh

Cassandra Schema Initialization

docker exec -i cassandra cqlsh < storage/cassandra_setup.cql

3. Register Debezium CDC Connector

Debezium monitors MongoDB for new data, captures changes, and streams them out.

bash streaming/register-connector.sh

Once registered, every new air quality record inserted in MongoDB automatically triggers a CDC event.

4. System Health Check

bash health-check.sh

When all checks pass — the real-time pipeline is alive!

Deep Dive — The Data Flow in Action

Step 1: The Producer (Python)

Fetches data from Open-Meteo every hour, ensuring we only publish complete readings.

Step 2: The Consumer (MongoDB Writer)

Consumes messages from Kafka and writes them as raw JSON into MongoDB.

Each entry contains pollutant levels, timestamps, and metadata for each city.

Step 3: Debezium CDC Connector

Debezium detects new inserts in MongoDB and publishes “change events” to a Kafka CDC topic.

Step 4: Cassandra Consumer

Reads CDC events, cleans the data, skips incomplete values, and inserts time-series records into Cassandra.

Monitoring & Dashboards

With Kafka UI, you can see your streaming data live.

Kafka UI displaying all active topics.

$Live messages in air\_quality\_data topic$
Real-time message flow for each city.

Consumers processing messages without lag.

Querying the Data

Raw Data in MongoDB

db.air_quality_raw.find().sort({_id: -1}).limit(5)

Raw data including PM2.5, ozone, and NO₂ readings.

Analytics Data in Cassandra

SELECT city, timestamp, pm2_5, pm10, ozone 
FROM air_quality_analytics.air_quality_readings 
WHERE city='Nairobi' LIMIT 5;

Structured air quality readings optimized for analysis.

Timestamp Insight

Each record has two timestamps:

timestamp: when the reading was captured
inserted_at: when it entered the pipeline

This lets you track latency and data freshness — crucial for real-time systems.

What You’ll Learn

Building this pipeline teaches core data engineering concepts:

Concept	What You’ll Learn
Streaming Systems	How to build and manage real-time Kafka pipelines
CDC (Change Data Capture)	Tracking database changes with Debezium
Multi-Database Architecture	Choosing MongoDB for raw data, Cassandra for analytics
Distributed Systems	Managing replication and eventual consistency
Containerization	Deploying complex pipelines with Docker Compose

Future Enhancements

This is just the beginning — imagine expanding this into a nationwide environmental dashboard.

Next Steps:

Add more cities (Kisumu, Eldoret, Nakuru)
Create Grafana dashboards for AQI visualization
Add SMS or Slack alerts for dangerous readings
Integrate ML for forecasting and anomaly detection
Build an API or GraphQL endpoint for app developers

Why It Matters

Data shouldn’t live in spreadsheets — it should live in motion.

By streaming real-time air quality data, we can give cities, developers, and citizens live awareness of environmental health.
Projects like this can inform policy, support research, and raise awareness about what’s really in our air.