Building a Real-Time Air Quality Data Pipeline for Mombasa & Nairobi
The Invisible Problem We Breathe
If you’ve ever driven through Nairobi at rush hour or felt the coastal haze in Mombasa, you’ve likely wondered:
“What exactly am I breathing right now?”
Air pollution often hides in plain sight, invisible but deadly. As a data engineer passionate about real-world impact, I decided to build a system that could listen to the air and tell us the truth in real time.
That journey became the Real-Time Air Quality Pipeline:
a streaming data architecture that fetches hourly pollutant readings, processes them instantly, and makes them queryable within seconds — all built with open-source tools.
Project Overview
This pipeline fetches air quality data (PM2.5, PM10, CO, NO₂, SO₂, Ozone, UV Index) from the Open-Meteo API for Nairobi and Mombasa, then streams it through a real-time pipeline using Kafka, MongoDB, Debezium, and Cassandra.
What It Does
- Fetches data hourly
- Streams via Kafka
- Stores raw data in MongoDB
- Uses Debezium for CDC (Change Data Capture)
- Writes processed data to Cassandra for analytics
- Fully containerized using Docker Compose
End Result: Live, queryable data on Kenya’s air quality — updated every hour.
Architecture Overview
(Open-Meteo API)
↓
[Producer - Python]
↓
[Kafka Topic - air_quality_data]
↓
[Consumer - MongoDB Writer]
↓
[MongoDB - Raw Data Storage]
↓
[Debezium CDC Connector]
↓
[Kafka CDC Topic]
↓
[Cassandra Consumer]
↓
[Cassandra - Analytics Storage]
Each block is a service — communicating in real-time via Kafka topics.
Together, they form a streaming ecosystem that can handle continuous data without breaking a sweat.
Setting the Pipeline in Motion
1. Start the System
docker-compose up -d
After a few minutes, you’ll see all 9 containers running:
-
mongo
,zookeeper
,kafka
,kafka-ui
,mongo-connector
,producer
,consumer
,cassandra
, andcassandra-consumer
.
2. Initialize Databases
MongoDB Replica Set Setup
bash storage/init-replica-set.sh
Cassandra Schema Initialization
docker exec -i cassandra cqlsh < storage/cassandra_setup.cql
3. Register Debezium CDC Connector
Debezium monitors MongoDB for new data, captures changes, and streams them out.
bash streaming/register-connector.sh
Once registered, every new air quality record inserted in MongoDB automatically triggers a CDC event.
4. System Health Check
bash health-check.sh
When all checks pass — the real-time pipeline is alive!
Deep Dive — The Data Flow in Action
Step 1: The Producer (Python)
Fetches data from Open-Meteo every hour, ensuring we only publish complete readings.
Step 2: The Consumer (MongoDB Writer)
Consumes messages from Kafka and writes them as raw JSON into MongoDB.
Each entry contains pollutant levels, timestamps, and metadata for each city.
Step 3: Debezium CDC Connector
Debezium detects new inserts in MongoDB and publishes “change events” to a Kafka CDC topic.
Step 4: Cassandra Consumer
Reads CDC events, cleans the data, skips incomplete values, and inserts time-series records into Cassandra.
Monitoring & Dashboards
With Kafka UI, you can see your streaming data live.
Kafka UI displaying all active topics.
Real-time message flow for each city.
Consumers processing messages without lag.
Querying the Data
Raw Data in MongoDB
db.air_quality_raw.find().sort({_id: -1}).limit(5)
Raw data including PM2.5, ozone, and NO₂ readings.
Analytics Data in Cassandra
SELECT city, timestamp, pm2_5, pm10, ozone
FROM air_quality_analytics.air_quality_readings
WHERE city='Nairobi' LIMIT 5;
Structured air quality readings optimized for analysis.
Timestamp Insight
Each record has two timestamps:
-
timestamp
: when the reading was captured -
inserted_at
: when it entered the pipeline
This lets you track latency and data freshness — crucial for real-time systems.
What You’ll Learn
Building this pipeline teaches core data engineering concepts:
Concept | What You’ll Learn |
---|---|
Streaming Systems | How to build and manage real-time Kafka pipelines |
CDC (Change Data Capture) | Tracking database changes with Debezium |
Multi-Database Architecture | Choosing MongoDB for raw data, Cassandra for analytics |
Distributed Systems | Managing replication and eventual consistency |
Containerization | Deploying complex pipelines with Docker Compose |
Future Enhancements
This is just the beginning — imagine expanding this into a nationwide environmental dashboard.
Next Steps:
- Add more cities (Kisumu, Eldoret, Nakuru)
- Create Grafana dashboards for AQI visualization
- Add SMS or Slack alerts for dangerous readings
- Integrate ML for forecasting and anomaly detection
- Build an API or GraphQL endpoint for app developers
Why It Matters
Data shouldn’t live in spreadsheets — it should live in motion.
By streaming real-time air quality data, we can give cities, developers, and citizens live awareness of environmental health.
Projects like this can inform policy, support research, and raise awareness about what’s really in our air.
“Data is the new air — you can’t see it, but everything depends on it.”
Learn More & Contribute
Author: Samwel Oliver
GitHub: @25thOliver
Email: oliversamwel33@gmail.com
Explore the Full Project on GitHub
Top comments (0)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.