DEV Community

Oliver Samuel
Oliver Samuel

Posted on

From smog to streams: how data engineering helps us breathe easier.

Building a Real-Time Air Quality Data Pipeline for Mombasa & Nairobi


The Invisible Problem We Breathe

If you’ve ever driven through Nairobi at rush hour or felt the coastal haze in Mombasa, you’ve likely wondered:

“What exactly am I breathing right now?”

Air pollution often hides in plain sight, invisible but deadly. As a data engineer passionate about real-world impact, I decided to build a system that could listen to the air and tell us the truth in real time.

That journey became the Real-Time Air Quality Pipeline:
a streaming data architecture that fetches hourly pollutant readings, processes them instantly, and makes them queryable within seconds — all built with open-source tools.


Project Overview

This pipeline fetches air quality data (PM2.5, PM10, CO, NO₂, SO₂, Ozone, UV Index) from the Open-Meteo API for Nairobi and Mombasa, then streams it through a real-time pipeline using Kafka, MongoDB, Debezium, and Cassandra.

What It Does

  • Fetches data hourly
  • Streams via Kafka
  • Stores raw data in MongoDB
  • Uses Debezium for CDC (Change Data Capture)
  • Writes processed data to Cassandra for analytics
  • Fully containerized using Docker Compose

End Result: Live, queryable data on Kenya’s air quality — updated every hour.


Architecture Overview

(Open-Meteo API)[Producer - Python]
       ↓
  [Kafka Topic - air_quality_data]
       ↓
  [Consumer - MongoDB Writer]
       ↓
  [MongoDB - Raw Data Storage]
       ↓
  [Debezium CDC Connector]
       ↓
  [Kafka CDC Topic]
       ↓
  [Cassandra Consumer]
       ↓
  [Cassandra - Analytics Storage]
Enter fullscreen mode Exit fullscreen mode

Each block is a service — communicating in real-time via Kafka topics.
Together, they form a streaming ecosystem that can handle continuous data without breaking a sweat.


Setting the Pipeline in Motion

1. Start the System

docker-compose up -d
Enter fullscreen mode Exit fullscreen mode

Docker Compose starting all services

After a few minutes, you’ll see all 9 containers running:

  • mongo, zookeeper, kafka, kafka-ui, mongo-connector, producer, consumer, cassandra, and cassandra-consumer.

All services live in Docker


2. Initialize Databases

MongoDB Replica Set Setup

bash storage/init-replica-set.sh
Enter fullscreen mode Exit fullscreen mode

MongoDB replica set initialized successfully

Cassandra Schema Initialization

docker exec -i cassandra cqlsh < storage/cassandra_setup.cql
Enter fullscreen mode Exit fullscreen mode

3. Register Debezium CDC Connector

Debezium monitors MongoDB for new data, captures changes, and streams them out.

bash streaming/register-connector.sh
Enter fullscreen mode Exit fullscreen mode

Debezium connector registered and active

Once registered, every new air quality record inserted in MongoDB automatically triggers a CDC event.


4. System Health Check

bash health-check.sh
Enter fullscreen mode Exit fullscreen mode

All services healthy and connected

When all checks pass — the real-time pipeline is alive!


Deep Dive — The Data Flow in Action

Step 1: The Producer (Python)

Fetches data from Open-Meteo every hour, ensuring we only publish complete readings.

Producer fetching and publishing new air quality data


Step 2: The Consumer (MongoDB Writer)

Consumes messages from Kafka and writes them as raw JSON into MongoDB.

MongoDB consumer writing data

Each entry contains pollutant levels, timestamps, and metadata for each city.


Step 3: Debezium CDC Connector

Debezium detects new inserts in MongoDB and publishes “change events” to a Kafka CDC topic.

Debezium CDC connector running


Step 4: Cassandra Consumer

Reads CDC events, cleans the data, skips incomplete values, and inserts time-series records into Cassandra.

Cassandra consumer logs showing inserted readings


Monitoring & Dashboards

With Kafka UI, you can see your streaming data live.

Kafka UI topics overview
Kafka UI displaying all active topics.

Live messages in air\_quality\_data topic
Real-time message flow for each city.

Kafka consumer group statuses
Consumers processing messages without lag.


Querying the Data

Raw Data in MongoDB

db.air_quality_raw.find().sort({_id: -1}).limit(5)
Enter fullscreen mode Exit fullscreen mode

MongoDB query showing pollutant values
Raw data including PM2.5, ozone, and NO₂ readings.


Analytics Data in Cassandra

SELECT city, timestamp, pm2_5, pm10, ozone 
FROM air_quality_analytics.air_quality_readings 
WHERE city='Nairobi' LIMIT 5;
Enter fullscreen mode Exit fullscreen mode

Cassandra analytics results
Structured air quality readings optimized for analysis.


Timestamp Insight

Each record has two timestamps:

  • timestamp: when the reading was captured
  • inserted_at: when it entered the pipeline

This lets you track latency and data freshness — crucial for real-time systems.


What You’ll Learn

Building this pipeline teaches core data engineering concepts:

Concept What You’ll Learn
Streaming Systems How to build and manage real-time Kafka pipelines
CDC (Change Data Capture) Tracking database changes with Debezium
Multi-Database Architecture Choosing MongoDB for raw data, Cassandra for analytics
Distributed Systems Managing replication and eventual consistency
Containerization Deploying complex pipelines with Docker Compose

Future Enhancements

This is just the beginning — imagine expanding this into a nationwide environmental dashboard.

Next Steps:

  • Add more cities (Kisumu, Eldoret, Nakuru)
  • Create Grafana dashboards for AQI visualization
  • Add SMS or Slack alerts for dangerous readings
  • Integrate ML for forecasting and anomaly detection
  • Build an API or GraphQL endpoint for app developers

Why It Matters

Data shouldn’t live in spreadsheets — it should live in motion.

By streaming real-time air quality data, we can give cities, developers, and citizens live awareness of environmental health.
Projects like this can inform policy, support research, and raise awareness about what’s really in our air.

“Data is the new air — you can’t see it, but everything depends on it.”


Learn More & Contribute

Author: Samwel Oliver
GitHub: @25thOliver
Email: oliversamwel33@gmail.com

Explore the Full Project on GitHub


Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.