Hey Devs π,
I'm Mohamed Hussain S, currently working as an Associate Data Engineer Intern.
After building a batch pipeline with Airflow and Postgres, I wanted to step into the real-time data world β so I created this lightweight Kafka β ClickHouse pipeline.
If youβre curious how streaming data pipelines actually work (beyond just theory), this oneβs for you π―
π What This Project Does
β
Generates mock user data (name
, email
, age
)
β
Sends each message to a Kafka topic called user-signups
β
A ClickHouse Kafka engine table listens for those messages
β
A materialized view pushes clean data into a persistent table
β
All of this runs in Docker for easy setup and teardown
Itβs super lightweight and totally beginner-friendly β perfect for learning how Kafka and ClickHouse can work together.
π§° Tech Stack
- Python β Kafka producer to simulate user signups
- Kafka β distributed streaming platform
- ClickHouse β OLAP database with native Kafka support
- Docker β to spin up Kafka, Zookeeper, and ClickHouse
- SQL β to define engine tables and views in ClickHouse
ποΈ Project Structure
kafka-clickhouse-pipeline/
βββ producer/ # Python Kafka producer
βββ clickhouse-setup.sql # SQL to set up ClickHouse tables
βββ docker-compose.yml # All services defined here
βββ screenshots/ # CLI outputs, topic messages, etc.
βββ README.md # Everything documented here
βοΈ How It Works
- Run
docker-compose up
β spins up Kafka, Zookeeper & ClickHouse - Run the SQL file to create:
- Kafka engine table
- Materialized view
- Target
users
table- Start the Python producer β sends mock user data to Kafka
- ClickHouse listens to the topic and stores data via materialized view
- Boom β your real-time pipeline is up and running!
π§ͺ Example Output
A single message sent to Kafka looks like this:
{"name": "Alice", "email": "alice@example.com", "age": 24}
And the users
table in ClickHouse will store it like this:
name | age | |
---|---|---|
Alice | alice@example.com | 24 |
Check the screenshots/
folder in the repo to see the whole thing in action πΈ
π§ Key Learnings
β
How Kafka producers work with Python
β
Setting up Kafka topics and brokers in Docker
β
How ClickHouse can natively consume Kafka messages
β
How materialized views automate transformation & insert
β
Containerized orchestration made simple with Docker
π‘ Whatβs Next?
π Add a proper Kafka consumer (Python-based) as an alt to ClickHouse ingestion
π Add logging, retries, and dead-letter queue logic
π Simulate more complex streaming use cases like page visits
π Plug in Grafana for real-time metrics from ClickHouse
π Why You Should Try This
If you're exploring real-time data engineering:
- Start with Kafka and Python β itβs intuitive and powerful
- ClickHouseβs Kafka engine + materialized view combo = π―
- Docker lets you test and learn without messing up your local setup
This small project helped me understand the data flow in real-time systems β not just conceptually, but hands-on.
π Repo
π GitHub Repo:
πββοΈ About Me
Mohamed Hussain S
Associate Data Engineer Intern
LinkedIn | GitHub
βοΈ Building in public β one stream at a time.
Top comments (0)