Ever wondered how to build a production-grade real-time data pipeline that can handle millions of events per second? In this guide, I'll show you how to set up a complete IoT streaming platform locally using Kubernetes (Kind) that processes sensor data in real-time.
π― What We're Building
A fully functional IoT data pipeline that:
- β Generates realistic IoT sensor data (temperature, humidity, pressure)
- β Streams data through Apache Pulsar at 1000+ msg/sec
- β Processes data in real-time with Apache Flink
- β Stores analytics in ClickHouse for fast queries
- β Detects and alerts on anomalies (high temperature, low battery)
- β Runs entirely on your local machine with Kubernetes
ποΈ Architecture Overview
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β β β β β β β β
β IoT ProducerβββββββΊβ Pulsar βββββββΊβ Flink βββββββΊβ ClickHouse β
β (Java) β β (Broker) β β (Processor) β β (Storage) β
β β β β β β β β
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
Sensor Data Message Queue Stream Processing Analytics DB
Tech Stack:
- Kubernetes (Kind): Local Kubernetes cluster
- Apache Pulsar: Distributed messaging and streaming
- Apache Flink: Real-time stream processing
- ClickHouse: Columnar database for analytics
- Docker: Containerization
- Maven: Build automation
π Prerequisites
Before we begin, make sure you have:
# Check versions
docker --version # Docker 20.10+
kubectl version # kubectl 1.28+
kind version # kind 0.20+
mvn --version # Maven 3.8+
java -version # Java 11+
Installation (macOS):
# Install required tools
brew install kind kubectl maven docker
Installation (Linux):
# Install Kind
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.20.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind
# Install kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/
π Step 1: Clone the Repository
git clone https://github.com/hyperscaledesignhub/RealtimeDataPlatform.git
cd RealtimeDataPlatform/local-setup/k8s
Repository structure:
local-setup/k8s/
βββ create-cluster.sh # Kind cluster setup
βββ start-pipeline.sh # Deploy entire pipeline
βββ stop-pipeline.sh # Clean shutdown
βββ k8s-manifests/ # Kubernetes YAML files
βββ flink-jobs/ # Flink job definitions
βββ scripts/ # Helper utilities
π§ Step 2: Create the Kind Cluster
The repository includes a pre-configured Kind cluster setup:
./create-cluster.sh
What this does:
- Creates a 3-node Kubernetes cluster (1 control-plane, 2 workers)
- Configures port mappings for service access
- Sets up kubeconfig at
/tmp/iot-kubeconfig - Validates cluster readiness
Expected output:
β
Kind cluster created successfully!
Cluster Information:
NAME STATUS ROLES AGE VERSION
iot-pipeline-control-plane Ready control-plane 40d v1.32.2
iot-pipeline-worker Ready <none> 40d v1.32.2
iot-pipeline-worker2 Ready <none> 40d v1.32.2
π¬ Step 3: Deploy the IoT Pipeline
Now for the exciting part - deploying the entire pipeline:
export KUBECONFIG=/tmp/iot-kubeconfig
./start-pipeline.sh
π What Happens Behind the Scenes
The script performs these steps automatically:
1. Build Phase (~2 minutes)
# Builds the IoT producer Docker image
Building producer image...
β
iot-producer:latest built
# Compiles Flink consumer with Maven
Building Flink consumer JAR...
β
flink-consumer-1.0.0.jar created
2. Load Images into Kind
kind load docker-image iot-producer:latest --name iot-pipeline
3. Deploy Services
Creates namespace and deploys:
- Pulsar: StatefulSet with persistent storage
- ClickHouse: StatefulSet with initialization scripts
- Flink: JobManager + 2 TaskManagers
- IoT Producer: Deployment generating sensor data
4. Initialize ClickHouse Schema
The platform automatically creates the sensor data schema based on our AVRO definition:
CREATE DATABASE IF NOT EXISTS iot;
CREATE TABLE IF NOT EXISTS sensor_raw_data (
sensor_id Int32,
sensor_type Int32,
temperature Float64,
humidity Float64,
pressure Float64,
battery_level Float64,
status Int32,
timestamp DateTime64(3),
event_time DateTime64(3) DEFAULT now64()
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (sensor_id, timestamp)
SETTINGS index_granularity = 8192;
-- Create alerts table for anomaly detection
CREATE TABLE IF NOT EXISTS sensor_alerts (
sensor_id Int32,
alert_type String,
alert_time DateTime64(3),
temperature Float64,
humidity Float64,
battery_level Float64,
description String
) ENGINE = MergeTree()
ORDER BY (alert_time, sensor_id);
5. Submit Flink Job
Deploys the stream processing job that:
- Consumes from Pulsar topic:
persistent://public/default/iot-sensor-data - Deserializes AVRO sensor data using our schema
- Detects anomalies (temp > 35Β°C, humidity > 80%, battery < 20%)
- Writes processed data to ClickHouse
π Step 4: Verify the Pipeline
After deployment (~90 seconds), you'll see:
β
Pipeline is working! Data is flowing successfully.
Data Flow Status:
Sensor readings: 39
Alerts generated: 7
Sample sensor data:
βββββββββββββ³ββββββββββββββ³ββββββββββββββββ³ββββββββββββββ³ββββββββββββββββββββββ
β sensor_id β sensor_type β temperature β humidity β timestamp β
β‘ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β 4 β 1 β 24.5 β 68.6 β 2025-10-26 10:07:32 β
β 3 β 1 β 28.2 β 79.3 β 2025-10-26 10:07:31 β
β 2 β 1 β 26.3 β 60.1 β 2025-10-26 10:07:30 β
βββββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄ββββββββββββ΄ββββββββββββββββββββββ
Alert distribution:
ββββββββββββββββββββ³ββββββββ
β alert_type β count β
β‘βββββββββββββββββββββββββββ©
β HIGH_TEMPERATURE β 7 β
ββββββββββββββββββββ΄ββββββββ
Check All Pods Running
kubectl get pods -n iot-pipeline
Expected output:
NAME READY STATUS RESTARTS AGE
clickhouse-0 1/1 Running 0 2m
flink-jobmanager-77c4d6f6c5-v7kqv 1/1 Running 0 2m
flink-taskmanager-7d67d89fd6-5v84r 1/1 Running 0 2m
flink-taskmanager-7d67d89fd6-n7f96 1/1 Running 0 2m
iot-producer-78466d4cf4-6pskj 1/1 Running 0 90s
pulsar-0 1/1 Running 0 2m
π Step 5: Explore the Pipeline
Access Flink Dashboard
The script automatically sets up port forwarding:
# Flink UI is already accessible at:
open http://localhost:8081
What you'll see:
- Running Jobs: 1 job (
JDBC IoT Data Pipeline) - Task Managers: 2 active
- Task Slots: 4 available
- Job Graph: Visual representation of data flow
Query ClickHouse Data
# Enter ClickHouse client
kubectl exec -n iot-pipeline clickhouse-0 -- clickhouse-client -d iot
# Count total sensor readings
SELECT COUNT(*) FROM sensor_raw_data;
# Get average metrics by sensor type
SELECT
sensor_type,
COUNT(*) as reading_count,
AVG(temperature) as avg_temp,
AVG(humidity) as avg_humidity,
AVG(pressure) as avg_pressure,
AVG(battery_level) as avg_battery
FROM sensor_raw_data
GROUP BY sensor_type
ORDER BY sensor_type;
# Get recent high-temperature alerts
SELECT
sensor_id,
alert_type,
alert_time,
temperature,
description
FROM sensor_alerts
WHERE alert_type = 'HIGH_TEMPERATURE'
ORDER BY alert_time DESC
LIMIT 10;
# Monitor data ingestion rate (records per minute)
SELECT
toStartOfMinute(timestamp) as minute,
COUNT(*) as records_per_minute
FROM sensor_raw_data
WHERE timestamp >= now() - INTERVAL 10 MINUTE
GROUP BY minute
ORDER BY minute DESC;
Monitor Data Flow in Real-Time
# Watch sensor data being written
watch -n 2 "kubectl exec -n iot-pipeline clickhouse-0 -- \
clickhouse-client -d iot --query 'SELECT COUNT(*) FROM sensor_raw_data'"
# Monitor Flink job status
watch -n 5 "kubectl exec -n iot-pipeline \
\$(kubectl get pods -n iot-pipeline -l app=flink,component=jobmanager -o jsonpath='{.items[0].metadata.name}') -- \
flink list"
# Check Pulsar topic stats
kubectl exec -n iot-pipeline pulsar-0 -- \
bin/pulsar-admin topics stats persistent://public/default/iot-sensor-data
π Performance Metrics
On a MacBook Pro (M1/M2) with 16GB RAM:
| Metric | Value |
|---|---|
| Throughput | ~1,000 msg/sec |
| End-to-end Latency | < 500ms |
| CPU Usage | ~40% (all pods) |
| Memory Usage | ~6GB total |
| Storage | ~500MB after 1 hour |
| Records/minute | ~60,000 |
π¨ Key Features Demonstrated
1. Real-Time Stream Processing
The Flink job processes AVRO-serialized sensor data:
// Flink processes data with 1-minute windows
DataStream<SensorRecord> sensorStream = env.fromSource(
pulsarSource,
WatermarkStrategy.noWatermarks(),
"Pulsar IoT Source"
);
sensorStream
.keyBy(record -> record.getSensorId())
.window(TumblingProcessingTimeWindows.of(Time.minutes(1)))
.aggregate(new SensorAggregator())
.addSink(new ClickHouseJDBCSink(clickhouseUrl));
2. AVRO Schema Processing
Our sensor data follows the optimized AVRO schema:
// Schema highlights from our SensorData AVRO record:
// - sensorId: int (for efficiency)
// - sensorType: int (1=temp, 2=humidity, 3=pressure, etc.)
// - temperature, humidity, pressure: double
// - batteryLevel: double (percentage)
// - status: int (1=online, 2=offline, 3=maintenance, 4=error)
// - timestamp: long with logical type timestamp-millis
3. Anomaly Detection
// Alert on various conditions
if (record.getTemperature() > 35.0) {
alertSink.invoke(new Alert(
record.getSensorId(),
"HIGH_TEMPERATURE",
record.getTemperature(),
Instant.now()
));
}
if (record.getBatteryLevel() < 20.0) {
alertSink.invoke(new Alert(
record.getSensorId(),
"LOW_BATTERY",
record.getBatteryLevel(),
Instant.now()
));
}
4. Scalable Architecture
- Horizontal scaling: Add more TaskManagers for increased parallelism
- Vertical scaling: Adjust resource limits per pod
- Data partitioning: Pulsar topic partitioning for parallel processing
π οΈ Troubleshooting
Issue: Pods not starting
# Check pod status and events
kubectl describe pod -n iot-pipeline <pod-name>
# Check logs for specific errors
kubectl logs -n iot-pipeline <pod-name>
# Check resource constraints
kubectl top pods -n iot-pipeline
Issue: Flink job not running
# Check Flink JobManager logs
kubectl logs -n iot-pipeline -l app=flink,component=jobmanager
# Check TaskManager logs
kubectl logs -n iot-pipeline -l app=flink,component=taskmanager
# Access Flink CLI
kubectl exec -n iot-pipeline deploy/flink-jobmanager -- flink list
Issue: No data in ClickHouse
# Check producer logs for errors
kubectl logs -n iot-pipeline -l app=iot-producer
# Verify Pulsar topic creation
kubectl exec -n iot-pipeline pulsar-0 -- \
bin/pulsar-admin topics list public/default
# Check Pulsar message production
kubectl exec -n iot-pipeline pulsar-0 -- \
bin/pulsar-admin topics stats persistent://public/default/iot-sensor-data
# Test ClickHouse connectivity
kubectl exec -n iot-pipeline clickhouse-0 -- \
clickhouse-client --query "SELECT version()"
Issue: Port forwarding not working
# Manual port forwarding setup
kubectl port-forward -n iot-pipeline svc/flink-jobmanager 8081:8081 &
kubectl port-forward -n iot-pipeline svc/clickhouse 8123:8123 &
# Check service endpoints
kubectl get svc -n iot-pipeline
π§Ή Cleanup
When you're done exploring:
# Stop the pipeline (keeps cluster)
./stop-pipeline.sh
# Or delete everything including cluster
kind delete cluster --name iot-pipeline
# Clean up Docker images (optional)
docker rmi iot-producer:latest
π What You've Learned
By following this guide, you've:
β
Set up a local Kubernetes cluster with Kind
β
Deployed a distributed streaming platform
β
Built a real-time data processing pipeline
β
Implemented stream processing with Apache Flink
β
Used AVRO schemas for efficient serialization
β
Stored and queried streaming data in ClickHouse
β
Implemented real-time anomaly detection
β
Monitored a production-grade data pipeline
π Next Steps
Want to take this further?
1. Scale to Production
Deploy on AWS EKS with the realtime-platform-1million-events/ setup for handling 1M+ events/sec
2. Add More Features
- Checkpointing: Implement exactly-once processing with Flink checkpoints
- Monitoring: Add Grafana dashboards and Prometheus metrics
- Windowing: Implement different time windows (sliding, session-based)
- ML Integration: Add machine learning for predictive maintenance
3. Customize the Pipeline
- Schema Evolution: Practice updating AVRO schemas without downtime
- Custom Transformations: Add complex event processing (CEP) patterns
- External APIs: Connect to weather services or IoT device management platforms
- Data Lake Integration: Archive data to S3/MinIO for long-term storage
4. Advanced Topics
- Multi-tenant Setup: Isolate different customer data streams
- Cross-region Replication: Set up geo-distributed deployments
- Security: Add authentication and encryption
- Performance Tuning: Optimize for specific workload patterns
π Resources
- Local Setup Directory: local-setup/k8s
- Main Repository: RealtimeDataPlatform
- Kind Documentation: kind.sigs.k8s.io
- Apache Flink Guides: flink.apache.org
- Apache Pulsar Docs: pulsar.apache.org
- ClickHouse Documentation: clickhouse.com/docs
- AVRO Specification: avro.apache.org
π¬ Conclusion
You now have a fully functional, production-grade IoT streaming pipeline running on your local machine! This setup demonstrates real-world patterns used by companies processing billions of events per day.
The best part? Everything runs in Docker containers orchestrated by Kubernetes, making it easy to understand, modify, and eventually deploy to production cloud environments.
Key takeaways:
- AVRO schemas provide efficient serialization and schema evolution
- Kind clusters enable realistic Kubernetes testing locally
- Stream processing patterns work the same at any scale
- Real-time analytics can be achieved with open-source tools
What would you build with this pipeline? Share your IoT project ideas in the comments! π
Found this helpful? Follow me for more posts on real-time data engineering, Kubernetes, and distributed systems!
Next in the series: "Scaling to 1 Million Events/Second - Production Deployment Guide"
π Repository Stats
β Star this repo if you found it useful!
π Issues/PRs welcome - contributions appreciated
πΌ Production ready - scales to 1M events/sec
π Well documented - complete guides included
ποΈ Local K8s Setup: local-setup/k8s directory
Tags: #kubernetes #iot #dataengineering #streaming #flink #pulsar #clickhouse #devops #avro #realtimeanalytics
Top comments (0)