DEV Community

HyperscaleDesignHub
HyperscaleDesignHub

Posted on

Building a Real-Time Data Platform with Kubernetes (Kind) - A Complete Local Setup Guide

Ever wondered how to build a production-grade real-time data pipeline that can handle millions of events per second? In this guide, I'll show you how to set up a complete IoT streaming platform locally using Kubernetes (Kind) that processes sensor data in real-time.

🎯 What We're Building

A fully functional IoT data pipeline that:

  • βœ… Generates realistic IoT sensor data (temperature, humidity, pressure)
  • βœ… Streams data through Apache Pulsar at 1000+ msg/sec
  • βœ… Processes data in real-time with Apache Flink
  • βœ… Stores analytics in ClickHouse for fast queries
  • βœ… Detects and alerts on anomalies (high temperature, low battery)
  • βœ… Runs entirely on your local machine with Kubernetes

πŸ—οΈ Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚             β”‚      β”‚             β”‚      β”‚             β”‚      β”‚             β”‚
β”‚ IoT Producerβ”œβ”€β”€β”€β”€β”€β–Ίβ”‚   Pulsar    β”œβ”€β”€β”€β”€β”€β–Ίβ”‚    Flink    β”œβ”€β”€β”€β”€β”€β–Ίβ”‚ ClickHouse  β”‚
β”‚  (Java)     β”‚      β”‚  (Broker)   β”‚      β”‚ (Processor) β”‚      β”‚  (Storage)  β”‚
β”‚             β”‚      β”‚             β”‚      β”‚             β”‚      β”‚             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
   Sensor Data     Message Queue      Stream Processing     Analytics DB
Enter fullscreen mode Exit fullscreen mode

Tech Stack:

  • Kubernetes (Kind): Local Kubernetes cluster
  • Apache Pulsar: Distributed messaging and streaming
  • Apache Flink: Real-time stream processing
  • ClickHouse: Columnar database for analytics
  • Docker: Containerization
  • Maven: Build automation

πŸ“‹ Prerequisites

Before we begin, make sure you have:

# Check versions
docker --version      # Docker 20.10+
kubectl version      # kubectl 1.28+
kind version         # kind 0.20+
mvn --version        # Maven 3.8+
java -version        # Java 11+
Enter fullscreen mode Exit fullscreen mode

Installation (macOS):

# Install required tools
brew install kind kubectl maven docker
Enter fullscreen mode Exit fullscreen mode

Installation (Linux):

# Install Kind
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.20.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind

# Install kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/
Enter fullscreen mode Exit fullscreen mode

πŸš€ Step 1: Clone the Repository

git clone https://github.com/hyperscaledesignhub/RealtimeDataPlatform.git
cd RealtimeDataPlatform/local-setup/k8s
Enter fullscreen mode Exit fullscreen mode

Repository structure:

local-setup/k8s/
β”œβ”€β”€ create-cluster.sh          # Kind cluster setup
β”œβ”€β”€ start-pipeline.sh          # Deploy entire pipeline
β”œβ”€β”€ stop-pipeline.sh           # Clean shutdown
β”œβ”€β”€ k8s-manifests/            # Kubernetes YAML files
β”œβ”€β”€ flink-jobs/               # Flink job definitions
└── scripts/                  # Helper utilities
Enter fullscreen mode Exit fullscreen mode

πŸ”§ Step 2: Create the Kind Cluster

The repository includes a pre-configured Kind cluster setup:

./create-cluster.sh
Enter fullscreen mode Exit fullscreen mode

What this does:

  • Creates a 3-node Kubernetes cluster (1 control-plane, 2 workers)
  • Configures port mappings for service access
  • Sets up kubeconfig at /tmp/iot-kubeconfig
  • Validates cluster readiness

Expected output:

βœ… Kind cluster created successfully!

Cluster Information:
NAME                         STATUS   ROLES           AGE   VERSION
iot-pipeline-control-plane   Ready    control-plane   40d   v1.32.2
iot-pipeline-worker          Ready    <none>          40d   v1.32.2
iot-pipeline-worker2         Ready    <none>          40d   v1.32.2
Enter fullscreen mode Exit fullscreen mode

🎬 Step 3: Deploy the IoT Pipeline

Now for the exciting part - deploying the entire pipeline:

export KUBECONFIG=/tmp/iot-kubeconfig
./start-pipeline.sh
Enter fullscreen mode Exit fullscreen mode

πŸ” What Happens Behind the Scenes

The script performs these steps automatically:

1. Build Phase (~2 minutes)

# Builds the IoT producer Docker image
Building producer image...
βœ… iot-producer:latest built

# Compiles Flink consumer with Maven
Building Flink consumer JAR...
βœ… flink-consumer-1.0.0.jar created
Enter fullscreen mode Exit fullscreen mode

2. Load Images into Kind

kind load docker-image iot-producer:latest --name iot-pipeline
Enter fullscreen mode Exit fullscreen mode

3. Deploy Services

Creates namespace and deploys:

  • Pulsar: StatefulSet with persistent storage
  • ClickHouse: StatefulSet with initialization scripts
  • Flink: JobManager + 2 TaskManagers
  • IoT Producer: Deployment generating sensor data

4. Initialize ClickHouse Schema

The platform automatically creates the sensor data schema based on our AVRO definition:

CREATE DATABASE IF NOT EXISTS iot;

CREATE TABLE IF NOT EXISTS sensor_raw_data (
    sensor_id Int32,
    sensor_type Int32,
    temperature Float64,
    humidity Float64,
    pressure Float64,
    battery_level Float64,
    status Int32,
    timestamp DateTime64(3),
    event_time DateTime64(3) DEFAULT now64()
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (sensor_id, timestamp)
SETTINGS index_granularity = 8192;

-- Create alerts table for anomaly detection
CREATE TABLE IF NOT EXISTS sensor_alerts (
    sensor_id Int32,
    alert_type String,
    alert_time DateTime64(3),
    temperature Float64,
    humidity Float64,
    battery_level Float64,
    description String
) ENGINE = MergeTree()
ORDER BY (alert_time, sensor_id);
Enter fullscreen mode Exit fullscreen mode

5. Submit Flink Job

Deploys the stream processing job that:

  • Consumes from Pulsar topic: persistent://public/default/iot-sensor-data
  • Deserializes AVRO sensor data using our schema
  • Detects anomalies (temp > 35Β°C, humidity > 80%, battery < 20%)
  • Writes processed data to ClickHouse

πŸ“Š Step 4: Verify the Pipeline

After deployment (~90 seconds), you'll see:

βœ… Pipeline is working! Data is flowing successfully.

Data Flow Status:
Sensor readings: 39
Alerts generated: 7

Sample sensor data:
┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ sensor_id ┃ sensor_type ┃ temperature ┃ humidity ┃ timestamp           ┃
┑━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
β”‚        4  β”‚           1 β”‚        24.5 β”‚     68.6 β”‚ 2025-10-26 10:07:32 β”‚
β”‚        3  β”‚           1 β”‚        28.2 β”‚     79.3 β”‚ 2025-10-26 10:07:31 β”‚
β”‚        2  β”‚           1 β”‚        26.3 β”‚     60.1 β”‚ 2025-10-26 10:07:30 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Alert distribution:
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ alert_type       ┃ count ┃
┑━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
β”‚ HIGH_TEMPERATURE β”‚     7 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

Check All Pods Running

kubectl get pods -n iot-pipeline
Enter fullscreen mode Exit fullscreen mode

Expected output:

NAME                                 READY   STATUS    RESTARTS   AGE
clickhouse-0                         1/1     Running   0          2m
flink-jobmanager-77c4d6f6c5-v7kqv    1/1     Running   0          2m
flink-taskmanager-7d67d89fd6-5v84r   1/1     Running   0          2m
flink-taskmanager-7d67d89fd6-n7f96   1/1     Running   0          2m
iot-producer-78466d4cf4-6pskj        1/1     Running   0          90s
pulsar-0                             1/1     Running   0          2m
Enter fullscreen mode Exit fullscreen mode

πŸ” Step 5: Explore the Pipeline

Access Flink Dashboard

The script automatically sets up port forwarding:

# Flink UI is already accessible at:
open http://localhost:8081
Enter fullscreen mode Exit fullscreen mode

What you'll see:

  • Running Jobs: 1 job (JDBC IoT Data Pipeline)
  • Task Managers: 2 active
  • Task Slots: 4 available
  • Job Graph: Visual representation of data flow

Query ClickHouse Data

# Enter ClickHouse client
kubectl exec -n iot-pipeline clickhouse-0 -- clickhouse-client -d iot

# Count total sensor readings
SELECT COUNT(*) FROM sensor_raw_data;

# Get average metrics by sensor type
SELECT 
    sensor_type,
    COUNT(*) as reading_count,
    AVG(temperature) as avg_temp,
    AVG(humidity) as avg_humidity,
    AVG(pressure) as avg_pressure,
    AVG(battery_level) as avg_battery
FROM sensor_raw_data
GROUP BY sensor_type
ORDER BY sensor_type;

# Get recent high-temperature alerts
SELECT 
    sensor_id,
    alert_type,
    alert_time,
    temperature,
    description
FROM sensor_alerts
WHERE alert_type = 'HIGH_TEMPERATURE'
ORDER BY alert_time DESC
LIMIT 10;

# Monitor data ingestion rate (records per minute)
SELECT 
    toStartOfMinute(timestamp) as minute,
    COUNT(*) as records_per_minute
FROM sensor_raw_data
WHERE timestamp >= now() - INTERVAL 10 MINUTE
GROUP BY minute
ORDER BY minute DESC;
Enter fullscreen mode Exit fullscreen mode

Monitor Data Flow in Real-Time

# Watch sensor data being written
watch -n 2 "kubectl exec -n iot-pipeline clickhouse-0 -- \
  clickhouse-client -d iot --query 'SELECT COUNT(*) FROM sensor_raw_data'"

# Monitor Flink job status
watch -n 5 "kubectl exec -n iot-pipeline \
  \$(kubectl get pods -n iot-pipeline -l app=flink,component=jobmanager -o jsonpath='{.items[0].metadata.name}') -- \
  flink list"

# Check Pulsar topic stats
kubectl exec -n iot-pipeline pulsar-0 -- \
  bin/pulsar-admin topics stats persistent://public/default/iot-sensor-data
Enter fullscreen mode Exit fullscreen mode

πŸ“ˆ Performance Metrics

On a MacBook Pro (M1/M2) with 16GB RAM:

Metric Value
Throughput ~1,000 msg/sec
End-to-end Latency < 500ms
CPU Usage ~40% (all pods)
Memory Usage ~6GB total
Storage ~500MB after 1 hour
Records/minute ~60,000

🎨 Key Features Demonstrated

1. Real-Time Stream Processing

The Flink job processes AVRO-serialized sensor data:

// Flink processes data with 1-minute windows
DataStream<SensorRecord> sensorStream = env.fromSource(
    pulsarSource,
    WatermarkStrategy.noWatermarks(),
    "Pulsar IoT Source"
);

sensorStream
    .keyBy(record -> record.getSensorId())
    .window(TumblingProcessingTimeWindows.of(Time.minutes(1)))
    .aggregate(new SensorAggregator())
    .addSink(new ClickHouseJDBCSink(clickhouseUrl));
Enter fullscreen mode Exit fullscreen mode

2. AVRO Schema Processing

Our sensor data follows the optimized AVRO schema:

// Schema highlights from our SensorData AVRO record:
// - sensorId: int (for efficiency)
// - sensorType: int (1=temp, 2=humidity, 3=pressure, etc.)
// - temperature, humidity, pressure: double
// - batteryLevel: double (percentage)
// - status: int (1=online, 2=offline, 3=maintenance, 4=error)
// - timestamp: long with logical type timestamp-millis
Enter fullscreen mode Exit fullscreen mode

3. Anomaly Detection

// Alert on various conditions
if (record.getTemperature() > 35.0) {
    alertSink.invoke(new Alert(
        record.getSensorId(),
        "HIGH_TEMPERATURE",
        record.getTemperature(),
        Instant.now()
    ));
}

if (record.getBatteryLevel() < 20.0) {
    alertSink.invoke(new Alert(
        record.getSensorId(),
        "LOW_BATTERY",
        record.getBatteryLevel(),
        Instant.now()
    ));
}
Enter fullscreen mode Exit fullscreen mode

4. Scalable Architecture

  • Horizontal scaling: Add more TaskManagers for increased parallelism
  • Vertical scaling: Adjust resource limits per pod
  • Data partitioning: Pulsar topic partitioning for parallel processing

πŸ› οΈ Troubleshooting

Issue: Pods not starting

# Check pod status and events
kubectl describe pod -n iot-pipeline <pod-name>

# Check logs for specific errors
kubectl logs -n iot-pipeline <pod-name>

# Check resource constraints
kubectl top pods -n iot-pipeline
Enter fullscreen mode Exit fullscreen mode

Issue: Flink job not running

# Check Flink JobManager logs
kubectl logs -n iot-pipeline -l app=flink,component=jobmanager

# Check TaskManager logs
kubectl logs -n iot-pipeline -l app=flink,component=taskmanager

# Access Flink CLI
kubectl exec -n iot-pipeline deploy/flink-jobmanager -- flink list
Enter fullscreen mode Exit fullscreen mode

Issue: No data in ClickHouse

# Check producer logs for errors
kubectl logs -n iot-pipeline -l app=iot-producer

# Verify Pulsar topic creation
kubectl exec -n iot-pipeline pulsar-0 -- \
  bin/pulsar-admin topics list public/default

# Check Pulsar message production
kubectl exec -n iot-pipeline pulsar-0 -- \
  bin/pulsar-admin topics stats persistent://public/default/iot-sensor-data

# Test ClickHouse connectivity
kubectl exec -n iot-pipeline clickhouse-0 -- \
  clickhouse-client --query "SELECT version()"
Enter fullscreen mode Exit fullscreen mode

Issue: Port forwarding not working

# Manual port forwarding setup
kubectl port-forward -n iot-pipeline svc/flink-jobmanager 8081:8081 &
kubectl port-forward -n iot-pipeline svc/clickhouse 8123:8123 &

# Check service endpoints
kubectl get svc -n iot-pipeline
Enter fullscreen mode Exit fullscreen mode

🧹 Cleanup

When you're done exploring:

# Stop the pipeline (keeps cluster)
./stop-pipeline.sh

# Or delete everything including cluster
kind delete cluster --name iot-pipeline

# Clean up Docker images (optional)
docker rmi iot-producer:latest
Enter fullscreen mode Exit fullscreen mode

πŸŽ“ What You've Learned

By following this guide, you've:

βœ… Set up a local Kubernetes cluster with Kind

βœ… Deployed a distributed streaming platform

βœ… Built a real-time data processing pipeline

βœ… Implemented stream processing with Apache Flink

βœ… Used AVRO schemas for efficient serialization

βœ… Stored and queried streaming data in ClickHouse

βœ… Implemented real-time anomaly detection

βœ… Monitored a production-grade data pipeline

πŸš€ Next Steps

Want to take this further?

1. Scale to Production

Deploy on AWS EKS with the realtime-platform-1million-events/ setup for handling 1M+ events/sec

2. Add More Features

  • Checkpointing: Implement exactly-once processing with Flink checkpoints
  • Monitoring: Add Grafana dashboards and Prometheus metrics
  • Windowing: Implement different time windows (sliding, session-based)
  • ML Integration: Add machine learning for predictive maintenance

3. Customize the Pipeline

  • Schema Evolution: Practice updating AVRO schemas without downtime
  • Custom Transformations: Add complex event processing (CEP) patterns
  • External APIs: Connect to weather services or IoT device management platforms
  • Data Lake Integration: Archive data to S3/MinIO for long-term storage

4. Advanced Topics

  • Multi-tenant Setup: Isolate different customer data streams
  • Cross-region Replication: Set up geo-distributed deployments
  • Security: Add authentication and encryption
  • Performance Tuning: Optimize for specific workload patterns

πŸ“š Resources

πŸ’¬ Conclusion

You now have a fully functional, production-grade IoT streaming pipeline running on your local machine! This setup demonstrates real-world patterns used by companies processing billions of events per day.

The best part? Everything runs in Docker containers orchestrated by Kubernetes, making it easy to understand, modify, and eventually deploy to production cloud environments.

Key takeaways:

  • AVRO schemas provide efficient serialization and schema evolution
  • Kind clusters enable realistic Kubernetes testing locally
  • Stream processing patterns work the same at any scale
  • Real-time analytics can be achieved with open-source tools

What would you build with this pipeline? Share your IoT project ideas in the comments! πŸ‘‡


Found this helpful? Follow me for more posts on real-time data engineering, Kubernetes, and distributed systems!

Next in the series: "Scaling to 1 Million Events/Second - Production Deployment Guide"


πŸ“Š Repository Stats

⭐ Star this repo if you found it useful!

πŸ› Issues/PRs welcome - contributions appreciated

πŸ’Ό Production ready - scales to 1M events/sec

πŸ“– Well documented - complete guides included

πŸ—οΈ Local K8s Setup: local-setup/k8s directory

Tags: #kubernetes #iot #dataengineering #streaming #flink #pulsar #clickhouse #devops #avro #realtimeanalytics

Top comments (0)