HyperscaleDesignHub

Posted on Oct 26

Building a Real-Time Data Platform with Kubernetes (Kind) - A Complete Local Setup Guide

#tutorial #iot #dataengineering #kubernetes

Ever wondered how to build a production-grade real-time data pipeline that can handle millions of events per second? In this guide, I'll show you how to set up a complete IoT streaming platform locally using Kubernetes (Kind) that processes sensor data in real-time.

🎯 What We're Building

A fully functional IoT data pipeline that:

✅ Generates realistic IoT sensor data (temperature, humidity, pressure)
✅ Streams data through Apache Pulsar at 1000+ msg/sec
✅ Processes data in real-time with Apache Flink
✅ Stores analytics in ClickHouse for fast queries
✅ Detects and alerts on anomalies (high temperature, low battery)
✅ Runs entirely on your local machine with Kubernetes

🏗️ Architecture Overview

┌─────────────┐      ┌─────────────┐      ┌─────────────┐      ┌─────────────┐
│             │      │             │      │             │      │             │
│ IoT Producer├─────►│   Pulsar    ├─────►│    Flink    ├─────►│ ClickHouse  │
│  (Java)     │      │  (Broker)   │      │ (Processor) │      │  (Storage)  │
│             │      │             │      │             │      │             │
└─────────────┘      └─────────────┘      └─────────────┘      └─────────────┘
   Sensor Data     Message Queue      Stream Processing     Analytics DB

Tech Stack:

Kubernetes (Kind): Local Kubernetes cluster
Apache Pulsar: Distributed messaging and streaming
Apache Flink: Real-time stream processing
ClickHouse: Columnar database for analytics
Docker: Containerization
Maven: Build automation

📋 Prerequisites

Before we begin, make sure you have:

# Check versions
docker --version      # Docker 20.10+
kubectl version      # kubectl 1.28+
kind version         # kind 0.20+
mvn --version        # Maven 3.8+
java -version        # Java 11+

Installation (macOS):

# Install required tools
brew install kind kubectl maven docker

Installation (Linux):

# Install Kind
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.20.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind

# Install kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/

🚀 Step 1: Clone the Repository

git clone https://github.com/hyperscaledesignhub/RealtimeDataPlatform.git
cd RealtimeDataPlatform/local-setup/k8s

Repository structure:

local-setup/k8s/
├── create-cluster.sh          # Kind cluster setup
├── start-pipeline.sh          # Deploy entire pipeline
├── stop-pipeline.sh           # Clean shutdown
├── k8s-manifests/            # Kubernetes YAML files
├── flink-jobs/               # Flink job definitions
└── scripts/                  # Helper utilities

🔧 Step 2: Create the Kind Cluster

The repository includes a pre-configured Kind cluster setup:

./create-cluster.sh

What this does:

Creates a 3-node Kubernetes cluster (1 control-plane, 2 workers)
Configures port mappings for service access
Sets up kubeconfig at /tmp/iot-kubeconfig
Validates cluster readiness

Expected output:

✅ Kind cluster created successfully!

Cluster Information:
NAME                         STATUS   ROLES           AGE   VERSION
iot-pipeline-control-plane   Ready    control-plane   40d   v1.32.2
iot-pipeline-worker          Ready    <none>          40d   v1.32.2
iot-pipeline-worker2         Ready    <none>          40d   v1.32.2

🎬 Step 3: Deploy the IoT Pipeline

Now for the exciting part - deploying the entire pipeline:

export KUBECONFIG=/tmp/iot-kubeconfig
./start-pipeline.sh

🔍 What Happens Behind the Scenes

The script performs these steps automatically:

1. Build Phase (~2 minutes)

# Builds the IoT producer Docker image
Building producer image...
✅ iot-producer:latest built

# Compiles Flink consumer with Maven
Building Flink consumer JAR...
✅ flink-consumer-1.0.0.jar created

2. Load Images into Kind

kind load docker-image iot-producer:latest --name iot-pipeline

3. Deploy Services

Creates namespace and deploys:

Pulsar: StatefulSet with persistent storage
ClickHouse: StatefulSet with initialization scripts
Flink: JobManager + 2 TaskManagers
IoT Producer: Deployment generating sensor data

4. Initialize ClickHouse Schema

The platform automatically creates the sensor data schema based on our AVRO definition:

CREATE DATABASE IF NOT EXISTS iot;

CREATE TABLE IF NOT EXISTS sensor_raw_data (
    sensor_id Int32,
    sensor_type Int32,
    temperature Float64,
    humidity Float64,
    pressure Float64,
    battery_level Float64,
    status Int32,
    timestamp DateTime64(3),
    event_time DateTime64(3) DEFAULT now64()
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (sensor_id, timestamp)
SETTINGS index_granularity = 8192;

-- Create alerts table for anomaly detection
CREATE TABLE IF NOT EXISTS sensor_alerts (
    sensor_id Int32,
    alert_type String,
    alert_time DateTime64(3),
    temperature Float64,
    humidity Float64,
    battery_level Float64,
    description String
) ENGINE = MergeTree()
ORDER BY (alert_time, sensor_id);

5. Submit Flink Job

Deploys the stream processing job that:

Consumes from Pulsar topic: persistent://public/default/iot-sensor-data
Deserializes AVRO sensor data using our schema
Detects anomalies (temp > 35°C, humidity > 80%, battery < 20%)
Writes processed data to ClickHouse

📊 Step 4: Verify the Pipeline

After deployment (~90 seconds), you'll see:

✅ Pipeline is working! Data is flowing successfully.

Data Flow Status:
Sensor readings: 39
Alerts generated: 7

Sample sensor data:
┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ sensor_id ┃ sensor_type ┃ temperature ┃ humidity ┃ timestamp           ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│        4  │           1 │        24.5 │     68.6 │ 2025-10-26 10:07:32 │
│        3  │           1 │        28.2 │     79.3 │ 2025-10-26 10:07:31 │
│        2  │           1 │        26.3 │     60.1 │ 2025-10-26 10:07:30 │
└───────────┴─────────────┴─────────────┴───────────┴─────────────────────┘

Alert distribution:
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ alert_type       ┃ count ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ HIGH_TEMPERATURE │     7 │
└──────────────────┴───────┘

Check All Pods Running

kubectl get pods -n iot-pipeline

Expected output:

NAME                                 READY   STATUS    RESTARTS   AGE
clickhouse-0                         1/1     Running   0          2m
flink-jobmanager-77c4d6f6c5-v7kqv    1/1     Running   0          2m
flink-taskmanager-7d67d89fd6-5v84r   1/1     Running   0          2m
flink-taskmanager-7d67d89fd6-n7f96   1/1     Running   0          2m
iot-producer-78466d4cf4-6pskj        1/1     Running   0          90s
pulsar-0                             1/1     Running   0          2m

🔍 Step 5: Explore the Pipeline

Access Flink Dashboard

The script automatically sets up port forwarding:

# Flink UI is already accessible at:
open http://localhost:8081

What you'll see:

Running Jobs: 1 job (JDBC IoT Data Pipeline)
Task Managers: 2 active
Task Slots: 4 available
Job Graph: Visual representation of data flow

Query ClickHouse Data

# Enter ClickHouse client
kubectl exec -n iot-pipeline clickhouse-0 -- clickhouse-client -d iot

# Count total sensor readings
SELECT COUNT(*) FROM sensor_raw_data;

# Get average metrics by sensor type
SELECT 
    sensor_type,
    COUNT(*) as reading_count,
    AVG(temperature) as avg_temp,
    AVG(humidity) as avg_humidity,
    AVG(pressure) as avg_pressure,
    AVG(battery_level) as avg_battery
FROM sensor_raw_data
GROUP BY sensor_type
ORDER BY sensor_type;

# Get recent high-temperature alerts
SELECT 
    sensor_id,
    alert_type,
    alert_time,
    temperature,
    description
FROM sensor_alerts
WHERE alert_type = 'HIGH_TEMPERATURE'
ORDER BY alert_time DESC
LIMIT 10;

# Monitor data ingestion rate (records per minute)
SELECT 
    toStartOfMinute(timestamp) as minute,
    COUNT(*) as records_per_minute
FROM sensor_raw_data
WHERE timestamp >= now() - INTERVAL 10 MINUTE
GROUP BY minute
ORDER BY minute DESC;

Monitor Data Flow in Real-Time

# Watch sensor data being written
watch -n 2 "kubectl exec -n iot-pipeline clickhouse-0 -- \
  clickhouse-client -d iot --query 'SELECT COUNT(*) FROM sensor_raw_data'"

# Monitor Flink job status
watch -n 5 "kubectl exec -n iot-pipeline \
  \$(kubectl get pods -n iot-pipeline -l app=flink,component=jobmanager -o jsonpath='{.items[0].metadata.name}') -- \
  flink list"

# Check Pulsar topic stats
kubectl exec -n iot-pipeline pulsar-0 -- \
  bin/pulsar-admin topics stats persistent://public/default/iot-sensor-data

📈 Performance Metrics

On a MacBook Pro (M1/M2) with 16GB RAM:

Metric	Value
Throughput	~1,000 msg/sec
End-to-end Latency	< 500ms
CPU Usage	~40% (all pods)
Memory Usage	~6GB total
Storage	~500MB after 1 hour
Records/minute	~60,000

🎨 Key Features Demonstrated

1. Real-Time Stream Processing

The Flink job processes AVRO-serialized sensor data:

// Flink processes data with 1-minute windows
DataStream<SensorRecord> sensorStream = env.fromSource(
    pulsarSource,
    WatermarkStrategy.noWatermarks(),
    "Pulsar IoT Source"
);

sensorStream
    .keyBy(record -> record.getSensorId())
    .window(TumblingProcessingTimeWindows.of(Time.minutes(1)))
    .aggregate(new SensorAggregator())
    .addSink(new ClickHouseJDBCSink(clickhouseUrl));

2. AVRO Schema Processing

Our sensor data follows the optimized AVRO schema:

// Schema highlights from our SensorData AVRO record:
// - sensorId: int (for efficiency)
// - sensorType: int (1=temp, 2=humidity, 3=pressure, etc.)
// - temperature, humidity, pressure: double
// - batteryLevel: double (percentage)
// - status: int (1=online, 2=offline, 3=maintenance, 4=error)
// - timestamp: long with logical type timestamp-millis

3. Anomaly Detection

// Alert on various conditions
if (record.getTemperature() > 35.0) {
    alertSink.invoke(new Alert(
        record.getSensorId(),
        "HIGH_TEMPERATURE",
        record.getTemperature(),
        Instant.now()
    ));
}

if (record.getBatteryLevel() < 20.0) {
    alertSink.invoke(new Alert(
        record.getSensorId(),
        "LOW_BATTERY",
        record.getBatteryLevel(),
        Instant.now()
    ));
}

4. Scalable Architecture

Horizontal scaling: Add more TaskManagers for increased parallelism
Vertical scaling: Adjust resource limits per pod
Data partitioning: Pulsar topic partitioning for parallel processing

🛠️ Troubleshooting

Issue: Pods not starting

# Check pod status and events
kubectl describe pod -n iot-pipeline <pod-name>

# Check logs for specific errors
kubectl logs -n iot-pipeline <pod-name>

# Check resource constraints
kubectl top pods -n iot-pipeline

Issue: Flink job not running

# Check Flink JobManager logs
kubectl logs -n iot-pipeline -l app=flink,component=jobmanager

# Check TaskManager logs
kubectl logs -n iot-pipeline -l app=flink,component=taskmanager

# Access Flink CLI
kubectl exec -n iot-pipeline deploy/flink-jobmanager -- flink list

Issue: No data in ClickHouse

# Check producer logs for errors
kubectl logs -n iot-pipeline -l app=iot-producer

# Verify Pulsar topic creation
kubectl exec -n iot-pipeline pulsar-0 -- \
  bin/pulsar-admin topics list public/default

# Check Pulsar message production
kubectl exec -n iot-pipeline pulsar-0 -- \
  bin/pulsar-admin topics stats persistent://public/default/iot-sensor-data

# Test ClickHouse connectivity
kubectl exec -n iot-pipeline clickhouse-0 -- \
  clickhouse-client --query "SELECT version()"

Issue: Port forwarding not working

# Manual port forwarding setup
kubectl port-forward -n iot-pipeline svc/flink-jobmanager 8081:8081 &
kubectl port-forward -n iot-pipeline svc/clickhouse 8123:8123 &

# Check service endpoints
kubectl get svc -n iot-pipeline

🧹 Cleanup

When you're done exploring:

# Stop the pipeline (keeps cluster)
./stop-pipeline.sh

# Or delete everything including cluster
kind delete cluster --name iot-pipeline

# Clean up Docker images (optional)
docker rmi iot-producer:latest

🎓 What You've Learned

By following this guide, you've:

✅ Set up a local Kubernetes cluster with Kind

✅ Deployed a distributed streaming platform

✅ Built a real-time data processing pipeline

✅ Implemented stream processing with Apache Flink

✅ Used AVRO schemas for efficient serialization

✅ Stored and queried streaming data in ClickHouse

✅ Implemented real-time anomaly detection

✅ Monitored a production-grade data pipeline

🚀 Next Steps

Want to take this further?

1. Scale to Production

Deploy on AWS EKS with the realtime-platform-1million-events/ setup for handling 1M+ events/sec

2. Add More Features

Checkpointing: Implement exactly-once processing with Flink checkpoints
Monitoring: Add Grafana dashboards and Prometheus metrics
Windowing: Implement different time windows (sliding, session-based)
ML Integration: Add machine learning for predictive maintenance

3. Customize the Pipeline

Schema Evolution: Practice updating AVRO schemas without downtime
Custom Transformations: Add complex event processing (CEP) patterns
External APIs: Connect to weather services or IoT device management platforms
Data Lake Integration: Archive data to S3/MinIO for long-term storage

4. Advanced Topics

Multi-tenant Setup: Isolate different customer data streams
Cross-region Replication: Set up geo-distributed deployments
Security: Add authentication and encryption
Performance Tuning: Optimize for specific workload patterns

📚 Resources

Local Setup Directory: local-setup/k8s
Main Repository: RealtimeDataPlatform
Kind Documentation: kind.sigs.k8s.io
Apache Flink Guides: flink.apache.org
Apache Pulsar Docs: pulsar.apache.org
ClickHouse Documentation: clickhouse.com/docs
AVRO Specification: avro.apache.org

💬 Conclusion

You now have a fully functional, production-grade IoT streaming pipeline running on your local machine! This setup demonstrates real-world patterns used by companies processing billions of events per day.

The best part? Everything runs in Docker containers orchestrated by Kubernetes, making it easy to understand, modify, and eventually deploy to production cloud environments.

Key takeaways:

AVRO schemas provide efficient serialization and schema evolution
Kind clusters enable realistic Kubernetes testing locally
Stream processing patterns work the same at any scale
Real-time analytics can be achieved with open-source tools

What would you build with this pipeline? Share your IoT project ideas in the comments! 👇

Found this helpful? Follow me for more posts on real-time data engineering, Kubernetes, and distributed systems!

Next in the series: "Scaling to 1 Million Events/Second - Production Deployment Guide"

📊 Repository Stats

⭐ Star this repo if you found it useful!

🐛 Issues/PRs welcome - contributions appreciated

💼 Production ready - scales to 1M events/sec

📖 Well documented - complete guides included

🏗️ Local K8s Setup: local-setup/k8s directory

Tags: #kubernetes #iot #dataengineering #streaming #flink #pulsar #clickhouse #devops #avro #realtimeanalytics