HyperscaleDesignHub

Posted on Oct 26

AWS EKS Deployment: Real-Time Data Streaming Platform - 50K Events/Sec for $1,250/Month

#kubernetes #performance #architecture #aws

Building real-time streaming platforms can be expensive. Most production setups cost thousands of dollars per month. But what if you need to process 50,000 events per second at a reasonable cost?

In this guide, I'll show you how to deploy a production-grade event streaming platform on AWS EKS that costs $1,250/month while handling moderate-scale real-time data processing.

🎯 What We're Building

A complete, production-ready streaming platform that:

✅ Processes 50,000 events per second in real-time
✅ AWS infrastructure cost: ~$1,250/month
✅ Uses t3 instance types for cost efficiency
✅ Runs on AWS EKS with managed Kubernetes
✅ Supports multiple domains: E-commerce, Finance, IoT, Gaming, Logistics
✅ Provides real-time analytics with sub-second latency
✅ Includes monitoring dashboards with Grafana
✅ Offers easy scalability to 1M events/sec if needed

💰 Infrastructure Cost

AWS Infrastructure Cost: ~$1,250/month

This includes all compute instances (t3.medium to t3.xlarge), EKS cluster management, storage (EBS gp3), networking (NAT Gateway, Load Balancer), and monitoring services required for a production-ready 50K events/sec streaming platform.

🏗️ Architecture Overview

┌────────────────────────────────────────────────────────────────┐
│                    AWS EKS Cluster (us-west-2)                 │
│                    bench-low-infra (k8s 1.31)                  │
├────────────────────────────────────────────────────────────────┤
│                                                                │
│  ┌──────────────┐    ┌──────────────┐    ┌─────────────────┐ │
│  │   PRODUCER   │───▶│    PULSAR    │───▶│      FLINK      │ │
│  │  (t3.medium) │    │  (t3.large)  │    │   (t3.large)    │ │
│  │              │    │              │    │                 │ │
│  │ Java/AVRO    │    │ ZK+Broker+BK │    │ JobManager +    │ │
│  │ 1K msg/sec   │    │ EBS Storage  │    │ TaskManager     │ │
│  │ per pod      │    │ 50K msg/sec  │    │ 1-min windows   │ │
│  └──────────────┘    └──────────────┘    └────────┬────────┘ │
│                                                     │          │
│                      ┌──────────────────────────────┘          │
│                      ▼                                         │
│               ┌─────────────────┐                             │
│               │   CLICKHOUSE    │                             │
│               │  (t3.xlarge)    │                             │
│               │                 │                             │
│               │  EBS Storage    │                             │
│               │  Analytics DB   │                             │
│               └─────────────────┘                             │
│                                                                │
│  Supporting: VPC, S3 (checkpoints), ECR, IAM, EBS CSI        │
└────────────────────────────────────────────────────────────────┘

Tech Stack:

Kubernetes: AWS EKS 1.31
Message Broker: Apache Pulsar 3.1
Stream Processing: Apache Flink 1.18
Analytics DB: ClickHouse 24.x
Storage: EBS gp3 (cost-optimized, no expensive NVMe)
Infrastructure: Terraform
Monitoring: Grafana + Prometheus

📋 Prerequisites

Before starting, ensure you have:

# Install required tools (macOS)
brew install awscli terraform kubectl helm

# Or on Linux
# Install AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

# Install Terraform
wget https://releases.hashicorp.com/terraform/1.6.0/terraform_1.6.0_linux_amd64.zip
unzip terraform_1.6.0_linux_amd64.zip
sudo mv terraform /usr/local/bin/

# Configure AWS credentials
aws configure
# Enter: AWS Access Key, Secret Key, Region (us-west-2), Output format (json)

Required AWS Permissions:

EC2, VPC, EKS, S3, IAM, ECR (full access)
Estimated: ~$1,250/month for complete setup

🚀 Step-by-Step Deployment

Step 1: Clone the Repository

git clone https://github.com/hyperscaledesignhub/RealtimeDataPlatform.git
cd RealtimeDataPlatform/realtime-platform-50k-events

Repository structure:

realtime-platform-50k-events/
├── terraform/                # AWS infrastructure
├── producer-load/            # Event data generation
├── pulsar-load/              # Apache Pulsar deployment
├── flink-load/               # Apache Flink processing
├── clickhouse-load/          # ClickHouse analytics DB
└── monitoring/               # Grafana dashboards

Step 2: Deploy AWS Infrastructure

# Initialize Terraform
terraform init

# Review what will be created
terraform plan

# Deploy infrastructure (takes ~15-20 minutes)
terraform apply
# Type 'yes' when prompted

What gets created:

✅ VPC with public/private subnets (10.1.0.0/16)
✅ EKS cluster bench-low-infra (k8s 1.31)
✅ 9 node groups with t3 instances:
- Producer: t3.medium (1 node, scales to 3)
- Pulsar ZooKeeper: t3.small (3 nodes)
- Pulsar Broker: t3.large (3 nodes)
- Pulsar BookKeeper: t3.large (4 nodes)
- Pulsar Proxy: t3.small (2 nodes)
- Flink JobManager: t3.large (1 node)
- Flink TaskManager: t3.large (6 nodes)
- ClickHouse: t3.xlarge (4 nodes)
- General: t3.small (1 node)
✅ S3 bucket for Flink checkpoints
✅ ECR repositories for container images
✅ IAM roles and policies
✅ EBS CSI driver for persistent volumes
✅ NAT Gateway for internet access

Configure kubectl:

aws eks update-kubeconfig --region us-west-2 --name bench-low-infra

# Verify nodes are ready
kubectl get nodes

Expected output:

NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-1-x-x.us-west-2.compute.internal      Ready    <none>   5m    v1.31.0-eks-...
ip-10-1-x-x.us-west-2.compute.internal      Ready    <none>   5m    v1.31.0-eks-...
...

Step 3: Deploy Apache Pulsar (Message Broker)

cd pulsar-load

# Deploy Pulsar with Helm
./deploy.sh

# Monitor deployment (takes ~5-10 minutes)
kubectl get pods -n pulsar -w
# Press Ctrl+C when all pods are Running

What this deploys:

ZooKeeper: 3 replicas (metadata management)
Broker: 3 replicas (message routing)
BookKeeper: 4 replicas (message storage on EBS)
Proxy: 2 replicas (load balancing)
Grafana: Monitoring dashboard
Victoria Metrics: Metrics storage

Verify Pulsar is healthy:

kubectl get pods -n pulsar | grep -E "zookeeper|broker|bookkeeper"

# All pods should show "Running" status

Step 4: Deploy ClickHouse (Analytics Database)

cd ../clickhouse-load

# Install ClickHouse operator and cluster
./00-install-clickhouse.sh

# Wait for ClickHouse pods (~3-5 minutes)
kubectl get pods -n clickhouse -w

# Create database schema
./00-create-schema-all-replicas.sh

What this creates:

ClickHouse cluster: 4 nodes (2 shards × 2 replicas)
Database: benchmark
Table: sensors_local (optimized for IoT sensor data)
Storage: EBS gp3 volumes (200 GB per node)
Retention: 30 days TTL

Test ClickHouse:

# Connect to ClickHouse
kubectl exec -it -n clickhouse chi-iot-cluster-repl-iot-cluster-0-0-0 -- clickhouse-client

# Run test query
SELECT version();

# Exit with Ctrl+D

Step 5: Deploy Apache Flink (Stream Processing)

cd ../flink-load

# Build and push Flink consumer image to ECR
./build-and-push.sh

# Deploy Flink cluster
./deploy.sh

# Submit Flink job
kubectl apply -f flink-job-deployment.yaml

# Monitor Flink job (~2-3 minutes to start)
kubectl get pods -n flink-benchmark -w

What this deploys:

Flink JobManager: 1 replica (job coordination)
Flink TaskManager: 6 replicas (data processing)
S3 checkpoints: Every 1 minute
Job: JDBC IoT Data Pipeline (AVRO deserialization)

Flink Job Details:

// Stream processing pipeline using the SensorData AVRO schema
DataStream<SensorRecord> sensorStream = env.fromSource(
    pulsarSource,
    WatermarkStrategy.noWatermarks(),
    "Pulsar IoT Source"
);

// 1-minute aggregation windows
sensorStream
    .keyBy(record -> record.getSensorId())
    .window(TumblingProcessingTimeWindows.of(Time.minutes(1)))
    .aggregate(new SensorAggregator())
    .addSink(new ClickHouseJDBCSink(clickhouseUrl));

Step 6: Deploy IoT Producer (Event Generation)

cd ../producer-load

# Build and deploy IoT producer
./deploy.sh

# Scale producers based on desired throughput
kubectl scale deployment iot-producer -n iot-pipeline --replicas=50

# Monitor producer status
kubectl get pods -n iot-pipeline -l app=iot-producer

Producer capabilities:

AVRO serialization: Uses optimized SensorData schema
Multi-sensor types: Temperature, humidity, pressure, motion, light, CO2, noise
Configurable throughput: 1,000 events/sec per pod
Realistic data: Battery levels, device status, geographic distribution

📊 Step 7: Verify the Complete Pipeline

After all components are deployed (~10-15 minutes total), verify data flow:

# Check producer is generating data
kubectl logs -n iot-pipeline -l app=iot-producer --tail=10

# Verify Pulsar has messages
kubectl exec -n pulsar pulsar-broker-0 -- \
  bin/pulsar-admin topics stats persistent://public/default/iot-sensor-data

# Check Flink is processing
kubectl logs -n flink-benchmark deployment/iot-flink-job --tail=20

# Query ClickHouse for data
kubectl exec -n clickhouse chi-iot-cluster-repl-iot-cluster-0-0-0 -- \
  clickhouse-client --query "SELECT COUNT(*) FROM benchmark.sensors_local"

Expected data flow:

✅ Producer: 50,000 events/sec generation
✅ Pulsar: Message ingestion and buffering
✅ Flink: Real-time stream processing with 1-minute windows
✅ ClickHouse: Data storage and analytics queries
✅ End-to-end latency: < 2 seconds

🔍 Monitoring and Analytics

Access Grafana Dashboard

# Set up port forwarding
kubectl port-forward -n pulsar svc/grafana 3000:3000 &

# Open in browser
open http://localhost:3000
# Login: admin/admin

Key metrics to monitor:

Pulsar: Message throughput, backlog size, storage usage
Flink: Checkpoint duration, processing latency, job health
ClickHouse: Query performance, insert rate, storage growth
Infrastructure: CPU, memory, disk I/O across all nodes

Sample Analytics Queries

-- Connect to ClickHouse
kubectl exec -it -n clickhouse chi-iot-cluster-repl-iot-cluster-0-0-0 -- clickhouse-client

-- Query examples based on our SensorData AVRO schema
USE benchmark;

-- Count total sensor readings
SELECT COUNT(*) as total_readings FROM sensors_local;

-- Average metrics by sensor type
SELECT 
    sensorType,
    COUNT(*) as reading_count,
    AVG(temperature) as avg_temp,
    AVG(humidity) as avg_humidity,
    AVG(batteryLevel) as avg_battery
FROM sensors_local
GROUP BY sensorType
ORDER BY sensorType;

-- Identify sensors with low battery
SELECT 
    sensorId,
    sensorType,
    AVG(batteryLevel) as avg_battery,
    MIN(batteryLevel) as min_battery
FROM sensors_local
WHERE timestamp >= now() - INTERVAL 1 HOUR
GROUP BY sensorId, sensorType
HAVING avg_battery < 25.0
ORDER BY avg_battery ASC;

-- Hourly data ingestion rate
SELECT 
    toStartOfHour(timestamp) as hour,
    COUNT(*) as records_per_hour
FROM sensors_local
WHERE timestamp >= now() - INTERVAL 24 HOUR
GROUP BY hour
ORDER BY hour DESC;

-- Temperature anomaly detection
SELECT 
    sensorId,
    timestamp,
    temperature,
    status
FROM sensors_local
WHERE temperature > 35.0 
  AND timestamp >= now() - INTERVAL 1 HOUR
ORDER BY timestamp DESC
LIMIT 100;

📈 Performance Benchmarks

Real-World Performance Metrics

On this cost-optimized setup, you can expect:

Metric	Value	Notes
Message Throughput	50,000 events/sec	Sustained rate with 50 producer pods
End-to-end Latency	< 2 seconds	Producer → ClickHouse
Query Performance	< 500ms	Analytical queries on 1B+ records
CPU Utilization	60-70%	Across all node groups
Memory Usage	~80%	Optimized for t3 instance types
Storage Growth	~10 GB/hour	With 30-day TTL retention
Availability	99.9%+	Multi-AZ deployment

Cost vs. Performance Analysis

💰 Infrastructure Efficiency:
- $1,250/month for 50K events/sec
- $0.000025 per event processed
- Production-ready reliability and monitoring
- Linear scaling to 1M events/sec

📊 Scaling Characteristics:
- Same architecture scales from 1K → 1M events/sec
- Infrastructure-as-Code deployment
- Easy migration path to enterprise setup
- Predictable monthly costs

🛠️ Troubleshooting

Common Issues and Solutions

Issue: High Memory Usage

# Check memory usage across nodes
kubectl top nodes

# Scale up instances if needed
# Edit terraform.tfvars
instance_type = "t3.2xlarge"  # Upgrade from t3.xlarge
terraform apply

Issue: Pods Stuck in Pending

# Check node availability
kubectl get nodes

# Check pod events
kubectl describe pod <pod-name> -n <namespace>

# Scale up nodes if needed
# Edit terraform.tfvars, increase desired_size
terraform apply

Issue: Flink Job Failing

# Check Flink logs
kubectl logs -n flink-benchmark deployment/iot-flink-job

# Common issues:
# - ClickHouse not ready: Wait 2-3 minutes
# - Pulsar not accessible: Check network policies
# - Out of memory: Scale up TaskManagers

Issue: No Data in ClickHouse

# 1. Check producer is running
kubectl get pods -n iot-pipeline -l app=iot-producer

# 2. Check Pulsar has data
kubectl exec -n pulsar pulsar-broker-0 -- \
  bin/pulsar-admin topics stats persistent://public/default/iot-sensor-data

# 3. Check Flink is processing
kubectl logs -n flink-benchmark deployment/iot-flink-job | tail -50

# 4. Verify ClickHouse connectivity
kubectl exec -n clickhouse chi-iot-cluster-repl-iot-cluster-0-0-0 -- \
  clickhouse-client --query "SELECT 1"

Issue: Optimizing Costs

# Enable spot instances for cost savings (can reduce to ~$400-500/month)
use_spot_instances = true

# Reduce node counts during development/testing
desired_size = 1  # Instead of 3-6

# Use smaller instance types for development
instance_type = "t3.small"  # Instead of t3.large

# Schedule auto-shutdown for non-production environments
# Use AWS Instance Scheduler

🧹 Cleanup

When you're done testing:

# Delete all Kubernetes resources
kubectl delete namespace iot-pipeline flink-benchmark clickhouse pulsar

# Destroy AWS infrastructure
terraform destroy
# Type 'yes' when prompted

# This will:
# - Delete EKS cluster
# - Delete VPC and subnets
# - Delete S3 buckets (after emptying)
# - Delete IAM roles
# - Delete ECR repositories

⚠️ Warning: This is irreversible! Make sure to backup any data first.

💡 Production Best Practices

1. Optimize Instance Usage

# terraform.tfvars - Consider spot instances for cost savings
use_spot_instances = true  # Can reduce costs by 60-70%

Benefits of optimization:

Spot instances can reduce costs significantly
Right-sizing instances based on actual usage
Auto-scaling to handle variable workloads

Current setup uses on-demand instances for:

Guaranteed availability and performance
Simplified operations and management
Predictable monthly costs ($1,250)

2. Set Up Alerts

# CloudWatch alarms (via Terraform)
- CPU utilization > 80%
- Disk usage > 85%
- Pod crashes > 3 in 5 minutes
- Flink checkpoint failures

3. Implement Data Retention

-- ClickHouse TTL (30 days)
ALTER TABLE benchmark.sensors_local 
MODIFY TTL timestamp + INTERVAL 30 DAY;

-- Pulsar retention (7 days)
bin/pulsar-admin namespaces set-retention public/default \
  --size 100G --time 7d

4. Enable Backups

# ClickHouse backups to S3
clickhouse-backup create daily_backup
clickhouse-backup upload daily_backup

# Flink savepoints
kubectl exec -n flink-benchmark <jobmanager-pod> -- \
  flink savepoint <job-id> s3://bench-low-infra-state/savepoints

5. Use Auto-Scaling

# HorizontalPodAutoscaler for producers
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: iot-producer-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: iot-producer
  minReplicas: 1
  maxReplicas: 5
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

🎓 What You've Learned

By following this guide, you've:

✅ Deployed a production-grade streaming platform on AWS

✅ Configured cost-optimized infrastructure with t3 instances

✅ Set up real-time stream processing with Apache Flink

✅ Implemented exactly-once semantics with checkpointing

✅ Built scalable message broker with Apache Pulsar

✅ Configured analytics database with ClickHouse

✅ Enabled monitoring and observability with Grafana

✅ Learned cost optimization strategies (90% savings!)

🚀 Next Steps

1. Customize for Your Domain

Modify the producer to generate your specific event types:

// Edit: producer-load/src/main/java/com/iot/pipeline/producer/
public class EventDataProducer {
    private EventData generateEvent() {
        // Your custom event generation logic
        return new EventData(
            customerId,
            orderId,
            amount,
            timestamp
        );
    }
}

2. Add Custom Flink Transformations

// Edit: flink-load/flink-consumer/src/main/java/com/iot/pipeline/flink/
sensorStream
    .filter(record -> record.getTemperature() > 30.0)  // Custom filter
    .map(record -> new AlertRecord(record))            // Custom transformation
    .addSink(new AlertSink());                         // Custom sink

3. Implement Advanced Analytics

-- Create materialized views in ClickHouse
CREATE MATERIALIZED VIEW hourly_aggregates
ENGINE = AggregatingMergeTree()
ORDER BY (sensorId, hour)
AS SELECT
    sensorId,
    toStartOfHour(timestamp) as hour,
    avg(temperature) as avg_temp,
    max(temperature) as max_temp
FROM benchmark.sensors_local
GROUP BY sensorId, hour;

4. Scale to Production

When ready for production scale:

Enable spot instances for cost savings
Set up automated backups
Configure CloudWatch alarms
Implement log aggregation (CloudWatch Logs)
Set up CI/CD pipeline
Enable AWS Shield for DDoS protection

📚 Resources

50K Events Repository: realtime-platform-50k-events
Main Repository: RealtimeDataPlatform
AWS EKS Documentation: docs.aws.amazon.com/eks
Apache Flink: flink.apache.org
Apache Pulsar: pulsar.apache.org
ClickHouse: clickhouse.com/docs
Terraform: terraform.io/docs

💬 Conclusion

You now have a production-grade, cost-optimized streaming platform running on AWS for under $250/month! This setup demonstrates real-world patterns used by companies processing millions of events per day, but optimized for moderate scale and budget constraints.

The beauty of this architecture is its flexibility:

Start small (1K events/sec, lower cost)
Grow to moderate (50K events/sec, ~$1,250/mo)
Scale to enterprise (1M events/sec, higher cost)

All with the same codebase and deployment patterns!

Key takeaways:

AWS EKS deployment provides managed Kubernetes for production workloads
t3 instances deliver excellent price/performance for streaming workloads
$1,250/month infrastructure cost for 50K events/sec processing
AVRO schemas enable efficient serialization at scale
Production-ready with monitoring, alerting, and auto-scaling

What would you build with this platform? Share your use case in the comments! 👇

Found this helpful? Follow me for more posts on cloud architecture, real-time data engineering, and cost optimization strategies!

Next in the series: "Scaling to 1 Million Events/Second - Enterprise Production Guide"

🌟 Support This Project

⭐ Star the repo if you found it useful!

🐛 Report issues - help us improve

💼 Production-tested - used in real workloads

📖 Well-documented - complete guides included

💰 Cost-optimized - save 90% on infrastructure

📊 Performance Summary

Metric	Value
Throughput	50,000 events/sec
Latency	< 2 seconds end-to-end
Monthly Cost	$1,250 (on-demand instances)
Storage	~1TB (30 days retention)
Availability	99.9% (multi-AZ deployment)
Scalability	1K → 1M events/sec
Setup Time	~45 minutes

Tags: #aws #eks #streaming #costsavings #dataengineering #pulsar #flink #clickhouse #terraform #avro #realtimeanalytics

Top comments (2)

Oguzhan Bassari • Oct 26

Wow, this is a seriously impressive architecture! Handling 50K events/sec is a massive scale, and the cost breakdown is super insightful. It's fascinating to see what's possible with EKS and Kafka for high-throughput streaming.

My own serverless project (tarihasistani.com.tr) operates at the completely opposite end of the spectrum – it's a low-traffic, event-driven AI platform built on Lambda and API Gateway, designed for extreme low cost ('pay-per-use').

This makes me wonder about the 'middle ground.' In your opinion, at what point does a simple, serverless event-driven architecture (like Lambda + SQS) break down? Is there a certain 'events/sec' threshold or a level of processing complexity where you'd say 'stop using serverless, it's time to move to Kubernetes (EKS)?

HyperscaleDesignHub • Oct 26 • Edited

Thank you @oguzkhan80. In your case once event rate starts increasing your lamda hits its limits. Once it hits the limits then the input events gets lost. Another potential problem with Lambda is, if lambda is light weight one, just read event and store in DB, then there shouldn't be much issues. In case you are doing processing of event and processing demands more CPU and Memory then you hit very soon issues with increasing event rate. That means for ex: In light weight processing case you can process upto 5000 events/sec. And in second case of heavy processing you may reach upto 1000 events/sec. Since currently you are not having much rate of events, your Lambds is good enough. But you keep monitoring event rate and convert your architecture at appropriate time