Building real-time streaming platforms can be expensive. Most production setups cost thousands of dollars per month. But what if you need to process 50,000 events per second at a reasonable cost?
In this guide, I'll show you how to deploy a production-grade event streaming platform on AWS EKS that costs $1,250/month while handling moderate-scale real-time data processing.
π― What We're Building
A complete, production-ready streaming platform that:
- β Processes 50,000 events per second in real-time
- β AWS infrastructure cost: ~$1,250/month
- β Uses t3 instance types for cost efficiency
- β Runs on AWS EKS with managed Kubernetes
- β Supports multiple domains: E-commerce, Finance, IoT, Gaming, Logistics
- β Provides real-time analytics with sub-second latency
- β Includes monitoring dashboards with Grafana
- β Offers easy scalability to 1M events/sec if needed
π° Infrastructure Cost
AWS Infrastructure Cost: ~$1,250/month
This includes all compute instances (t3.medium to t3.xlarge), EKS cluster management, storage (EBS gp3), networking (NAT Gateway, Load Balancer), and monitoring services required for a production-ready 50K events/sec streaming platform.
ποΈ Architecture Overview
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AWS EKS Cluster (us-west-2) β
β bench-low-infra (k8s 1.31) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββ ββββββββββββββββ βββββββββββββββββββ β
β β PRODUCER βββββΆβ PULSAR βββββΆβ FLINK β β
β β (t3.medium) β β (t3.large) β β (t3.large) β β
β β β β β β β β
β β Java/AVRO β β ZK+Broker+BK β β JobManager + β β
β β 1K msg/sec β β EBS Storage β β TaskManager β β
β β per pod β β 50K msg/sec β β 1-min windows β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββ¬βββββββββ β
β β β
β ββββββββββββββββββββββββββββββββ β
β βΌ β
β βββββββββββββββββββ β
β β CLICKHOUSE β β
β β (t3.xlarge) β β
β β β β
β β EBS Storage β β
β β Analytics DB β β
β βββββββββββββββββββ β
β β
β Supporting: VPC, S3 (checkpoints), ECR, IAM, EBS CSI β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Tech Stack:
- Kubernetes: AWS EKS 1.31
- Message Broker: Apache Pulsar 3.1
- Stream Processing: Apache Flink 1.18
- Analytics DB: ClickHouse 24.x
- Storage: EBS gp3 (cost-optimized, no expensive NVMe)
- Infrastructure: Terraform
- Monitoring: Grafana + Prometheus
π Prerequisites
Before starting, ensure you have:
# Install required tools (macOS)
brew install awscli terraform kubectl helm
# Or on Linux
# Install AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
# Install Terraform
wget https://releases.hashicorp.com/terraform/1.6.0/terraform_1.6.0_linux_amd64.zip
unzip terraform_1.6.0_linux_amd64.zip
sudo mv terraform /usr/local/bin/
# Configure AWS credentials
aws configure
# Enter: AWS Access Key, Secret Key, Region (us-west-2), Output format (json)
Required AWS Permissions:
- EC2, VPC, EKS, S3, IAM, ECR (full access)
- Estimated: ~$1,250/month for complete setup
π Step-by-Step Deployment
Step 1: Clone the Repository
git clone https://github.com/hyperscaledesignhub/RealtimeDataPlatform.git
cd RealtimeDataPlatform/realtime-platform-50k-events
Repository structure:
realtime-platform-50k-events/
βββ terraform/ # AWS infrastructure
βββ producer-load/ # Event data generation
βββ pulsar-load/ # Apache Pulsar deployment
βββ flink-load/ # Apache Flink processing
βββ clickhouse-load/ # ClickHouse analytics DB
βββ monitoring/ # Grafana dashboards
Step 2: Deploy AWS Infrastructure
# Initialize Terraform
terraform init
# Review what will be created
terraform plan
# Deploy infrastructure (takes ~15-20 minutes)
terraform apply
# Type 'yes' when prompted
What gets created:
- β VPC with public/private subnets (10.1.0.0/16)
- β
EKS cluster
bench-low-infra(k8s 1.31) - β
9 node groups with t3 instances:
- Producer: t3.medium (1 node, scales to 3)
- Pulsar ZooKeeper: t3.small (3 nodes)
- Pulsar Broker: t3.large (3 nodes)
- Pulsar BookKeeper: t3.large (4 nodes)
- Pulsar Proxy: t3.small (2 nodes)
- Flink JobManager: t3.large (1 node)
- Flink TaskManager: t3.large (6 nodes)
- ClickHouse: t3.xlarge (4 nodes)
- General: t3.small (1 node)
- β S3 bucket for Flink checkpoints
- β ECR repositories for container images
- β IAM roles and policies
- β EBS CSI driver for persistent volumes
- β NAT Gateway for internet access
Configure kubectl:
aws eks update-kubeconfig --region us-west-2 --name bench-low-infra
# Verify nodes are ready
kubectl get nodes
Expected output:
NAME STATUS ROLES AGE VERSION
ip-10-1-x-x.us-west-2.compute.internal Ready <none> 5m v1.31.0-eks-...
ip-10-1-x-x.us-west-2.compute.internal Ready <none> 5m v1.31.0-eks-...
...
Step 3: Deploy Apache Pulsar (Message Broker)
cd pulsar-load
# Deploy Pulsar with Helm
./deploy.sh
# Monitor deployment (takes ~5-10 minutes)
kubectl get pods -n pulsar -w
# Press Ctrl+C when all pods are Running
What this deploys:
- ZooKeeper: 3 replicas (metadata management)
- Broker: 3 replicas (message routing)
- BookKeeper: 4 replicas (message storage on EBS)
- Proxy: 2 replicas (load balancing)
- Grafana: Monitoring dashboard
- Victoria Metrics: Metrics storage
Verify Pulsar is healthy:
kubectl get pods -n pulsar | grep -E "zookeeper|broker|bookkeeper"
# All pods should show "Running" status
Step 4: Deploy ClickHouse (Analytics Database)
cd ../clickhouse-load
# Install ClickHouse operator and cluster
./00-install-clickhouse.sh
# Wait for ClickHouse pods (~3-5 minutes)
kubectl get pods -n clickhouse -w
# Create database schema
./00-create-schema-all-replicas.sh
What this creates:
- ClickHouse cluster: 4 nodes (2 shards Γ 2 replicas)
- Database:
benchmark - Table:
sensors_local(optimized for IoT sensor data) - Storage: EBS gp3 volumes (200 GB per node)
- Retention: 30 days TTL
Test ClickHouse:
# Connect to ClickHouse
kubectl exec -it -n clickhouse chi-iot-cluster-repl-iot-cluster-0-0-0 -- clickhouse-client
# Run test query
SELECT version();
# Exit with Ctrl+D
Step 5: Deploy Apache Flink (Stream Processing)
cd ../flink-load
# Build and push Flink consumer image to ECR
./build-and-push.sh
# Deploy Flink cluster
./deploy.sh
# Submit Flink job
kubectl apply -f flink-job-deployment.yaml
# Monitor Flink job (~2-3 minutes to start)
kubectl get pods -n flink-benchmark -w
What this deploys:
- Flink JobManager: 1 replica (job coordination)
- Flink TaskManager: 6 replicas (data processing)
- S3 checkpoints: Every 1 minute
- Job:
JDBC IoT Data Pipeline(AVRO deserialization)
Flink Job Details:
// Stream processing pipeline using the SensorData AVRO schema
DataStream<SensorRecord> sensorStream = env.fromSource(
pulsarSource,
WatermarkStrategy.noWatermarks(),
"Pulsar IoT Source"
);
// 1-minute aggregation windows
sensorStream
.keyBy(record -> record.getSensorId())
.window(TumblingProcessingTimeWindows.of(Time.minutes(1)))
.aggregate(new SensorAggregator())
.addSink(new ClickHouseJDBCSink(clickhouseUrl));
Step 6: Deploy IoT Producer (Event Generation)
cd ../producer-load
# Build and deploy IoT producer
./deploy.sh
# Scale producers based on desired throughput
kubectl scale deployment iot-producer -n iot-pipeline --replicas=50
# Monitor producer status
kubectl get pods -n iot-pipeline -l app=iot-producer
Producer capabilities:
- AVRO serialization: Uses optimized SensorData schema
- Multi-sensor types: Temperature, humidity, pressure, motion, light, CO2, noise
- Configurable throughput: 1,000 events/sec per pod
- Realistic data: Battery levels, device status, geographic distribution
π Step 7: Verify the Complete Pipeline
After all components are deployed (~10-15 minutes total), verify data flow:
# Check producer is generating data
kubectl logs -n iot-pipeline -l app=iot-producer --tail=10
# Verify Pulsar has messages
kubectl exec -n pulsar pulsar-broker-0 -- \
bin/pulsar-admin topics stats persistent://public/default/iot-sensor-data
# Check Flink is processing
kubectl logs -n flink-benchmark deployment/iot-flink-job --tail=20
# Query ClickHouse for data
kubectl exec -n clickhouse chi-iot-cluster-repl-iot-cluster-0-0-0 -- \
clickhouse-client --query "SELECT COUNT(*) FROM benchmark.sensors_local"
Expected data flow:
β
Producer: 50,000 events/sec generation
β
Pulsar: Message ingestion and buffering
β
Flink: Real-time stream processing with 1-minute windows
β
ClickHouse: Data storage and analytics queries
β
End-to-end latency: < 2 seconds
π Monitoring and Analytics
Access Grafana Dashboard
# Set up port forwarding
kubectl port-forward -n pulsar svc/grafana 3000:3000 &
# Open in browser
open http://localhost:3000
# Login: admin/admin
Key metrics to monitor:
- Pulsar: Message throughput, backlog size, storage usage
- Flink: Checkpoint duration, processing latency, job health
- ClickHouse: Query performance, insert rate, storage growth
- Infrastructure: CPU, memory, disk I/O across all nodes
Sample Analytics Queries
-- Connect to ClickHouse
kubectl exec -it -n clickhouse chi-iot-cluster-repl-iot-cluster-0-0-0 -- clickhouse-client
-- Query examples based on our SensorData AVRO schema
USE benchmark;
-- Count total sensor readings
SELECT COUNT(*) as total_readings FROM sensors_local;
-- Average metrics by sensor type
SELECT
sensorType,
COUNT(*) as reading_count,
AVG(temperature) as avg_temp,
AVG(humidity) as avg_humidity,
AVG(batteryLevel) as avg_battery
FROM sensors_local
GROUP BY sensorType
ORDER BY sensorType;
-- Identify sensors with low battery
SELECT
sensorId,
sensorType,
AVG(batteryLevel) as avg_battery,
MIN(batteryLevel) as min_battery
FROM sensors_local
WHERE timestamp >= now() - INTERVAL 1 HOUR
GROUP BY sensorId, sensorType
HAVING avg_battery < 25.0
ORDER BY avg_battery ASC;
-- Hourly data ingestion rate
SELECT
toStartOfHour(timestamp) as hour,
COUNT(*) as records_per_hour
FROM sensors_local
WHERE timestamp >= now() - INTERVAL 24 HOUR
GROUP BY hour
ORDER BY hour DESC;
-- Temperature anomaly detection
SELECT
sensorId,
timestamp,
temperature,
status
FROM sensors_local
WHERE temperature > 35.0
AND timestamp >= now() - INTERVAL 1 HOUR
ORDER BY timestamp DESC
LIMIT 100;
π Performance Benchmarks
Real-World Performance Metrics
On this cost-optimized setup, you can expect:
| Metric | Value | Notes |
|---|---|---|
| Message Throughput | 50,000 events/sec | Sustained rate with 50 producer pods |
| End-to-end Latency | < 2 seconds | Producer β ClickHouse |
| Query Performance | < 500ms | Analytical queries on 1B+ records |
| CPU Utilization | 60-70% | Across all node groups |
| Memory Usage | ~80% | Optimized for t3 instance types |
| Storage Growth | ~10 GB/hour | With 30-day TTL retention |
| Availability | 99.9%+ | Multi-AZ deployment |
Cost vs. Performance Analysis
π° Infrastructure Efficiency:
- $1,250/month for 50K events/sec
- $0.000025 per event processed
- Production-ready reliability and monitoring
- Linear scaling to 1M events/sec
π Scaling Characteristics:
- Same architecture scales from 1K β 1M events/sec
- Infrastructure-as-Code deployment
- Easy migration path to enterprise setup
- Predictable monthly costs
π οΈ Troubleshooting
Common Issues and Solutions
Issue: High Memory Usage
# Check memory usage across nodes
kubectl top nodes
# Scale up instances if needed
# Edit terraform.tfvars
instance_type = "t3.2xlarge" # Upgrade from t3.xlarge
terraform apply
Issue: Pods Stuck in Pending
# Check node availability
kubectl get nodes
# Check pod events
kubectl describe pod <pod-name> -n <namespace>
# Scale up nodes if needed
# Edit terraform.tfvars, increase desired_size
terraform apply
Issue: Flink Job Failing
# Check Flink logs
kubectl logs -n flink-benchmark deployment/iot-flink-job
# Common issues:
# - ClickHouse not ready: Wait 2-3 minutes
# - Pulsar not accessible: Check network policies
# - Out of memory: Scale up TaskManagers
Issue: No Data in ClickHouse
# 1. Check producer is running
kubectl get pods -n iot-pipeline -l app=iot-producer
# 2. Check Pulsar has data
kubectl exec -n pulsar pulsar-broker-0 -- \
bin/pulsar-admin topics stats persistent://public/default/iot-sensor-data
# 3. Check Flink is processing
kubectl logs -n flink-benchmark deployment/iot-flink-job | tail -50
# 4. Verify ClickHouse connectivity
kubectl exec -n clickhouse chi-iot-cluster-repl-iot-cluster-0-0-0 -- \
clickhouse-client --query "SELECT 1"
Issue: Optimizing Costs
# Enable spot instances for cost savings (can reduce to ~$400-500/month)
use_spot_instances = true
# Reduce node counts during development/testing
desired_size = 1 # Instead of 3-6
# Use smaller instance types for development
instance_type = "t3.small" # Instead of t3.large
# Schedule auto-shutdown for non-production environments
# Use AWS Instance Scheduler
π§Ή Cleanup
When you're done testing:
# Delete all Kubernetes resources
kubectl delete namespace iot-pipeline flink-benchmark clickhouse pulsar
# Destroy AWS infrastructure
terraform destroy
# Type 'yes' when prompted
# This will:
# - Delete EKS cluster
# - Delete VPC and subnets
# - Delete S3 buckets (after emptying)
# - Delete IAM roles
# - Delete ECR repositories
β οΈ Warning: This is irreversible! Make sure to backup any data first.
π‘ Production Best Practices
1. Optimize Instance Usage
# terraform.tfvars - Consider spot instances for cost savings
use_spot_instances = true # Can reduce costs by 60-70%
Benefits of optimization:
- Spot instances can reduce costs significantly
- Right-sizing instances based on actual usage
- Auto-scaling to handle variable workloads
Current setup uses on-demand instances for:
- Guaranteed availability and performance
- Simplified operations and management
- Predictable monthly costs ($1,250)
2. Set Up Alerts
# CloudWatch alarms (via Terraform)
- CPU utilization > 80%
- Disk usage > 85%
- Pod crashes > 3 in 5 minutes
- Flink checkpoint failures
3. Implement Data Retention
-- ClickHouse TTL (30 days)
ALTER TABLE benchmark.sensors_local
MODIFY TTL timestamp + INTERVAL 30 DAY;
-- Pulsar retention (7 days)
bin/pulsar-admin namespaces set-retention public/default \
--size 100G --time 7d
4. Enable Backups
# ClickHouse backups to S3
clickhouse-backup create daily_backup
clickhouse-backup upload daily_backup
# Flink savepoints
kubectl exec -n flink-benchmark <jobmanager-pod> -- \
flink savepoint <job-id> s3://bench-low-infra-state/savepoints
5. Use Auto-Scaling
# HorizontalPodAutoscaler for producers
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: iot-producer-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: iot-producer
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
π What You've Learned
By following this guide, you've:
β
Deployed a production-grade streaming platform on AWS
β
Configured cost-optimized infrastructure with t3 instances
β
Set up real-time stream processing with Apache Flink
β
Implemented exactly-once semantics with checkpointing
β
Built scalable message broker with Apache Pulsar
β
Configured analytics database with ClickHouse
β
Enabled monitoring and observability with Grafana
β
Learned cost optimization strategies (90% savings!)
π Next Steps
1. Customize for Your Domain
Modify the producer to generate your specific event types:
// Edit: producer-load/src/main/java/com/iot/pipeline/producer/
public class EventDataProducer {
private EventData generateEvent() {
// Your custom event generation logic
return new EventData(
customerId,
orderId,
amount,
timestamp
);
}
}
2. Add Custom Flink Transformations
// Edit: flink-load/flink-consumer/src/main/java/com/iot/pipeline/flink/
sensorStream
.filter(record -> record.getTemperature() > 30.0) // Custom filter
.map(record -> new AlertRecord(record)) // Custom transformation
.addSink(new AlertSink()); // Custom sink
3. Implement Advanced Analytics
-- Create materialized views in ClickHouse
CREATE MATERIALIZED VIEW hourly_aggregates
ENGINE = AggregatingMergeTree()
ORDER BY (sensorId, hour)
AS SELECT
sensorId,
toStartOfHour(timestamp) as hour,
avg(temperature) as avg_temp,
max(temperature) as max_temp
FROM benchmark.sensors_local
GROUP BY sensorId, hour;
4. Scale to Production
When ready for production scale:
- Enable spot instances for cost savings
- Set up automated backups
- Configure CloudWatch alarms
- Implement log aggregation (CloudWatch Logs)
- Set up CI/CD pipeline
- Enable AWS Shield for DDoS protection
π Resources
- 50K Events Repository: realtime-platform-50k-events
- Main Repository: RealtimeDataPlatform
- AWS EKS Documentation: docs.aws.amazon.com/eks
- Apache Flink: flink.apache.org
- Apache Pulsar: pulsar.apache.org
- ClickHouse: clickhouse.com/docs
- Terraform: terraform.io/docs
π¬ Conclusion
You now have a production-grade, cost-optimized streaming platform running on AWS for under $250/month! This setup demonstrates real-world patterns used by companies processing millions of events per day, but optimized for moderate scale and budget constraints.
The beauty of this architecture is its flexibility:
- Start small (1K events/sec, lower cost)
- Grow to moderate (50K events/sec, ~$1,250/mo)
- Scale to enterprise (1M events/sec, higher cost)
All with the same codebase and deployment patterns!
Key takeaways:
- AWS EKS deployment provides managed Kubernetes for production workloads
- t3 instances deliver excellent price/performance for streaming workloads
- $1,250/month infrastructure cost for 50K events/sec processing
- AVRO schemas enable efficient serialization at scale
- Production-ready with monitoring, alerting, and auto-scaling
What would you build with this platform? Share your use case in the comments! π
Found this helpful? Follow me for more posts on cloud architecture, real-time data engineering, and cost optimization strategies!
Next in the series: "Scaling to 1 Million Events/Second - Enterprise Production Guide"
π Support This Project
β Star the repo if you found it useful!
π Report issues - help us improve
πΌ Production-tested - used in real workloads
π Well-documented - complete guides included
π° Cost-optimized - save 90% on infrastructure
π Performance Summary
| Metric | Value |
|---|---|
| Throughput | 50,000 events/sec |
| Latency | < 2 seconds end-to-end |
| Monthly Cost | $1,250 (on-demand instances) |
| Storage | ~1TB (30 days retention) |
| Availability | 99.9% (multi-AZ deployment) |
| Scalability | 1K β 1M events/sec |
| Setup Time | ~45 minutes |
Tags: #aws #eks #streaming #costsavings #dataengineering #pulsar #flink #clickhouse #terraform #avro #realtimeanalytics
Top comments (2)
Wow, this is a seriously impressive architecture! Handling 50K events/sec is a massive scale, and the cost breakdown is super insightful. It's fascinating to see what's possible with EKS and Kafka for high-throughput streaming.
My own serverless project (tarihasistani.com.tr) operates at the completely opposite end of the spectrum β it's a low-traffic, event-driven AI platform built on Lambda and API Gateway, designed for extreme low cost ('pay-per-use').
This makes me wonder about the 'middle ground.' In your opinion, at what point does a simple, serverless event-driven architecture (like Lambda + SQS) break down? Is there a certain 'events/sec' threshold or a level of processing complexity where you'd say 'stop using serverless, it's time to move to Kubernetes (EKS)?
Thank you @oguzkhan80. In your case once event rate starts increasing your lamda hits its limits. Once it hits the limits then the input events gets lost. Another potential problem with Lambda is, if lambda is light weight one, just read event and store in DB, then there shouldn't be much issues. In case you are doing processing of event and processing demands more CPU and Memory then you hit very soon issues with increasing event rate. That means for ex: In light weight processing case you can process upto 5000 events/sec. And in second case of heavy processing you may reach upto 1000 events/sec. Since currently you are not having much rate of events, your Lambds is good enough. But you keep monitoring event rate and convert your architecture at appropriate time