DEV Community

HyperscaleDesignHub
HyperscaleDesignHub

Posted on

AWS EKS Deployment: Real-Time Data Streaming Platform - 50K Events/Sec for $1,250/Month

Building real-time streaming platforms can be expensive. Most production setups cost thousands of dollars per month. But what if you need to process 50,000 events per second at a reasonable cost?

In this guide, I'll show you how to deploy a production-grade event streaming platform on AWS EKS that costs $1,250/month while handling moderate-scale real-time data processing.

🎯 What We're Building

A complete, production-ready streaming platform that:

  • βœ… Processes 50,000 events per second in real-time
  • βœ… AWS infrastructure cost: ~$1,250/month
  • βœ… Uses t3 instance types for cost efficiency
  • βœ… Runs on AWS EKS with managed Kubernetes
  • βœ… Supports multiple domains: E-commerce, Finance, IoT, Gaming, Logistics
  • βœ… Provides real-time analytics with sub-second latency
  • βœ… Includes monitoring dashboards with Grafana
  • βœ… Offers easy scalability to 1M events/sec if needed

πŸ’° Infrastructure Cost

AWS Infrastructure Cost: ~$1,250/month

This includes all compute instances (t3.medium to t3.xlarge), EKS cluster management, storage (EBS gp3), networking (NAT Gateway, Load Balancer), and monitoring services required for a production-ready 50K events/sec streaming platform.

πŸ—οΈ Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    AWS EKS Cluster (us-west-2)                 β”‚
β”‚                    bench-low-infra (k8s 1.31)                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚   PRODUCER   │───▢│    PULSAR    │───▢│      FLINK      β”‚ β”‚
β”‚  β”‚  (t3.medium) β”‚    β”‚  (t3.large)  β”‚    β”‚   (t3.large)    β”‚ β”‚
β”‚  β”‚              β”‚    β”‚              β”‚    β”‚                 β”‚ β”‚
β”‚  β”‚ Java/AVRO    β”‚    β”‚ ZK+Broker+BK β”‚    β”‚ JobManager +    β”‚ β”‚
β”‚  β”‚ 1K msg/sec   β”‚    β”‚ EBS Storage  β”‚    β”‚ TaskManager     β”‚ β”‚
β”‚  β”‚ per pod      β”‚    β”‚ 50K msg/sec  β”‚    β”‚ 1-min windows   β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                                     β”‚          β”‚
β”‚                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
β”‚                      β–Ό                                         β”‚
β”‚               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                             β”‚
β”‚               β”‚   CLICKHOUSE    β”‚                             β”‚
β”‚               β”‚  (t3.xlarge)    β”‚                             β”‚
β”‚               β”‚                 β”‚                             β”‚
β”‚               β”‚  EBS Storage    β”‚                             β”‚
β”‚               β”‚  Analytics DB   β”‚                             β”‚
β”‚               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                             β”‚
β”‚                                                                β”‚
β”‚  Supporting: VPC, S3 (checkpoints), ECR, IAM, EBS CSI        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

Tech Stack:

  • Kubernetes: AWS EKS 1.31
  • Message Broker: Apache Pulsar 3.1
  • Stream Processing: Apache Flink 1.18
  • Analytics DB: ClickHouse 24.x
  • Storage: EBS gp3 (cost-optimized, no expensive NVMe)
  • Infrastructure: Terraform
  • Monitoring: Grafana + Prometheus

πŸ“‹ Prerequisites

Before starting, ensure you have:

# Install required tools (macOS)
brew install awscli terraform kubectl helm

# Or on Linux
# Install AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

# Install Terraform
wget https://releases.hashicorp.com/terraform/1.6.0/terraform_1.6.0_linux_amd64.zip
unzip terraform_1.6.0_linux_amd64.zip
sudo mv terraform /usr/local/bin/

# Configure AWS credentials
aws configure
# Enter: AWS Access Key, Secret Key, Region (us-west-2), Output format (json)
Enter fullscreen mode Exit fullscreen mode

Required AWS Permissions:

  • EC2, VPC, EKS, S3, IAM, ECR (full access)
  • Estimated: ~$1,250/month for complete setup

πŸš€ Step-by-Step Deployment

Step 1: Clone the Repository

git clone https://github.com/hyperscaledesignhub/RealtimeDataPlatform.git
cd RealtimeDataPlatform/realtime-platform-50k-events
Enter fullscreen mode Exit fullscreen mode

Repository structure:

realtime-platform-50k-events/
β”œβ”€β”€ terraform/                # AWS infrastructure
β”œβ”€β”€ producer-load/            # Event data generation
β”œβ”€β”€ pulsar-load/              # Apache Pulsar deployment
β”œβ”€β”€ flink-load/               # Apache Flink processing
β”œβ”€β”€ clickhouse-load/          # ClickHouse analytics DB
└── monitoring/               # Grafana dashboards
Enter fullscreen mode Exit fullscreen mode

Step 2: Deploy AWS Infrastructure

# Initialize Terraform
terraform init

# Review what will be created
terraform plan

# Deploy infrastructure (takes ~15-20 minutes)
terraform apply
# Type 'yes' when prompted
Enter fullscreen mode Exit fullscreen mode

What gets created:

  • βœ… VPC with public/private subnets (10.1.0.0/16)
  • βœ… EKS cluster bench-low-infra (k8s 1.31)
  • βœ… 9 node groups with t3 instances:
    • Producer: t3.medium (1 node, scales to 3)
    • Pulsar ZooKeeper: t3.small (3 nodes)
    • Pulsar Broker: t3.large (3 nodes)
    • Pulsar BookKeeper: t3.large (4 nodes)
    • Pulsar Proxy: t3.small (2 nodes)
    • Flink JobManager: t3.large (1 node)
    • Flink TaskManager: t3.large (6 nodes)
    • ClickHouse: t3.xlarge (4 nodes)
    • General: t3.small (1 node)
  • βœ… S3 bucket for Flink checkpoints
  • βœ… ECR repositories for container images
  • βœ… IAM roles and policies
  • βœ… EBS CSI driver for persistent volumes
  • βœ… NAT Gateway for internet access

Configure kubectl:

aws eks update-kubeconfig --region us-west-2 --name bench-low-infra

# Verify nodes are ready
kubectl get nodes
Enter fullscreen mode Exit fullscreen mode

Expected output:

NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-1-x-x.us-west-2.compute.internal      Ready    <none>   5m    v1.31.0-eks-...
ip-10-1-x-x.us-west-2.compute.internal      Ready    <none>   5m    v1.31.0-eks-...
...
Enter fullscreen mode Exit fullscreen mode

Step 3: Deploy Apache Pulsar (Message Broker)

cd pulsar-load

# Deploy Pulsar with Helm
./deploy.sh

# Monitor deployment (takes ~5-10 minutes)
kubectl get pods -n pulsar -w
# Press Ctrl+C when all pods are Running
Enter fullscreen mode Exit fullscreen mode

What this deploys:

  • ZooKeeper: 3 replicas (metadata management)
  • Broker: 3 replicas (message routing)
  • BookKeeper: 4 replicas (message storage on EBS)
  • Proxy: 2 replicas (load balancing)
  • Grafana: Monitoring dashboard
  • Victoria Metrics: Metrics storage

Verify Pulsar is healthy:

kubectl get pods -n pulsar | grep -E "zookeeper|broker|bookkeeper"

# All pods should show "Running" status
Enter fullscreen mode Exit fullscreen mode

Step 4: Deploy ClickHouse (Analytics Database)

cd ../clickhouse-load

# Install ClickHouse operator and cluster
./00-install-clickhouse.sh

# Wait for ClickHouse pods (~3-5 minutes)
kubectl get pods -n clickhouse -w

# Create database schema
./00-create-schema-all-replicas.sh
Enter fullscreen mode Exit fullscreen mode

What this creates:

  • ClickHouse cluster: 4 nodes (2 shards Γ— 2 replicas)
  • Database: benchmark
  • Table: sensors_local (optimized for IoT sensor data)
  • Storage: EBS gp3 volumes (200 GB per node)
  • Retention: 30 days TTL

Test ClickHouse:

# Connect to ClickHouse
kubectl exec -it -n clickhouse chi-iot-cluster-repl-iot-cluster-0-0-0 -- clickhouse-client

# Run test query
SELECT version();

# Exit with Ctrl+D
Enter fullscreen mode Exit fullscreen mode

Step 5: Deploy Apache Flink (Stream Processing)

cd ../flink-load

# Build and push Flink consumer image to ECR
./build-and-push.sh

# Deploy Flink cluster
./deploy.sh

# Submit Flink job
kubectl apply -f flink-job-deployment.yaml

# Monitor Flink job (~2-3 minutes to start)
kubectl get pods -n flink-benchmark -w
Enter fullscreen mode Exit fullscreen mode

What this deploys:

  • Flink JobManager: 1 replica (job coordination)
  • Flink TaskManager: 6 replicas (data processing)
  • S3 checkpoints: Every 1 minute
  • Job: JDBC IoT Data Pipeline (AVRO deserialization)

Flink Job Details:

// Stream processing pipeline using the SensorData AVRO schema
DataStream<SensorRecord> sensorStream = env.fromSource(
    pulsarSource,
    WatermarkStrategy.noWatermarks(),
    "Pulsar IoT Source"
);

// 1-minute aggregation windows
sensorStream
    .keyBy(record -> record.getSensorId())
    .window(TumblingProcessingTimeWindows.of(Time.minutes(1)))
    .aggregate(new SensorAggregator())
    .addSink(new ClickHouseJDBCSink(clickhouseUrl));
Enter fullscreen mode Exit fullscreen mode

Step 6: Deploy IoT Producer (Event Generation)

cd ../producer-load

# Build and deploy IoT producer
./deploy.sh

# Scale producers based on desired throughput
kubectl scale deployment iot-producer -n iot-pipeline --replicas=50

# Monitor producer status
kubectl get pods -n iot-pipeline -l app=iot-producer
Enter fullscreen mode Exit fullscreen mode

Producer capabilities:

  • AVRO serialization: Uses optimized SensorData schema
  • Multi-sensor types: Temperature, humidity, pressure, motion, light, CO2, noise
  • Configurable throughput: 1,000 events/sec per pod
  • Realistic data: Battery levels, device status, geographic distribution

πŸ“Š Step 7: Verify the Complete Pipeline

After all components are deployed (~10-15 minutes total), verify data flow:

# Check producer is generating data
kubectl logs -n iot-pipeline -l app=iot-producer --tail=10

# Verify Pulsar has messages
kubectl exec -n pulsar pulsar-broker-0 -- \
  bin/pulsar-admin topics stats persistent://public/default/iot-sensor-data

# Check Flink is processing
kubectl logs -n flink-benchmark deployment/iot-flink-job --tail=20

# Query ClickHouse for data
kubectl exec -n clickhouse chi-iot-cluster-repl-iot-cluster-0-0-0 -- \
  clickhouse-client --query "SELECT COUNT(*) FROM benchmark.sensors_local"
Enter fullscreen mode Exit fullscreen mode

Expected data flow:

βœ… Producer: 50,000 events/sec generation
βœ… Pulsar: Message ingestion and buffering
βœ… Flink: Real-time stream processing with 1-minute windows
βœ… ClickHouse: Data storage and analytics queries
βœ… End-to-end latency: < 2 seconds
Enter fullscreen mode Exit fullscreen mode

πŸ” Monitoring and Analytics

Access Grafana Dashboard

# Set up port forwarding
kubectl port-forward -n pulsar svc/grafana 3000:3000 &

# Open in browser
open http://localhost:3000
# Login: admin/admin
Enter fullscreen mode Exit fullscreen mode

Key metrics to monitor:

  • Pulsar: Message throughput, backlog size, storage usage
  • Flink: Checkpoint duration, processing latency, job health
  • ClickHouse: Query performance, insert rate, storage growth
  • Infrastructure: CPU, memory, disk I/O across all nodes

Sample Analytics Queries

-- Connect to ClickHouse
kubectl exec -it -n clickhouse chi-iot-cluster-repl-iot-cluster-0-0-0 -- clickhouse-client

-- Query examples based on our SensorData AVRO schema
USE benchmark;

-- Count total sensor readings
SELECT COUNT(*) as total_readings FROM sensors_local;

-- Average metrics by sensor type
SELECT 
    sensorType,
    COUNT(*) as reading_count,
    AVG(temperature) as avg_temp,
    AVG(humidity) as avg_humidity,
    AVG(batteryLevel) as avg_battery
FROM sensors_local
GROUP BY sensorType
ORDER BY sensorType;

-- Identify sensors with low battery
SELECT 
    sensorId,
    sensorType,
    AVG(batteryLevel) as avg_battery,
    MIN(batteryLevel) as min_battery
FROM sensors_local
WHERE timestamp >= now() - INTERVAL 1 HOUR
GROUP BY sensorId, sensorType
HAVING avg_battery < 25.0
ORDER BY avg_battery ASC;

-- Hourly data ingestion rate
SELECT 
    toStartOfHour(timestamp) as hour,
    COUNT(*) as records_per_hour
FROM sensors_local
WHERE timestamp >= now() - INTERVAL 24 HOUR
GROUP BY hour
ORDER BY hour DESC;

-- Temperature anomaly detection
SELECT 
    sensorId,
    timestamp,
    temperature,
    status
FROM sensors_local
WHERE temperature > 35.0 
  AND timestamp >= now() - INTERVAL 1 HOUR
ORDER BY timestamp DESC
LIMIT 100;
Enter fullscreen mode Exit fullscreen mode

πŸ“ˆ Performance Benchmarks

Real-World Performance Metrics

On this cost-optimized setup, you can expect:

Metric Value Notes
Message Throughput 50,000 events/sec Sustained rate with 50 producer pods
End-to-end Latency < 2 seconds Producer β†’ ClickHouse
Query Performance < 500ms Analytical queries on 1B+ records
CPU Utilization 60-70% Across all node groups
Memory Usage ~80% Optimized for t3 instance types
Storage Growth ~10 GB/hour With 30-day TTL retention
Availability 99.9%+ Multi-AZ deployment

Cost vs. Performance Analysis

πŸ’° Infrastructure Efficiency:
- $1,250/month for 50K events/sec
- $0.000025 per event processed
- Production-ready reliability and monitoring
- Linear scaling to 1M events/sec

πŸ“Š Scaling Characteristics:
- Same architecture scales from 1K β†’ 1M events/sec
- Infrastructure-as-Code deployment
- Easy migration path to enterprise setup
- Predictable monthly costs
Enter fullscreen mode Exit fullscreen mode

πŸ› οΈ Troubleshooting

Common Issues and Solutions

Issue: High Memory Usage

# Check memory usage across nodes
kubectl top nodes

# Scale up instances if needed
# Edit terraform.tfvars
instance_type = "t3.2xlarge"  # Upgrade from t3.xlarge
terraform apply
Enter fullscreen mode Exit fullscreen mode

Issue: Pods Stuck in Pending

# Check node availability
kubectl get nodes

# Check pod events
kubectl describe pod <pod-name> -n <namespace>

# Scale up nodes if needed
# Edit terraform.tfvars, increase desired_size
terraform apply
Enter fullscreen mode Exit fullscreen mode

Issue: Flink Job Failing

# Check Flink logs
kubectl logs -n flink-benchmark deployment/iot-flink-job

# Common issues:
# - ClickHouse not ready: Wait 2-3 minutes
# - Pulsar not accessible: Check network policies
# - Out of memory: Scale up TaskManagers
Enter fullscreen mode Exit fullscreen mode

Issue: No Data in ClickHouse

# 1. Check producer is running
kubectl get pods -n iot-pipeline -l app=iot-producer

# 2. Check Pulsar has data
kubectl exec -n pulsar pulsar-broker-0 -- \
  bin/pulsar-admin topics stats persistent://public/default/iot-sensor-data

# 3. Check Flink is processing
kubectl logs -n flink-benchmark deployment/iot-flink-job | tail -50

# 4. Verify ClickHouse connectivity
kubectl exec -n clickhouse chi-iot-cluster-repl-iot-cluster-0-0-0 -- \
  clickhouse-client --query "SELECT 1"
Enter fullscreen mode Exit fullscreen mode

Issue: Optimizing Costs

# Enable spot instances for cost savings (can reduce to ~$400-500/month)
use_spot_instances = true

# Reduce node counts during development/testing
desired_size = 1  # Instead of 3-6

# Use smaller instance types for development
instance_type = "t3.small"  # Instead of t3.large

# Schedule auto-shutdown for non-production environments
# Use AWS Instance Scheduler
Enter fullscreen mode Exit fullscreen mode

🧹 Cleanup

When you're done testing:

# Delete all Kubernetes resources
kubectl delete namespace iot-pipeline flink-benchmark clickhouse pulsar

# Destroy AWS infrastructure
terraform destroy
# Type 'yes' when prompted

# This will:
# - Delete EKS cluster
# - Delete VPC and subnets
# - Delete S3 buckets (after emptying)
# - Delete IAM roles
# - Delete ECR repositories
Enter fullscreen mode Exit fullscreen mode

⚠️ Warning: This is irreversible! Make sure to backup any data first.

πŸ’‘ Production Best Practices

1. Optimize Instance Usage

# terraform.tfvars - Consider spot instances for cost savings
use_spot_instances = true  # Can reduce costs by 60-70%
Enter fullscreen mode Exit fullscreen mode

Benefits of optimization:

  • Spot instances can reduce costs significantly
  • Right-sizing instances based on actual usage
  • Auto-scaling to handle variable workloads

Current setup uses on-demand instances for:

  • Guaranteed availability and performance
  • Simplified operations and management
  • Predictable monthly costs ($1,250)

2. Set Up Alerts

# CloudWatch alarms (via Terraform)
- CPU utilization > 80%
- Disk usage > 85%
- Pod crashes > 3 in 5 minutes
- Flink checkpoint failures
Enter fullscreen mode Exit fullscreen mode

3. Implement Data Retention

-- ClickHouse TTL (30 days)
ALTER TABLE benchmark.sensors_local 
MODIFY TTL timestamp + INTERVAL 30 DAY;

-- Pulsar retention (7 days)
bin/pulsar-admin namespaces set-retention public/default \
  --size 100G --time 7d
Enter fullscreen mode Exit fullscreen mode

4. Enable Backups

# ClickHouse backups to S3
clickhouse-backup create daily_backup
clickhouse-backup upload daily_backup

# Flink savepoints
kubectl exec -n flink-benchmark <jobmanager-pod> -- \
  flink savepoint <job-id> s3://bench-low-infra-state/savepoints
Enter fullscreen mode Exit fullscreen mode

5. Use Auto-Scaling

# HorizontalPodAutoscaler for producers
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: iot-producer-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: iot-producer
  minReplicas: 1
  maxReplicas: 5
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
Enter fullscreen mode Exit fullscreen mode

πŸŽ“ What You've Learned

By following this guide, you've:

βœ… Deployed a production-grade streaming platform on AWS

βœ… Configured cost-optimized infrastructure with t3 instances

βœ… Set up real-time stream processing with Apache Flink

βœ… Implemented exactly-once semantics with checkpointing

βœ… Built scalable message broker with Apache Pulsar

βœ… Configured analytics database with ClickHouse

βœ… Enabled monitoring and observability with Grafana

βœ… Learned cost optimization strategies (90% savings!)

πŸš€ Next Steps

1. Customize for Your Domain

Modify the producer to generate your specific event types:

// Edit: producer-load/src/main/java/com/iot/pipeline/producer/
public class EventDataProducer {
    private EventData generateEvent() {
        // Your custom event generation logic
        return new EventData(
            customerId,
            orderId,
            amount,
            timestamp
        );
    }
}
Enter fullscreen mode Exit fullscreen mode

2. Add Custom Flink Transformations

// Edit: flink-load/flink-consumer/src/main/java/com/iot/pipeline/flink/
sensorStream
    .filter(record -> record.getTemperature() > 30.0)  // Custom filter
    .map(record -> new AlertRecord(record))            // Custom transformation
    .addSink(new AlertSink());                         // Custom sink
Enter fullscreen mode Exit fullscreen mode

3. Implement Advanced Analytics

-- Create materialized views in ClickHouse
CREATE MATERIALIZED VIEW hourly_aggregates
ENGINE = AggregatingMergeTree()
ORDER BY (sensorId, hour)
AS SELECT
    sensorId,
    toStartOfHour(timestamp) as hour,
    avg(temperature) as avg_temp,
    max(temperature) as max_temp
FROM benchmark.sensors_local
GROUP BY sensorId, hour;
Enter fullscreen mode Exit fullscreen mode

4. Scale to Production

When ready for production scale:

  • Enable spot instances for cost savings
  • Set up automated backups
  • Configure CloudWatch alarms
  • Implement log aggregation (CloudWatch Logs)
  • Set up CI/CD pipeline
  • Enable AWS Shield for DDoS protection

πŸ“š Resources

πŸ’¬ Conclusion

You now have a production-grade, cost-optimized streaming platform running on AWS for under $250/month! This setup demonstrates real-world patterns used by companies processing millions of events per day, but optimized for moderate scale and budget constraints.

The beauty of this architecture is its flexibility:

  • Start small (1K events/sec, lower cost)
  • Grow to moderate (50K events/sec, ~$1,250/mo)
  • Scale to enterprise (1M events/sec, higher cost)

All with the same codebase and deployment patterns!

Key takeaways:

  • AWS EKS deployment provides managed Kubernetes for production workloads
  • t3 instances deliver excellent price/performance for streaming workloads
  • $1,250/month infrastructure cost for 50K events/sec processing
  • AVRO schemas enable efficient serialization at scale
  • Production-ready with monitoring, alerting, and auto-scaling

What would you build with this platform? Share your use case in the comments! πŸ‘‡


Found this helpful? Follow me for more posts on cloud architecture, real-time data engineering, and cost optimization strategies!

Next in the series: "Scaling to 1 Million Events/Second - Enterprise Production Guide"


🌟 Support This Project

⭐ Star the repo if you found it useful!

πŸ› Report issues - help us improve

πŸ’Ό Production-tested - used in real workloads

πŸ“– Well-documented - complete guides included

πŸ’° Cost-optimized - save 90% on infrastructure


πŸ“Š Performance Summary

Metric Value
Throughput 50,000 events/sec
Latency < 2 seconds end-to-end
Monthly Cost $1,250 (on-demand instances)
Storage ~1TB (30 days retention)
Availability 99.9% (multi-AZ deployment)
Scalability 1K β†’ 1M events/sec
Setup Time ~45 minutes

Tags: #aws #eks #streaming #costsavings #dataengineering #pulsar #flink #clickhouse #terraform #avro #realtimeanalytics

Top comments (2)

Collapse
 
oguzkhan80 profile image
Oguzhan Bassari

Wow, this is a seriously impressive architecture! Handling 50K events/sec is a massive scale, and the cost breakdown is super insightful. It's fascinating to see what's possible with EKS and Kafka for high-throughput streaming.

My own serverless project (tarihasistani.com.tr) operates at the completely opposite end of the spectrum – it's a low-traffic, event-driven AI platform built on Lambda and API Gateway, designed for extreme low cost ('pay-per-use').

This makes me wonder about the 'middle ground.' In your opinion, at what point does a simple, serverless event-driven architecture (like Lambda + SQS) break down? Is there a certain 'events/sec' threshold or a level of processing complexity where you'd say 'stop using serverless, it's time to move to Kubernetes (EKS)?

Collapse
 
vijaya_bhaskarv_ba95adf9 profile image
HyperscaleDesignHub • Edited

Thank you @oguzkhan80. In your case once event rate starts increasing your lamda hits its limits. Once it hits the limits then the input events gets lost. Another potential problem with Lambda is, if lambda is light weight one, just read event and store in DB, then there shouldn't be much issues. In case you are doing processing of event and processing demands more CPU and Memory then you hit very soon issues with increasing event rate. That means for ex: In light weight processing case you can process upto 5000 events/sec. And in second case of heavy processing you may reach upto 1000 events/sec. Since currently you are not having much rate of events, your Lambds is good enough. But you keep monitoring event rate and convert your architecture at appropriate time