When tasked with building a real-time data streaming platform capable of processing 1 million events per second, we faced a critical decision: build a self-hosted solution using open-source technologies, or leverage AWS managed services for convenience.
This article details how we built our self-hosted real-time data streaming platform and achieved a 90% cost reduction compared to equivalent AWS managed services, while maintaining enterprise-grade performance and reliability.
The result: A production-ready platform processing 1M events/sec for $24,592/month instead of $243,033/month with AWS managed services.
You can find the complete implementation in our RealtimeDataPlatform repository.
Here's how we did it and the lessons learned along the way.
ποΈ Architecture Comparison
Self-Hosted Stack Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Self-Hosted Stack on AWS EC2 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββ β
β β PRODUCER βββββΆβ PULSAR βββββΆβ FLINK βββββΆβCLICKHOUSE β β
β β β β β β β β β β
β β IoT Sensors β β Open Source β β Open Source β βOpen Sourceβ β
β β AVRO Data β β Message β β Stream β β Analytics β β
β β β β Broker β β Processing β β Database β β
β β 4x c5.4xl β β 6x i7i.8xl β β 4x c5.4xl β β6x r6id.4xlβ β
β β Full Controlβ β Self-Managedβ β Self-Managedβ βSelf-Managedβ β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββ β
β β
β Monthly Cost: ~$24,592 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
AWS Managed Stack Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AWS Managed Services Stack β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββ β
β β PRODUCER βββββΆβ MSK βββββΆβ KINESIS βββββΆβ REDSHIFT β β
β β β β β β β β β β
β β IoT Sensors β β Managed β β Data Analytics β Serverless β β
β β AVRO Data β β Streaming β β for Flink β β Analytics β β
β β β β for Kafka β β β β Warehouse β β
β β 4x c5.4xl β β AWS Managed β β AWS Managed β βAWS Managedβ β
β β Hands-off β β Serverless β β Serverless β βServerless β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββ β
β β
β Monthly Cost: ~$243,033 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π° The Self-Hosted Stack: Control and Cost-Effectiveness
In this scenario, we deploy our entire stack on Amazon EC2 instances. This gives us maximum control over the configuration and tuning of each component.
The Technology Stack
- Message Broker: Apache Pulsar
- Stream Processing: Apache Flink
- Analytics Database: ClickHouse
Infrastructure Details
Apache Pulsar Cluster:
- 6Γ i7i.8xlarge instances (32 vCPU, 256GB RAM, 2Γ3.75TB NVMe)
- Co-located brokers and bookies
- NVMe device separation for journal and ledger storage
- Cost: ~$12,960/month
Apache Flink Cluster:
- 4Γ c5.4xlarge instances (16 vCPU, 32GB RAM)
- 64-way parallelism matching Pulsar partitions
- 1:1 slot-to-CPU ratio for optimal performance
- Cost: ~$2,400/month
ClickHouse Cluster:
- 6Γ r6id.4xlarge instances (16 vCPU, 128GB RAM, 950GB NVMe)
- Distributed analytics with real-time ingestion
- Optimized for sub-second query performance
- Cost: ~$7,200/month
Producer Infrastructure:
- 4Γ c5.4xlarge instances for load generation
- AVRO serialization for efficient data transfer
- Cost: ~$1,920/month
The Cost Breakdown
Using infrastructure analysis tools to examine the Terraform configuration for this setup:
| Component | Instance Type | Count | Monthly Cost |
|---|---|---|---|
| Pulsar (Broker+Bookie) | i7i.8xlarge | 6 | $12,960 |
| ClickHouse | r6id.4xlarge | 6 | $7,200 |
| Flink | c5.4xlarge | 4 | $2,400 |
| Producers | c5.4xlarge | 4 | $1,920 |
| Supporting Infrastructure | Various | - | $112 |
| Total | $24,592 |
Total Estimated Monthly Cost: $24,592
The primary cost drivers are the large EC2 instances required to handle the 1 million events/sec workload, particularly for the Pulsar brokers and ClickHouse nodes.
βοΈ The AWS Managed Stack: Convenience at a Premium
In this approach, we replace our self-hosted components with their AWS-native counterparts. This offloads the operational burden of managing the infrastructure to AWS.
The Technology Stack
- Managed Kafka: Amazon MSK (Managed Streaming for Apache Kafka)
- Managed Flink: Amazon Kinesis Data Analytics for Apache Flink
- ClickHouse Equivalent: Amazon Redshift Serverless
Service Configuration & Assumptions
Amazon MSK:
- 1KB average event size
- 1-day data retention
- High throughput configuration
- Multi-AZ deployment for reliability
Kinesis Data Analytics for Flink:
- 64 Kinesis Processing Units (KPUs)
- Continuous processing (24/7)
- Auto-scaling enabled
Amazon Redshift Serverless:
- 1-month data retention
- High-performance analytics workload
- On-demand scaling
The Cost Breakdown
Estimating the cost for a managed stack at this scale requires several assumptions. Based on AWS pricing and typical configurations:
| Service | Configuration | Monthly Cost |
|---|---|---|
| Amazon MSK | High throughput, 1-day retention | $30,525 |
| Kinesis Data Analytics | 64 KPUs, continuous processing | $81,180 |
| Amazon Redshift Serverless | 1-month retention, analytics | $131,328 |
| Total | $243,033 |
Total Estimated Monthly Cost: ~$243,033
π The Cost Comparison: A Dramatic Difference
Let's put these numbers side-by-side:
| Approach | Monthly Cost | Cost per Million Events |
|---|---|---|
| Self-Hosted on EC2 | $24,592 | $0.0094 |
| AWS Managed Services | $243,033 | $0.0932 |
| Difference | +988% | +988% |
The difference is stark: the AWS managed stack is roughly 10 times more expensive than the self-hosted approach for this high-throughput scenario.
Cost Analysis by Component
Messaging Layer:
- Self-hosted Pulsar: $12,960/month
- AWS MSK: $30,525/month
- Premium: 235%
Stream Processing:
- Self-hosted Flink: $2,400/month
- Kinesis Data Analytics: $81,180/month
- Premium: 3,382%
Analytics Storage:
- Self-hosted ClickHouse: $7,200/month
- Redshift Serverless: $131,328/month
- Premium: 1,824%
π€ Beyond the Numbers: Understanding the Trade-offs
So, why would anyone choose the managed stack given the massive price difference? The answer lies in the trade-offs between cost and operational overhead.
The Case for Self-Hosting
Advantages:
β
Dramatic Cost Savings: 90% lower infrastructure costs
β
Complete Control: Fine-grained tuning and optimization
β
No Vendor Lock-in: Portable across cloud providers
β
Technology Choice: Use cutting-edge open-source features
β
Performance Optimization: Custom configurations for specific workloads
Challenges:
β High Operational Overhead: Full responsibility for infrastructure management
β Expertise Required: Deep knowledge of distributed systems needed
β Time Investment: Significant setup and maintenance effort
β Scaling Complexity: Manual scaling and capacity planning
β Security Responsibility: Comprehensive security management required
The Case for Managed Services
Advantages:
β
Reduced Operational Overhead: AWS handles infrastructure management
β
Built-in Scalability: Auto-scaling and high availability
β
Faster Time to Market: Rapid deployment without infrastructure setup
β
Enterprise Features: Built-in monitoring, security, and compliance
β
Support: Professional support from AWS
Challenges:
β Significant Cost Premium: 10x higher costs for high-throughput workloads
β Vendor Lock-in: Tied to AWS ecosystem
β Limited Control: Constrained by service limitations
β Feature Lag: May not have latest open-source features
π― When to Choose Each Approach
Choose Self-Hosted When:
- Cost is Critical: Operating at scale where managed service costs become prohibitive
- Performance is Key: Need maximum performance through custom tuning
- Team Expertise: Have experienced platform engineering team
- Long-term Investment: Building for sustained high-volume workloads
- Multi-cloud Strategy: Want to avoid vendor lock-in
Choose Managed Services When:
- Speed to Market: Need to ship quickly without infrastructure complexity
- Small Team: Limited platform engineering resources
- Variable Workloads: Unpredictable or seasonal traffic patterns
- Compliance Focus: Need built-in enterprise compliance features
- Prototype/MVP: Testing concepts before committing to self-hosted infrastructure
π‘ Hybrid Approaches & Optimization Strategies
Cost Optimization for Self-Hosted
- Reserved Instances: 40-60% savings with 1-3 year commitments
- Spot Instances: Up to 70% savings for fault-tolerant components
- Right-sizing: Regular capacity planning and instance optimization
- Auto-scaling: Implement demand-based scaling
Hybrid Architecture Considerations
# Example hybrid approach
Message Ingestion: AWS MSK (managed complexity)
Stream Processing: Self-hosted Flink (cost optimization)
Analytics Storage: Self-hosted ClickHouse (performance optimization)
Monitoring: AWS CloudWatch (convenience)
π ROI Analysis: Break-Even Points
Total Cost of Ownership (TCO) Considerations
Self-Hosted Additional Costs:
- Platform engineering team: ~$400K-600K/year (2-3 engineers)
- Operations overhead: ~20-30% additional management time
- Training and certifications: ~$20K/year
Managed Services Hidden Benefits:
- Reduced hiring needs
- Faster feature delivery
- Lower operational risk
Break-Even Analysis
For our 1M events/sec workload:
- Cost difference: $218K/month ($2.6M/year)
- Engineering team cost: ~$500K/year
- Net savings with self-hosted: ~$2.1M/year
The break-even point strongly favors self-hosting for high-throughput, sustained workloads.
π Conclusion
The choice between self-hosting and managed services is not a one-size-fits-all decision, but the cost implications are dramatic at scale.
Key Takeaways
- For High-Throughput Workloads: Self-hosting can provide 90% cost savings
- Expertise Matters: Success requires skilled platform engineering teams
- Scale is Key: The larger your workload, the more self-hosting makes financial sense
- Time Horizon: Long-term, sustained workloads favor self-hosting
Decision Framework
Choose Self-Hosted If:
- Processing >100K events/sec sustained
- Have platform engineering expertise
- Cost optimization is critical
- Long-term workload (>2 years)
Choose Managed Services If:
- Getting started or prototyping
- Small engineering team
- Variable/unpredictable workloads
- Time to market is critical
For a high-throughput workload of 1 million events per second, the cost of managed services can be substantial. It's crucial to weigh the significant cost premium against the benefits of offloading the operational complexity to your cloud provider.
The bottom line: At enterprise scale, self-hosting open-source streaming infrastructure can deliver massive cost savings while providing superior performance and controlβif you have the team to manage it effectively.
π Resources
- Complete Implementation: RealtimeDataPlatform/realtime-platform-1million-events
- Infrastructure Code: Terraform configurations and Helm charts
- Cost Analysis Tools: AWS Pricing Calculator, Infracost
- Performance Benchmarks: Pulsar vs Kafka, ClickHouse Performance
Have you made this choice in your organization? What factors influenced your decision? Share your experience in the comments! π
Follow me for more deep dives on cloud architecture, cost optimization, and distributed systems!
Tags: #aws #cost #architecture #streaming #devops #realtime #pulsar #flink #clickhouse #msk #kinesis
Top comments (0)