DEV Community

HyperscaleDesignHub
HyperscaleDesignHub

Posted on

Real-Time Data Streaming Platform: How We Built a Self-Hosted Platform with 90% Cost Reduction vs AWS Managed Services

When tasked with building a real-time data streaming platform capable of processing 1 million events per second, we faced a critical decision: build a self-hosted solution using open-source technologies, or leverage AWS managed services for convenience.

This article details how we built our self-hosted real-time data streaming platform and achieved a 90% cost reduction compared to equivalent AWS managed services, while maintaining enterprise-grade performance and reliability.

The result: A production-ready platform processing 1M events/sec for $24,592/month instead of $243,033/month with AWS managed services.

You can find the complete implementation in our RealtimeDataPlatform repository.

Here's how we did it and the lessons learned along the way.

πŸ—οΈ Architecture Comparison

Self-Hosted Stack Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Self-Hosted Stack on AWS EC2                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚   PRODUCER  │───▢│   PULSAR    │───▢│    FLINK    │───▢│CLICKHOUSE β”‚ β”‚
β”‚  β”‚             β”‚    β”‚             β”‚    β”‚             β”‚    β”‚           β”‚ β”‚
β”‚  β”‚ IoT Sensors β”‚    β”‚ Open Source β”‚    β”‚ Open Source β”‚    β”‚Open Sourceβ”‚ β”‚
β”‚  β”‚ AVRO Data   β”‚    β”‚ Message     β”‚    β”‚ Stream      β”‚    β”‚ Analytics β”‚ β”‚
β”‚  β”‚             β”‚    β”‚ Broker      β”‚    β”‚ Processing  β”‚    β”‚ Database  β”‚ β”‚
β”‚  β”‚ 4x c5.4xl   β”‚    β”‚ 6x i7i.8xl  β”‚    β”‚ 4x c5.4xl   β”‚    β”‚6x r6id.4xlβ”‚ β”‚
β”‚  β”‚ Full Controlβ”‚    β”‚ Self-Managedβ”‚    β”‚ Self-Managedβ”‚    β”‚Self-Managedβ”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                                                         β”‚
β”‚  Monthly Cost: ~$24,592                                                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

AWS Managed Stack Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    AWS Managed Services Stack                          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚   PRODUCER  │───▢│     MSK     │───▢│   KINESIS   │───▢│ REDSHIFT  β”‚ β”‚
β”‚  β”‚             β”‚    β”‚             β”‚    β”‚             β”‚    β”‚           β”‚ β”‚
β”‚  β”‚ IoT Sensors β”‚    β”‚ Managed     β”‚    β”‚ Data Analytics  β”‚ Serverless β”‚ β”‚
β”‚  β”‚ AVRO Data   β”‚    β”‚ Streaming   β”‚    β”‚ for Flink   β”‚    β”‚ Analytics β”‚ β”‚
β”‚  β”‚             β”‚    β”‚ for Kafka   β”‚    β”‚             β”‚    β”‚ Warehouse β”‚ β”‚
β”‚  β”‚ 4x c5.4xl   β”‚    β”‚ AWS Managed β”‚    β”‚ AWS Managed β”‚    β”‚AWS Managedβ”‚ β”‚
β”‚  β”‚ Hands-off   β”‚    β”‚ Serverless  β”‚    β”‚ Serverless  β”‚    β”‚Serverless β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                                                         β”‚
β”‚  Monthly Cost: ~$243,033                                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

πŸ’° The Self-Hosted Stack: Control and Cost-Effectiveness

In this scenario, we deploy our entire stack on Amazon EC2 instances. This gives us maximum control over the configuration and tuning of each component.

The Technology Stack

  • Message Broker: Apache Pulsar
  • Stream Processing: Apache Flink
  • Analytics Database: ClickHouse

Infrastructure Details

Apache Pulsar Cluster:

  • 6Γ— i7i.8xlarge instances (32 vCPU, 256GB RAM, 2Γ—3.75TB NVMe)
  • Co-located brokers and bookies
  • NVMe device separation for journal and ledger storage
  • Cost: ~$12,960/month

Apache Flink Cluster:

  • 4Γ— c5.4xlarge instances (16 vCPU, 32GB RAM)
  • 64-way parallelism matching Pulsar partitions
  • 1:1 slot-to-CPU ratio for optimal performance
  • Cost: ~$2,400/month

ClickHouse Cluster:

  • 6Γ— r6id.4xlarge instances (16 vCPU, 128GB RAM, 950GB NVMe)
  • Distributed analytics with real-time ingestion
  • Optimized for sub-second query performance
  • Cost: ~$7,200/month

Producer Infrastructure:

  • 4Γ— c5.4xlarge instances for load generation
  • AVRO serialization for efficient data transfer
  • Cost: ~$1,920/month

The Cost Breakdown

Using infrastructure analysis tools to examine the Terraform configuration for this setup:

Component Instance Type Count Monthly Cost
Pulsar (Broker+Bookie) i7i.8xlarge 6 $12,960
ClickHouse r6id.4xlarge 6 $7,200
Flink c5.4xlarge 4 $2,400
Producers c5.4xlarge 4 $1,920
Supporting Infrastructure Various - $112
Total $24,592

Total Estimated Monthly Cost: $24,592

The primary cost drivers are the large EC2 instances required to handle the 1 million events/sec workload, particularly for the Pulsar brokers and ClickHouse nodes.

☁️ The AWS Managed Stack: Convenience at a Premium

In this approach, we replace our self-hosted components with their AWS-native counterparts. This offloads the operational burden of managing the infrastructure to AWS.

The Technology Stack

  • Managed Kafka: Amazon MSK (Managed Streaming for Apache Kafka)
  • Managed Flink: Amazon Kinesis Data Analytics for Apache Flink
  • ClickHouse Equivalent: Amazon Redshift Serverless

Service Configuration & Assumptions

Amazon MSK:

  • 1KB average event size
  • 1-day data retention
  • High throughput configuration
  • Multi-AZ deployment for reliability

Kinesis Data Analytics for Flink:

  • 64 Kinesis Processing Units (KPUs)
  • Continuous processing (24/7)
  • Auto-scaling enabled

Amazon Redshift Serverless:

  • 1-month data retention
  • High-performance analytics workload
  • On-demand scaling

The Cost Breakdown

Estimating the cost for a managed stack at this scale requires several assumptions. Based on AWS pricing and typical configurations:

Service Configuration Monthly Cost
Amazon MSK High throughput, 1-day retention $30,525
Kinesis Data Analytics 64 KPUs, continuous processing $81,180
Amazon Redshift Serverless 1-month retention, analytics $131,328
Total $243,033

Total Estimated Monthly Cost: ~$243,033

πŸ“Š The Cost Comparison: A Dramatic Difference

Let's put these numbers side-by-side:

Approach Monthly Cost Cost per Million Events
Self-Hosted on EC2 $24,592 $0.0094
AWS Managed Services $243,033 $0.0932
Difference +988% +988%

The difference is stark: the AWS managed stack is roughly 10 times more expensive than the self-hosted approach for this high-throughput scenario.

Cost Analysis by Component

Messaging Layer:

  • Self-hosted Pulsar: $12,960/month
  • AWS MSK: $30,525/month
  • Premium: 235%

Stream Processing:

  • Self-hosted Flink: $2,400/month
  • Kinesis Data Analytics: $81,180/month
  • Premium: 3,382%

Analytics Storage:

  • Self-hosted ClickHouse: $7,200/month
  • Redshift Serverless: $131,328/month
  • Premium: 1,824%

πŸ€” Beyond the Numbers: Understanding the Trade-offs

So, why would anyone choose the managed stack given the massive price difference? The answer lies in the trade-offs between cost and operational overhead.

The Case for Self-Hosting

Advantages:

βœ… Dramatic Cost Savings: 90% lower infrastructure costs
βœ… Complete Control: Fine-grained tuning and optimization
βœ… No Vendor Lock-in: Portable across cloud providers
βœ… Technology Choice: Use cutting-edge open-source features
βœ… Performance Optimization: Custom configurations for specific workloads

Challenges:

❌ High Operational Overhead: Full responsibility for infrastructure management
❌ Expertise Required: Deep knowledge of distributed systems needed
❌ Time Investment: Significant setup and maintenance effort
❌ Scaling Complexity: Manual scaling and capacity planning
❌ Security Responsibility: Comprehensive security management required

The Case for Managed Services

Advantages:

βœ… Reduced Operational Overhead: AWS handles infrastructure management
βœ… Built-in Scalability: Auto-scaling and high availability
βœ… Faster Time to Market: Rapid deployment without infrastructure setup
βœ… Enterprise Features: Built-in monitoring, security, and compliance
βœ… Support: Professional support from AWS

Challenges:

❌ Significant Cost Premium: 10x higher costs for high-throughput workloads
❌ Vendor Lock-in: Tied to AWS ecosystem
❌ Limited Control: Constrained by service limitations
❌ Feature Lag: May not have latest open-source features

🎯 When to Choose Each Approach

Choose Self-Hosted When:

  • Cost is Critical: Operating at scale where managed service costs become prohibitive
  • Performance is Key: Need maximum performance through custom tuning
  • Team Expertise: Have experienced platform engineering team
  • Long-term Investment: Building for sustained high-volume workloads
  • Multi-cloud Strategy: Want to avoid vendor lock-in

Choose Managed Services When:

  • Speed to Market: Need to ship quickly without infrastructure complexity
  • Small Team: Limited platform engineering resources
  • Variable Workloads: Unpredictable or seasonal traffic patterns
  • Compliance Focus: Need built-in enterprise compliance features
  • Prototype/MVP: Testing concepts before committing to self-hosted infrastructure

πŸ’‘ Hybrid Approaches & Optimization Strategies

Cost Optimization for Self-Hosted

  1. Reserved Instances: 40-60% savings with 1-3 year commitments
  2. Spot Instances: Up to 70% savings for fault-tolerant components
  3. Right-sizing: Regular capacity planning and instance optimization
  4. Auto-scaling: Implement demand-based scaling

Hybrid Architecture Considerations

# Example hybrid approach
Message Ingestion: AWS MSK (managed complexity)
Stream Processing: Self-hosted Flink (cost optimization)
Analytics Storage: Self-hosted ClickHouse (performance optimization)
Monitoring: AWS CloudWatch (convenience)
Enter fullscreen mode Exit fullscreen mode

πŸ“ˆ ROI Analysis: Break-Even Points

Total Cost of Ownership (TCO) Considerations

Self-Hosted Additional Costs:

  • Platform engineering team: ~$400K-600K/year (2-3 engineers)
  • Operations overhead: ~20-30% additional management time
  • Training and certifications: ~$20K/year

Managed Services Hidden Benefits:

  • Reduced hiring needs
  • Faster feature delivery
  • Lower operational risk

Break-Even Analysis

For our 1M events/sec workload:

  • Cost difference: $218K/month ($2.6M/year)
  • Engineering team cost: ~$500K/year
  • Net savings with self-hosted: ~$2.1M/year

The break-even point strongly favors self-hosting for high-throughput, sustained workloads.

πŸŽ“ Conclusion

The choice between self-hosting and managed services is not a one-size-fits-all decision, but the cost implications are dramatic at scale.

Key Takeaways

  1. For High-Throughput Workloads: Self-hosting can provide 90% cost savings
  2. Expertise Matters: Success requires skilled platform engineering teams
  3. Scale is Key: The larger your workload, the more self-hosting makes financial sense
  4. Time Horizon: Long-term, sustained workloads favor self-hosting

Decision Framework

Choose Self-Hosted If:

  • Processing >100K events/sec sustained
  • Have platform engineering expertise
  • Cost optimization is critical
  • Long-term workload (>2 years)

Choose Managed Services If:

  • Getting started or prototyping
  • Small engineering team
  • Variable/unpredictable workloads
  • Time to market is critical

For a high-throughput workload of 1 million events per second, the cost of managed services can be substantial. It's crucial to weigh the significant cost premium against the benefits of offloading the operational complexity to your cloud provider.

The bottom line: At enterprise scale, self-hosting open-source streaming infrastructure can deliver massive cost savings while providing superior performance and controlβ€”if you have the team to manage it effectively.

πŸ“š Resources


Have you made this choice in your organization? What factors influenced your decision? Share your experience in the comments! πŸ‘‡

Follow me for more deep dives on cloud architecture, cost optimization, and distributed systems!

Tags: #aws #cost #architecture #streaming #devops #realtime #pulsar #flink #clickhouse #msk #kinesis

Top comments (0)