Building a real-time streaming platform that processes 1 million events per second taught me lessons that no tutorial or documentation could. After months of optimization, debugging, and scaling our self-hosted platform on AWS, here are the hard-won insights that saved us 90% in costs and countless hours of troubleshooting.
You can find our complete implementation in the RealtimeDataPlatform repository.
π― The Reality Check: Scale Changes Everything
What I thought: "If it works at 10K events/sec, it'll work at 1M events/sec."
What I learned: Scale isn't linear. At 1M events/sec, everything breaks differently.
The Surprising Bottlenecks
Expected Bottlenecks β Actual Bottlenecks
CPU/Memory β Network I/O
Application Logic β Storage I/O patterns
Compute Resources β Configuration limits
Code Efficiency β Infrastructure design
Example: Our Flink job processed 890K events/sec from Pulsar backlog but only 600K live. The bottleneck? Pulsar's storage configuration, not Flink's processing power.
πΎ Storage: The Hidden Performance Killer
Challenge #1: The NVMe Device Separation Discovery
Initial setup: Single NVMe device for both journal and ledger storage in BookKeeper.
Problem: Write latency spikes to 15ms, throughput capped at 400K events/sec.
Solution: Separate NVMe devices for different I/O patterns.
# Game-changing configuration
/dev/nvme1n1 β Journal (WAL) - Sequential writes
/dev/nvme2n1 β Ledgers (Data) - Random reads/writes
Result: Latency dropped to 2.1ms, throughput jumped to 1M+ events/sec.
Lesson: I/O pattern separation matters more than raw storage speed.
Challenge #2: The journalSyncData Trade-off
The dilemma: Enable journalSyncData for safety vs. disable for performance.
# The 10x performance decision
journalSyncData: "false" # Risk: data loss on power failure
# Gain: 10x latency improvement
Lesson: Know your data's value. For IoT telemetry, we chose speed over perfect durability. For financial data, we wouldn't.
π§ Parallelism: More Isn't Always Better
Challenge #3: The Slot-to-CPU Ratio Mystery
Initial thinking: "More parallelism = better performance"
Reality check: Our pipeline had different resource needs:
Source (I/O) β keyBy β Window (CPU) β Aggregate (CPU) β Sink (I/O)
π’ π΄ π΄ π’
Wrong approach: 2:1 slot-to-CPU ratio (32 slots on 16 vCPUs)
- Result: CPU starvation during aggregation
Right approach: 1:1 slot-to-CPU ratio (16 slots on 16 vCPUs)
- Result: Dedicated CPU per CPU-intensive task
Lesson: Match your slot configuration to your workload's compute pattern, not theoretical maximums.
Challenge #4: The Parallelism-Partition Matching Rule
Discovery: Mismatched parallelism and partitions kills performance.
β Wrong:
Pulsar partitions: 8
Flink parallelism: 64
Result: 56 idle Flink tasks
β
Right:
Pulsar partitions: 64
Flink parallelism: 64
Result: Perfect work distribution
Lesson: Parallelism should always match your source partitions for optimal resource utilization.
π΅οΈ Debugging: The Backlog Test Technique
Challenge #5: Finding the Real Bottleneck
The mystery: System processing 600K events/sec but target was 1M.
The experiment: Stop all producers, let Flink catch up from backlog.
# The revealing test
kubectl scale deployment iot-producer --replicas=0
# Result: Flink consumed 890K events/sec from backlog!
# Conclusion: Pulsar was the bottleneck, not Flink
Lesson: The backlog test reveals your true system capacity and identifies which component is actually limiting throughput.
π° Cost Optimization: The Managed Services Reality
Challenge #6: The 10x Cost Shock
Our self-hosted platform: $24,592/month
Equivalent AWS managed services:
- MSK: $30,525/month
- Kinesis Data Analytics: $81,180/month
- Redshift Serverless: $131,328/month
- Total: $243,033/month
The math: 988% cost increase for managed services at scale.
Lesson: At high throughput, managed services pricing becomes prohibitive. The break-even point strongly favors self-hosting for sustained, high-volume workloads.
ποΈ Architecture: Instance Selection Matters
Challenge #7: Right-Sizing for Performance vs. Cost
Wrong approach: Many small instances
- 16Γ c5.large instances
- Complex networking, management overhead
Right approach: Fewer, larger instances
- 4Γ c5.4xlarge instances
- Better price/performance, simpler operations
Lesson: Bigger instances often provide better performance-per-dollar and reduce operational complexity.
Challenge #8: The i7i.8xlarge Sweet Spot
Why i7i.8xlarge became our standard:
- 32 vCPUs, 256GB RAM
- 2Γ 3.75TB NVMe devices (perfect for separation)
- Latest generation CPU performance
- $2,160/month (better than alternatives)
Lesson: The latest generation instances often provide the best performance-per-dollar despite higher unit costs.
π Monitoring: What Actually Matters
The Metrics That Saved Us
Instead of generic CPU/memory metrics, focus on:
# Backpressure - Your canary in the coal mine
flink_taskmanager_job_task_backPressureTimeMsPerSecond
# True throughput - Not just input rate
rate(flink_taskmanager_job_task_numRecordsOutPerSecond[1m])
# Storage performance - Often the real bottleneck
rate(bookie_journal_JOURNAL_ADD_ENTRY_count[1m])
Lesson: Domain-specific metrics matter more than generic infrastructure metrics for identifying real problems.
π The Five Universal Truths of Streaming at Scale
1. Storage I/O Patterns Trump Raw Performance
Separate your sequential writes from random reads. Always.
2. Configuration Limits Hit Before Resource Limits
You'll hit default timeouts, queue sizes, and connection limits before CPU/memory limits.
3. The Bottleneck Moves
Optimize one component, and the bottleneck shifts to the next weakest link.
4. Test with Realistic Data
Synthetic loads behave differently than real-world data patterns.
5. Cost Scales Non-Linearly
At high throughput, managed services become exponentially more expensive than self-hosting.
π‘ What I'd Do Differently
Start with these decisions:
β
Separate storage devices from day one
β
Match parallelism to partitions immediately
β
Use 1:1 slot-to-CPU for CPU-bound workloads
β
Implement backlog testing early
β
Choose latest-generation instances
β
Plan for self-hosting at scale
π The Bottom Line
Building a real-time streaming platform at scale taught me that the fundamentals matter more than the fancy features. Storage I/O patterns, proper parallelism matching, and understanding your actual bottlenecks will get you further than any advanced configuration.
Our results:
- 1,040,000 events/sec sustained throughput
- $24,592/month infrastructure cost (90% savings vs managed)
- <2ms p99 latency end-to-end
- 99.95% uptime in production
The biggest lesson? Scale reveals the truth about your architecture. What works at small scale often breaks in unexpected ways at large scale. Plan for it, test for it, and measure everything that matters.
π Resources
- Complete Implementation: RealtimeDataPlatform
- Pulsar Performance Guide: Our Pulsar optimization journey
- Flink Tuning Details: Our Flink scaling experience
What's your biggest streaming challenge? Have you hit similar bottlenecks at scale? Share your war stories in the comments! π
Follow me for more real-world lessons from building distributed systems at scale.
Tags: #streaming #realtime #scale #performance #aws #pulsar #flink #architecture #lessons
Top comments (0)