HyperscaleDesignHub

Posted on Oct 26

Real-Time Streaming Challenges: What I Learned Building at Scale

#systemdesign #performance #architecture #aws

Building a real-time streaming platform that processes 1 million events per second taught me lessons that no tutorial or documentation could. After months of optimization, debugging, and scaling our self-hosted platform on AWS, here are the hard-won insights that saved us 90% in costs and countless hours of troubleshooting.

You can find our complete implementation in the RealtimeDataPlatform repository.

🎯 The Reality Check: Scale Changes Everything

What I thought: "If it works at 10K events/sec, it'll work at 1M events/sec."

What I learned: Scale isn't linear. At 1M events/sec, everything breaks differently.

The Surprising Bottlenecks

Expected Bottlenecks    →    Actual Bottlenecks
CPU/Memory             →    Network I/O
Application Logic      →    Storage I/O patterns  
Compute Resources      →    Configuration limits
Code Efficiency        →    Infrastructure design

Example: Our Flink job processed 890K events/sec from Pulsar backlog but only 600K live. The bottleneck? Pulsar's storage configuration, not Flink's processing power.

💾 Storage: The Hidden Performance Killer

Challenge #1: The NVMe Device Separation Discovery

Initial setup: Single NVMe device for both journal and ledger storage in BookKeeper.

Problem: Write latency spikes to 15ms, throughput capped at 400K events/sec.

Solution: Separate NVMe devices for different I/O patterns.

# Game-changing configuration
/dev/nvme1n1 → Journal (WAL) - Sequential writes
/dev/nvme2n1 → Ledgers (Data) - Random reads/writes

Result: Latency dropped to 2.1ms, throughput jumped to 1M+ events/sec.

Lesson: I/O pattern separation matters more than raw storage speed.

Challenge #2: The journalSyncData Trade-off

The dilemma: Enable journalSyncData for safety vs. disable for performance.

# The 10x performance decision
journalSyncData: "false"  # Risk: data loss on power failure
                         # Gain: 10x latency improvement

Lesson: Know your data's value. For IoT telemetry, we chose speed over perfect durability. For financial data, we wouldn't.

🔧 Parallelism: More Isn't Always Better

Challenge #3: The Slot-to-CPU Ratio Mystery

Initial thinking: "More parallelism = better performance"

Reality check: Our pipeline had different resource needs:

Source (I/O) → keyBy → Window (CPU) → Aggregate (CPU) → Sink (I/O)
   🟢                    🔴             🔴               🟢

Wrong approach: 2:1 slot-to-CPU ratio (32 slots on 16 vCPUs)

Result: CPU starvation during aggregation

Right approach: 1:1 slot-to-CPU ratio (16 slots on 16 vCPUs)

Result: Dedicated CPU per CPU-intensive task

Lesson: Match your slot configuration to your workload's compute pattern, not theoretical maximums.

Challenge #4: The Parallelism-Partition Matching Rule

Discovery: Mismatched parallelism and partitions kills performance.

❌ Wrong:
  Pulsar partitions: 8
  Flink parallelism: 64
  Result: 56 idle Flink tasks

✅ Right:
  Pulsar partitions: 64  
  Flink parallelism: 64
  Result: Perfect work distribution

Lesson: Parallelism should always match your source partitions for optimal resource utilization.

🕵️ Debugging: The Backlog Test Technique

Challenge #5: Finding the Real Bottleneck

The mystery: System processing 600K events/sec but target was 1M.

The experiment: Stop all producers, let Flink catch up from backlog.

# The revealing test
kubectl scale deployment iot-producer --replicas=0

# Result: Flink consumed 890K events/sec from backlog!
# Conclusion: Pulsar was the bottleneck, not Flink

Lesson: The backlog test reveals your true system capacity and identifies which component is actually limiting throughput.

💰 Cost Optimization: The Managed Services Reality

Challenge #6: The 10x Cost Shock

Our self-hosted platform: $24,592/month

Equivalent AWS managed services:

MSK: $30,525/month
Kinesis Data Analytics: $81,180/month
Redshift Serverless: $131,328/month
Total: $243,033/month

The math: 988% cost increase for managed services at scale.

Lesson: At high throughput, managed services pricing becomes prohibitive. The break-even point strongly favors self-hosting for sustained, high-volume workloads.

🏗️ Architecture: Instance Selection Matters

Challenge #7: Right-Sizing for Performance vs. Cost

Wrong approach: Many small instances

16× c5.large instances
Complex networking, management overhead

Right approach: Fewer, larger instances

4× c5.4xlarge instances
Better price/performance, simpler operations

Lesson: Bigger instances often provide better performance-per-dollar and reduce operational complexity.

Challenge #8: The i7i.8xlarge Sweet Spot

Why i7i.8xlarge became our standard:

32 vCPUs, 256GB RAM
2× 3.75TB NVMe devices (perfect for separation)
Latest generation CPU performance
$2,160/month (better than alternatives)

Lesson: The latest generation instances often provide the best performance-per-dollar despite higher unit costs.

🔍 Monitoring: What Actually Matters

The Metrics That Saved Us

Instead of generic CPU/memory metrics, focus on:

# Backpressure - Your canary in the coal mine
flink_taskmanager_job_task_backPressureTimeMsPerSecond

# True throughput - Not just input rate
rate(flink_taskmanager_job_task_numRecordsOutPerSecond[1m])

# Storage performance - Often the real bottleneck
rate(bookie_journal_JOURNAL_ADD_ENTRY_count[1m])

Lesson: Domain-specific metrics matter more than generic infrastructure metrics for identifying real problems.

🎓 The Five Universal Truths of Streaming at Scale

1. Storage I/O Patterns Trump Raw Performance

Separate your sequential writes from random reads. Always.

2. Configuration Limits Hit Before Resource Limits

You'll hit default timeouts, queue sizes, and connection limits before CPU/memory limits.

3. The Bottleneck Moves

Optimize one component, and the bottleneck shifts to the next weakest link.

4. Test with Realistic Data

Synthetic loads behave differently than real-world data patterns.

5. Cost Scales Non-Linearly

At high throughput, managed services become exponentially more expensive than self-hosting.

💡 What I'd Do Differently

Start with these decisions:

✅ Separate storage devices from day one

✅ Match parallelism to partitions immediately

✅ Use 1:1 slot-to-CPU for CPU-bound workloads

✅ Implement backlog testing early

✅ Choose latest-generation instances

✅ Plan for self-hosting at scale

🚀 The Bottom Line

Building a real-time streaming platform at scale taught me that the fundamentals matter more than the fancy features. Storage I/O patterns, proper parallelism matching, and understanding your actual bottlenecks will get you further than any advanced configuration.

Our results:

1,040,000 events/sec sustained throughput
$24,592/month infrastructure cost (90% savings vs managed)
<2ms p99 latency end-to-end
99.95% uptime in production

The biggest lesson? Scale reveals the truth about your architecture. What works at small scale often breaks in unexpected ways at large scale. Plan for it, test for it, and measure everything that matters.

📚 Resources

Complete Implementation: RealtimeDataPlatform
Pulsar Performance Guide: Our Pulsar optimization journey
Flink Tuning Details: Our Flink scaling experience

What's your biggest streaming challenge? Have you hit similar bottlenecks at scale? Share your war stories in the comments! 👇

Follow me for more real-world lessons from building distributed systems at scale.

Tags: #streaming #realtime #scale #performance #aws #pulsar #flink #architecture #lessons

DEV Community