Processing 1 million messages per second isn't just about throwing more hardware at the problem. It requires deep understanding of storage I/O, careful configuration tuning, and smart architectural decisions.
In this article, I'll share the exact configurations and optimizations that enabled Apache Pulsar to reliably handle 1,000,000 messages/sec with 300-byte payloads on AWS EKS.
🎯 The Challenge
Requirements:
- Throughput: 1 million messages/second sustained
- Message Size: ~300 bytes (AVRO-serialized sensor data)
- Total Bandwidth: ~2.4 Gbps (300 MB/sec)
- Latency: < 10ms p99
- Durability: No message loss, replicated storage
- Cost: Optimized for AWS infrastructure
🏗️ Architecture Overview
┌──────────────────────────────────────────────────────────────────────────┐
│ Apache Pulsar on AWS EKS │
│ benchmark-high-infra (k8s 1.31) │
├──────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────┐ ┌─────────────┐ ┌──────────┐ │
│ │ PRODUCERS │──▶│ ZooKeeper │ │ Pulsar │──▶│ PROXIES │ │
│ │ │ │ │ │ Brokers │ │ │ │
│ │ 4 nodes │ │ 3 nodes │ │ 6 nodes │ │ 2 nodes │ │
│ │ c5.4xlarge │ │ t3.medium │ │ i7i.8xlarge │ │c5.2xlarge│ │
│ │ │ │ │ │ │ │ │ │
│ │ Java/AVRO │ │ Metadata │ │ Message │ │ Load │ │
│ │ 250K evt/sec │ │ Management │ │ Routing │ │ Balance │ │
│ │ per node │ │ │ │ │ │ │ │
│ └─────────────────┘ └─────────────┘ └─────────────┘ └──────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ BookKeeper │ │
│ │ Bookies │ │
│ │ │ │
│ │ 6 nodes │ │
│ │ i7i.8xlarge │ │
│ │ │ │
│ │ NVMe Storage│ │
│ │ Separation: │ │
│ │ Device 0: │ │
│ │ Journal WAL │ │
│ │ Device 1: │ │
│ │ Ledger Data │ │
│ └─────────────┘ │
└──────────────────────────────────────────────────────────────────────────┘
💾 The Storage Strategy - Why NVMe Matters
EC2 Instance Selection: i7i.8xlarge
Why i7i.8xlarge?
Instance: i7i.8xlarge
- 32 vCPUs
- 256 GiB RAM
- 2x 3,750 GB NVMe SSDs (7.5TB total)
- Network: 25 Gbps
- Cost: ~$2,160/month per instance
Key Benefits:
- Ultra-low latency: NVMe SSDs provide <100µs latency vs EBS's ~1-3ms
- High IOPS: 3.75M IOPS vs EBS's 64K IOPS (gp3)
- Sustained throughput: 30 GB/s vs EBS's 1 GB/s
- No network overhead: Local storage doesn't compete with network bandwidth
NVMe Device Separation - The Game Changer
Critical Design Decision: Use 2 separate NVMe devices per node
# Device configuration on each i7i.8xlarge node
/dev/nvme1n1 → Journal (Write-Ahead Log) - 3,750 GB
/dev/nvme2n1 → Ledgers (Message Storage) - 3,750 GB
Why separate devices?
Journal (Write-Ahead Log):
- Sequential writes only
- Low latency critical (blocks producer ACKs)
- Large capacity available (3.75TB per device)
- High write frequency
Ledgers (Message Storage):
- Random reads/writes
- Large capacity needed (3.75TB per device)
- Background compaction operations
- Read-heavy for consumers
Performance Impact:
- Without separation: Journal writes compete with ledger I/O → increased latency
- With separation: Independent I/O queues → consistent <1ms write latency
🚀 Producer Infrastructure - High-Volume Event Generation
Producer Instance Selection: c5.4xlarge
Why c5.4xlarge for producers?
Instance: c5.4xlarge
- 16 vCPUs (high single-thread performance)
- 32 GiB RAM
- Up to 10 Gbps network performance
- EBS Optimized: Up to 4,750 Mbps
- Cost: ~$480/month per instance
Producer Architecture Details:
Node Configuration:
- Count: 4 producer nodes
- Instance Type: c5.4xlarge
- Target Throughput: 250,000 messages/sec per node
- Total Capacity: 1,000,000 messages/sec across all nodes
Producer Implementation:
- Language: Java with high-performance Pulsar client
- Serialization: AVRO for efficient message encoding
- Message Size: ~300 bytes (AVRO-serialized sensor data)
- Batching: Optimized batch sizes for throughput
- Connection Pooling: Multiple connections per producer
Producer Performance Characteristics
Per-Node Metrics:
# Each c5.4xlarge producer node generates:
- Messages/sec: 250,000
- Data rate: 75 MB/sec (300 bytes × 250K)
- CPU utilization: 70-80%
- Memory usage: 8-12 GB
- Network utilization: ~600 Mbps
Pulsar Proxy Configuration:
- Instance Type: c5.2xlarge (8 vCPUs, 16 GiB RAM)
- Why c5.2xlarge? Higher network performance and CPU for connection handling
- Role: Load balancing and connection management for 1M+ connections
- Network Performance: Up to 10 Gbps (critical for high-throughput scenarios)
- Cost: ~$240/month per instance
You can find the complete producer implementation in the producer-load directory.
🔧 BookKeeper Configuration - The Secret Sauce
The complete configuration can be found in the pulsar-load directory and the specific values.yaml file.
1. Journal Configuration
# pulsar-values.yaml - BookKeeper section
bookkeeper:
configData:
# Journal write buffer - CRITICAL for throughput
journalMaxSizeMB: "2048" # 2GB buffer (default: 512MB)
journalMaxBackups: "5"
# Disable fsync for each write (NVMe is reliable enough)
journalSyncData: "false" # Default: true
# Enable adaptive group writes (batch small writes)
journalAdaptiveGroupWrites: "true" # Default: false
# Flush immediately when queue is empty
journalFlushWhenQueueEmpty: "true" # Default: false
Explanation:
journalMaxSizeMB: "2048"
- Increased from default 512MB to 2GB
- Allows buffering more writes before flush
- Reduces fsync frequency
- Impact: 3-4x improvement in write throughput
journalSyncData: "false"
- Disables fsync() after each write
- Relies on NVMe's own write cache and power loss protection
- Risk: Potential data loss on sudden power failure
- Mitigation: NVMe drives have capacitor-backed cache
- Impact: 10x reduction in write latency (10ms → 1ms)
journalAdaptiveGroupWrites: "true"
- Groups multiple small writes into batches
- Reduces system call overhead
- Impact: Improves throughput by 20-30% under high load
journalFlushWhenQueueEmpty: "true"
- Immediately flushes when no pending writes
- Reduces latency for sporadic writes
- Impact: Better p99 latency during variable load
2. Ledger Storage Configuration
# Entry log settings
entryLogSizeLimit: "2147483648" # 2GB per entry log file
entryLogFilePreAllocationEnabled: "true" # Pre-allocate files
# Flush interval
flushInterval: "60000" # 60 seconds (default: 60000)
# Use RocksDB for better performance
ledgerStorageClass: "org.apache.bookkeeper.bookie.storage.ldb.DbLedgerStorage"
Explanation:
entryLogSizeLimit: "2147483648" (2GB)
- Larger entry log files reduce file rotation overhead
- Better sequential write patterns
- Impact: 15% improvement in write throughput
entryLogFilePreAllocationEnabled: "true"
- Pre-allocates disk space for entry log files
- Eliminates file system overhead during writes
- Impact: More predictable latency
ledgerStorageClass: "DbLedgerStorage"
- Uses RocksDB instead of InterleavedLedgerStorage
- Better for high-throughput workloads
- Faster index lookups
- Impact: 40% improvement in random read performance
3. Garbage Collection Tuning
# GC settings optimized for high throughput
minorCompactionInterval: "3600" # 1 hour (default: 2 hours)
majorCompactionInterval: "86400" # 24 hours (default: 24 hours)
isForceGCAllowWhenNoSpace: "true" # Force GC when disk full
gcWaitTime: "900000" # 15 minutes (default: 15 minutes)
Why this matters:
- High throughput generates ledgers quickly
- Compaction reclaims space from deleted messages
- Balance: Too frequent → CPU overhead, Too rare → Disk full
4. Cache and Memory Configuration
# Broker configuration
broker:
configData:
# Managed ledger cache (hot data in memory)
managedLedgerCacheSizeMB: "512" # 512MB per broker
# Replication settings
managedLedgerDefaultEnsembleSize: "3" # 3 bookies per ledger
managedLedgerDefaultWriteQuorum: "2" # Write to 2 bookies
managedLedgerDefaultAckQuorum: "2" # Wait for 2 ACKs
Trade-offs:
managedLedgerCacheSizeMB: "512"
- Caches recently written messages
- Speeds up tailing reads (consumers close to tail)
- Impact: 50% reduction in read latency for hot data
Quorum Configuration (3/2/2):
- Ensemble: 3 bookies hold each ledger segment
- Write Quorum: Write to 2 bookies simultaneously
- Ack Quorum: Wait for 2 ACKs before acknowledging producer
-
Trade-off: Balance between durability and latency
- 3/3/3: More durable, higher latency
- 3/2/2: Balanced (our choice)
- 3/2/1: Fast but less durable
🚀 Broker Configuration for High Throughput
broker:
replicaCount: 6 # 6 broker instances
resources:
requests:
memory: "2Gi"
cpu: "1"
limits:
memory: "4Gi"
cpu: "2"
configData:
# Increase connection limits
maxConcurrentLookupRequest: "50000" # Default: 5000
maxConcurrentTopicLoadRequest: "50000" # Default: 5000
# Batch settings for better throughput
maxMessagesBatchingEnabled: "true"
maxNumMessagesInBatch: "1000"
maxBatchingDelayInMillis: "10"
# Producer settings
maxProducersPerTopic: "10000"
maxConsumersPerTopic: "10000"
maxConsumersPerSubscription: "10000"
Key optimizations:
Connection Limits:
- Increased from 5K to 50K concurrent requests
- Handles high producer/consumer concurrency
- Impact: Eliminates connection throttling
Batching Configuration:
- Groups messages for efficient network utilization
- 10ms delay balances latency vs throughput
- Impact: 30% improvement in network efficiency
📊 Performance Monitoring
Essential Metrics to Track
1. Throughput Metrics:
# Messages per second
kubectl logs -n pulsar pulsar-broker-0 | grep "msg/s"
# Bytes per second
kubectl exec -n pulsar pulsar-broker-0 -- \
bin/pulsar-admin topics stats persistent://public/default/iot-sensor-data
2. Latency Metrics:
# BookKeeper journal write latency
kubectl exec -n pulsar pulsar-broker-0 -- \
bin/bookkeeper shell listledgers | head -10
# End-to-end producer latency (check application logs)
3. Storage Metrics:
# NVMe utilization
kubectl exec -n pulsar pulsar-broker-0 -- iostat -x 1 5
# Disk space usage
kubectl exec -n pulsar pulsar-broker-0 -- df -h
Critical Performance Indicators
Target Metrics:
- Message Rate: 1,000,000+ msg/sec
- Journal Write Latency: < 2ms p99
- CPU Utilization: 70-80% (brokers)
- Memory Utilization: 60-70% (bookies)
- Network Utilization: < 80% of 25 Gbps
Grafana Dashboards
Critical Panels:
- Message Rate In/Out
rate(pulsar_in_messages_total[1m])
rate(pulsar_out_messages_total[1m])
- BookKeeper Write Latency
histogram_quantile(0.99,
rate(bookie_journal_JOURNAL_ADD_ENTRY_bucket[5m])
)
- Storage Fill Rate
rate(bookie_ledgers_size_bytes[1h])
💡 Lessons Learned & Best Practices
1. NVMe Device Separation is Critical
Before optimization (single device):
- Write latency p99: ~15ms
- Throughput: ~400K msg/sec
- Frequent latency spikes
After optimization (separate devices):
- Write latency p99: ~2.1ms
- Throughput: 1M+ msg/sec
- Stable performance
Key Takeaway: Journal and ledger I/O patterns conflict. Separate them.
2. Disable journalSyncData (Carefully)
Trade-off Analysis:
Benefits:
- 10x reduction in write latency
- 2-3x increase in throughput
Risks:
- Data loss on sudden power failure (rare with NVMe)
- Not suitable for financial transactions
When to use:
- IoT/telemetry data (lossy acceptable)
- High-volume logs
- Event streaming
When NOT to use:
- Financial transactions
- Critical business data
- Compliance-regulated workloads
3. Right-Size Your Instances
Instance comparison for Pulsar:
| Instance | vCPU | RAM | NVMe | Best For |
|---|---|---|---|---|
| i3en.6xlarge | 24 | 192GB | 2x 7.5TB | High capacity |
| i7i.8xlarge | 32 | 256GB | 2x 3.75TB | Balanced |
Our choice: i7i.8xlarge
- Latest generation (better CPU performance)
- 2 NVMe devices (perfect for journal/ledger separation)
- Optimal balance of CPU, memory, and storage for Pulsar workloads
4. Broker and Bookie Co-location
Pros:
- Reduced network hops (broker→bookie is local)
- Lower latency
- Cost savings (fewer instances)
Cons:
- Resource contention (CPU, memory)
- Harder to scale independently
Our approach:
- Co-locate for high throughput workloads
- Separate for latency-sensitive applications
- Result: Works well for 1M msg/sec with proper resource allocation
5. Monitor Journal Cache Time
# Critical metric: How long messages stay in journal cache
gcWaitTime: "900000" # 15 minutes
Why it matters:
- Journal cache holds recently written messages
- Longer cache time = Better read performance for tailing consumers
- Too long = More data loss risk on failure
- Sweet spot: 10-15 minutes for streaming workloads
🚧 Common Pitfalls & How to Avoid Them
1. EBS Instead of NVMe
Symptom: Throughput caps at ~100K msg/sec, high latency
Cause: EBS gp3 maxes at:
- 16K IOPS (baseline)
- 64K IOPS (provisioned)
- 1 GB/s throughput
Solution: Use NVMe-backed instances (i7i, i4i, i3en)
2. Single NVMe Device for Journal + Ledger
Symptom: Latency spikes, inconsistent throughput
Cause: Journal sequential writes blocked by ledger random I/O
Solution: Use separate devices or at least separate partitions
3. journalSyncData Enabled
Symptom: Write latency >10ms, throughput <200K msg/sec
Cause: fsync() after every write (10ms overhead)
Solution: Disable if data loss tolerance acceptable
4. Insufficient Broker Count
Symptom: High CPU on brokers, throttling, connection refused
Cause: Too few brokers for traffic volume
Solution:
- Rule of thumb: 1 broker per 150-200K msg/sec
- For 1M msg/sec: Minimum 5-6 brokers
5. Not Using Separate Node Groups
Symptom: Performance degradation when other workloads deploy
Cause: Resource contention with non-Pulsar pods
Solution: Dedicated node groups with taints
📈 Scaling Beyond 1M msg/sec
To 2M msg/sec:
# Increase broker and bookie count
broker:
replicaCount: 8 # Up from 6
bookkeeper:
replicaCount: 8 # Up from 6
# Scale producer infrastructure
producers:
replicaCount: 8 # Up from 4 (250K each = 2M total)
instanceType: c5.4xlarge # Keep same instance type
Cost impact: ~+$6,240/month (2 more i7i.8xlarge @ $2,160 each + 4 more c5.4xlarge @ $480 each)
To 5M msg/sec:
- Horizontal scaling: 15-20 brokers, 15-20 bookies
- Producer scaling: 20 c5.4xlarge nodes (250K each)
- Network: Upgrade to 50-100 Gbps instances
- Storage: Consider i4i.16xlarge (4x NVMe, 64 vCPU)
- Cost: ~$45,000-50,000/month
💰 Cost Analysis
Monthly Cost Breakdown (1M msg/sec)
| Component | Instance | Count | Cost/mo |
|---|---|---|---|
| Producer Nodes | c5.4xlarge | 4 | $1,920 |
| Pulsar Brokers | i7i.8xlarge | 6 | $12,960 |
| BookKeeper Bookies | i7i.8xlarge | 6 | (Co-located) |
| ZooKeeper | t3.medium | 3 | $90 |
| Pulsar Proxy | c5.2xlarge | 2 | $480 |
| Total | $15,450 |
Cost per message: $0.0000155 per 1M messages
Infrastructure Breakdown:
- Producer Infrastructure: $1,920/month (12.4% of total)
- Pulsar Core Infrastructure: $13,530/month (87.6% of total)
- Total Infrastructure: $15,450/month
Note: i7i.8xlarge cost = $12,960 ÷ 6 instances = $2,160/month per instance
Cost Optimization Options
-
Savings Plans (26% off):
- 3-year commitment
- Reduces to ~$11,433/month
-
Spot Instances (60% off):
- $6,180/month
- Risk: Potential interruptions
- Mitigation: Use for non-critical environments
-
Reserved Instances (40% off):
- 1-year commitment
- Reduces to ~$9,270/month
- Balance between savings and flexibility
🎓 Conclusion
Achieving 1 million messages/sec with Pulsar requires:
✅ NVMe storage with separated journal and ledger devices
✅ Careful BookKeeper tuning (journalSyncData, buffer sizes)
✅ Right-sized instances (i7i.8xlarge sweet spot)
✅ Horizontal scaling (6+ brokers, 6+ bookies)
✅ Dedicated infrastructure (node groups with taints)
✅ Monitoring (latency, IOPS, CPU, throughput)
Key Metrics Achieved:
- Throughput: 1,040,000 msg/sec sustained
- Latency: 0.8ms p50, 2.1ms p99
- Infrastructure Cost: $15,450/month ($11,433 with savings plans)
- Reliability: 99.95% uptime
The secret sauce combinations:
- NVMe device separation for journal vs ledgers
- journalSyncData: false for 10x latency improvement
- i7i.8xlarge instances for optimal price/performance
- Broker-bookie co-location for reduced network hops
- Proper resource allocation and monitoring
📚 Resources
- Pulsar Load Repository: pulsar-load
- Configuration Values: values.yaml
- Main Repository: RealtimeDataPlatform
- Apache Pulsar Documentation: pulsar.apache.org
- BookKeeper Documentation: bookkeeper.apache.org
- AWS i7i Instances: aws.amazon.com/ec2/instance-types/i7i
Have you implemented high-throughput messaging systems? What challenges did you face with storage I/O optimization? Share your experiences in the comments! 👇
Building real-time data platforms? Follow me for deep dives on performance optimization, distributed systems, and cloud infrastructure!
Next in the series: "ClickHouse Performance Tuning for 1M Events/Sec Ingestion"
🌟 Key Takeaways
- Storage architecture matters more than CPU for messaging systems
- NVMe device separation is critical for predictable latency
- journalSyncData: false gives 10x performance boost (with trade-offs)
- Right instance type beats "more instances" for cost efficiency
- Monitoring is essential - know your bottlenecks before scaling
Tags: #pulsar #apache #aws #performance #messaging #nvme #bookkeeper #streaming #eks #realtimedata
Top comments (0)