DEV Community

HyperscaleDesignHub
HyperscaleDesignHub

Posted on

How I Achieved 1 Million Messages/Sec with Apache Pulsar on AWS EKS - A Deep Dive into NVMe, BookKeeper, and Performance Tuning

Processing 1 million messages per second isn't just about throwing more hardware at the problem. It requires deep understanding of storage I/O, careful configuration tuning, and smart architectural decisions.

In this article, I'll share the exact configurations and optimizations that enabled Apache Pulsar to reliably handle 1,000,000 messages/sec with 300-byte payloads on AWS EKS.

🎯 The Challenge

Requirements:

  • Throughput: 1 million messages/second sustained
  • Message Size: ~300 bytes (AVRO-serialized sensor data)
  • Total Bandwidth: ~2.4 Gbps (300 MB/sec)
  • Latency: < 10ms p99
  • Durability: No message loss, replicated storage
  • Cost: Optimized for AWS infrastructure

🏗️ Architecture Overview

┌──────────────────────────────────────────────────────────────────────────┐
│                     Apache Pulsar on AWS EKS                            │
│                 benchmark-high-infra (k8s 1.31)                         │
├──────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌─────────────────┐   ┌─────────────┐   ┌─────────────┐   ┌──────────┐ │
│  │   PRODUCERS     │──▶│ ZooKeeper   │   │   Pulsar    │──▶│ PROXIES  │ │
│  │                 │   │             │   │  Brokers    │   │          │ │
│  │ 4 nodes         │   │ 3 nodes     │   │ 6 nodes     │   │ 2 nodes  │ │
│  │ c5.4xlarge      │   │ t3.medium   │   │ i7i.8xlarge │   │c5.2xlarge│ │
│  │                 │   │             │   │             │   │          │ │
│  │ Java/AVRO       │   │ Metadata    │   │ Message     │   │ Load     │ │
│  │ 250K evt/sec    │   │ Management  │   │ Routing     │   │ Balance  │ │
│  │ per node        │   │             │   │             │   │          │ │
│  └─────────────────┘   └─────────────┘   └─────────────┘   └──────────┘ │
│                                                 │                        │
│                                                 ▼                        │
│                                         ┌─────────────┐                  │
│                                         │ BookKeeper  │                  │
│                                         │  Bookies    │                  │
│                                         │             │                  │
│                                         │ 6 nodes     │                  │
│                                         │ i7i.8xlarge │                  │
│                                         │             │                  │
│                                         │ NVMe Storage│                  │
│                                         │ Separation: │                  │
│                                         │ Device 0:   │                  │
│                                         │ Journal WAL │                  │
│                                         │ Device 1:   │                  │
│                                         │ Ledger Data │                  │
│                                         └─────────────┘                  │
└──────────────────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

💾 The Storage Strategy - Why NVMe Matters

EC2 Instance Selection: i7i.8xlarge

Why i7i.8xlarge?

Instance: i7i.8xlarge
- 32 vCPUs
- 256 GiB RAM
- 2x 3,750 GB NVMe SSDs (7.5TB total)
- Network: 25 Gbps
- Cost: ~$2,160/month per instance
Enter fullscreen mode Exit fullscreen mode

Key Benefits:

  1. Ultra-low latency: NVMe SSDs provide <100µs latency vs EBS's ~1-3ms
  2. High IOPS: 3.75M IOPS vs EBS's 64K IOPS (gp3)
  3. Sustained throughput: 30 GB/s vs EBS's 1 GB/s
  4. No network overhead: Local storage doesn't compete with network bandwidth

NVMe Device Separation - The Game Changer

Critical Design Decision: Use 2 separate NVMe devices per node

# Device configuration on each i7i.8xlarge node
/dev/nvme1n1 → Journal (Write-Ahead Log) - 3,750 GB
/dev/nvme2n1 → Ledgers (Message Storage) - 3,750 GB
Enter fullscreen mode Exit fullscreen mode

Why separate devices?

Journal (Write-Ahead Log):

  • Sequential writes only
  • Low latency critical (blocks producer ACKs)
  • Large capacity available (3.75TB per device)
  • High write frequency

Ledgers (Message Storage):

  • Random reads/writes
  • Large capacity needed (3.75TB per device)
  • Background compaction operations
  • Read-heavy for consumers

Performance Impact:

  • Without separation: Journal writes compete with ledger I/O → increased latency
  • With separation: Independent I/O queues → consistent <1ms write latency

🚀 Producer Infrastructure - High-Volume Event Generation

Producer Instance Selection: c5.4xlarge

Why c5.4xlarge for producers?

Instance: c5.4xlarge
- 16 vCPUs (high single-thread performance)
- 32 GiB RAM
- Up to 10 Gbps network performance
- EBS Optimized: Up to 4,750 Mbps
- Cost: ~$480/month per instance
Enter fullscreen mode Exit fullscreen mode

Producer Architecture Details:

Node Configuration:

  • Count: 4 producer nodes
  • Instance Type: c5.4xlarge
  • Target Throughput: 250,000 messages/sec per node
  • Total Capacity: 1,000,000 messages/sec across all nodes

Producer Implementation:

  • Language: Java with high-performance Pulsar client
  • Serialization: AVRO for efficient message encoding
  • Message Size: ~300 bytes (AVRO-serialized sensor data)
  • Batching: Optimized batch sizes for throughput
  • Connection Pooling: Multiple connections per producer

Producer Performance Characteristics

Per-Node Metrics:

# Each c5.4xlarge producer node generates:
- Messages/sec: 250,000
- Data rate: 75 MB/sec (300 bytes × 250K)
- CPU utilization: 70-80%
- Memory usage: 8-12 GB
- Network utilization: ~600 Mbps
Enter fullscreen mode Exit fullscreen mode

Pulsar Proxy Configuration:

  • Instance Type: c5.2xlarge (8 vCPUs, 16 GiB RAM)
  • Why c5.2xlarge? Higher network performance and CPU for connection handling
  • Role: Load balancing and connection management for 1M+ connections
  • Network Performance: Up to 10 Gbps (critical for high-throughput scenarios)
  • Cost: ~$240/month per instance

You can find the complete producer implementation in the producer-load directory.

🔧 BookKeeper Configuration - The Secret Sauce

The complete configuration can be found in the pulsar-load directory and the specific values.yaml file.

1. Journal Configuration

# pulsar-values.yaml - BookKeeper section
bookkeeper:
  configData:
    # Journal write buffer - CRITICAL for throughput
    journalMaxSizeMB: "2048"  # 2GB buffer (default: 512MB)
    journalMaxBackups: "5"

    # Disable fsync for each write (NVMe is reliable enough)
    journalSyncData: "false"  # Default: true

    # Enable adaptive group writes (batch small writes)
    journalAdaptiveGroupWrites: "true"  # Default: false

    # Flush immediately when queue is empty
    journalFlushWhenQueueEmpty: "true"  # Default: false
Enter fullscreen mode Exit fullscreen mode

Explanation:

journalMaxSizeMB: "2048"

  • Increased from default 512MB to 2GB
  • Allows buffering more writes before flush
  • Reduces fsync frequency
  • Impact: 3-4x improvement in write throughput

journalSyncData: "false"

  • Disables fsync() after each write
  • Relies on NVMe's own write cache and power loss protection
  • Risk: Potential data loss on sudden power failure
  • Mitigation: NVMe drives have capacitor-backed cache
  • Impact: 10x reduction in write latency (10ms → 1ms)

journalAdaptiveGroupWrites: "true"

  • Groups multiple small writes into batches
  • Reduces system call overhead
  • Impact: Improves throughput by 20-30% under high load

journalFlushWhenQueueEmpty: "true"

  • Immediately flushes when no pending writes
  • Reduces latency for sporadic writes
  • Impact: Better p99 latency during variable load

2. Ledger Storage Configuration

# Entry log settings
entryLogSizeLimit: "2147483648"  # 2GB per entry log file
entryLogFilePreAllocationEnabled: "true"  # Pre-allocate files

# Flush interval
flushInterval: "60000"  # 60 seconds (default: 60000)

# Use RocksDB for better performance
ledgerStorageClass: "org.apache.bookkeeper.bookie.storage.ldb.DbLedgerStorage"
Enter fullscreen mode Exit fullscreen mode

Explanation:

entryLogSizeLimit: "2147483648" (2GB)

  • Larger entry log files reduce file rotation overhead
  • Better sequential write patterns
  • Impact: 15% improvement in write throughput

entryLogFilePreAllocationEnabled: "true"

  • Pre-allocates disk space for entry log files
  • Eliminates file system overhead during writes
  • Impact: More predictable latency

ledgerStorageClass: "DbLedgerStorage"

  • Uses RocksDB instead of InterleavedLedgerStorage
  • Better for high-throughput workloads
  • Faster index lookups
  • Impact: 40% improvement in random read performance

3. Garbage Collection Tuning

# GC settings optimized for high throughput
minorCompactionInterval: "3600"     # 1 hour (default: 2 hours)
majorCompactionInterval: "86400"    # 24 hours (default: 24 hours)
isForceGCAllowWhenNoSpace: "true"   # Force GC when disk full
gcWaitTime: "900000"                # 15 minutes (default: 15 minutes)
Enter fullscreen mode Exit fullscreen mode

Why this matters:

  • High throughput generates ledgers quickly
  • Compaction reclaims space from deleted messages
  • Balance: Too frequent → CPU overhead, Too rare → Disk full

4. Cache and Memory Configuration

# Broker configuration
broker:
  configData:
    # Managed ledger cache (hot data in memory)
    managedLedgerCacheSizeMB: "512"  # 512MB per broker

    # Replication settings
    managedLedgerDefaultEnsembleSize: "3"  # 3 bookies per ledger
    managedLedgerDefaultWriteQuorum: "2"   # Write to 2 bookies
    managedLedgerDefaultAckQuorum: "2"     # Wait for 2 ACKs
Enter fullscreen mode Exit fullscreen mode

Trade-offs:

managedLedgerCacheSizeMB: "512"

  • Caches recently written messages
  • Speeds up tailing reads (consumers close to tail)
  • Impact: 50% reduction in read latency for hot data

Quorum Configuration (3/2/2):

  • Ensemble: 3 bookies hold each ledger segment
  • Write Quorum: Write to 2 bookies simultaneously
  • Ack Quorum: Wait for 2 ACKs before acknowledging producer
  • Trade-off: Balance between durability and latency
    • 3/3/3: More durable, higher latency
    • 3/2/2: Balanced (our choice)
    • 3/2/1: Fast but less durable

🚀 Broker Configuration for High Throughput

broker:
  replicaCount: 6  # 6 broker instances

  resources:
    requests:
      memory: "2Gi"
      cpu: "1"
    limits:
      memory: "4Gi"
      cpu: "2"

  configData:
    # Increase connection limits
    maxConcurrentLookupRequest: "50000"      # Default: 5000
    maxConcurrentTopicLoadRequest: "50000"   # Default: 5000

    # Batch settings for better throughput
    maxMessagesBatchingEnabled: "true"
    maxNumMessagesInBatch: "1000"
    maxBatchingDelayInMillis: "10"

    # Producer settings
    maxProducersPerTopic: "10000"
    maxConsumersPerTopic: "10000"
    maxConsumersPerSubscription: "10000"
Enter fullscreen mode Exit fullscreen mode

Key optimizations:

Connection Limits:

  • Increased from 5K to 50K concurrent requests
  • Handles high producer/consumer concurrency
  • Impact: Eliminates connection throttling

Batching Configuration:

  • Groups messages for efficient network utilization
  • 10ms delay balances latency vs throughput
  • Impact: 30% improvement in network efficiency

📊 Performance Monitoring

Essential Metrics to Track

1. Throughput Metrics:

# Messages per second
kubectl logs -n pulsar pulsar-broker-0 | grep "msg/s"

# Bytes per second
kubectl exec -n pulsar pulsar-broker-0 -- \
  bin/pulsar-admin topics stats persistent://public/default/iot-sensor-data
Enter fullscreen mode Exit fullscreen mode

2. Latency Metrics:

# BookKeeper journal write latency
kubectl exec -n pulsar pulsar-broker-0 -- \
  bin/bookkeeper shell listledgers | head -10

# End-to-end producer latency (check application logs)
Enter fullscreen mode Exit fullscreen mode

3. Storage Metrics:

# NVMe utilization
kubectl exec -n pulsar pulsar-broker-0 -- iostat -x 1 5

# Disk space usage
kubectl exec -n pulsar pulsar-broker-0 -- df -h
Enter fullscreen mode Exit fullscreen mode

Critical Performance Indicators

Target Metrics:

  • Message Rate: 1,000,000+ msg/sec
  • Journal Write Latency: < 2ms p99
  • CPU Utilization: 70-80% (brokers)
  • Memory Utilization: 60-70% (bookies)
  • Network Utilization: < 80% of 25 Gbps

Grafana Dashboards

Critical Panels:

  1. Message Rate In/Out
   rate(pulsar_in_messages_total[1m])
   rate(pulsar_out_messages_total[1m])
Enter fullscreen mode Exit fullscreen mode
  1. BookKeeper Write Latency
   histogram_quantile(0.99, 
     rate(bookie_journal_JOURNAL_ADD_ENTRY_bucket[5m])
   )
Enter fullscreen mode Exit fullscreen mode
  1. Storage Fill Rate
   rate(bookie_ledgers_size_bytes[1h])
Enter fullscreen mode Exit fullscreen mode

💡 Lessons Learned & Best Practices

1. NVMe Device Separation is Critical

Before optimization (single device):

  • Write latency p99: ~15ms
  • Throughput: ~400K msg/sec
  • Frequent latency spikes

After optimization (separate devices):

  • Write latency p99: ~2.1ms
  • Throughput: 1M+ msg/sec
  • Stable performance

Key Takeaway: Journal and ledger I/O patterns conflict. Separate them.

2. Disable journalSyncData (Carefully)

Trade-off Analysis:

Benefits:

  • 10x reduction in write latency
  • 2-3x increase in throughput

Risks:

  • Data loss on sudden power failure (rare with NVMe)
  • Not suitable for financial transactions

When to use:

  • IoT/telemetry data (lossy acceptable)
  • High-volume logs
  • Event streaming

When NOT to use:

  • Financial transactions
  • Critical business data
  • Compliance-regulated workloads

3. Right-Size Your Instances

Instance comparison for Pulsar:

Instance vCPU RAM NVMe Best For
i3en.6xlarge 24 192GB 2x 7.5TB High capacity
i7i.8xlarge 32 256GB 2x 3.75TB Balanced

Our choice: i7i.8xlarge

  • Latest generation (better CPU performance)
  • 2 NVMe devices (perfect for journal/ledger separation)
  • Optimal balance of CPU, memory, and storage for Pulsar workloads

4. Broker and Bookie Co-location

Pros:

  • Reduced network hops (broker→bookie is local)
  • Lower latency
  • Cost savings (fewer instances)

Cons:

  • Resource contention (CPU, memory)
  • Harder to scale independently

Our approach:

  • Co-locate for high throughput workloads
  • Separate for latency-sensitive applications
  • Result: Works well for 1M msg/sec with proper resource allocation

5. Monitor Journal Cache Time

# Critical metric: How long messages stay in journal cache
gcWaitTime: "900000"  # 15 minutes
Enter fullscreen mode Exit fullscreen mode

Why it matters:

  • Journal cache holds recently written messages
  • Longer cache time = Better read performance for tailing consumers
  • Too long = More data loss risk on failure
  • Sweet spot: 10-15 minutes for streaming workloads

🚧 Common Pitfalls & How to Avoid Them

1. EBS Instead of NVMe

Symptom: Throughput caps at ~100K msg/sec, high latency

Cause: EBS gp3 maxes at:

  • 16K IOPS (baseline)
  • 64K IOPS (provisioned)
  • 1 GB/s throughput

Solution: Use NVMe-backed instances (i7i, i4i, i3en)

2. Single NVMe Device for Journal + Ledger

Symptom: Latency spikes, inconsistent throughput

Cause: Journal sequential writes blocked by ledger random I/O

Solution: Use separate devices or at least separate partitions

3. journalSyncData Enabled

Symptom: Write latency >10ms, throughput <200K msg/sec

Cause: fsync() after every write (10ms overhead)

Solution: Disable if data loss tolerance acceptable

4. Insufficient Broker Count

Symptom: High CPU on brokers, throttling, connection refused

Cause: Too few brokers for traffic volume

Solution:

  • Rule of thumb: 1 broker per 150-200K msg/sec
  • For 1M msg/sec: Minimum 5-6 brokers

5. Not Using Separate Node Groups

Symptom: Performance degradation when other workloads deploy

Cause: Resource contention with non-Pulsar pods

Solution: Dedicated node groups with taints

📈 Scaling Beyond 1M msg/sec

To 2M msg/sec:

# Increase broker and bookie count
broker:
  replicaCount: 8  # Up from 6

bookkeeper:
  replicaCount: 8  # Up from 6

# Scale producer infrastructure
producers:
  replicaCount: 8  # Up from 4 (250K each = 2M total)
  instanceType: c5.4xlarge  # Keep same instance type
Enter fullscreen mode Exit fullscreen mode

Cost impact: ~+$6,240/month (2 more i7i.8xlarge @ $2,160 each + 4 more c5.4xlarge @ $480 each)

To 5M msg/sec:

  • Horizontal scaling: 15-20 brokers, 15-20 bookies
  • Producer scaling: 20 c5.4xlarge nodes (250K each)
  • Network: Upgrade to 50-100 Gbps instances
  • Storage: Consider i4i.16xlarge (4x NVMe, 64 vCPU)
  • Cost: ~$45,000-50,000/month

💰 Cost Analysis

Monthly Cost Breakdown (1M msg/sec)

Component Instance Count Cost/mo
Producer Nodes c5.4xlarge 4 $1,920
Pulsar Brokers i7i.8xlarge 6 $12,960
BookKeeper Bookies i7i.8xlarge 6 (Co-located)
ZooKeeper t3.medium 3 $90
Pulsar Proxy c5.2xlarge 2 $480
Total $15,450

Cost per message: $0.0000155 per 1M messages

Infrastructure Breakdown:

  • Producer Infrastructure: $1,920/month (12.4% of total)
  • Pulsar Core Infrastructure: $13,530/month (87.6% of total)
  • Total Infrastructure: $15,450/month

Note: i7i.8xlarge cost = $12,960 ÷ 6 instances = $2,160/month per instance

Cost Optimization Options

  1. Savings Plans (26% off):

    • 3-year commitment
    • Reduces to ~$11,433/month
  2. Spot Instances (60% off):

    • $6,180/month
    • Risk: Potential interruptions
    • Mitigation: Use for non-critical environments
  3. Reserved Instances (40% off):

    • 1-year commitment
    • Reduces to ~$9,270/month
    • Balance between savings and flexibility

🎓 Conclusion

Achieving 1 million messages/sec with Pulsar requires:

NVMe storage with separated journal and ledger devices

Careful BookKeeper tuning (journalSyncData, buffer sizes)

Right-sized instances (i7i.8xlarge sweet spot)

Horizontal scaling (6+ brokers, 6+ bookies)

Dedicated infrastructure (node groups with taints)

Monitoring (latency, IOPS, CPU, throughput)

Key Metrics Achieved:

  • Throughput: 1,040,000 msg/sec sustained
  • Latency: 0.8ms p50, 2.1ms p99
  • Infrastructure Cost: $15,450/month ($11,433 with savings plans)
  • Reliability: 99.95% uptime

The secret sauce combinations:

  1. NVMe device separation for journal vs ledgers
  2. journalSyncData: false for 10x latency improvement
  3. i7i.8xlarge instances for optimal price/performance
  4. Broker-bookie co-location for reduced network hops
  5. Proper resource allocation and monitoring

📚 Resources


Have you implemented high-throughput messaging systems? What challenges did you face with storage I/O optimization? Share your experiences in the comments! 👇

Building real-time data platforms? Follow me for deep dives on performance optimization, distributed systems, and cloud infrastructure!

Next in the series: "ClickHouse Performance Tuning for 1M Events/Sec Ingestion"


🌟 Key Takeaways

  1. Storage architecture matters more than CPU for messaging systems
  2. NVMe device separation is critical for predictable latency
  3. journalSyncData: false gives 10x performance boost (with trade-offs)
  4. Right instance type beats "more instances" for cost efficiency
  5. Monitoring is essential - know your bottlenecks before scaling

Tags: #pulsar #apache #aws #performance #messaging #nvme #bookkeeper #streaming #eks #realtimedata

Top comments (0)