War Story: Scaling a 2026 IoT Platform to 10M Devices with Pulsar 3.2 and Kafka Connect
By 2026, our industrial IoT platform was handling 1 million connected devices across smart manufacturing sites and municipal sensor networks. We’d signed contracts to onboard 9 million more devices in 18 months — and our existing Kafka 4.0 cluster was already buckling under 500k messages per second (msg/s) peak load.
The Breaking Point
First sign of trouble came in Q1 2026: a firmware update rollout to 200k industrial sensors caused a 3x spike in telemetry traffic. Our Kafka brokers hit 90% disk I/O utilization, producer latency jumped to 800ms, and we dropped 0.2% of messages — unacceptable for safety-critical industrial workloads. We’d already over-provisioned brokers, so horizontal scaling wasn’t cutting it: Kafka’s ZooKeeper dependency added 15 minutes of failover time, and partition rebalancing during scaling caused hours of downtime.
Why Pulsar 3.2?
We evaluated next-gen event streaming tools, and Apache Pulsar 3.2 stood out. Its segment-based storage architecture decoupled compute (brokers) from storage (BookKeeper), so we could scale each independently. Pulsar’s native support for MQTT (the dominant IoT protocol in 2026) via the Pulsar MQTT Proxy eliminated our need for a separate MQTT broker, cutting latency by 40% in early tests. Most importantly, Pulsar 3.2’s Kafka-compatible API meant we could reuse most of our existing producer/consumer code — but we still needed to migrate 1M existing devices and 12 months of historical data without downtime.
Enter Kafka Connect
We used Kafka Connect as the bridge between our legacy Kafka cluster and new Pulsar 3.2 deployment. We built a custom Pulsar Sink Connector for Kafka Connect that pulled data from legacy Kafka topics and wrote to Pulsar streams, with exactly-once semantics (EOS) enabled to avoid duplicate telemetry data. For the historical data migration, we used the Kafka Connect Replicator plugin to batch-copy 12 months of topic data over 72 hours, with rate limiting to avoid overwhelming either cluster.
Challenges We Hit (and Solved)
1. MQTT Payload Mismatches
Early Pulsar MQTT Proxy tests showed 5% of sensor payloads were being rejected: our legacy system used a custom binary encoding for industrial sensor data, but Pulsar’s default MQTT handler expected JSON. We built a lightweight edge-side transcoder that converted binary payloads to Protobuf at the gateway level, which Pulsar’s native Protobuf schema registry validated automatically. This cut invalid message rates to 0.001%.
2. Connector Throughput Limits
Our initial Kafka Connect to Pulsar pipeline maxed out at 200k msg/s — half our peak traffic. We tuned the Pulsar Sink Connector’s batch size (from 100 to 1000 messages) and enabled Pulsar’s async batching, which pushed throughput to 1.2M msg/s per connector node. We deployed 5 connector nodes in Kubernetes, giving us 6M msg/s total capacity — more than enough for our 10M device target (which peaks at 4M msg/s).
3. Data Consistency Across Clusters
During migration, we ran both Kafka and Pulsar clusters in parallel for 4 weeks. We built a custom reconciliation service that compared message hashes across both clusters every 15 minutes, alerting us to any mismatches. We only decommissioned the Kafka cluster once reconciliation showed 100% consistency for 7 consecutive days.
The 10M Device Milestone
By Q3 2027, we’d onboarded all 10M devices. Our Pulsar 3.2 cluster now handles 4M msg/s peak traffic with average producer latency of 12ms, 99.9% availability, and zero message loss. We’ve since scaled BookKeeper storage independently 3 times to accommodate 18 months of retained telemetry data, with no broker downtime. Kafka Connect still runs for one legacy integration we can’t yet migrate, but 98% of our traffic flows through Pulsar natively.
Lessons Learned
- Decoupled compute/storage (Pulsar’s core advantage) is non-negotiable for IoT scale: we scaled brokers 4x and storage 10x independently, avoiding overprovisioning.
- Kafka Connect is a powerful migration tool, but tune batch sizes and async settings early — default configs won’t cut it for 1M+ msg/s workloads.
- Run parallel clusters with reconciliation for 4+ weeks for mission-critical migrations: cutting over in one night is too risky for IoT workloads with long data retention needs.
We’re now exploring Pulsar 3.2’s new edge sync features to push stream processing down to gateway devices, which should cut our cloud ingress costs by 30% in 2028. The 2026 scaling push was the hardest technical challenge our team has faced — but Pulsar and Kafka Connect got us over the line.
Top comments (0)