The Challenge:
Our data processing pipeline faced continuous high-volume traffic, not just occasional spikes. Traditional Kubernetes HPA using CPU and memory metrics couldn't effectively scale based on actual message backlog. We needed a solution that could handle sustained traffic while guaranteeing zero message loss.
The KEDA + Kafka Solution:
KEDA (Kubernetes Event-Driven Autoscaler) monitors Kafka consumer group lag and scales pods based on real message queue depth, not just resource utilization. Combined with Kafka's offset commit strategies and built-in retry mechanisms, we achieved:
✅ Smart scaling based on actual consumer lag and message backlog
✅ Zero message loss through offset management and retry logic
✅ Reliable processing even during pod scaling events
✅ Optimal resource utilization for sustained high-traffic periods
✅ Cloud-agnostic deployment across multiple environments
How It Works:
When message backlog grows in Kafka topics, KEDA automatically scales up processing pods. Messages are only marked as processed after successful completion, and failures trigger automatic retries. During scaling events, Kafka's consumer group rebalancing ensures continuous, consistent processing without data loss.
The Technical Setup:
Our Kafka topic is configured with 24 partitions to handle high-volume parallel processing. Here's our KEDA ScaledObject configuration:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: kafka-consumer-scaler
namespace: data-processing
spec:
scaleTargetRef:
name: data-processor-deployment
minReplicaCount: 6 # Baseline: ~4 partitions per pod
maxReplicaCount: 30 # Peak: allows scaling beyond partition count
pollingInterval: 30 # Check lag every 30 seconds
cooldownPeriod: 300 # Wait 5 min before scaling down
triggers:
- type: kafka
metadata:
bootstrapServers: kafka-broker-1:9092,kafka-broker-2:9092,kafka-broker-3:9092
consumerGroup: data-processing-group
topic: high-volume-events
lagThreshold: '1000' # Scale when lag > 1000 per partition
activationLagThreshold: '100' # Activate scaling at 100 lag
offsetResetPolicy: earliest
allowIdleConsumers: 'false'
Partition and Scaling Strategy:
- 24 Kafka partitions - Enables high parallel processing throughput
- Min 6 replicas - Ensures ~4 partitions per pod at baseline (24/6)
- Max 30 replicas - Allows 1:1 partition-to-pod ratio during peak + buffer
- Lag threshold 1000 - Each pod handles up to 1000 message lag before scaling
- Polling interval 30s - KEDA checks consumer lag every 30 seconds
- Cooldown 5 min - Prevents rapid scale-up/down oscillations
Key Implementation Details:
- Manual offset commit - AutoCommit.Enable = false ensures no data loss
- session.MarkMessage() - Commits offset only after successful processing
- Exponential backoff - 1s, 2s, 4s delays between retry attempts
- DLQ fallback - Failed messages sent to dead letter queue after 3 retries
The Results:
A production-grade data processing system that scales intelligently, guarantees message delivery, and optimizes resources automatically. Every message gets processed, every time — whether traffic is steady or surging.
If you're building data pipelines on Kubernetes and need to handle serious volume with absolute reliability, the KEDA + Kafka combination delivers event-driven autoscaling that actually understands your workload.
Top comments (0)