Kubernetes Scaling with KEDA: Zero-Loss Data Processing at Scale

#kubernetes #keda #kafka #eventdriven

The Challenge:
Our data processing pipeline faced continuous high-volume traffic, not just occasional spikes. Traditional Kubernetes HPA using CPU and memory metrics couldn't effectively scale based on actual message backlog. We needed a solution that could handle sustained traffic while guaranteeing zero message loss.

The KEDA + Kafka Solution:
KEDA (Kubernetes Event-Driven Autoscaler) monitors Kafka consumer group lag and scales pods based on real message queue depth, not just resource utilization. Combined with Kafka's offset commit strategies and built-in retry mechanisms, we achieved:

✅ Smart scaling based on actual consumer lag and message backlog
✅ Zero message loss through offset management and retry logic
✅ Reliable processing even during pod scaling events
✅ Optimal resource utilization for sustained high-traffic periods
✅ Cloud-agnostic deployment across multiple environments

How It Works:
When message backlog grows in Kafka topics, KEDA automatically scales up processing pods. Messages are only marked as processed after successful completion, and failures trigger automatic retries. During scaling events, Kafka's consumer group rebalancing ensures continuous, consistent processing without data loss.

The Technical Setup:
Our Kafka topic is configured with 24 partitions to handle high-volume parallel processing. Here's our KEDA ScaledObject configuration:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: kafka-consumer-scaler
  namespace: data-processing
spec:
  scaleTargetRef:
    name: data-processor-deployment
  minReplicaCount: 6      # Baseline: ~4 partitions per pod
  maxReplicaCount: 30     # Peak: allows scaling beyond partition count
  pollingInterval: 30     # Check lag every 30 seconds
  cooldownPeriod: 300     # Wait 5 min before scaling down
  triggers:
  - type: kafka
    metadata:
      bootstrapServers: kafka-broker-1:9092,kafka-broker-2:9092,kafka-broker-3:9092
      consumerGroup: data-processing-group
      topic: high-volume-events
      lagThreshold: '1000'           # Scale when lag > 1000 per partition
      activationLagThreshold: '100'  # Activate scaling at 100 lag
      offsetResetPolicy: earliest
      allowIdleConsumers: 'false'

Partition and Scaling Strategy:

24 Kafka partitions - Enables high parallel processing throughput
Min 6 replicas - Ensures ~4 partitions per pod at baseline (24/6)
Max 30 replicas - Allows 1:1 partition-to-pod ratio during peak + buffer
Lag threshold 1000 - Each pod handles up to 1000 message lag before scaling
Polling interval 30s - KEDA checks consumer lag every 30 seconds
Cooldown 5 min - Prevents rapid scale-up/down oscillations

Key Implementation Details:

Manual offset commit - AutoCommit.Enable = false ensures no data loss
session.MarkMessage() - Commits offset only after successful processing
Exponential backoff - 1s, 2s, 4s delays between retry attempts
DLQ fallback - Failed messages sent to dead letter queue after 3 retries

The Results:
A production-grade data processing system that scales intelligently, guarantees message delivery, and optimizes resources automatically. Every message gets processed, every time — whether traffic is steady or surging.

If you're building data pipelines on Kubernetes and need to handle serious volume with absolute reliability, the KEDA + Kafka combination delivers event-driven autoscaling that actually understands your workload.

Top comments (1)

Jorge Lozano • Oct 26 • Edited

Nice top article about KEDA good approach!

Invite you to unlock new knowledge about kubernetes
medium.com/towards-aws/mastering-c...