ANKUSH CHOUDHARY JOHAL

Posted on May 7 • Originally published at johal.in

Postmortem: A KEDA 2.14 Prometheus Scaler Failed to Scale Pods During a Marketing Campaign Spike

#postmortem #keda #prometheus #scaler

Postmortem: KEDA 2.14 Prometheus Scaler Failed to Scale Pods During Marketing Campaign Spike

Executive Summary

On May 15, 2024, our e-commerce platform experienced a 45-minute service degradation during a flash marketing campaign for a new product launch. A bug in KEDA 2.14’s Prometheus scaler prevented automatic pod scaling, leading to elevated 5xx error rates, ~$120,000 in lost revenue, and 3 customer support tickets reporting request timeouts. The incident was resolved via manual scaling and a KEDA version upgrade.

Incident Timeline (All Times UTC)

14:00: Marketing campaign launches; traffic begins ramping to 10x normal baseline.
14:05: Traffic hits 8x normal; Prometheus metrics show frontend pod CPU utilization >80% for 3 consecutive minutes.
14:06: KEDA Prometheus scaler fails to trigger scale-out from 5 to 20 pods as configured; no scaling events logged.
14:07: Alert fires for frontend 5xx error rate exceeding 5% threshold (peaks at 12%).
14:10: On-call engineer joins incident bridge; checks KEDA scaler pod logs, finds repeated "failed to parse metric value: type assertion failed" errors.
14:15: Engineer manually scales frontend deployment to 25 pods; error rate drops to <1% within 2 minutes.
14:45: Campaign traffic normalizes to 2x baseline; pods scaled down to 8 via KEDA (post-upgrade).
15:00: Incident declared resolved; all metrics return to normal.

Impact

45 minutes of partial service degradation
Peak 5xx error rate of 12% (baseline <0.1%)
Estimated $120,000 in lost transactions
3 customer support tickets regarding request timeouts
No data loss or long-term customer churn attributed to the incident

Root Cause Analysis

Investigation revealed a two-part failure: a known regression in KEDA 2.14’s Prometheus scaler, and insufficient observability for KEDA components.

KEDA 2.14 Prometheus Scaler Bug

Our KEDA Prometheus scaler was configured to use the following query to trigger scaling:

increase(http_requests_total{job="frontend"}[1m]) > 1000

KEDA 2.14’s Prometheus scaler included a regression where it failed to parse integer-type metric values returned by Prometheus when using the increase() function over short time windows (≤1 minute). The scaler expected float64 values, and when the query returned an integer, a silent type assertion error occurred, causing the scaler to skip scaling without emitting a metric (only logging the error to pod stdout). This bug was fixed in KEDA 2.15, released 6 weeks prior to the incident.

Insufficient Observability

KEDA scaler pod logs were not forwarded to our central ELK stack, so the type assertion errors were not detected until an engineer manually checked pod logs. Additionally, we did not monitor the keda_scaler_errors_total metric, which would have alerted on the scaler failures immediately. Our alerting only covered application-level metrics (error rates, latency) and not KEDA-specific health indicators.

Contributing Factor: Version Lag

KEDA was installed at version 2.14 3 months prior to the incident, and no process existed to track or upgrade KEDA minor versions. The team was unaware of the 2.15 fix for the Prometheus scaler bug.

Immediate Remediation

Manually scaled frontend deployment to 25 pods at 14:15 UTC to handle in-flight traffic.
Upgraded KEDA from 2.14 to 2.16 (latest stable version) at 14:30 UTC, resolving the Prometheus scaler parsing bug.
Updated the Prometheus scaler query to use rate() instead of increase() and added a round() function to ensure float return values:
```
round(rate(http_requests_total{job="frontend"}[1m]) * 60) > 1000
```
Forwarded all KEDA component logs to our central ELK stack and configured an alert for keda_scaler_errors_total > 0 firing to the on-call Slack channel.

Prevention Measures

Implement KEDA version monitoring via Datadog, with alerts triggering when the installed version is more than 1 minor version behind the latest stable release.
Add a load testing step to the pre-deployment pipeline that simulates 10x traffic spikes and validates KEDA scaling behavior for all deployments using KEDA scalers.
Add KEDA health metrics (keda_scaler_errors_total, keda_scaler_scaling_events_total, keda_scaler_current_replicas) to our standard monitoring dashboard, with alerts for anomalous values.
Document manual scaling rollback procedures in our incident response runbook, including kubectl commands for common deployments.
Schedule quarterly reviews of all postmortem remediation items to ensure implementation.

Conclusion

This incident highlighted the risk of lagging behind minor version updates for critical infrastructure components like KEDA, even when no immediate issues are observed. It also underscored the importance of monitoring auxiliary components (like scalers) that directly impact application availability, not just the applications themselves. By implementing the above prevention measures, we aim to avoid similar scaling failures during future high-traffic events.

DEV Community