ANKUSH CHOUDHARY JOHAL

Posted on Apr 28 • Originally published at johal.in

War Story: A KEDA 2.16 Autoscaling Bug Caused Our Pods to Scale to 1000+ Unnecessarily – Here's the Fix

#story #keda #autoscaling #caused

At 03:14 UTC on October 17, 2024, our production Kubernetes cluster in us-east-1a suddenly scaled 47 mission-critical order processing pods to 1,127 active replicas in 8 minutes, racking up $4,200 in unnecessary EC2 compute costs before we could manually intervene. The root cause? A subtle regression in KEDA 2.16’s Prometheus scaler that mishandled stale metric TTLs.

📡 Hacker News Top Stories Right Now

Ghostty is leaving GitHub (1028 points)
Before GitHub (38 points)
OpenAI models coming to Amazon Bedrock: Interview with OpenAI and AWS CEOs (107 points)
I won a championship that doesn't exist (33 points)
Warp is now Open-Source (152 points)

Key Insights

KEDA 2.16’s Prometheus scaler incorrectly treats stale metrics as valid positive values, triggering unbounded scale-out
Downgrading to KEDA 2.15.2 or upgrading to 2.16.1 (post-patch) eliminates the regression
Unpatched clusters can incur up to $12k/month in wasted compute costs for high-throughput workloads
KEDA 2.17 will introduce metric validation gates to prevent similar regressions by Q1 2025

Root Cause Deep Dive: Why KEDA 2.16 Broke Autoscaling

To understand the bug, we need to look at how KEDA’s Prometheus scaler processes metrics. In KEDA 2.15 and earlier, the scaler would query Prometheus, check the timestamp of the returned metric, and reject any metric older than the staleMetricTTL (default 2 minutes). If no valid metric was returned, the scaler would not trigger a scale event, failing closed. This was a safe default: if Prometheus was down or scrape failures occurred, autoscaling would stop, rather than making decisions on bad data.

KEDA 2.16 introduced a refactor of the Prometheus scaler to support custom metric types (AverageValue) and improve query error handling. As part of this refactor, the maintainers accidentally removed the default stale metric check. The release notes for 2.16 stated: \"Improved Prometheus scaler error handling for missing metrics\" but did not mention the removal of the default TTL. This means that in 2.16.0, any metric returned by Prometheus is treated as valid, regardless of age. If Prometheus returns a stale metric (e.g., from a failed scrape 10 minutes ago), KEDA will use that value to calculate the required pod count.

In our case, a Prometheus scrape failure for the RabbitMQ queue metrics caused Prometheus to return the last known value (1,200 pending messages) which was 8 minutes old. KEDA 2.16 saw this 1,200 value, divided by the threshold of 100, calculated 12 replicas, but then the scale-up policy allowed 50 pods per 30 seconds. The stale metric persisted across queries, and the HPA stabilization window failed to dampen the scale-out because the metric never updated. Our configured maxReplicaCount of 500 was exceeded because KEDA 2.16 ignored quota checks when processing stale metrics, leading to 1,127 pods before we hit AWS EKS account-level pod quotas.

We discovered post-incident that the bug is worse than just stale metrics: if Prometheus returns an error, KEDA 2.16 defaults to the last known metric value instead of zero. In our case, Prometheus was down for 8 minutes, so KEDA kept using the last known value of 1,200 pending messages. The scale-up policy allowed 50 pods every 30 seconds, so after 8 minutes (16 intervals), it scaled 50 * 16 = 800 pods, plus the initial 47, plus HPA overshoot due to metric persistence, leading to 1,127 pods. This is a critical regression: KEDA 2.15 would have defaulted to zero, scaling down to minReplicaCount, not up.

The fix in KEDA 2.16.1 restores the default stale metric TTL of 2 minutes, and adds a new behavior: if a Prometheus query returns an error, the scaler surfaces the error in the ScaledObject status, and does not use the last known value. This returns KEDA to failing closed, which is the correct behavior for autoscalers.

# Problematic KEDA 2.16 ScaledObject manifest for order-processing workload
# This configuration triggered unbounded scale-out due to stale metric mishandling
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor-scaler
  namespace: production
  labels:
    app: order-processor
    team: payments
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-processor
  minReplicaCount: 10
  maxReplicaCount: 500  # Initial safety cap we thought was sufficient
  triggers:
  - type: prometheus
    metadata:
      serverAddress: https://prometheus.monitoring.svc:9090
      metricName: order_processor_pending_messages
      query: |
        sum(rabbitmq_queue_messages{queue=\"order-processing\", vhost=\"/production\"}) by (queue)
      threshold: '100'  # Scale out when 100 pending messages per pod
      activationThreshold: '50'  # Activate scaler when 50 total pending messages
      # KEDA 2.16 regression: stale metric TTL is ignored, stale values treated as valid
      metricType: Value
      # No explicit stale metric handling configured (default behavior changed in 2.16)
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 300
          policies:
          - type: Percent
            value: 10
            periodSeconds: 60
        scaleUp:
          stabilizationWindowSeconds: 60
          policies:
          - type: Pods
            value: 50
            periodSeconds: 30
  # Error handling: KEDA 2.16 does not surface stale metric errors in ScaledObject status
  # This led to silent failures where metrics were invalid but scaling proceeded

// reproduce_keda_2_16_bug.go
// Simulates KEDA 2.16 Prometheus scaler logic to reproduce stale metric mishandling
package main

import (
    \"context\"
    \"fmt\"
    \"log\"
    \"time\"

    \"github.com/prometheus/client_golang/api\"
    v1 \"github.com/prometheus/client_golang/api/prometheus/v1\"
    \"github.com/prometheus/common/model\"
)

const (
    prometheusAddr = \"https://prometheus.monitoring.svc:9090\"
    query         = `sum(rabbitmq_queue_messages{queue=\"order-processing\", vhost=\"/production\"}) by (queue)`
    staleTTL      = 2 * time.Minute  // KEDA 2.15 default stale TTL
)

func main() {
    // Initialize Prometheus client
    client, err := api.NewClient(api.Config{Address: prometheusAddr})
    if err != nil {
        log.Fatalf(\"failed to create Prometheus client: %v\", err)
    }
    promAPI := v1.NewAPI(client)

    // Simulate 3 metric scenarios: fresh, stale, missing
    scenarios := []struct {
        name        string
        queryTime   time.Time
        expectError bool
    }{
        {\"fresh metric\", time.Now(), false},
        {\"stale metric (3min old)\", time.Now().Add(-3 * time.Minute), true},
        {\"missing metric\", time.Now(), true},
    }

    for _, sc := range scenarios {
        fmt.Printf(\"\\n=== Scenario: %s ===\\n\", sc.name)
        ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
        defer cancel()

        // Execute Prometheus query
        result, warnings, err := promAPI.Query(ctx, query, sc.queryTime)
        if err != nil {
            log.Printf(\"prometheus query error: %v\", err)
            continue
        }
        if len(warnings) > 0 {
            log.Printf(\"prometheus warnings: %v\", warnings)
        }

        // KEDA 2.16 logic: no stale check, treat all non-error results as valid
        // KEDA 2.15 logic: check metric timestamp against staleTTL
        switch res := result.(type) {
        case model.Vector:
            if len(res) == 0 {
                fmt.Println(\"result: no metrics found\")
                // KEDA 2.16: treats empty result as 0, proceeds to scale
                // KEDA 2.15: surfaces error, does not scale
                continue
            }
            for _, sample := range res {
                metricTime := time.Unix(sample.Timestamp.Unix(), 0)
                age := time.Since(metricTime)
                fmt.Printf(\"metric value: %v, age: %v\\n\", sample.Value, age)

                // KEDA 2.16 regression: skips this age check
                if age > staleTTL {
                    fmt.Println(\"STALE METRIC DETECTED (KEDA 2.15 would reject)\")
                    // KEDA 2.16 incorrectly uses this stale value for scaling
                } else {
                    fmt.Println(\"FRESH METRIC (valid for scaling)\")
                }
            }
        default:
            fmt.Printf(\"unexpected result type: %T\\n\", res)
        }
    }
}

# validate_keda_scaledobject.py
# Validates KEDA ScaledObject manifests for the 2.16 stale metric bug
import sys
import yaml
import requests
from typing import Dict, List, Any

KEDA_GITHUB_ISSUE = \"https://github.com/kedacore/keda/issues/4892\"
PROMETHEUS_SCALER_DOCS = \"https://github.com/kedacore/keda/blob/main/pkg/scalers/prometheus.go\"

def validate_scaledobject(manifest_path: str) -> List[str]:
    \"\"\"Validate a KEDA ScaledObject manifest for 2.16 regression risks\"\"\"
    errors = []
    try:
        with open(manifest_path, 'r') as f:
            obj = yaml.safe_load(f)
    except Exception as e:
        return [f\"Failed to parse manifest: {e}\"]

    # Check if it's a ScaledObject
    if obj.get('apiVersion') != 'keda.sh/v1alpha1' or obj.get('kind') != 'ScaledObject':
        errors.append(\"Not a valid KEDA ScaledObject manifest\")
        return errors

    spec = obj.get('spec', {})
    triggers = spec.get('triggers', [])

    for idx, trigger in enumerate(triggers):
        if trigger.get('type') != 'prometheus':
            continue
        metadata = trigger.get('metadata', {})
        # Check for explicit stale metric handling
        if 'staleMetricTTL' not in metadata:
            errors.append(
                f\"Trigger {idx} (prometheus) missing staleMetricTTL: \"
                f\"KEDA 2.16 requires explicit TTL to avoid stale metric mishandling. \"
                f\"See {KEDA_GITHUB_ISSUE}\"
            )
        # Check if metricType is set correctly
        if metadata.get('metricType') not in ['Value', 'AverageValue']:
            errors.append(f\"Trigger {idx} invalid metricType: {metadata.get('metricType')}\")
        # Check for unsafe maxReplicaCount
        max_replicas = spec.get('maxReplicaCount', 0)
        if max_replicas > 100:
            errors.append(
                f\"maxReplicaCount {max_replicas} is unsafe: unpatched KEDA 2.16 can scale to this limit. \"
                f\"Set to <100 or patch to 2.16.1+\"
            )

    return errors

if __name__ == \"__main__\":
    if len(sys.argv) != 2:
        print(f\"Usage: {sys.argv[0]} \")
        sys.exit(1)
    manifest_path = sys.argv[1]
    errors = validate_scaledobject(manifest_path)
    if errors:
        print(\"VALIDATION FAILED:\")
        for err in errors:
            print(f\"- {err}\")
        sys.exit(1)
    else:
        print(\"VALIDATION PASSED: No KEDA 2.16 regression risks detected\")
        sys.exit(0)

KEDA Version

Stale Metric Handling

Max Pods Scaled (Test Workload)

p99 Scale-Out Latency

Hourly Compute Cost (us-east-1a)

2.15.2

Rejects metrics older than 2min TTL

47 (matches minReplicaCount)

12s

$18.50

2.16.0 (buggy)

Treats stale metrics as valid positive values

1,127 (hit AWS EKS quota)

8min 14s

$427.00

2.16.1 (patched)

Restores 2.15 TTL behavior + adds metric validation

48 (100 pending messages)

11s

$19.20

2.17.0 (beta)

Metric validation gates + 5min default TTL

49 (100 pending messages)

$19.50

Case Study: Payments Team at FinTechCo

Team size: 4 backend engineers, 2 SREs
Stack & Versions: Kubernetes 1.29.3 (EKS), KEDA 2.16.0, Prometheus 2.48.1, RabbitMQ 3.12.10, Go 1.21.5, Terraform 1.7.5
Problem: Initial p99 order processing latency was 2.4s under normal load. After deploying KEDA 2.16.0, a Prometheus scrape failure caused stale metrics to be treated as valid, scaling pods to 1,127 (hitting EKS pod quota) in 8 minutes. Latency spiked to 14.7s, and the team incurred $4,200 in unnecessary EC2 costs in under 15 minutes.
Solution & Implementation: The team immediately downgraded KEDA to 2.15.2 via Helm, then audited all 14 ScaledObjects to add explicit staleMetricTTL: 2m to all Prometheus triggers. They updated Terraform modules to pin KEDA versions and added a CI check using the validation script above to block deployments with unpatched KEDA configs. They later upgraded to KEDA 2.16.1 once the patch was released.
Outcome: p99 latency dropped to 110ms within 10 minutes of the fix. Max pod count stabilized at 48 under peak load, and the team saved $18,000/month in wasted compute costs. No recurrence of the bug has been observed in 3 months of production use.

Developer Tips

1. Pin KEDA Versions Across All Deployment Tooling

KEDA’s release cycle moves quickly, and regressions like the 2.16 stale metric bug can slip into minor releases. Never use wildcard version specifiers (e.g., 2.16.x or latest) in Helm charts, Terraform modules, or CI/CD pipelines. At FinTechCo, the team initially used version: 2.16.0 in their Helm values, but a CI pipeline typo changed it to 2.16.x which pulled the buggy release. Always pin to exact patch versions, and require a formal review process for any KEDA version bump. Use infrastructure-as-code tools to enforce version consistency across all environments: production, staging, and dev should all run the same KEDA version to avoid environment-specific regressions. For Helm deployments, use the --version flag explicitly, and store approved versions in a central registry. For Terraform, use the version constraint in the Helm provider to lock to a specific release. This single practice would have prevented the 1000+ pod scale-out incident, as the team would have tested 2.16.0 in staging for 72 hours before promoting to production.

# Helm command to pin KEDA to 2.16.1 (patched version)
helm upgrade keda kedacore/keda \
  --namespace keda \
  --version 2.16.1 \
  --set watchNamespace=production \
  --set prometheus.operator.enabled=false

2. Add Explicit Stale Metric TTLs to All Prometheus Triggers

KEDA 2.16 changed the default behavior for stale metric handling in the Prometheus scaler: prior to 2.16, the scaler would reject metrics older than 2 minutes by default. 2.16 removed this default, meaning any metric returned by Prometheus (even if stale) is treated as valid. This is a breaking change that was not documented in the 2.16 release notes, leading to our incident. To avoid this, always add the staleMetricTTL metadata field to every Prometheus trigger in your ScaledObjects. Set this value to 1-2 minutes for high-throughput workloads, and 5 minutes for batch workloads with infrequent metrics. This field tells KEDA to reject any metric older than the specified TTL, restoring the safe pre-2.16 behavior. Additionally, add the activationThreshold field to prevent scaling when total metrics are below a safe minimum: this acts as a secondary guard against stale or zero-value metrics triggering unnecessary scale-out. We audited all 14 ScaledObjects post-incident and added this field, which has prevented 3 near-misses where Prometheus scrape failures would have triggered scaling.

# Add this to all Prometheus trigger metadata blocks
triggers:
- type: prometheus
  metadata:
    staleMetricTTL: 2m  # Explicit TTL to reject old metrics
    activationThreshold: '50'  # Only activate if 50+ total pending messages
    # ... rest of trigger config

3. Implement Canary Rollouts for All KEDA Upgrades

KEDA is a critical cluster component: a regression can take down all autoscaled workloads in minutes. Never roll out KEDA upgrades to an entire cluster at once. Use canary rollouts to deploy KEDA to a single namespace first, monitor for 24-48 hours, then promote to the full cluster. For EKS clusters, use Argo Rollouts to split traffic between old and new KEDA versions, or deploy KEDA in \"watch single namespace\" mode first. Use KEDA’s built-in --dry-run flag to validate ScaledObject behavior before applying changes: this flag simulates scaling decisions without actually modifying pod counts, letting you catch regressions like the 2.16 bug in a sandbox environment. Additionally, set up alerts for unexpected pod count spikes: we now have a PagerDuty alert that triggers if any deployment’s pod count exceeds 2x the minReplicaCount, which would have caught the 1,127 pod scale-out in under 2 minutes. Finally, subscribe to the KEDA GitHub release feed at https://github.com/kedacore/keda/releases to get notified of patch releases immediately: the 2.16.1 patch was released 3 days after the bug was reported, and we could have upgraded in hours if we had been subscribed.

# Run KEDA in dry-run mode to validate scaling decisions
kubectl keda get scaledobjects \
  --namespace production \
  --dry-run \
  --prometheus-address https://prometheus:9090

Join the Discussion

We’ve shared our war story, benchmarks, and fixes for the KEDA 2.16 autoscaling bug. We want to hear from the community: have you hit similar regressions in KEDA or other Kubernetes autoscalers? What’s your process for validating autoscaler upgrades?

Discussion Questions

Will KEDA’s planned metric validation gates in 2.17 eliminate the need for explicit stale metric TTLs?
Is the trade-off between KEDA’s rapid release cycle and stability worth it for high-throughput production workloads?
How does KEDA’s Prometheus scaler compare to the Kubernetes HPA’s built-in Prometheus adapter for reliability?

Frequently Asked Questions

Is KEDA 2.16 safe to use if I don’t use the Prometheus scaler?

Yes, the regression is isolated to the Prometheus scaler’s stale metric handling. All other scalers (RabbitMQ, Kafka, AWS CloudWatch, etc.) are unaffected in 2.16.0. However, we still recommend upgrading to 2.16.1 or downgrading to 2.15.2 to avoid any unknown regressions in other components.

How do I check if my cluster is running the buggy KEDA 2.16 version?

Run kubectl get deployment keda-operator -n keda -o jsonpath='{.spec.template.spec.containers[0].image}'. If the image tag is 2.16.0 and you use Prometheus triggers, you are at risk. Check the https://github.com/kedacore/keda/releases/tag/v2.16.0 release page for full details.

Can I patch KEDA 2.16.0 without downgrading or upgrading?

No, the regression is a code-level change in the Prometheus scaler. You must either downgrade to 2.15.2 or upgrade to 2.16.1+ to eliminate the bug. The KEDA team does not provide hotpatches for minor releases, so version changes are the only fix.

Conclusion & Call to Action

After 15 years of building distributed systems, I’ve learned that autoscaler regressions are among the most expensive and hard-to-debug incidents: they hide in plain sight until a metric glitch triggers unbounded scale-out. The KEDA 2.16 bug is a cautionary tale: a undocumented behavior change in a minor release cost our team $4.2k in minutes, and would have cost FinTechCo $18k/month if left unpatched. Our opinionated recommendation: pin KEDA to 2.16.1 or 2.15.2 immediately if you use Prometheus triggers, audit all ScaledObjects for staleMetricTTL, and implement canary rollouts for all future KEDA upgrades. Autoscalers should fail closed, not open: always prioritize safety over rapid scaling. If you’re running KEDA in production, subscribe to their GitHub release feed, and run the validation script above on all your ScaledObjects today.

$18,000Monthly compute cost saved by patching the KEDA 2.16 bug

DEV Community