Gledis Lami

Posted on Jun 24

Trust the Forecast, But Verify: Confidence-Gated Autoscaling for Kubernetes

#programming #devops #kubernetes #learning

A benchmark-driven look at a predictive-reactive hybrid that recovers most of a reactive safety net's protection without paying its standing cost, evaluated against HPA, KEDA, and pure forecasting across seven workloads including a real Wikipedia trace.

Introduction

Autoscaling is supposed to be the easy win of running services on Kubernetes: set a target, let the Horizontal Pod Autoscaler add and remove pods, move on. In practice, anyone who has run a latency-sensitive service under real traffic knows the rough edges.

Reactive autoscalers, HPA included, only move after a metric has already crossed a threshold. By the time CPU or request rate has climbed far enough to trigger a scale up, the queue has already built, the tail latency has already spiked, and the new pods still have to schedule, pull, and warm up before they serve anything. That gap between "decision" and "ready capacity" is where SLOs go to die. The usual reactions make it worse: shrink the stabilization window and the autoscaler oscillates, flapping pods up and down on noise; widen it and it reacts even later. Provision generously to be safe and you pay for headroom that sits idle most of the day.

So the standing tradeoff is blunt. React late and break latency during bursts, or over-provision and pay for capacity you rarely use. Forecasting promises a way out: if you can see the surge coming, you can have the pods ready before the traffic arrives. The catch is that forecasts are sometimes wrong, and a controller that trusts a wrong forecast fails exactly when it matters. This article is about a controller that forecasts aggressively when its forecast has been reliable and falls back to a reactive safety net when it has not, and about benchmarking that behavior honestly.

Problem statement

Three observations frame the design.

HPA struggles because it is purely reactive and metric-lagged. CPU utilisation is a proxy for load, it trails actual demand, and the control law is proportional to an error that only exists after the SLO is already at risk. KEDA scaling on request rate is a better signal, but it is still reactive: it sizes to load that has already arrived.

Forecasting can help, but only conditionally. A predictive controller that sizes to a forecast of the next interval's load can provision ahead of a ramp. On smooth, seasonal traffic this is a clear win. On an unforecastable burst or a regime change, the forecast misses and the predictive controller under-provisions, breaking the SLO precisely when load is hardest.

Hybrids are the standard fix, and they overpay. The common remedy is a static hybrid: take the maximum of the predictive recommendation and a reactive floor sized to the last observed load. The floor restores safety, but it is always on, so it charges the full reactive premium on every interval, including the long stretches where the forecast was accurate and the floor was redundant.

The question this work asks: can the floor be made to cost something only when it is actually needed?

Architecture

The system is a closed control loop around a request-serving Deployment. One interval of the loop senses load and latency, decides a replica count, and actuates it.

                       control loop, one tick per interval (e.g. 30 s)
        ┌───────────────────────────────────────────────────────────────┐
        │                                                                 │
        ▼                                                                 │
   traffic  ──►  Service (N pods)  ──►  Prometheus  ──►  Controller  ──────┘
 (trace replay      │  serves requests     scrapes:        │
  or real load)     │  p99 latency rises   RPS, p99,       ├─ forecaster (offline, replayed)
                    │  with utilisation    replica count   ├─ confidence gate (recent error)
                    │                                       ├─ predictive plan (MPC)
                    └──────────  scale to N'  ◄─────────────┴─ reactive floor → max(), rate-limited

Components:

System under test. A request-serving workload where tail latency rises with utilisation. In the controlled benchmark this is a discrete-event M/M/c queue; on the cluster it is a small instrumented HTTP service whose per-replica service rate and p99 are calibrated to the same parameters, so the two paths are comparable.
Metrics. Prometheus scrapes request rate, p99 latency, and current replica count on the control cadence. The controller reads three scalars per tick.
Forecaster. A time-series model produces a short-horizon arrival-rate forecast. It runs out of band (the heavy model is computed offline and replayed into the loop), which keeps a large dependency off the control path and makes runs deterministic.
Controller. Combines a predictive plan, a confidence-gated reactive floor, and a ramp limit into a single replica decision.
Actuator. Writes the decision to the Deployment's replica count (the Kubernetes API on the cluster, a direct setter in simulation).

The important design choice for benchmarking: the forecaster is held identical across every controller we compare. Whatever differences we measure are due to how the predictive and reactive signals are combined, not to a better or worse forecast.

Forecasting approach

The controller needs a short look-ahead of the arrival rate, not a perfect one. We use a time-series foundation model (TimesFM) as the forecaster, computed offline and replayed into the loop so the control path stays light and reproducible. The specific model matters less than two properties the controller relies on:

A short horizon. The plan looks a handful of intervals ahead. Long horizons buy little here for reasons covered in the horizon-analysis section: under a normal scaling rate limit and instantaneous-enough actuation, accurate one-step sizing captures most of the benefit, and the deeper plan is often dormant.
A usable error signal. Because the forecast is produced each interval and the actual load is observed one interval later, the controller can measure how wrong it has recently been. That realised error is the input to the confidence gate.

Prediction cadence matches the control interval: one forecast, one decision, one actuation per tick. The scaling signal is not the forecast directly; it is the replica count that the predictive plan and the gated floor jointly imply.

Hybrid scaling strategy

The core idea is a reactive floor whose aggressiveness is gated by how much the forecaster has earned trust recently.

When the forecaster's recent one-step error is low, the controller is confident and keeps the floor light, letting the efficient predictive term lead and the replica count sit close to pure prediction. When the recent error rises, from drift or an unforecastable burst, confidence drops and the floor tightens: it targets a lower utilisation, which adds headroom and pulls the decision toward the safe reactive sizing. The final action is the larger of the predictive and floor recommendations, then clamped by the per-interval ramp limit.

In pseudocode, with the tuning constants abstracted:

# Confidence-gated reactive floor (illustrative; constants tuned offline).
err = rolling_mean(abs(prev_forecast - load_now) / load_now)   # recent 1-step error
confidence = clamp((E_HI - err) / (E_HI - E_LO), 0.0, 1.0)     # 1.0 = fully trust forecast

# light utilisation target when confident, conservative (more headroom) when not
util_target = U_LOW + confidence * (U_HIGH - U_LOW)

reactive_floor = ceil(load_now / (mu * util_target))           # KEDA-style sizing, gated
predictive     = mpc_plan(forecast)[0]                          # predictive recommendation

replicas = clamp_ramp(max(predictive, reactive_floor), r_prev)  # take higher, rate-limit

Operationally this behaves like an SRE's instinct. Trust the model while it is right, and the moment it starts being wrong, widen the safety margin until it proves itself again. The fallback is automatic and continuous rather than a hard mode switch, and it uses a signal the controller already has: its own recent accuracy.

Benchmark methodology

We evaluate in two consistent environments driven by the same controller code.

Controlled benchmark (primary). A deterministic discrete-event M/M/c simulator is the ground truth. It serves each interval's Poisson arrivals on the deployed replicas and reports the realised p99, so the controller is graded against simulated reality, not against the analytical model it optimises. This path lets us hold the forecaster fixed, sweep policies cleanly, and average over five random seeds per configuration. Service parameters: per-replica rate 50 req/s, SLO of p99 latency at or below 200 ms, control interval 30 s, replica ramp limit of a few pods per interval.

Kubernetes reference deployment (validation). The identical controller drives a real Deployment through Prometheus on a single-node k3s/kind cluster, with kube-prometheus-stack scraping on the control cadence and an instrumented service calibrated to the same per-replica rate and SLO. CPU-based HPA and KEDA request-rate scaling run as the reactive baselines on the same service.

Workloads. Seven arrival-rate traces chosen to span the regimes that separate these controllers:

periodic and forecastable: a multi-cycle diurnal, a weekday_weekend two-scale diurnal, and a sawtooth ramp;
aperiodic and hard: a single flash_crowd spike and a multi_burst series of instantaneous steps;
semi-real and real: a real_week composite and wikipedia, real hourly English Wikipedia pageview totals rescaled to the load regime.

Metrics. SLO-violation rate (fraction of intervals whose realised p99 exceeds the SLO), cost in replica-hours, and scaling churn (sum of absolute replica changes). Statistics are the mean over five seeds; the simulator's per-interval latency is stable at these arrival counts, so seed variance is small.

Baselines

CPU HPA. Native Kubernetes Horizontal Pod Autoscaler on CPU utilisation. The canonical reactive controller, and the one most teams actually run.
KEDA (request rate). Event-driven scaling on requests per second per replica. A stronger reactive baseline than CPU for request-serving workloads, and the closest cluster analog of the simulator's reactive policy.
Static provisioning / always-on floor. The reactive floor left permanently active, representing both the "over-provision to be safe" approach and the static predictive-reactive hybrid.
Pure predictive (MPC). The forecast-driven controller with no reactive floor.
Oracle. A perfect-foresight controller, included only as an upper bound on what any forecaster could achieve.

Results

The table reports SLO-violation rate with cost in replica-hours in parentheses, mean of five seeds, with the forecaster held identical across the three deployable controllers. Pure predictive is the efficient-but-fragile baseline; the static floor is the safe-but-expensive one; gated is the confidence-gated hybrid.

Workload	Pure predictive	Static floor	Confidence-gated
diurnal	0.015 (30.3)	0.000 (39.0)	0.003 (32.5)
weekday_weekend	0.016 (20.2)	0.000 (25.9)	0.003 (21.6)
sawtooth	0.020 (18.7)	0.000 (24.1)	0.004 (20.0)
flash_crowd	0.033 (13.4)	0.017 (17.5)	0.018 (14.6)
multi_burst	0.058 (11.7)	0.029 (14.8)	0.033 (12.9)
real_week	0.048 (20.1)	0.015 (25.6)	0.026 (21.7)
wikipedia	0.104 (30.5)	0.006 (40.5)	0.020 (35.0)

Reading the numbers:

Latency / SLO. Against pure prediction, the gate cuts the SLO-violation rate by 1.8x to 5.2x. The real Wikipedia trace is the sharpest case: pure prediction misses the SLO on 10.4% of intervals, the gate brings that to 2.0%, a five-fold reduction. The static floor is marginally safer still (0.6%), but at a price.

Cost. That safety costs the gate only 7% to 15% more than pure prediction. Compared to the static floor, the gate is 13% to 17% cheaper for near-equal safety. On Wikipedia the static floor spends 16% more than the gate to shave the last 1.4 points of violation rate. The reason is structural: the static floor pays the full reactive premium on every interval, while the gate pays it only when recent forecast error justifies it. On the cleanly forecastable workloads, the gate's violations (0.003 to 0.004) are within a fraction of a percent of the static floor's zero, at roughly 17% lower cost.

Pod churn. The gate also lowers scaling churn relative to the static floor on six of seven workloads (for example 130 scaling actions versus 165 on diurnal). The exception is multi_burst, where the repeated short bursts make the gate toggle between light and conservative and churn rises (125 versus 91); this is the one honest regression and it points directly at a fix (below).

Responsiveness vs the frontier. Plot cost against violation rate and the gate sits at the knee of the cost/SLO frontier: left of the static and reactive points (cheaper), below the pure-predictive points (safer). It does not strictly dominate either baseline. Pure prediction is always cheaper, and the static floor is always at least as safe. The contribution is the knee, the operating point a practitioner usually actually wants.

Forecast horizon analysis

A natural assumption is that a predictive autoscaler wants a long horizon so it can pre-scale for surges far ahead. Our experiments push back on that.

The look-ahead in a receding-horizon controller only changes the decision when the action it takes now is driven by a future requirement rather than the current one. That happens only when the controller cannot simply realise the needed capacity at the moment it is needed. There are two reasons it cannot: a per-interval ramp limit (you can add only so many pods per interval), and an actuation delay (new pods take time to become ready). Absent both, the greedy one-step action is already optimal and the longer plan is dormant.

This has direct operational consequences:

Short horizons are usually enough. Under a typical ramp limit and reasonably fast pod startup, accurate one-step sizing captures most of the benefit. A longer horizon mainly adds forecast error (predictions degrade with distance) for little gain.
Lead time, not horizon depth, handles startup lag. If pods take time to warm up, what matters is that the controller sizes for the load the new pods will face when they are ready, that is, that it accounts for the delay. A controller that plans far ahead but acts for "now" still lands its capacity late. Modelling the delay is the lever; lengthening the horizon is not.
The horizon earns its keep on sharp, ramp-bound change. Instantaneous steps that exceed the ramp limit are the case where starting to scale several intervals early genuinely pays. There the useful horizon is about the time it takes to ramp into the change, and no longer.

For the confidence-gated controller specifically, this is why a short horizon plus a reactive floor is a good combination: the predictive term handles forecastable change one step at a time, and the gated floor, not a longer horizon, is what absorbs the surprises.

Code snippets

The reactive baselines are stock Kubernetes. CPU HPA, with a fast scale-up and the default anti-flap window on scale-down:

# Baseline: native HPA on CPU utilisation
spec:
  minReplicas: 1
  maxReplicas: 60
  metrics:
    - type: Resource
      resource: { name: cpu, target: { type: Utilization, averageUtilization: 70 } }
  behavior:
    scaleUp:   { stabilizationWindowSeconds: 0 }
    scaleDown: { stabilizationWindowSeconds: 300 }

A stronger reactive baseline, KEDA scaling on request rate per replica:

# Baseline: KEDA request-rate scaling (Prometheus trigger)
triggers:
  - type: prometheus
    metadata:
      threshold: "35"   # ~ mu * target utilisation
      query: sum(rate(http_requests_total{service="frontend"}[1m]))

The signals the controller reads each interval are plain PromQL:

# arrival rate (req/s)
sum(rate(http_requests_total{service="frontend"}[1m]))

# tail latency (p99, seconds)
histogram_quantile(0.99, sum by (le)(rate(http_request_duration_seconds_bucket[1m])))

Reproducing the controlled benchmark is a single command per experiment; results are written as CSV and plotted by a separate script:

# run the confidence-gated comparison across all workloads and seeds
PYTHONPATH=src:. python -m experiments.confidence_hybrid   # -> runs/confidence_hybrid.csv
PYTHONPATH=src:. python paper/make_figures.py      # -> violations.png, frontier.png

The confidence-gate decision itself is the pseudocode shown earlier. The full controller (the receding-horizon optimiser, the queueing-model sizing, and the tuned gate constants) is part of unpublished thesis work and is deliberately left out here.

Lessons learned

Gate on the signal you already have. The forecaster's own recent error is a free, online confidence signal. It needs no extra model and no labels, and it captures exactly the failure mode that hurts: the forecast being wrong right now.
Defaults travelled. A single gate setting worked across all seven workloads with no per-workload tuning. The frontier knee did not depend on hand-fitting thresholds to each trace, which is the difference between a result and a parlour trick.
Hold the forecaster fixed. The most important methodological decision was making the forecaster identical across controllers. It is the only way to attribute a difference to the scaling policy rather than to prediction quality.
Churn is a real cost, and gating can add it. On choppy multi-burst load the gate toggled and produced more scaling actions than the static floor. Safety and churn are not the same axis, and a hybrid can win one while losing the other.
Keep the heavy model off the control path. Running the forecaster offline and replaying it kept runs deterministic and the loop dependency-light, which made five-seed sweeps and apples-to-apples comparisons practical.

Limitations

This is a benchmark study, and honesty about its scope matters.

Synthetic and rescaled workloads. Six of the seven traces are synthetic and the seventh is a real trace rescaled to the load regime. They are chosen to stress specific behaviours, not to represent any one production service.
A shared queueing model. The simulator and the controller's sizing both assume M/M/c, so absolute violation rates are likely optimistic. The comparisons are fair because every controller shares the model and the forecaster, but the absolute numbers should be read as relative.
Forecast uncertainty is the whole game. The gate helps to the exact extent that recent error predicts current error. On load that is forecastable until it suddenly is not, the gate reacts one interval late, the same interval the reactive floor would have.
Cloud variability is abstracted. Pod startup, scheduling latency, node autoscaling, and multi-tenant noise are modelled coarsely or held fixed. A production deployment adds variance the controlled benchmark does not capture.
Single service. The study is one Deployment. Coupled multi-tier systems, where scaling one service shifts another's load, are out of scope.

Conclusion

Reactive autoscaling is late by construction, and the standard fix, a permanent reactive floor on top of a forecast, buys safety by paying for it on every interval. Gating that floor by the forecaster's recent reliability recovers most of the safety while paying for it only when the forecast is actually unreliable. Across seven workloads including a real trace, the confidence-gated hybrid cut the pure-predictive SLO-violation rate by 1.8x to 5.2x for 7% to 15% more cost, matched a static floor's safety for 13% to 17% less cost, and lowered scaling churn on six of seven workloads. It lands on the knee of the cost/SLO frontier, which is usually the point you want.

For practitioners, the takeaways are concrete. A request-rate signal beats CPU for request-serving workloads. A short forecast horizon plus a reactive floor beats a long horizon. And if you run a predictive autoscaler, gating its safety net on its own recent accuracy is a cheap way to stop paying for protection you are not using.

DEV Community