Use KEDA Scaling Modifiers to Manage AI Infrastructure

Sam Farid — Thu, 03 Oct 2024 19:02:44 +0000

TL;DR KEDA’s new scaling modifiers have unlocked some incredible tactics to dynamically manage infrastructure. We think scaling modifiers are particularly good for handling the size, cost, and complexity of AI infrastructure. Here are a few examples of scaling modifiers in action

What are KEDA's Scaling Modifiers?

KEDA extends Kubernetes’ native Horizontal Pod Autoscaling to handle custom metrics and events. With HPAs, you’re usually adding and subtracting pod replicas based on the Memory and CPU utilization of your workload. With KEDA, scaling replicas can be triggered by events such as a push notification to your users, scaling your backend to match the spike in traffic.

Scaling Modifiers went GA in KEDA 2.15; they give you the ability to declare powerful, flexible scaling criteria with formulas and conditions. We’ve been playing around with scaling modifiers in the month since its release and we think this is game-changing for managing the complexity of today’s AI infrastructure.

Here are three applications to show off the Power and Glory of scaling modifiers.

Dynamic Model Retraining on Validation Failure and Available GPUs

Model retraining is typically done on a predefined schedule, but this has downsides. If it’s not yet necessary to retrain, you’re running too early and wasting resources. But on the other hand, if the model’s validation scores drift quickly, you’re running too late and are already serving bad results.

We need a mechanism that:

Triggers only when necessary, i.e. when validation scores have drifted
Checks that GPUs are available for retraining
Checks that there’s enough time to run without potentially interfering with production serving

Here's what this could look like:

advanced:
    scalingModifiers:
      formula: "validation_drift > 0.1 && available_gpus >= 2 && request_queue_length < 1000 ? 1 : 0"
      activationTarget: "1"

The activationTarget is the value needed to scale from 0 to 1, so this would kick off a new retraining job.

Scale Model Infrastructure on Tokens, Not Traffic

End-to-end response latency as a standalone metric isn’t a useful proxy for understanding if a model is overloaded because the length of a response can vary wildly based on the prompt. A model’s load is better reflected by the latency of generating individual tokens.

Here’s an example of how to scale up model serving replicas when token latency is creeping up:

advanced:
  scalingModifiers:
    formula: "1 / (tokens_per_min / pod_count)"
    target: "0.001"

The rate of token generation per pod will decrease when the model replicas are becoming unhealthy, so this formula watches for the inverse of this rate to increase, creating more replicas as token latency rises.

Model Rebalancing for Changing Traffic

During periods of high traffic, costs and latencies can balloon if sticking with the same model for every request. In a pinch, consider swapping in a smaller/cheaper model during peak traffic to maintain throughput and cap costs.

This formula shows how to scale replicas with a cheaper model, only triggering at high throughput:

advanced:
    scalingModifiers:
      formula: "tokens_per_min >= 5000 ? request_rate : 0"
      activationTarget: "1"
      target: "100"

Use KEDA to Automate and Scale AI Infrastructure

Scaling Modifiers are new but we think they'll quickly become a core tool for anyone managing large scale infrastructure.

We’d like to give a huge congrats to the KEDA-Core team (Jeff, Jorge, Zybnek, Jan) on this release and recommend checking out their talk from Kubecon-Europe.