<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sam Farid</title>
    <description>The latest articles on DEV Community by Sam Farid (@sf_fc).</description>
    <link>https://dev.to/sf_fc</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2163242%2Fae4c8ff6-a17c-4cb1-9826-2865bddabff3.jpg</url>
      <title>DEV Community: Sam Farid</title>
      <link>https://dev.to/sf_fc</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sf_fc"/>
    <language>en</language>
    <item>
      <title>Use KEDA Scaling Modifiers to Manage AI Infrastructure</title>
      <dc:creator>Sam Farid</dc:creator>
      <pubDate>Thu, 03 Oct 2024 19:02:44 +0000</pubDate>
      <link>https://dev.to/sf_fc/use-keda-scaling-modifiers-to-manage-ai-infrastructure-44eg</link>
      <guid>https://dev.to/sf_fc/use-keda-scaling-modifiers-to-manage-ai-infrastructure-44eg</guid>
      <description>&lt;p&gt;&lt;em&gt;TL;DR KEDA’s new scaling modifiers have unlocked some incredible tactics to dynamically manage infrastructure. We think scaling modifiers are particularly good for handling the size, cost, and complexity of AI infrastructure. Here are a few examples of scaling modifiers in action&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What are KEDA's Scaling Modifiers?
&lt;/h2&gt;

&lt;p&gt;KEDA extends Kubernetes’ native Horizontal Pod Autoscaling to handle custom metrics and events. With HPAs, you’re usually adding and subtracting pod replicas based on the Memory and CPU utilization of your workload. With KEDA, scaling replicas can be triggered by events such as a push notification to your users, scaling your backend to match the spike in traffic.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/kedacore/keda-docs/pull/1460" rel="noopener noreferrer"&gt;Scaling Modifiers went GA in KEDA 2.15&lt;/a&gt;; they give you the ability to declare powerful, flexible scaling criteria with formulas and conditions. We’ve been playing around with scaling modifiers in the month since its release and we think this is game-changing for managing the complexity of today’s AI infrastructure.  &lt;/p&gt;

&lt;p&gt;Here are three applications to show off the Power and Glory of scaling modifiers. &lt;/p&gt;

&lt;h2&gt;
  
  
  Dynamic Model Retraining on Validation Failure and Available GPUs
&lt;/h2&gt;

&lt;p&gt;Model retraining is typically done on a predefined schedule, but this has downsides. If it’s not yet necessary to retrain, you’re running too early and wasting resources. But on the other hand, if the model’s validation scores drift quickly, you’re running too late and are already serving bad results.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F89xuaffkuuhg38wdandb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F89xuaffkuuhg38wdandb.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We need a mechanism that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Triggers only when necessary, i.e. when validation scores have drifted&lt;/li&gt;
&lt;li&gt;Checks that GPUs are available for retraining&lt;/li&gt;
&lt;li&gt;Checks that there’s enough time to run without potentially interfering with production serving&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's what this could look like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;advanced&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;scalingModifiers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;formula&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validation_drift&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;0.1&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;available_gpus&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;2&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;request_queue_length&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;1000&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;?&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;0"&lt;/span&gt;
      &lt;span class="na"&gt;activationTarget&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The activationTarget is the value needed to scale from 0 to 1, so this would kick off a new retraining job.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scale Model Infrastructure on Tokens, Not Traffic
&lt;/h2&gt;

&lt;p&gt;End-to-end response latency as a standalone metric isn’t a useful proxy for understanding if a model is overloaded because the length of a response can vary wildly based on the prompt. A model’s load is better reflected by the latency of generating individual tokens. &lt;/p&gt;

&lt;p&gt;Here’s an example of how to scale up model serving replicas when token latency is creeping up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;advanced&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;scalingModifiers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;formula&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(tokens_per_min&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;pod_count)"&lt;/span&gt;
    &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.001"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The rate of token generation per pod will decrease when the model replicas are becoming unhealthy, so this formula watches for the inverse of this rate to increase, creating more replicas as token latency rises.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model Rebalancing for Changing Traffic
&lt;/h2&gt;

&lt;p&gt;During periods of high traffic, costs and latencies can balloon if sticking with the same model for every request. In a pinch, consider swapping in a smaller/cheaper model during peak traffic to maintain throughput and cap costs.&lt;/p&gt;

&lt;p&gt;This formula shows how to scale replicas with a cheaper model, only triggering at high throughput:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;advanced&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;scalingModifiers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;formula&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tokens_per_min&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;5000&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;?&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;request_rate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;0"&lt;/span&gt;
      &lt;span class="na"&gt;activationTarget&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
      &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;100"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Use KEDA to Automate and Scale AI Infrastructure
&lt;/h2&gt;

&lt;p&gt;Scaling Modifiers are new but we think they'll quickly become a core tool for anyone managing large scale infrastructure. &lt;/p&gt;

&lt;p&gt;We’d like to give a huge congrats to the KEDA-Core team (Jeff, Jorge, Zybnek, Jan) on this release and recommend checking out their talk from &lt;a href="https://www.youtube.com/watch?v=_5_njiPr5vg" rel="noopener noreferrer"&gt;Kubecon-Europe&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Thanks for reading! I'll be posting more about KEDA and workload management at &lt;a href="https://www.flightcrew.io/blog" rel="noopener noreferrer"&gt;https://www.flightcrew.io/blog&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>ai</category>
      <category>keda</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
