<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Lukas Brunner</title>
    <description>The latest articles on DEV Community by Lukas Brunner (@lukas_brunner).</description>
    <link>https://dev.to/lukas_brunner</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3887850%2Fad139a3f-c71b-447d-9898-085dcf3ec0e8.jpg</url>
      <title>DEV Community: Lukas Brunner</title>
      <link>https://dev.to/lukas_brunner</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/lukas_brunner"/>
    <language>en</language>
    <item>
      <title>Detecting Silent Model Failure: Drift Monitoring That Actually Works</title>
      <dc:creator>Lukas Brunner</dc:creator>
      <pubDate>Thu, 21 May 2026 06:52:38 +0000</pubDate>
      <link>https://dev.to/lukas_brunner/detecting-silent-model-failure-drift-monitoring-that-actually-works-5ge0</link>
      <guid>https://dev.to/lukas_brunner/detecting-silent-model-failure-drift-monitoring-that-actually-works-5ge0</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Most drift monitoring setups alert on the wrong thing. Feature distribution drift is cheap to compute and almost always misleading. Prediction drift plus a delayed ground-truth feedback loop catches the failures that actually cost money. Here is the setup I use at Yokoy.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A model that returns HTTP 200 with a plausible-looking float is the worst kind of broken. No exception, no pager, no Slack message. The metric only moves three weeks later when finance reviews the numbers.&lt;/p&gt;

&lt;p&gt;I have spent the last two years rebuilding the monitoring story for our expense classification models. What follows is what I kept after throwing out the rest.&lt;/p&gt;

&lt;h2&gt;
  
  
  The mistake I keep seeing
&lt;/h2&gt;

&lt;p&gt;Teams instrument input feature drift first because it is the easiest thing to compute. Pull yesterday's feature values, pull today's, run a KS test on each column, alert when p &amp;lt; 0.05.&lt;/p&gt;

&lt;p&gt;This generates noise. A lot of noise.&lt;/p&gt;

&lt;p&gt;Features drift constantly for reasons that have nothing to do with model quality. A new customer onboards, the merchant category distribution shifts, you get a Slack ping at 03:00 for something that does not matter. After two weeks of this, on-call mutes the channel. After four weeks, the channel is deleted.&lt;/p&gt;

&lt;p&gt;The problem is not the test. The problem is that input drift is a weak proxy for what you actually care about: did model performance degrade.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to monitor instead
&lt;/h2&gt;

&lt;p&gt;Three signals, ranked by cost and value.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Signal&lt;/th&gt;
&lt;th&gt;Compute cost&lt;/th&gt;
&lt;th&gt;Latency to detect&lt;/th&gt;
&lt;th&gt;False positive rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Input feature drift&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Hours&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prediction distribution drift&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Hours&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Performance vs delayed labels&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Days to weeks&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Prediction drift is the underrated one. If your model started returning a different distribution of outputs without you shipping new weights, something upstream broke. Could be feature pipeline. Could be a provider returning malformed embeddings. Could be a real population shift. All of these are worth investigating.&lt;/p&gt;

&lt;p&gt;The detection logic is short:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scipy.stats&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;wasserstein_distance&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;prediction_drift_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reference&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;wasserstein_distance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reference&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# reference = predictions from the validation window when the model was promoted
# current = predictions from the last 24h of production traffic
# alert when score exceeds the 99th percentile of bootstrapped baseline scores
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wasserstein over KS for prediction monitoring. KS is hypersensitive to large samples and you will have large samples in production. With 500k predictions per day, KS rejects the null hypothesis for differences nobody cares about.&lt;/p&gt;

&lt;h2&gt;
  
  
  The feedback loop is non-negotiable
&lt;/h2&gt;

&lt;p&gt;For expense classification, ground truth arrives when a human approves or corrects the prediction. Median latency is four days. P95 is three weeks.&lt;/p&gt;

&lt;p&gt;We log every prediction with a join key and write it to a Parquet table partitioned by date. When labels arrive, a nightly Kubeflow pipeline joins them and computes per-segment performance: accuracy per merchant category, per country, per customer tier.&lt;/p&gt;

&lt;p&gt;The per-segment view is what surfaces the failures. Aggregate accuracy stays at 94% while accuracy on a specific Swiss VAT category collapses to 71%. The aggregate view would never have caught it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified pipeline component spec&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;compute-segmented-metrics&lt;/span&gt;
  &lt;span class="na"&gt;inputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;predictions_table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gs://yokoy-ml/predictions/dt={{date}}&lt;/span&gt;
    &lt;span class="na"&gt;labels_table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gs://yokoy-ml/labels/dt={{date}}&lt;/span&gt;
  &lt;span class="na"&gt;outputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metrics_table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gs://yokoy-ml/metrics/dt={{date}}&lt;/span&gt;
  &lt;span class="na"&gt;segments&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;merchant_category&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;country&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;customer_tier&lt;/span&gt;
  &lt;span class="na"&gt;resource_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;16Gi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The cost: roughly 12 minutes of compute per day on our volume. The value: every regression we caught in the last 18 months was caught here, not by drift monitoring.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where input drift still earns its place
&lt;/h2&gt;

&lt;p&gt;I have not fully abandoned input drift. It is useful as a debugging tool after the fact. When per-segment accuracy drops, the first question is which features moved. Having the historical drift scores already computed means the investigation starts with a query instead of a backfill.&lt;/p&gt;

&lt;p&gt;So compute it, store it, do not alert on it.&lt;/p&gt;

&lt;h2&gt;
  
  
  A note on LLM-based features
&lt;/h2&gt;

&lt;p&gt;We added an LLM-derived feature last year for invoice text classification, routed through a gateway in front of multiple providers (Bifrost handles this for us, though others like LiteLLM or Portkey cover the same ground). The drift profile changed immediately. Provider model updates, even minor ones, shift the feature distribution in ways you cannot see from your side.&lt;/p&gt;

&lt;p&gt;Lesson: pin the provider model version explicitly. Treat a provider model change as a feature pipeline change. Re-run the validation set. This sounds obvious until the day a default model alias updates and you find out from the metrics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;Per-segment monitoring has a cardinality problem. With three segments of 50, 30, and 5 values you get 7500 cells. Most are empty or have too few samples for meaningful metrics. We use a minimum sample threshold of 100 per cell per day and accept that long-tail segments take longer to detect issues in.&lt;/p&gt;

&lt;p&gt;Delayed labels mean delayed detection. For models where the label takes weeks, you need a complementary fast signal. Prediction drift fills part of that gap but it is a leading indicator, not a measurement.&lt;/p&gt;

&lt;p&gt;Wasserstein distance has no native interpretation in production units. You bootstrap a baseline and alert on deviation from it. This works but it is not as crisp as "accuracy dropped 3 points."&lt;/p&gt;

&lt;p&gt;Storing every prediction with features for joinability is expensive. We compress aggressively and tier old partitions to cold storage after 90 days. Plan the storage cost before you build it, not after.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://christophergs.com/machine%20learning/2020/03/14/how-to-monitor-machine-learning-models/" rel="noopener noreferrer"&gt;Monitoring Machine Learning Models in Production (Christopher Samiullah)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.evidentlyai.com/" rel="noopener noreferrer"&gt;Evidently AI documentation on drift detection methods&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developers.google.com/machine-learning/guides/rules-of-ml" rel="noopener noreferrer"&gt;Google's Rules of Machine Learning, especially rules 8 and 32&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://research.google/pubs/the-ml-test-score-a-rubric-for-ml-production-readiness-and-technical-debt-reduction/" rel="noopener noreferrer"&gt;The ML Test Score paper (Breck et al., Google)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.kubeflow.org/docs/components/pipelines/" rel="noopener noreferrer"&gt;Kubeflow Pipelines documentation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>mlops</category>
      <category>machinelearning</category>
      <category>infrastructure</category>
      <category>sre</category>
    </item>
    <item>
      <title>Detecting Silent Model Failure: Drift Monitoring That Actually Works</title>
      <dc:creator>Lukas Brunner</dc:creator>
      <pubDate>Wed, 20 May 2026 06:55:39 +0000</pubDate>
      <link>https://dev.to/lukas_brunner/detecting-silent-model-failure-drift-monitoring-that-actually-works-58lh</link>
      <guid>https://dev.to/lukas_brunner/detecting-silent-model-failure-drift-monitoring-that-actually-works-58lh</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Most drift monitoring setups alert on the wrong thing. Feature distribution drift is cheap to compute and almost always misleading. Prediction drift plus a delayed ground-truth feedback loop catches the failures that actually cost money. Here is the setup I use at Yokoy.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A model that returns HTTP 200 with a plausible-looking float is the worst kind of broken. No exception, no pager, no Slack message. The metric only moves three weeks later when finance reviews the numbers.&lt;/p&gt;

&lt;p&gt;I have spent the last two years rebuilding the monitoring story for our expense classification models. What follows is what I kept after throwing out the rest.&lt;/p&gt;

&lt;h2&gt;
  
  
  The mistake I keep seeing
&lt;/h2&gt;

&lt;p&gt;Teams instrument input feature drift first because it is the easiest thing to compute. Pull yesterday's feature values, pull today's, run a KS test on each column, alert when p &amp;lt; 0.05.&lt;/p&gt;

&lt;p&gt;This generates noise. A lot of noise.&lt;/p&gt;

&lt;p&gt;Features drift constantly for reasons that have nothing to do with model quality. A new customer onboards, the merchant category distribution shifts, you get a Slack ping at 03:00 for something that does not matter. After two weeks of this, on-call mutes the channel. After four weeks, the channel is deleted.&lt;/p&gt;

&lt;p&gt;The problem is not the test. The problem is that input drift is a weak proxy for what you actually care about: did model performance degrade.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to monitor instead
&lt;/h2&gt;

&lt;p&gt;Three signals, ranked by cost and value.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Signal&lt;/th&gt;
&lt;th&gt;Compute cost&lt;/th&gt;
&lt;th&gt;Latency to detect&lt;/th&gt;
&lt;th&gt;False positive rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Input feature drift&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Hours&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prediction distribution drift&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Hours&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Performance vs delayed labels&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Days to weeks&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Prediction drift is the underrated one. If your model started returning a different distribution of outputs without you shipping new weights, something upstream broke. Could be feature pipeline. Could be a provider returning malformed embeddings. Could be a real population shift. All of these are worth investigating.&lt;/p&gt;

&lt;p&gt;The detection logic is short:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scipy.stats&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;wasserstein_distance&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;prediction_drift_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reference&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;wasserstein_distance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reference&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# reference = predictions from the validation window when the model was promoted
# current = predictions from the last 24h of production traffic
# alert when score exceeds the 99th percentile of bootstrapped baseline scores
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wasserstein over KS for prediction monitoring. KS is hypersensitive to large samples and you will have large samples in production. With 500k predictions per day, KS rejects the null hypothesis for differences nobody cares about.&lt;/p&gt;

&lt;h2&gt;
  
  
  The feedback loop is non-negotiable
&lt;/h2&gt;

&lt;p&gt;For expense classification, ground truth arrives when a human approves or corrects the prediction. Median latency is four days. P95 is three weeks.&lt;/p&gt;

&lt;p&gt;We log every prediction with a join key and write it to a Parquet table partitioned by date. When labels arrive, a nightly Kubeflow pipeline joins them and computes per-segment performance: accuracy per merchant category, per country, per customer tier.&lt;/p&gt;

&lt;p&gt;The per-segment view is what surfaces the failures. Aggregate accuracy stays at 94% while accuracy on a specific Swiss VAT category collapses to 71%. The aggregate view would never have caught it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified pipeline component spec&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;compute-segmented-metrics&lt;/span&gt;
  &lt;span class="na"&gt;inputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;predictions_table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gs://yokoy-ml/predictions/dt={{date}}&lt;/span&gt;
    &lt;span class="na"&gt;labels_table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gs://yokoy-ml/labels/dt={{date}}&lt;/span&gt;
  &lt;span class="na"&gt;outputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metrics_table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gs://yokoy-ml/metrics/dt={{date}}&lt;/span&gt;
  &lt;span class="na"&gt;segments&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;merchant_category&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;country&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;customer_tier&lt;/span&gt;
  &lt;span class="na"&gt;resource_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;16Gi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The cost: roughly 12 minutes of compute per day on our volume. The value: every regression we caught in the last 18 months was caught here, not by drift monitoring.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where input drift still earns its place
&lt;/h2&gt;

&lt;p&gt;I have not fully abandoned input drift. It is useful as a debugging tool after the fact. When per-segment accuracy drops, the first question is which features moved. Having the historical drift scores already computed means the investigation starts with a query instead of a backfill.&lt;/p&gt;

&lt;p&gt;So compute it, store it, do not alert on it.&lt;/p&gt;

&lt;h2&gt;
  
  
  A note on LLM-based features
&lt;/h2&gt;

&lt;p&gt;We added an LLM-derived feature last year for invoice text classification, routed through a gateway in front of multiple providers (Bifrost handles this for us, though others like LiteLLM or Portkey cover the same ground). The drift profile changed immediately. Provider model updates, even minor ones, shift the feature distribution in ways you cannot see from your side.&lt;/p&gt;

&lt;p&gt;Lesson: pin the provider model version explicitly. Treat a provider model change as a feature pipeline change. Re-run the validation set. This sounds obvious until the day a default model alias updates and you find out from the metrics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;Per-segment monitoring has a cardinality problem. With three segments of 50, 30, and 5 values you get 7500 cells. Most are empty or have too few samples for meaningful metrics. We use a minimum sample threshold of 100 per cell per day and accept that long-tail segments take longer to detect issues in.&lt;/p&gt;

&lt;p&gt;Delayed labels mean delayed detection. For models where the label takes weeks, you need a complementary fast signal. Prediction drift fills part of that gap but it is a leading indicator, not a measurement.&lt;/p&gt;

&lt;p&gt;Wasserstein distance has no native interpretation in production units. You bootstrap a baseline and alert on deviation from it. This works but it is not as crisp as "accuracy dropped 3 points."&lt;/p&gt;

&lt;p&gt;Storing every prediction with features for joinability is expensive. We compress aggressively and tier old partitions to cold storage after 90 days. Plan the storage cost before you build it, not after.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://christophergs.com/machine%20learning/2020/03/14/how-to-monitor-machine-learning-models/" rel="noopener noreferrer"&gt;Monitoring Machine Learning Models in Production (Christopher Samiullah)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.evidentlyai.com/" rel="noopener noreferrer"&gt;Evidently AI documentation on drift detection methods&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developers.google.com/machine-learning/guides/rules-of-ml" rel="noopener noreferrer"&gt;Google's Rules of Machine Learning, especially rules 8 and 32&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://research.google/pubs/the-ml-test-score-a-rubric-for-ml-production-readiness-and-technical-debt-reduction/" rel="noopener noreferrer"&gt;The ML Test Score paper (Breck et al., Google)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.kubeflow.org/docs/components/pipelines/" rel="noopener noreferrer"&gt;Kubeflow Pipelines documentation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>mlops</category>
      <category>machinelearning</category>
      <category>infrastructure</category>
      <category>sre</category>
    </item>
    <item>
      <title>The Rise of Inference Optimization: The Real LLM Infra Trend Shaping 2026</title>
      <dc:creator>Lukas Brunner</dc:creator>
      <pubDate>Sun, 19 Apr 2026 20:43:01 +0000</pubDate>
      <link>https://dev.to/lukas_brunner/the-rise-of-inference-optimization-the-real-llm-infra-trend-shaping-2026-4e4o</link>
      <guid>https://dev.to/lukas_brunner/the-rise-of-inference-optimization-the-real-llm-infra-trend-shaping-2026-4e4o</guid>
      <description>&lt;p&gt;The large language model space is moving fast, but one trend is quietly defining the next phase of AI: inference optimization. While headlines focus on bigger models and benchmark wins, the real innovation is happening behind the scenes. Teams are no longer asking how to build smarter models. They are asking how to run them efficiently, cheaply, and at scale.&lt;/p&gt;

&lt;p&gt;If you care about LLM infrastructure, this shift matters more than the next model release.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Inference Optimization Is Taking Over
&lt;/h2&gt;

&lt;p&gt;Training a model is expensive, but it is a one time cost. Inference is forever. Every user query, every API call, every generated token adds to ongoing compute costs. For companies deploying LLMs in production, inference quickly becomes the dominant expense.&lt;/p&gt;

&lt;p&gt;This is why optimization is now the priority. Reducing latency, lowering cost per token, and improving throughput directly impacts margins and user experience. A model that is slightly less capable but twice as fast is often the better business decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Techniques Driving This Trend
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Model Quantization
&lt;/h3&gt;

&lt;p&gt;Quantization reduces the precision of model weights, which significantly lowers memory usage and speeds up inference. Moving from 16 bit to 8 bit or even 4 bit precision can unlock major performance gains with minimal quality loss.&lt;/p&gt;

&lt;p&gt;This is especially important for edge deployments and cost sensitive applications.&lt;/p&gt;

&lt;h3&gt;
  
  
  Smart Routing and Model Cascades
&lt;/h3&gt;

&lt;p&gt;Not every query needs a top tier model. Smart routing systems analyze incoming requests and decide which model should handle them. Simple queries go to smaller, cheaper models. Complex ones are escalated.&lt;/p&gt;

&lt;p&gt;This approach, often called model cascading, reduces overall costs without sacrificing quality where it matters.&lt;/p&gt;

&lt;h3&gt;
  
  
  KV Cache Optimization
&lt;/h3&gt;

&lt;p&gt;Key value caching is critical for speeding up long conversations. By reusing previously computed attention states, systems avoid recomputing tokens from scratch.&lt;/p&gt;

&lt;p&gt;Efficient cache management can dramatically reduce latency, especially in chat based applications where context grows over time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Speculative Decoding
&lt;/h3&gt;

&lt;p&gt;Speculative decoding is gaining traction as a way to accelerate generation. A smaller model generates candidate tokens, and a larger model verifies them. If the guess is correct, the system skips expensive computation.&lt;/p&gt;

&lt;p&gt;This technique can improve throughput without compromising output quality.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tradeoffs You Cannot Ignore
&lt;/h2&gt;

&lt;p&gt;Optimization is not free. Every gain comes with a tradeoff.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Aggressive quantization can degrade output quality
&lt;/li&gt;
&lt;li&gt;Routing systems can introduce inconsistency
&lt;/li&gt;
&lt;li&gt;Caching strategies can create stale or repetitive responses
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The challenge is finding the right balance for your use case. There is no universal setup. What works for a consumer chatbot may fail in a high accuracy enterprise workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Trend Matters for Builders
&lt;/h2&gt;

&lt;p&gt;For developers and companies, inference optimization is no longer optional. It is a competitive advantage.&lt;/p&gt;

&lt;p&gt;Lower costs mean you can serve more users. Faster responses improve engagement. Efficient systems unlock new product experiences that were previously too expensive to run.&lt;/p&gt;

&lt;p&gt;In short, infrastructure decisions are now product decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;The future of LLMs will not be defined by who has the biggest model. It will be defined by who can run models the smartest way.&lt;/p&gt;

&lt;p&gt;Inference optimization is where that battle is happening right now. If you are building in this space, this is the layer you cannot afford to ignore.&lt;/p&gt;

&lt;p&gt;Focus less on chasing model hype and more on mastering the systems that make those models usable at scale. That is where the real leverage is.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>programming</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
