<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: mlops</title>
    <description>The latest articles tagged 'mlops' on DEV Community.</description>
    <link>https://dev.to/t/mlops</link>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tag/mlops"/>
    <language>en</language>
    <item>
      <title>MLOps Training in Hyderabad | MLOps Training Course</title>
      <dc:creator>vamsi visualpath</dc:creator>
      <pubDate>Tue, 30 Jun 2026 11:14:29 +0000</pubDate>
      <link>https://dev.to/vamsi_visualpath_826a9ad2/mlops-training-in-hyderabad-mlops-training-course-13ao</link>
      <guid>https://dev.to/vamsi_visualpath_826a9ad2/mlops-training-in-hyderabad-mlops-training-course-13ao</guid>
      <description>&lt;p&gt;🚀 𝗕𝗲𝗰𝗼𝗺𝗲 𝗮𝗻 𝗜𝗻-𝗗𝗲𝗺𝗮𝗻𝗱 𝗠𝗟𝗢𝗽𝘀 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿 𝘄𝗶𝘁𝗵 𝗥𝗲𝗮𝗹-𝗧𝗶𝗺𝗲 𝗣𝗿𝗼𝗷𝗲𝗰𝘁𝘀!&lt;br&gt;
🎯 Join Visualpath's #MLOps Training and gain hands-on experience in building, deploying, and managing Machine Learning pipelines with industry-leading tools and real-world projects.&lt;/p&gt;

&lt;p&gt;✨ 𝗪𝗵𝗮𝘁 𝗬𝗼𝘂’𝗹𝗹 𝗟𝗲𝗮𝗿𝗻:&lt;br&gt;
 ✅ Python for Machine Learning&lt;br&gt;
 ✅ Kubeflow &amp;amp; MLflow&lt;br&gt;
 ✅ Docker, Kubernetes &amp;amp; Git&lt;br&gt;
 ✅ CI/CD for ML Pipelines&lt;br&gt;
 ✅ AWS EKS Deployment&lt;br&gt;
 ✅ Prometheus &amp;amp; Grafana Monitoring&lt;br&gt;
 ✅ Real-Time Industry Projects &amp;amp; Use Cases&lt;/p&gt;

&lt;p&gt;💥 𝗝𝗼𝗶𝗻 𝗢𝘂𝗿 𝗙𝗥𝗘𝗘 𝗟𝗶𝘃𝗲 𝗗𝗲𝗺𝗼 and master the complete MLOps lifecycle.&lt;/p&gt;

&lt;p&gt;📞 𝗖𝗮𝗹𝗹: +91 7032290546&lt;br&gt;
🌐 𝗩𝗶𝘀𝗶𝘁: &lt;a href="https://www.visualpath.in/mlops-training.html" rel="noopener noreferrer"&gt;https://www.visualpath.in/mlops-training.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔥 𝗟𝗲𝗮𝗿𝗻 𝗳𝗿𝗼𝗺 𝗜𝗻𝗱𝘂𝘀𝘁𝗿𝘆 𝗘𝘅𝗽𝗲𝗿𝘁𝘀, 𝗕𝘂𝗶𝗹𝗱 𝗥𝗲𝗮𝗹-𝗧𝗶𝗺𝗲 𝗣𝗿𝗼𝗷𝗲𝗰𝘁𝘀, 𝗮𝗻𝗱 𝗕𝗲𝗰𝗼𝗺𝗲 𝗮𝗻 𝗜𝗻-𝗗𝗲𝗺𝗮𝗻𝗱 𝗠𝗟𝗢𝗽𝘀 𝗣𝗿𝗼𝗳𝗲𝘀𝘀𝗶𝗼𝗻𝗮𝗹!&lt;/p&gt;

&lt;h1&gt;
  
  
  MLOps #MachineLearning #ArtificialIntelligence #MLEngineering #DataScience #OnlineTraining #CorporateTraining #Docker #Kubernetes #AWS #Python #DevOps #CloudComputing #AI #Visualpath
&lt;/h1&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>mlops</category>
      <category>python</category>
    </item>
    <item>
      <title>Why Your AI Observability Stack Is Missing the Most Important Metric</title>
      <dc:creator>zhongqiyue</dc:creator>
      <pubDate>Tue, 30 Jun 2026 01:17:27 +0000</pubDate>
      <link>https://dev.to/__c1b9e06dc90a7e0a676b/why-your-ai-observability-stack-is-missing-the-most-important-metric-1m7b</link>
      <guid>https://dev.to/__c1b9e06dc90a7e0a676b/why-your-ai-observability-stack-is-missing-the-most-important-metric-1m7b</guid>
      <description>&lt;p&gt;I spent last week debugging why our AI-powered customer support bot was giving increasingly strange answers. The model hadn't changed. The prompts were identical. The infrastructure was stable.&lt;/p&gt;

&lt;p&gt;So what was different?&lt;/p&gt;

&lt;p&gt;I checked the logs. I checked the embeddings. I even re-ran the evaluation suite — everything passed. But real users were complaining about hallucinated product recommendations.&lt;/p&gt;

&lt;p&gt;The breakthrough came when I stopped looking at the model and started looking at something nobody measures: &lt;strong&gt;context drift&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The metric nobody tracks
&lt;/h2&gt;

&lt;p&gt;Every AI observability tool I've used focuses on the same three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Latency (p50, p95, p99)&lt;/li&gt;
&lt;li&gt;Token usage and cost&lt;/li&gt;
&lt;li&gt;Error rates and timeouts&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These are all infrastructure metrics. They tell you whether the system is &lt;em&gt;working&lt;/em&gt;, not whether it's producing &lt;em&gt;good outputs&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Our bot was responding in 800ms, using 200 tokens, with zero errors. By every metric that mattered, it was performing perfectly.&lt;/p&gt;

&lt;p&gt;Yet it was slowly becoming useless.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I found
&lt;/h2&gt;

&lt;p&gt;I started tracking something simple: the semantic similarity between current prompts and the training corpus. Over three weeks, the average similarity dropped from 0.87 to 0.62.&lt;/p&gt;

&lt;p&gt;Users were asking about products we'd launched, features we'd added, and edge cases we'd never anticipated. The model was doing its best — but its best was calibrated for a different world.&lt;/p&gt;

&lt;p&gt;The observability stack saw zero anomalies because nothing broke. The system was functioning exactly as designed. It was just designing for a moving target.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern I built
&lt;/h2&gt;

&lt;p&gt;I created a simple monitoring layer that tracks three new signals:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output variance over time.&lt;/strong&gt; Not variance within a single request, but variance across days. If your model's output distribution shifts significantly (measured by embedding distance), that's your early warning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt-embedding drift.&lt;/strong&gt; Every incoming prompt gets embedded and compared against a rolling window of historical prompts. When the average distance crosses a threshold, you know the user base is evolving.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Feedback signal lag.&lt;/strong&gt; Most systems collect user feedback (thumbs up/down, corrections). But that feedback arrives hours or days after the problematic output. I built a pipeline that correlates feedback signals with the prompt-drift metric — and it turned out drift preceded bad feedback by an average of 4 days.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for your stack
&lt;/h2&gt;

&lt;p&gt;If you're building AI applications in 2026, here's what I'd add to your observability:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Embedding-based output monitoring&lt;/strong&gt;: Sample 1% of outputs daily, embed them, and track distribution shifts. A simple PCA projection over time reveals when your model's "personality" is drifting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt similarity windows&lt;/strong&gt;: Maintain a rolling buffer of the last 10,000 prompts. Compare new prompts against this buffer. Alert when similarity drops below a threshold.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Correlation dashboards&lt;/strong&gt;: Plot drift metrics alongside business metrics (conversion, retention, CSAT). You'll often find that model quality degradation shows up in business numbers &lt;em&gt;before&lt;/em&gt; it shows up in error rates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automated re-calibration triggers&lt;/strong&gt;: When drift exceeds a threshold, automatically trigger a re-evaluation of your prompts and, if necessary, a fine-tuning cycle.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The uncomfortable truth
&lt;/h2&gt;

&lt;p&gt;Most AI observability tools solve the wrong problem. They help you detect when your model crashes, not when your model becomes &lt;em&gt;incrementally worse&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Incremental degradation is harder to detect because it doesn't trigger alerts. It doesn't show up as an error. It's a slow bleed that only reveals itself when you're measuring the right things.&lt;/p&gt;

&lt;p&gt;I'm still iterating on this approach. Next up: building an automated system that detects drift patterns and suggests which prompts need updating.&lt;/p&gt;

&lt;p&gt;What metrics do you track that others don't? I'd love to hear what you've discovered.&lt;/p&gt;

</description>
      <category>observability</category>
      <category>ai</category>
      <category>mlops</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>We added synthetic data to our eval set. The pass rate rose, and so did our production incidents.</title>
      <dc:creator>Maya Andersson</dc:creator>
      <pubDate>Mon, 29 Jun 2026 16:56:20 +0000</pubDate>
      <link>https://dev.to/maya_andersson_dev/we-added-synthetic-data-to-our-eval-set-the-pass-rate-rose-and-so-did-our-production-incidents-1350</link>
      <guid>https://dev.to/maya_andersson_dev/we-added-synthetic-data-to-our-eval-set-the-pass-rate-rose-and-so-did-our-production-incidents-1350</guid>
      <description>&lt;p&gt;We needed a bigger eval set, so we generated one. A model wrote a few thousand test cases that looked like our traffic, we scored against them, the pass rate went up, and we felt good. Then production incidents went up too, on exactly the inputs the synthetic set said we handled. The test set had grown and its predictive value had dropped, at the same time.&lt;/p&gt;

&lt;p&gt;That is the trap with synthetic eval data, and it is not a tooling problem. Generating cases is easy now. Every framework will hand you a thousand. The hard part, the part none of the generators do for you, is proving the synthetic set behaves like the traffic you actually get. A test set that does not match your distribution is not a smaller version of production. It is a different test, and it can pass while production fails.&lt;/p&gt;

&lt;p&gt;So when I compare the tools that generate eval data, I do not grade them on how many cases they spit out, or how clean the prompts are. I grade them on one question: how much do they help me check that the generated set looks like reality before I trust a number it produces?&lt;/p&gt;

&lt;h2&gt;
  
  
  The criterion, stated precisely
&lt;/h2&gt;

&lt;p&gt;A synthetic eval set is trustworthy when two things hold. First, coverage: the cases span the same kinds of inputs your real traffic contains, in roughly the same proportions, including the messy and rare ones. Second, difficulty calibration: the synthetic cases are about as hard as real cases, so the pass rate on synthetic data tracks the pass rate on real data.&lt;/p&gt;

&lt;p&gt;Both are measurable, and neither is measured by default. Coverage you check by embedding real and synthetic inputs and comparing the distributions, or by labeling both with the same taxonomy and comparing the histograms. Calibration you check by holding out a labeled slice of real data and confirming the model's pass rate on it lands near its pass rate on the synthetic set. If those two numbers diverge, the synthetic set is lying to you, and no amount of volume fixes it.&lt;/p&gt;

&lt;p&gt;That is the lens for everything below.&lt;/p&gt;

&lt;h2&gt;
  
  
  The generators, by how much they help you validate
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;DeepEval (Synthesizer).&lt;/strong&gt; Strong, controllable generation: it builds test cases from documents or from scratch, with knobs for evolution and complexity. The generation is good. What it does not hand you is the distribution-match check against your real traffic. You generate, then you validate the realism yourself. Worth reading alongside the synthetic-data-for-evaluation literature, for example the Self-Instruct work (Wang et al., arXiv:2212.10560), which is honest that generated instructions drift in diversity and difficulty unless you correct for it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Promptfoo.&lt;/strong&gt; Dataset and test-case generation wired into a CI-first tool, so the generated cases drop straight into a gate. Convenient for getting volume into a pipeline fast. The realism question is still yours: it will generate and run, but it does not compare the generated set's distribution to production for you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Giskard.&lt;/strong&gt; Comes at it from the risk angle, generating adversarial and edge cases to surface failures rather than to mirror average traffic. That is a different and useful goal, finding what breaks, but do not confuse a stress set with a representative set. An eval set built only from Giskard-style probes will over-represent the hard tail, which is great for hardening and misleading for estimating real-world pass rate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ragas.&lt;/strong&gt; For RAG specifically, it generates question-answer test sets from your documents, including multi-hop questions. Good fit if your system is retrieval-shaped. The generated questions still need the same coverage check: documents you own are not the same distribution as questions users actually ask.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Future AGI.&lt;/strong&gt; The thing it does differently is integration, not the generator itself. It is an end-to-end open-source platform, and synthetic data generation lives inside the same Datasets and evaluation surface that runs your evals and holds your traces, so the generated set, the eval that scores it, and the production traces you would validate it against are in one place rather than three. The repo is github.com/future-agi/future-agi. Be clear on what that does and does not buy you: it does not auto-prove your synthetic set matches production any more than the others do, that check is still methodology you run. What it removes is the stitching, because comparing synthetic-set behavior to real-trace behavior is a lot easier when both already live in the same system than when you are exporting CSVs between a generator, an eval library, and a tracing tool. On raw generation controllability, DeepEval's Synthesizer is at least as configurable.&lt;/p&gt;

&lt;p&gt;The honest summary across all five: every one of them generates, and not one of them validates realism as the default first step. The validation is the work, and it is on you regardless of which generator you pick.&lt;/p&gt;

&lt;h2&gt;
  
  
  The procedure I actually run
&lt;/h2&gt;

&lt;p&gt;Tool aside, this is the sequence, and steps 1 and 4 are the ones teams skip.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pull a real sample. A few hundred genuine production inputs, with their outcomes if you have them.&lt;/li&gt;
&lt;li&gt;Generate the synthetic set with whichever tool fits your shape.&lt;/li&gt;
&lt;li&gt;Embed both real and synthetic inputs, compare the distributions. If the synthetic set clusters somewhere your real traffic does not, or misses a cluster real traffic has, fix the generation prompts and regenerate.&lt;/li&gt;
&lt;li&gt;Hold out a labeled real slice. Score the model on it and on the synthetic set. If the two pass rates differ by more than a few points, the synthetic set is miscalibrated and its pass rate is not a proxy for anything. Do not trust it until they converge.&lt;/li&gt;
&lt;li&gt;Only then use the synthetic set for volume, and keep the real slice as the anchor you re-check against.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The generator changes how pleasant steps 2 and 3 are. It does not change whether you have to do 1, 4, and 5.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Why not just use real data and skip synthetic entirely?&lt;/strong&gt; &lt;br&gt;
Because real data is often scarce, imbalanced, or sensitive, and you cannot get enough of the rare cases that matter. Synthetic data is a reasonable way to fill those gaps. The point is not to avoid it, it is to validate it before you trust a number it produces.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How much real data do I need to validate the synthetic set?&lt;/strong&gt;&lt;br&gt;
Enough to estimate a distribution and a pass rate with a usable confidence interval, which is usually a few hundred examples, not tens of thousands. The validation slice is smaller than the synthetic set it is checking.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is the single most common failure?&lt;/strong&gt; &lt;br&gt;
Difficulty miscalibration. Generated cases skew easy, because models write clean, unambiguous inputs and real users do not. The pass rate looks great and means nothing. The held-out real slice is what catches this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does generating adversarial cases count as a synthetic eval set?&lt;/strong&gt;&lt;br&gt;
It is a stress set, not a representative one. Use it to harden the system, not to estimate real-world pass rate. Keep the two sets and the two questions separate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open question
&lt;/h2&gt;

&lt;p&gt;Distribution-match has a chicken-and-egg problem on genuinely new features, where you have little or no real traffic yet, so there is nothing to validate the synthetic set against. You are forced to trust generated data precisely when you can least check it. I do not have a clean answer here. The best I have is to treat the synthetic pass rate on a brand-new feature as a smoke test rather than a measurement, and to re-validate aggressively the moment real traffic arrives. If you have a principled way to bound how wrong a synthetic set can be before you have any real data to compare against, I would genuinely like to see it.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>machinelearning</category>
      <category>ai</category>
      <category>mlops</category>
    </item>
    <item>
      <title>Master MLOps Course | MLOps Online Training</title>
      <dc:creator>vamsi visualpath</dc:creator>
      <pubDate>Mon, 29 Jun 2026 08:00:56 +0000</pubDate>
      <link>https://dev.to/vamsi_visualpath_826a9ad2/master-mlops-course-mlops-online-training-4jll</link>
      <guid>https://dev.to/vamsi_visualpath_826a9ad2/master-mlops-course-mlops-online-training-4jll</guid>
      <description>&lt;p&gt;MLOps Lifecycle Explained: From Model Development to Production&lt;br&gt;
Introduction&lt;br&gt;
Modern machine learning projects need more than building a good model. They also need testing, deployment, monitoring, and regular updates. The MLOps Course helps learners understand this complete process and prepares them for real production environments.&lt;br&gt;
This guide explains the complete MLOps lifecycle using simple language. It covers every important stage, useful tools, practical examples, and common challenges.&lt;br&gt;
What Is MLOps Lifecycle?&lt;br&gt;
The MLOps lifecycle is the complete process of creating, deploying, managing, and improving machine learning models. It combines machine learning, software engineering, and DevOps practices.&lt;br&gt;
The lifecycle ensures that models stay accurate, reliable, and useful after deployment.&lt;br&gt;
The main stages include:&lt;br&gt;
• Data collection&lt;br&gt;
• Data preparation&lt;br&gt;
• Model development&lt;br&gt;
• Model validation&lt;br&gt;
• Deployment&lt;br&gt;
• Monitoring&lt;br&gt;
• Model retraining&lt;br&gt;
• Version management&lt;br&gt;
Each stage supports the next one. Together, they create a reliable production workflow.&lt;br&gt;
Why Is MLOps Lifecycle Important in 2026?&lt;br&gt;
Machine learning projects continue to grow across many industries. Organizations now require faster deployments and stable production systems. Without a proper lifecycle, models often become outdated or fail after deployment.&lt;br&gt;
A structured lifecycle helps teams:&lt;br&gt;
• Reduce manual work&lt;br&gt;
• Improve collaboration&lt;br&gt;
• Deliver models faster&lt;br&gt;
• Maintain model quality&lt;br&gt;
• Detect performance issues early&lt;br&gt;
• Support continuous improvement&lt;br&gt;
Many professionals choose MLOps Online Training to learn these industry practices through guided projects and practical workflows.&lt;br&gt;
Key Features or Components of MLOps Lifecycle&lt;br&gt;
Several important components keep the lifecycle efficient. These components work together from development to production.&lt;br&gt;
Key components include:&lt;br&gt;
• Data collection from trusted sources&lt;br&gt;
• Data cleaning and pre-processing&lt;br&gt;
• Feature engineering&lt;br&gt;
• Model training&lt;br&gt;
• Model evaluation&lt;br&gt;
• Experiment tracking&lt;br&gt;
• Version control&lt;br&gt;
• Automated testing&lt;br&gt;
• Continuous integration&lt;br&gt;
• Continuous deployment&lt;br&gt;
• Model monitoring&lt;br&gt;
• Model retraining&lt;br&gt;
Each component helps maintain consistency throughout the project.&lt;br&gt;
How Does MLOps Lifecycle Work?&lt;br&gt;
The lifecycle follows a continuous workflow. Every stage supports model improvement. &lt;br&gt;
The typical process looks like this:&lt;br&gt;
• Collect raw business data.&lt;br&gt;
• Clean and prepare the dataset.&lt;br&gt;
• Train multiple machine learning models.&lt;br&gt;
• Compare model performance.&lt;br&gt;
• Select the best model.&lt;br&gt;
• Test the model before deployment.&lt;br&gt;
• Deploy the model into production.&lt;br&gt;
• Monitor predictions and system health.&lt;br&gt;
• Retrain the model when new data becomes available.&lt;br&gt;
For example, an online shopping company may retrain its recommendation model every month to match changing customer behaviour.&lt;br&gt;
Step-by-Step Guide to MLOps Lifecycle&lt;br&gt;
Following a structured process reduces deployment risks. Each step has a clear purpose.&lt;br&gt;
Step 1: Define the business problem&lt;br&gt;
Identify the goal before collecting data.&lt;br&gt;
Step 2: Collect data&lt;br&gt;
Gather quality data from trusted systems.&lt;br&gt;
Step 3: Prepare the data&lt;br&gt;
Remove errors and create useful features.&lt;br&gt;
Step 4: Train models&lt;br&gt;
Build several models using different algorithms.&lt;br&gt;
Step 5: Evaluate performance&lt;br&gt;
Measure accuracy, precision, recall, and other metrics.&lt;br&gt;
Step 6: Deploy the model&lt;br&gt;
Move the approved model into production.&lt;br&gt;
Step 7: Monitor continuously&lt;br&gt;
Track prediction quality and system performance.&lt;br&gt;
Step 8: Retrain regularly&lt;br&gt;
Update models whenever business data changes.&lt;br&gt;
Best Tools and Technologies for MLOps Lifecycle in 2026&lt;br&gt;
Modern MLOps uses many automation tools. Each tool supports a specific task.&lt;br&gt;
Popular MLOps tools include:&lt;br&gt;
• MLflow for experiment tracking&lt;br&gt;
• Kubeflow for pipeline management&lt;br&gt;
• Docker for containerization&lt;br&gt;
• Kubernetes for orchestration&lt;br&gt;
• Git for version control&lt;br&gt;
• Jenkins for CI/CD automation&lt;br&gt;
• Airflow for workflow scheduling&lt;br&gt;
• TensorFlow Extended (TFX) for production pipelines&lt;br&gt;
• Prometheus for monitoring&lt;br&gt;
• Grafana for dashboards&lt;br&gt;
Tool selection depends on project size and infrastructure.&lt;br&gt;
Real-World Use Cases of MLOps Lifecycle&lt;br&gt;
Many industries depend on reliable machine learning operations.&lt;br&gt;
Common examples include:&lt;br&gt;
• Banks detect fraud using continuously monitored models.&lt;br&gt;
• Hospitals improve medical predictions through regular retraining.&lt;br&gt;
• Retail companies update recommendation systems.&lt;br&gt;
• Manufacturing predicts equipment failures.&lt;br&gt;
• Insurance companies automate claim analysis.&lt;br&gt;
• Logistics firms improve delivery route planning.&lt;br&gt;
These examples show why production management matters as much as model development.&lt;br&gt;
Benefits of MLOps Lifecycle&lt;br&gt;
A structured lifecycle provides long-term value.&lt;br&gt;
Important benefits include:&lt;br&gt;
• Faster model deployment&lt;br&gt;
• Better collaboration&lt;br&gt;
• Higher model quality&lt;br&gt;
• Easier maintenance&lt;br&gt;
• Improved scalability&lt;br&gt;
• Better compliance&lt;br&gt;
• Continuous monitoring&lt;br&gt;
• Reduced operational risk&lt;br&gt;
• Faster issue detection&lt;br&gt;
• Reliable production systems&lt;br&gt;
Professionals looking for practical implementation often explore MLOps Training in Hyderabad to gain hands-on experience with these workflows.&lt;br&gt;
Challenges, Best Practices, and Future Trends&lt;br&gt;
Although MLOps offers many advantages, teams still face several challenges.&lt;br&gt;
Common challenges include:&lt;br&gt;
• Poor data quality&lt;br&gt;
• Model drift&lt;br&gt;
• Infrastructure complexity&lt;br&gt;
• Limited automation&lt;br&gt;
• Security concerns&lt;br&gt;
Best practices include:&lt;br&gt;
• Automate testing whenever possible.&lt;br&gt;
• Track every model version.&lt;br&gt;
• Monitor production continuously.&lt;br&gt;
• Document every workflow.&lt;br&gt;
• Retrain models using fresh data.&lt;br&gt;
• Build reusable pipelines.&lt;br&gt;
Looking ahead to 2026, organizations continue adopting AI-assisted monitoring, automated retraining, stronger governance, and better cloud-native deployment practices.&lt;br&gt;
FAQs&lt;br&gt;
Q. What Is the MLOps Lifecycle?&lt;br&gt;
A. It is the complete process of building, deploying, monitoring, and improving machine learning models for reliable production systems.&lt;br&gt;
Q. What Are the Key Stages of the MLOps Lifecycle?&lt;br&gt;
A. Data preparation, training, testing, deployment, monitoring, retraining, and version control keep models accurate throughout their lifecycle.&lt;br&gt;
Q. Why Is the MLOps Lifecycle Important for Production Machine Learning?&lt;br&gt;
A. It improves reliability, supports automation, reduces failures, and helps teams deliver production-ready machine learning solutions faster.&lt;br&gt;
Q. What Tools Are Commonly Used in the MLOps Lifecycle?&lt;br&gt;
A. MLflow, Kubeflow, Docker, Kubernetes, Git, and Jenkins are common tools. Visualpath training institute covers practical usage.&lt;br&gt;
Q. How Does the MLOps Lifecycle Differ from a Traditional Machine Learning Workflow?&lt;br&gt;
A. Traditional workflows end after training. MLOps adds deployment, monitoring, automation, retraining, and production management. Visualpath explains these stages.&lt;br&gt;
Conclusion&lt;br&gt;
The MLOps lifecycle connects machine learning development with reliable production operations. It helps teams build, deploy, monitor, and improve models through a structured workflow.&lt;br&gt;
By following every lifecycle stage, organizations can reduce errors, improve collaboration, and maintain model quality over time. Learning these practices also builds valuable industry skills for modern AI and machine learning careers.&lt;br&gt;
Visualpath is the leading and best software and online training institute in Hyderabad&lt;br&gt;
For More Information about MLOps Online Training&lt;br&gt;
Contact Call/WhatsApp: +91-7032290546&lt;br&gt;
Visit: &lt;a href="https://www.visualpath.in/mlops-course.html" rel="noopener noreferrer"&gt;https://www.visualpath.in/mlops-course.html&lt;/a&gt;&lt;/p&gt;

</description>
      <category>mlops</category>
      <category>mlopscourse</category>
      <category>mlopsonlinetraining</category>
      <category>ai</category>
    </item>
    <item>
      <title>MLOps: Building a CI/CD Pipeline for ML Models on Azure Databricks</title>
      <dc:creator>Jubin Soni</dc:creator>
      <pubDate>Sun, 28 Jun 2026 15:03:49 +0000</pubDate>
      <link>https://dev.to/jubinsoni/mlops-building-a-cicd-pipeline-for-ml-models-on-azure-databricks-18a4</link>
      <guid>https://dev.to/jubinsoni/mlops-building-a-cicd-pipeline-for-ml-models-on-azure-databricks-18a4</guid>
      <description>&lt;p&gt;Most ML teams are great at training models. Very few are great at shipping them. The gap between a notebook that works and a model that reliably serves production traffic is where most ML projects stall.&lt;/p&gt;

&lt;p&gt;In this tutorial I'll walk through building a proper CI/CD pipeline for ML models on &lt;strong&gt;Azure Databricks&lt;/strong&gt; using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MLflow&lt;/strong&gt; for experiment tracking, model versioning, and the model registry&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Databricks Asset Bundles (DABs)&lt;/strong&gt; for infrastructure-as-code and job deployment&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Azure DevOps / GitHub Actions&lt;/strong&gt; as the CI/CD orchestrator&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delta Lake&lt;/strong&gt; as the feature and validation data store&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Databricks Model Serving&lt;/strong&gt; for the production endpoint&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The use case is the same churn prediction model from the previous post, but this time we're focusing entirely on how it gets from a notebook to a production endpoint reliably and repeatably.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fio4kbk38e63w2lqdw3y4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fio4kbk38e63w2lqdw3y4.png" alt="Architecture description" width="791" height="3109"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  CI/CD Flow
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ftgng9c463v6zv9aubxfw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ftgng9c463v6zv9aubxfw.png" alt="CI/CD Flow description" width="800" height="353"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Pipeline Stage Breakdown
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;What happens&lt;/th&gt;
&lt;th&gt;Gate to next stage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GitHub Actions&lt;/td&gt;
&lt;td&gt;Lint, unit tests, bundle validate&lt;/td&gt;
&lt;td&gt;All tests green&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Train&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Databricks Job&lt;/td&gt;
&lt;td&gt;Full training run, MLflow logging&lt;/td&gt;
&lt;td&gt;Job exits 0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Register&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MLflow Registry&lt;/td&gt;
&lt;td&gt;Model versioned and moved to Staging&lt;/td&gt;
&lt;td&gt;Auto on train success&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Validate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Databricks Job&lt;/td&gt;
&lt;td&gt;Metric thresholds checked on holdout set&lt;/td&gt;
&lt;td&gt;ROC-AUC &amp;gt;= 0.80&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Promote&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MLflow Registry&lt;/td&gt;
&lt;td&gt;Model moved to Production&lt;/td&gt;
&lt;td&gt;Validation passes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deploy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Model Serving&lt;/td&gt;
&lt;td&gt;Endpoint updated to new model version&lt;/td&gt;
&lt;td&gt;Promotion complete&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Monitor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Databricks Lakehouse Monitoring&lt;/td&gt;
&lt;td&gt;Drift and accuracy tracked post-deploy&lt;/td&gt;
&lt;td&gt;Ongoing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Step 1 — Project Structure with Databricks Asset Bundles
&lt;/h2&gt;

&lt;p&gt;Databricks Asset Bundles (DABs) let you define your jobs, clusters, and pipelines as code and deploy them via CLI. This is your IaC layer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# databricks.yml&lt;/span&gt;
&lt;span class="na"&gt;bundle&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;churn-prediction&lt;/span&gt;

&lt;span class="na"&gt;variables&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dev&lt;/span&gt;

&lt;span class="na"&gt;targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;dev&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;development&lt;/span&gt;
    &lt;span class="na"&gt;default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;workspace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://adb-xxxx.azuredatabricks.net&lt;/span&gt;

  &lt;span class="na"&gt;staging&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;workspace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://adb-xxxx.azuredatabricks.net&lt;/span&gt;

  &lt;span class="na"&gt;production&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
    &lt;span class="na"&gt;workspace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://adb-xxxx.azuredatabricks.net&lt;/span&gt;

&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;training_job&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;churn-training-${var.env}&lt;/span&gt;
      &lt;span class="na"&gt;tasks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;task_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;feature_engineering&lt;/span&gt;
          &lt;span class="na"&gt;notebook_task&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;notebook_path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./notebooks/01_feature_engineering.py&lt;/span&gt;
          &lt;span class="na"&gt;new_cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;spark_version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;14.3.x-scala2.12&lt;/span&gt;
            &lt;span class="na"&gt;node_type_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Standard_DS3_v2&lt;/span&gt;
            &lt;span class="na"&gt;num_workers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4&lt;/span&gt;

        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;task_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;train_and_register&lt;/span&gt;
          &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;task_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;feature_engineering&lt;/span&gt;
          &lt;span class="na"&gt;notebook_task&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;notebook_path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./notebooks/02_train_and_register.py&lt;/span&gt;
          &lt;span class="na"&gt;new_cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;spark_version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;14.3.x-scala2.12&lt;/span&gt;
            &lt;span class="na"&gt;node_type_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Standard_DS3_v2&lt;/span&gt;
            &lt;span class="na"&gt;num_workers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;

    &lt;span class="na"&gt;validation_job&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;churn-validation-${var.env}&lt;/span&gt;
      &lt;span class="na"&gt;tasks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;task_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;validate_model&lt;/span&gt;
          &lt;span class="na"&gt;notebook_task&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;notebook_path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./notebooks/03_validate_and_promote.py&lt;/span&gt;
          &lt;span class="na"&gt;new_cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;spark_version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;14.3.x-scala2.12&lt;/span&gt;
            &lt;span class="na"&gt;node_type_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Standard_DS3_v2&lt;/span&gt;
            &lt;span class="na"&gt;num_workers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 2 — Training Job with MLflow Autologging
&lt;/h2&gt;

&lt;p&gt;Keep the training notebook clean and focused. Let MLflow autologging handle the heavy lifting on metric and param capture.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# notebooks/02_train_and_register.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mlflow.sklearn&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mlflow.models.signature&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;infer_signature&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.ensemble&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GradientBoostingClassifier&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;roc_auc_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f1_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;precision_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recall_score&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;delta.tables&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_experiment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/churn-prediction/ci-cd-pipeline&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sklearn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;autolog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log_input_examples&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;log_model_signatures&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Read Gold features and capture Delta version for reproducibility
&lt;/span&gt;&lt;span class="n"&gt;gold_table&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;churn.gold.features&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;delta_version&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gold_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;version&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;features_pdf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;churn.gold.features&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;feature_date = current_date()&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toPandas&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;FEATURE_COLS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total_events&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total_sessions&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;distinct_products&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;events_last_30d&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;events_last_90d&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;days_since_last_event&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;customer_tenure_days&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;avg_events_per_day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;recency_tier&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;engagement_score&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;TARGET&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;churned&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;features_pdf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;FEATURE_COLS&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;features_pdf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;TARGET&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stratify&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_run&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;n_estimators&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;max_depth&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;learning_rate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;subsample&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GradientBoostingClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;y_prob&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict_proba&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)[:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;y_pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;delta_feature_version&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;delta_version&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;git_sha&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;widgets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;git_sha&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  &lt;span class="c1"&gt;# passed in by CI
&lt;/span&gt;    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_metric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;roc_auc&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="nf"&gt;roc_auc_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_prob&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_metric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;f1_score&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="nf"&gt;f1_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_metric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;precision&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;precision_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_metric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;recall&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="nf"&gt;recall_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="n"&gt;signature&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;infer_signature&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sklearn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;artifact_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;churn-gbm&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;signature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;signature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;registered_model_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;churn-prediction-gbm&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;await_registration_for&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Pass run ID to downstream tasks via job output
&lt;/span&gt;    &lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;taskValues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;run_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Training complete. Run ID: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 3 — Validation Job: Gate on Metrics Before Promoting
&lt;/h2&gt;

&lt;p&gt;Never auto-promote based on a successful training run alone. Always validate on a held-out or recent dataset and check against a threshold before touching the Production alias.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# notebooks/03_validate_and_promote.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MlflowClient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;roc_auc_score&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MlflowClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;MODEL_NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;churn-prediction-gbm&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

&lt;span class="c1"&gt;# Pick up run_id from the upstream training task
&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;taskValues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;taskKey&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;train_and_register&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;run_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Thresholds — fail the job if any metric is below these
&lt;/span&gt;&lt;span class="n"&gt;THRESHOLDS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;roc_auc&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="mf"&gt;0.80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;f1_score&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="mf"&gt;0.72&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;precision&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.70&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;run&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;metrics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metrics&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Validating metrics against thresholds...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;failures&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;THRESHOLDS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PASS&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;FAIL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; (threshold: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;) -&amp;gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;failures&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; below threshold &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;failures&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Validation failed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;failures&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Promote to Production alias if all thresholds pass
&lt;/span&gt;&lt;span class="n"&gt;model_version&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search_model_versions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;filter_string&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;run_id=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;
&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_registered_model_alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MODEL_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;alias&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Production&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model_version&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Model version &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model_version&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; promoted to Production alias.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 4 — GitHub Actions CI/CD Workflow
&lt;/h2&gt;

&lt;p&gt;This is the glue that ties everything together. One workflow handles PR validation; the other handles deployment on merge to main.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/mlops.yml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;MLOps CI/CD&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;DATABRICKS_HOST&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;${{ secrets.DATABRICKS_HOST }}&lt;/span&gt;
  &lt;span class="na"&gt;DATABRICKS_TOKEN&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.DATABRICKS_TOKEN }}&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ci&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Lint and Test&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Set up Python&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/setup-python@v5&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;python-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;3.11'&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Install dependencies&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pip install databricks-cli databricks-sdk pytest flake8 mlflow&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Lint&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;flake8 notebooks/ src/ --max-line-length=120&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Unit tests&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pytest tests/unit/ -v&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Validate DAB bundle&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;databricks bundle validate --target staging&lt;/span&gt;

  &lt;span class="na"&gt;train-and-deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Train, Validate, Deploy&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ci&lt;/span&gt;
    &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;github.ref == 'refs/heads/main'&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Install Databricks CLI&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pip install databricks-cli databricks-sdk&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deploy bundle to staging&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;databricks bundle deploy --target staging&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run training job&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;JOB_RUN_ID=$(databricks bundle run training_job \&lt;/span&gt;
            &lt;span class="s"&gt;--var "git_sha=${{ github.sha }}" \&lt;/span&gt;
            &lt;span class="s"&gt;--output json | jq -r '.run_id')&lt;/span&gt;
          &lt;span class="s"&gt;echo "TRAINING_RUN_ID=$JOB_RUN_ID" &amp;gt;&amp;gt; $GITHUB_ENV&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run validation job&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;databricks bundle run validation_job --target staging&lt;/span&gt;
          &lt;span class="s"&gt;echo "Validation passed. Promoting to production."&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deploy bundle to production&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;databricks bundle deploy --target production&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Update Model Serving endpoint&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;python scripts/update_serving_endpoint.py \&lt;/span&gt;
            &lt;span class="s"&gt;--model-name churn-prediction-gbm \&lt;/span&gt;
            &lt;span class="s"&gt;--alias Production&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 5 — Update the Serving Endpoint
&lt;/h2&gt;

&lt;p&gt;The final step in the pipeline updates the Databricks Model Serving endpoint to point at the newly promoted Production model version.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# scripts/update_serving_endpoint.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;databricks.sdk&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;WorkspaceClient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;databricks.sdk.service.serving&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ServedModelInput&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;EndpointCoreConfigInput&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MlflowClient&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;update_serving_endpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;w&lt;/span&gt;      &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;WorkspaceClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MlflowClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Resolve alias to concrete version
&lt;/span&gt;    &lt;span class="n"&gt;model_version&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_model_version_by_alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Deploying &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; version &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model_version&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; (alias: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;alias&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;endpoint_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;churn-prediction-endpoint&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="n"&gt;served_model&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ServedModelInput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;model_version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model_version&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;workload_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Small&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;scale_to_zero_enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Update existing endpoint
&lt;/span&gt;        &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;serving_endpoints&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;endpoint_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;served_models&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;served_model&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Endpoint &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;endpoint_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; updated to version &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model_version&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Create if it doesn't exist yet
&lt;/span&gt;        &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;serving_endpoints&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;endpoint_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;EndpointCoreConfigInput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;served_models&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;served_model&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Endpoint &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;endpoint_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; created with version &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model_version&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;--model-name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;required&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;--alias&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="n"&gt;required&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;update_serving_endpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Tool Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Role in pipeline&lt;/th&gt;
&lt;th&gt;Why not the alternative&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Databricks Asset Bundles&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;IaC for jobs and clusters&lt;/td&gt;
&lt;td&gt;Terraform Databricks provider (more verbose, no native notebook support), manual UI (not reproducible)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MLflow Registry aliases&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Production / Staging promotion&lt;/td&gt;
&lt;td&gt;Stage-based promotion (deprecated in MLflow 2.x), manual version tracking (error prone)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GitHub Actions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CI/CD orchestrator&lt;/td&gt;
&lt;td&gt;Azure DevOps (works equally well, swap yml syntax), Jenkins (more ops overhead)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Metric gate in validation job&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Automated quality check&lt;/td&gt;
&lt;td&gt;Manual review (blocks velocity), no gate at all (risky)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Databricks Model Serving&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Managed REST endpoint&lt;/td&gt;
&lt;td&gt;AKS deployment (more control, much more ops), Azure ML endpoints (extra service dependency)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;dbutils.jobs.taskValues&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pass run_id between tasks&lt;/td&gt;
&lt;td&gt;Environment variables (not available cross-task in DABs), hardcoded run lookup (fragile)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Things to Watch in Production
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Pin your Databricks Runtime version.&lt;/strong&gt; Using &lt;code&gt;14.3.x-scala2.12&lt;/code&gt; in your bundle ensures every training run uses the same Spark and library versions. Floating versions (&lt;code&gt;latest&lt;/code&gt;) cause silent library drift that breaks reproducibility.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Store secrets in Azure Key Vault, not GitHub Secrets alone.&lt;/strong&gt; GitHub Secrets work fine for CI tokens but for long-lived service principal credentials that Databricks jobs use at runtime, back them with Azure Key Vault and reference them via Databricks secret scopes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Set a metric baseline from your current production model.&lt;/strong&gt; Your thresholds (ROC-AUC &amp;gt;= 0.80) should be relative to what the current Production model achieves on the same holdout set, not an arbitrary number. Add a step in the validation job that fetches the current Production model's metrics and gates the new model against those.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tag every MLflow run with git SHA.&lt;/strong&gt; Logging &lt;code&gt;git_sha&lt;/code&gt; as a param in every training run means you can always trace a model artifact back to the exact code version that produced it. Critical for incident response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scale to zero on serving endpoints.&lt;/strong&gt; For non-latency-critical models, enable &lt;code&gt;scale_to_zero_enabled=True&lt;/code&gt; on your serving endpoint. It cuts cost dramatically for endpoints that don't receive traffic 24/7.&lt;/p&gt;




&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;The pattern here is straightforward: code change triggers CI, CI triggers a training job, training job registers a model, a validation job gates on metrics, and only then does the model get promoted and deployed. Nothing manual, nothing skipped.&lt;/p&gt;

&lt;p&gt;What makes this production-grade rather than just automated is the combination of Delta versioning for feature reproducibility, MLflow aliases for clean promotion semantics, and metric-gated promotion so a worse model can never silently replace a better one.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.databricks.com/en/dev-tools/bundles/index.html" rel="noopener noreferrer"&gt;Databricks Asset Bundles Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mlflow.org/docs/latest/model-registry.html#registering-an-mlflow-model" rel="noopener noreferrer"&gt;MLflow Model Registry — Aliases and Tags&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.databricks.com/en/machine-learning/model-serving/index.html" rel="noopener noreferrer"&gt;Databricks Model Serving&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://databricks-sdk-py.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;Databricks SDK for Python&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.databricks.com/en/dev-tools/bundles/ci-cd.html" rel="noopener noreferrer"&gt;GitHub Actions — CI/CD for Databricks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mlflow.org/docs/latest/tracking/autolog.html" rel="noopener noreferrer"&gt;MLflow Autologging&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://learn.microsoft.com/en-us/azure/databricks/security/secrets/secret-scopes" rel="noopener noreferrer"&gt;Azure Key Vault — Databricks Secret Scopes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.databricks.com/en/lakehouse-monitoring/index.html" rel="noopener noreferrer"&gt;Databricks Lakehouse Monitoring&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>azure</category>
      <category>databricks</category>
      <category>mlops</category>
      <category>cicd</category>
    </item>
    <item>
      <title>How to Evaluate AI Agents: Trajectory Evals That Work</title>
      <dc:creator>sagar jain</dc:creator>
      <pubDate>Sun, 28 Jun 2026 09:30:10 +0000</pubDate>
      <link>https://dev.to/sagar_jain4010/how-to-evaluate-ai-agents-trajectory-evals-that-work-dhc</link>
      <guid>https://dev.to/sagar_jain4010/how-to-evaluate-ai-agents-trajectory-evals-that-work-dhc</guid>
      <description>&lt;p&gt;You cannot evaluate an agent by checking its final answer. A multi-step agent can reach the right output through a broken path, calling the wrong tool, recovering by luck, taking eight steps where two would do, and a final-answer check waves it through. Then the same broken path fails on the next input and you have no idea why. Agent evaluation has to grade the &lt;em&gt;trajectory&lt;/em&gt;, not just the destination.&lt;/p&gt;

&lt;p&gt;We build and ship AI agents, and the eval harness is the part that separates the agents that survive a model upgrade from the ones that silently regress the day the provider ships a new version.&lt;/p&gt;

&lt;h2&gt;
  
  
  Score the path, not just the answer
&lt;/h2&gt;

&lt;p&gt;A useful agent eval covers the whole trajectory with several dimensions, not one number:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tool correctness:&lt;/strong&gt; did it call the right tools? A deterministic check, exact tool names against expected.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Argument correctness:&lt;/strong&gt; were the parameters right? Also deterministic where you can specify required fields.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Step efficiency:&lt;/strong&gt; did it take a reasonable number of steps, or wander?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plan adherence and plan quality:&lt;/strong&gt; did it follow a sensible plan, and was the plan good to begin with?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Task completion and reasoning quality:&lt;/strong&gt; did it actually finish the job, and was the reasoning sound?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The important split: use &lt;strong&gt;deterministic checks&lt;/strong&gt; for anything with a crisp right answer (tool names, required parameters, expected outputs) and save &lt;strong&gt;LLM-as-judge&lt;/strong&gt; for the subjective stuff. Don't pay a judge model to check something a string comparison can verify.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-agent regressions hide in the sub-agents
&lt;/h2&gt;

&lt;p&gt;If you've got an orchestrator with sub-agents, a top-level score will lie to you. The orchestrator can look fine while a sub-agent quietly degrades, because the system recovered or the bad output got averaged away. You need &lt;strong&gt;span-level evaluation&lt;/strong&gt;: grade each sub-agent's span on its own. Most production regressions in multi-agent systems live in exactly the sub-agent nobody's eval was watching.&lt;/p&gt;

&lt;h2&gt;
  
  
  LLM-as-judge is useful and quietly biased
&lt;/h2&gt;

&lt;p&gt;LLM-as-judge is the right tool for subjective criteria, and it's riddled with biases you have to actively counter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Position bias.&lt;/strong&gt; Judges favor whichever answer came first, sometimes heavily. Flipping the order can flip the verdict. &lt;strong&gt;Fix:&lt;/strong&gt; evaluate both orderings and average, or randomize position.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-preference.&lt;/strong&gt; A judge tends to prefer outputs from its own model family. &lt;strong&gt;Fix:&lt;/strong&gt; use a judge that's maximally &lt;em&gt;different&lt;/em&gt; from the model you're grading, or require cross-family consensus.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verbosity bias.&lt;/strong&gt; Longer answers get rated higher regardless of substance. &lt;strong&gt;Fix:&lt;/strong&gt; control for length, or instruct the judge to ignore it and spot-check that it does.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Properly calibrated, with biases controlled and validated against human labels, LLM-as-judge reaches strong agreement with human preferences, about the level humans agree with each other. The judge is reliable once you've done the work to calibrate it. It is not reliable out of the box.&lt;/p&gt;

&lt;h2&gt;
  
  
  Calibrate against humans, then trust the automation
&lt;/h2&gt;

&lt;p&gt;The step teams skip is calibration. Before you trust a rubric, hand-label a set of examples and check that your judge agrees with your humans. If it doesn't, the rubric is ambiguous or the judge is biased, and either way your green dashboard is fiction. Humans calibrate the grader; the grader scales the humans. And watch for eval-set contamination: if benchmark examples leaked into training data, you're measuring memorization, not capability. Keep a held-out set you generated yourself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Offline evals miss drift, so run online too
&lt;/h2&gt;

&lt;p&gt;A test suite you run before deploy catches known failures. It does not catch the new ways real traffic breaks your agent. Run streaming evals on a sample of production traffic with drift detection and alerting. Offline evals are your regression net; online evals are how you find the failures you didn't know to write a test for. This is the runtime version of the same investment we argued for on AI-written code: &lt;a href="https://www.shantiinfosoft.com/blog/ai-writes-4x-code-qa-layer/" rel="noopener noreferrer"&gt;AI writes 4x the code, here's the QA layer that stops 4x the bugs&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Grade the trajectory: tool correctness, argument correctness, step efficiency, plan quality, completion. Not just the final answer.&lt;/li&gt;
&lt;li&gt;Deterministic checks for crisp things (tool names, params); LLM-as-judge for subjective things.&lt;/li&gt;
&lt;li&gt;Evaluate sub-agents at the span level. Top-level scores hide sub-agent regressions.&lt;/li&gt;
&lt;li&gt;LLM judges have position, self-preference, and verbosity biases. Counter them, then trust them.&lt;/li&gt;
&lt;li&gt;Calibrate judges against human labels, keep a held-out set, and run online evals to catch drift.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Why isn't final-answer accuracy enough?&lt;/strong&gt;&lt;br&gt;
Because an agent can get the right answer through a broken path that fails next time. Trajectory evals catch the broken path before it costs you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I trust LLM-as-judge?&lt;/strong&gt;&lt;br&gt;
After calibration, yes, for subjective criteria. Control for position and verbosity bias, use a different model family, and validate against human labels.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do I need online evals if I have a good offline suite?&lt;/strong&gt;&lt;br&gt;
Yes. Offline catches known regressions; online catches drift and novel real-world failures your tests never anticipated.&lt;/p&gt;

&lt;p&gt;If you're standing up an eval harness for agents and wrestling with judge calibration, that's a problem we like. Happy to swap rubrics and harness designs with anyone building agents at &lt;a href="https://shantiinfosoft.com" rel="noopener noreferrer"&gt;Shanti Infosoft&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>testing</category>
      <category>mlops</category>
    </item>
    <item>
      <title>Azure Databricks for MLOps and Feature Engineering at Scale with Apache Spark, Delta Lake, and MLflow</title>
      <dc:creator>Jubin Soni</dc:creator>
      <pubDate>Sun, 28 Jun 2026 01:35:55 +0000</pubDate>
      <link>https://dev.to/jubinsoni/azure-databricks-for-feature-engineering-at-scale-with-apache-spark-delta-lake-and-mlflow-3k4n</link>
      <guid>https://dev.to/jubinsoni/azure-databricks-for-feature-engineering-at-scale-with-apache-spark-delta-lake-and-mlflow-3k4n</guid>
      <description>&lt;p&gt;Raw data doesn't win model competitions. Features do. And when your raw data is tens of billions of rows sitting across multiple sources, you can't afford to run pandas in a notebook and call it a day.&lt;/p&gt;

&lt;p&gt;In this tutorial I'll walk through building a production-grade feature engineering pipeline on &lt;strong&gt;Azure Databricks&lt;/strong&gt; using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Apache Spark&lt;/strong&gt; for distributed transformation at scale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delta Lake&lt;/strong&gt; for reliable, versioned feature storage with ACID guarantees&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MLflow&lt;/strong&gt; for tracking feature pipeline runs, parameters, and the models trained on top of them&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The use case is a customer churn prediction system, but the patterns apply to any ML feature pipeline.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fcgb1wc07dd1wvcu9olx7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fcgb1wc07dd1wvcu9olx7.png" alt="Architecture description" width="660" height="1969"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The pipeline follows the &lt;strong&gt;Medallion Architecture&lt;/strong&gt; — a layered approach where data gets progressively cleaner and more feature-ready as it moves from Bronze to Silver to Gold. MLflow sits across all three layers tracking every run.&lt;/p&gt;




&lt;h2&gt;
  
  
  Pipeline Flow
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F1x17oavxvgp49urqx4x3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F1x17oavxvgp49urqx4x3.png" alt="Pipeline description" width="800" height="216"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer Breakdown
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Delta Table&lt;/th&gt;
&lt;th&gt;What happens here&lt;/th&gt;
&lt;th&gt;Typical latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bronze&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;churn.bronze.events&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Raw ingest, no transforms, append only&lt;/td&gt;
&lt;td&gt;Minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Silver&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;churn.silver.customers&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Deduplication, null handling, schema enforcement&lt;/td&gt;
&lt;td&gt;Minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gold&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;churn.gold.features&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Aggregations, window functions, encoding&lt;/td&gt;
&lt;td&gt;Minutes to hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MLflow Run&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Training, metric logging, artifact storage&lt;/td&gt;
&lt;td&gt;Hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Registry&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Versioned model store, stage promotion&lt;/td&gt;
&lt;td&gt;On demand&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Step 1 — Bronze Layer: Raw Ingest
&lt;/h2&gt;

&lt;p&gt;The Bronze layer is append-only. No transforms. No business logic. Just get the data in and preserve it exactly as it arrived so you can always replay from source.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;current_timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lit&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;delta.tables&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;

&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Read raw events from ADLS Gen2 / Event Hub / source of choice
&lt;/span&gt;&lt;span class="n"&gt;raw_events&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;abfss://raw@yourstorage.dfs.core.windows.net/events/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Add ingestion metadata — never mutate source columns
&lt;/span&gt;&lt;span class="n"&gt;bronze_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw_events&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;_ingested_at&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;current_timestamp&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; \
                       &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;_source&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;events_api&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Write to Bronze Delta table — append only, no overwrites
&lt;/span&gt;&lt;span class="n"&gt;bronze_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;mergeSchema&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;saveAsTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;churn.bronze.events&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bronze rows written: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;bronze_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why append-only?&lt;/strong&gt; If your downstream pipeline produces bad features, you want to replay from Bronze without re-ingesting from source. Overwriting Bronze breaks that ability.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Step 2 — Silver Layer: Clean and Validate
&lt;/h2&gt;

&lt;p&gt;Silver is where you enforce schema, handle nulls, deduplicate, and standardize. Think of it as your canonical, trusted dataset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;to_timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;when&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;upper&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;delta.tables&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;

&lt;span class="n"&gt;bronze&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;churn.bronze.events&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;silver_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bronze&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;isNotNull&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;event_type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;isNotNull&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dropDuplicates&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;event_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;event_ts&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="nf"&gt;to_timestamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;event_timestamp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;event_type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="nf"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;event_type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))))&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;country_code&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;country&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;isNull&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;UNKNOWN&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
                                &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;otherwise&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;country&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))))&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;event_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;event_type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;event_ts&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;country_code&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;product_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;session_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;_ingested_at&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Upsert into Silver using Delta MERGE — idempotent on re-runs
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isDeltaTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;churn.silver.customers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;silver_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;churn.silver.customers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;silver_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tgt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;silver_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;src&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tgt.customer_id = src.customer_id AND tgt.event_id = src.event_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;whenNotMatchedInsertAll&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;silver_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;saveAsTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;churn.silver.customers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Silver table updated. Total rows: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;churn.silver.customers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 3 — Gold Layer: Feature Engineering
&lt;/h2&gt;

&lt;p&gt;This is the heart of the pipeline. We compute aggregated, windowed, and encoded features that the model will actually train on.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;countDistinct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;sum&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;_sum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;datediff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;max&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;_max&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;min&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;_min&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;current_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;when&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.window&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Window&lt;/span&gt;

&lt;span class="n"&gt;silver&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;churn.silver.customers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# ------------------------------------------------------------------
# 1. Aggregate features per customer over 30 / 90 day windows
# ------------------------------------------------------------------
&lt;/span&gt;&lt;span class="n"&gt;today&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;current_date&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;agg_features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;silver&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;days_since_event&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;datediff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;today&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;event_ts&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;event_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                                          &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total_events&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;countDistinct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;session_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                                &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total_sessions&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;countDistinct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;product_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                                &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;distinct_products&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;_sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;days_since_event&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;otherwise&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;events_last_30d&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;_sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;days_since_event&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;otherwise&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;events_last_90d&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;_max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;event_ts&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                                           &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;last_event_ts&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;_min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;event_ts&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                                           &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;first_event_ts&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;days_since_last_event&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;datediff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;today&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;last_event_ts&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;customer_tenure_days&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="nf"&gt;datediff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;today&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;first_event_ts&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;avg_events_per_day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total_events&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;customer_tenure_days&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# ------------------------------------------------------------------
# 2. Encode churn risk tier as ordinal feature
# ------------------------------------------------------------------
&lt;/span&gt;&lt;span class="n"&gt;feature_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agg_features&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;recency_tier&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nf"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;days_since_last_event&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;   &lt;span class="c1"&gt;# active
&lt;/span&gt;       &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;days_since_last_event&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;   &lt;span class="c1"&gt;# at risk
&lt;/span&gt;       &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;otherwise&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;                                   &lt;span class="c1"&gt;# churned
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;engagement_score&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;events_last_30d&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.6&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;events_last_90d&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;customer_tenure_days&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# ------------------------------------------------------------------
# 3. Write to Gold feature store — overwrite with partition by date
# ------------------------------------------------------------------
&lt;/span&gt;&lt;span class="n"&gt;feature_df&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;feature_date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;current_date&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;replaceWhere&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;feature_date = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;today&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;saveAsTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;churn.gold.features&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Gold features written: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;feature_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; customers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 4 — MLflow: Track the Training Run
&lt;/h2&gt;

&lt;p&gt;With features in Gold, we hand off to MLflow to train, track, and register the model. Notice we log the Delta table version so we can always reproduce exactly which feature snapshot trained which model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mlflow.sklearn&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mlflow.models.signature&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;infer_signature&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.ensemble&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GradientBoostingClassifier&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;roc_auc_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f1_score&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_experiment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/churn-prediction/feature-pipeline&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Read Gold features — capture Delta version for reproducibility
&lt;/span&gt;&lt;span class="n"&gt;gold_table&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;churn.gold.features&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;delta_version&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gold_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;version&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;features_pdf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;churn.gold.features&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toPandas&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;FEATURE_COLS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total_events&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total_sessions&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;distinct_products&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;events_last_30d&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;events_last_90d&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;days_since_last_event&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;customer_tenure_days&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;avg_events_per_day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;recency_tier&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;engagement_score&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;TARGET&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;churned&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;features_pdf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;FEATURE_COLS&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;features_pdf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;TARGET&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;gbm-features-v&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;delta_version&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

    &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;n_estimators&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;max_depth&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;learning_rate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GradientBoostingClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;y_pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;y_prob&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict_proba&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)[:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# Log everything
&lt;/span&gt;    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_params&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_metric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;roc_auc&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;roc_auc_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_prob&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_metric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;f1_score&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;f1_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;delta_feature_version&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;delta_version&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;feature_columns&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;FEATURE_COLS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;training_rows&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="c1"&gt;# Log model with signature
&lt;/span&gt;    &lt;span class="n"&gt;signature&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;infer_signature&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sklearn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;artifact_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;churn-gbm&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;signature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;signature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;registered_model_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;churn-prediction-gbm&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Run ID: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ROC-AUC: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;roc_auc_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_prob&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Feature Delta version logged: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;delta_version&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Bonus: Delta Lake Time Travel for Feature Reproducibility
&lt;/h2&gt;

&lt;p&gt;One of the best things about Delta Lake is time travel. If a model behaves unexpectedly in production, you can reload the exact feature snapshot it was trained on.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Reload the exact feature version that trained a specific model run
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;

&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;your-run-id-here&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;feature_version&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;delta_feature_version&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Rehydrate that exact feature snapshot
&lt;/span&gt;&lt;span class="n"&gt;historical_features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;versionAsOf&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;feature_version&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;churn.gold.features&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Loaded feature snapshot from Delta version &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;feature_version&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Row count: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;historical_features&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# You can now retrain on the exact same data to reproduce the result
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Service Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Role in pipeline&lt;/th&gt;
&lt;th&gt;Why not the alternative&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Apache Spark&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Distributed feature computation&lt;/td&gt;
&lt;td&gt;Pandas (single node, OOM at scale), Dask (less native Databricks integration)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Delta Lake&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Feature storage with versioning&lt;/td&gt;
&lt;td&gt;Parquet (no ACID, no time travel), Hive tables (no merge support)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MLflow Tracking&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Experiment and param logging&lt;/td&gt;
&lt;td&gt;Manual logging (not reproducible), W&amp;amp;B (extra cost, less native on Databricks)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MLflow Registry&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Model versioning and promotion&lt;/td&gt;
&lt;td&gt;Custom model store (more ops overhead)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Medallion Architecture&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pipeline layer separation&lt;/td&gt;
&lt;td&gt;Flat pipelines (hard to debug, no replay capability)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Delta MERGE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Idempotent Silver upserts&lt;/td&gt;
&lt;td&gt;Overwrite (destroys history), append (creates duplicates)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Things to Watch in Production
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Shuffle partitions matter.&lt;/strong&gt; Spark defaults to 200 shuffle partitions which is fine for small data but will bottleneck at scale. Set &lt;code&gt;spark.conf.set("spark.sql.shuffle.partitions", "auto")&lt;/code&gt; on Databricks Runtime 10+ or tune it manually to &lt;code&gt;2-3x your core count&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Z-ordering on Gold features.&lt;/strong&gt; If you're querying Gold by &lt;code&gt;customer_id&lt;/code&gt; frequently, add &lt;code&gt;OPTIMIZE churn.gold.features ZORDER BY (customer_id)&lt;/code&gt; after the write. This co-locates related data and cuts query times dramatically on large tables.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Log Delta version in every MLflow run.&lt;/strong&gt; This is non-negotiable for reproducibility. Without it you can't prove which feature snapshot trained which model, which becomes a compliance problem in regulated industries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cluster autoscaling for feature jobs.&lt;/strong&gt; Feature engineering jobs tend to have spiky resource needs (big during aggregation, small during writes). Enable autoscaling on your Databricks cluster and set a min/max node count rather than a fixed size.&lt;/p&gt;




&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;The combination of Spark, Delta Lake, and MLflow on Databricks gives you a feature engineering pipeline that is reproducible (Delta time travel + MLflow param logging), scalable (Spark handles billions of rows), and auditable (every run is tracked, every feature version is stored).&lt;/p&gt;

&lt;p&gt;The Medallion Architecture keeps the pipeline modular — you can rerun just the Gold layer if you change a feature definition without touching Bronze or Silver, and MLflow ties model performance back to the exact feature version that produced it.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://learn.microsoft.com/en-us/azure/databricks/" rel="noopener noreferrer"&gt;Azure Databricks Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.delta.io/latest/index.html" rel="noopener noreferrer"&gt;Delta Lake — The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-window.html" rel="noopener noreferrer"&gt;Apache Spark SQL — Window Functions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mlflow.org/docs/latest/tracking.html" rel="noopener noreferrer"&gt;MLflow Tracking Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mlflow.org/docs/latest/model-registry.html" rel="noopener noreferrer"&gt;MLflow Model Registry&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.databricks.com/glossary/medallion-architecture" rel="noopener noreferrer"&gt;Medallion Architecture on Databricks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.delta.io/latest/delta-utility.html#history" rel="noopener noreferrer"&gt;Delta Lake Time Travel&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.databricks.com/en/machine-learning/feature-store/index.html" rel="noopener noreferrer"&gt;Databricks Feature Store Overview&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>azure</category>
      <category>databricks</category>
      <category>spark</category>
      <category>mlops</category>
    </item>
    <item>
      <title>What Is an Agent Registry? (And What We Broke Before We Had One)</title>
      <dc:creator>Sahajmeet Kaur</dc:creator>
      <pubDate>Sat, 27 Jun 2026 06:30:00 +0000</pubDate>
      <link>https://dev.to/sahajmeet_kaur_/what-is-an-agent-registry-and-what-we-broke-before-we-had-one-37jn</link>
      <guid>https://dev.to/sahajmeet_kaur_/what-is-an-agent-registry-and-what-we-broke-before-we-had-one-37jn</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An AI agent registry is a centralized catalog of every agent in your organization — what each agent does, what tools it can access, what version is running, who owns it, and how to call it&lt;/li&gt;
&lt;li&gt;It's to agents what a container registry is to Docker images or what a service mesh is to microservices — the layer that makes distributed components governable&lt;/li&gt;
&lt;li&gt;We hit the "which agents do we have?" wall at 14 agents across 3 teams. That's when the registry stopped being a nice-to-have&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;About four months into our agentic AI buildout, our head of security asked a question I couldn't answer: "Can you give me a list of every AI agent running in production, what systems they have access to, and what version of each is currently deployed?"&lt;/p&gt;

&lt;p&gt;I had a rough mental model. I knew about the agents my team had built. I had a vague idea of what the data engineering team had shipped. The product team had recently added two agents I'd heard about secondhand.&lt;/p&gt;

&lt;p&gt;I spent the better part of a day pulling together a spreadsheet. By the time I finished, one of the agents I'd listed had already been replaced by a newer version. Two of them had been granted access to an internal API I hadn't known about.&lt;/p&gt;

&lt;p&gt;The spreadsheet was outdated before I sent it.&lt;/p&gt;

&lt;p&gt;That was our forcing function for building a proper agent registry. This post is what I wish I'd read before that conversation happened.&lt;/p&gt;




&lt;h2&gt;
  
  
  What an agent registry is
&lt;/h2&gt;

&lt;p&gt;An agent registry is a centralized catalog of AI agents — a single source of truth that tracks every agent deployed in your organization, its capabilities, its integrations, its ownership, and its current state.&lt;/p&gt;

&lt;p&gt;The analogy that landed for me: it's to agents what a container registry (Docker Hub, ECR, GCR) is to container images. When you have three containers running, you don't need a registry — you know what you have. When you have 40 containers across six teams, you need a registry to know what's running, who owns it, what version is deployed, and what depends on what.&lt;/p&gt;

&lt;p&gt;Agents are the same. At two or three agents, a shared Notion doc is sufficient. At 14 agents across three teams, you need infrastructure that tracks state, not a doc that someone last edited last month.&lt;/p&gt;

&lt;p&gt;A registry stores metadata for each agent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Identity and ownership&lt;/strong&gt; — which team built it, who's the current owner, what's the canonical name&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capabilities&lt;/strong&gt; — what the agent can do, expressed as a standard interface (increasingly via the Model Context Protocol, so other agents can discover and call it without custom integration)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool and model access&lt;/strong&gt; — which MCP servers it's authorized to use, which models it can call, what permissions it holds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version and deployment state&lt;/strong&gt; — which version is currently in production, what changed, when it was last updated&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability metadata&lt;/strong&gt; — success rate, latency, last error, evaluation scores if you're running evals&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Access policy&lt;/strong&gt; — which other agents or services are authorized to call this agent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The last one is what distinguishes a registry from a spreadsheet: it's not just a catalog, it's the enforcement point for agent-to-agent communication.&lt;/p&gt;




&lt;h2&gt;
  
  
  What goes wrong without one
&lt;/h2&gt;

&lt;p&gt;We ran without a registry for longer than we should have. Here's what actually broke.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shadow agents.&lt;/strong&gt; Three separate teams had independently built agents that called our internal data API. None of them knew about the others. When we introduced rate limits on that API, two of the agents started failing intermittently — and we spent a week debugging what we thought was a data API problem before realizing the actual problem was three agents competing for quota we'd only budgeted for one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Version confusion at 2am.&lt;/strong&gt; An agent went into production with a bug. We rolled back. The rollback was applied to one environment but not the other. For six hours, our staging environment had the fixed version and production had the broken one, because there was no single source of truth for which version was where. The incident took longer to resolve than it should have because different team members were looking at different version references.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The offboarding gap.&lt;/strong&gt; When an engineer left the team, we revoked their credentials for the systems we knew about. Three weeks later, a contractor reported that an internal Jira webhook was still firing from an agent they'd built. The agent had been registered nowhere. It was running on a piece of infrastructure they'd stood up themselves, using credentials that hadn't been included in the offboarding checklist because nobody knew the agent existed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;M×N integration hell.&lt;/strong&gt; Each new agent that needed to call tools had to build its own integration with each tool. Eight agents, six tools: 48 potential integration points, each with its own credential management, error handling, and retry logic. When a tool API changed, we had to find and update every agent that used it manually.&lt;/p&gt;

&lt;p&gt;The registry fixes all four of these. Shadow agents can't exist if registration is a prerequisite for deployment. Version state is tracked centrally. Offboarding is "revoke this agent's access in the registry." M×N integrations collapse to each tool being registered once, each agent pointing to the registry.&lt;/p&gt;




&lt;h2&gt;
  
  
  What a registry is not
&lt;/h2&gt;

&lt;p&gt;Worth being explicit, because I conflated some things early on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's not a deployment platform.&lt;/strong&gt; The registry tracks what's running, but it doesn't run the agents. Deployment is a separate concern — Kubernetes, a container orchestrator, whatever your team uses. The registry is the catalog; deployment is the execution layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's not an orchestration framework.&lt;/strong&gt; LangGraph, CrewAI, AutoGen — those handle how agents coordinate with each other. The registry handles what agents exist and whether they're authorized to talk to each other at all. These are complementary, not competing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's not an MCP server list.&lt;/strong&gt; An MCP server registry catalogs available tools. An agent registry catalogs available agents. Both are useful. Both are needed. TrueFoundry calls the combination of the two a unified MCP and Agents Registry — one place where you can see both the tools agents can use and the agents themselves. That unification matters because the governance question is really "which agents can call which tools" — you need both catalogs to answer it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's not just a spreadsheet.&lt;/strong&gt; The spreadsheet version of an agent catalog is a snapshot. A proper registry is stateful — it connects to your observability layer and shows live performance, not last-week's-update performance. When TrueFoundry's registry shows you an agent's success rate, it's pulling from real-time telemetry, not a manually updated field.&lt;/p&gt;




&lt;h2&gt;
  
  
  The architecture pattern that makes it work
&lt;/h2&gt;

&lt;p&gt;The pattern that made everything cleaner: every agent registers with the gateway using the Model Context Protocol. Once registered, the agent looks like a standard MCP endpoint to every other agent in the system. A LangGraph agent and a CrewAI agent and a custom HTTP service all appear as the same kind of thing to the orchestrator — they're all just callable endpoints with a defined schema.&lt;/p&gt;

&lt;p&gt;This is what solves the M×N problem architecturally. Each tool is registered once. Each agent is registered once. The registry maps which agents can call which tools. Agents don't need to know how to integrate with Jira or Slack or your internal data API directly — they call the registry endpoint, and the registry handles routing, credentials, and access control.&lt;/p&gt;

&lt;p&gt;The other pattern that mattered: the registry as the access control enforcement point. Before this, access control for agent-to-agent calls lived in application code — each agent decided for itself whether to accept a call. That's as reliable as it sounds. Moving access control to the registry layer means it's enforced centrally, consistently, and not dependent on each individual agent implementation being correct.&lt;/p&gt;




&lt;h2&gt;
  
  
  What we ended up using
&lt;/h2&gt;

&lt;p&gt;After the security audit incident, we evaluated a few options and landed on &lt;a href="https://www.truefoundry.com/blog/ai-agent-registry" rel="noopener noreferrer"&gt;TrueFoundry's Agent Registry&lt;/a&gt;. I can explain specifically what mattered.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unified agent and MCP catalog.&lt;/strong&gt; Every agent and every tool visible in one place. When the security team asks "which agents have access to the internal data API," the answer is a query, not a two-day investigation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Framework-agnostic registration.&lt;/strong&gt; We have agents on LangGraph, one on CrewAI, and two custom HTTP services. The registry handles all of them through a standard registration interface. Once registered, governance policies apply regardless of what framework built the agent — the same RBAC rules, the same audit trail, the same access policies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Live performance tracking.&lt;/strong&gt; The registry shows each agent's success rate, average latency, and last error pulled from the observability layer. We set a routing rule: for production code changes, only route to agents with &amp;gt;90% success rate on the latest eval run. The registry enforces this automatically rather than requiring a human to check before deploying.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A2A communication via MCP.&lt;/strong&gt; When an agent needs to call another agent, it goes through the registry. The registry checks whether the calling agent is authorized to invoke the target agent, handles the call, and logs the interaction with both agent identities. The over-privileged sub-agent problem — where a spawned agent inherits more permissions than it should — is closed at the registry layer.&lt;/p&gt;

&lt;p&gt;The tradeoff: TrueFoundry is Kubernetes-native, so there's real infrastructure investment if you're not already on K8s. For a team of 5 with 3 agents, a YAML file is probably enough. The inflection point for us was around 10 agents across multiple teams with compliance requirements.&lt;/p&gt;




&lt;h2&gt;
  
  
  When you actually need one
&lt;/h2&gt;

&lt;p&gt;The honest answer: you need a registry before you think you do, and you'll know you needed it earlier after you don't have one.&lt;/p&gt;

&lt;p&gt;Some concrete signals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You can't answer "which agents do we have in production" without asking multiple people&lt;/li&gt;
&lt;li&gt;A team deploys an agent and you find out about it from a runaway cost alert rather than a check-in&lt;/li&gt;
&lt;li&gt;An engineer leaves and you realize you don't know what credentials their agents were using&lt;/li&gt;
&lt;li&gt;Two teams built agents that do similar things because neither knew the other existed&lt;/li&gt;
&lt;li&gt;You want to introduce rate limits or access controls on an internal system and don't know how many agents are calling it&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  If any of those describe your situation, the registry conversation is overdue. If none of them do yet, you're probably still small enough that the overhead isn't justified.
&lt;/h2&gt;

&lt;p&gt;What pushed you toward building or adopting a registry — and what does your current agent catalog look like? Curious whether most teams are still on the spreadsheet version or if the registry infrastructure has actually caught up to the agent deployment pace. Drop it in the comments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>mlops</category>
      <category>devops</category>
    </item>
    <item>
      <title>LiteLLM vs OpenRouter: I Used Both. Here's Where Each One Actually Broke.</title>
      <dc:creator>Sahajmeet Kaur</dc:creator>
      <pubDate>Fri, 26 Jun 2026 06:30:00 +0000</pubDate>
      <link>https://dev.to/sahajmeet_kaur_/litellm-vs-openrouter-i-used-both-heres-where-each-one-actually-broke-53gb</link>
      <guid>https://dev.to/sahajmeet_kaur_/litellm-vs-openrouter-i-used-both-heres-where-each-one-actually-broke-53gb</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LiteLLM and OpenRouter are not competing products - LiteLLM is a self-hosted open-source proxy you run yourself, OpenRouter is a managed cloud aggregator. The comparison only makes sense if you understand which problem you're actually trying to solve&lt;/li&gt;
&lt;li&gt;LiteLLM's ceiling: SSO and team-level budget enforcement are behind the enterprise license, Redis dependency for distributed rate limiting has a failure mode worth knowing about, YAML config gets unwieldy at scale&lt;/li&gt;
&lt;li&gt;OpenRouter's ceiling: everything lives in OpenRouter's infrastructure, no self-hosted models, no team-level governance, a 5.5% credit purchase fee that compounds at high volume&lt;/li&gt;
&lt;li&gt;Where we landed: neither was the right long-term answer for our setup - this post explains why&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;When I started evaluating LLM routing options about a year ago, most of the "LiteLLM vs OpenRouter" content I found was comparing features in a matrix and calling it a day. It wasn't that useful because it missed the more important question: these tools have fundamentally different architectures, different deployment models, and different ceilings. Picking between them is less "which has more features" and more "which problem are you actually trying to solve right now."&lt;/p&gt;

&lt;p&gt;I ran LiteLLM in staging for about six weeks and used OpenRouter for a parallel workload. Here's what I actually found.&lt;/p&gt;




&lt;h2&gt;
  
  
  What each tool is (the architecture distinction that matters)
&lt;/h2&gt;

&lt;p&gt;Before any feature comparison: LiteLLM and OpenRouter are not the same category of thing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LiteLLM&lt;/strong&gt; is an open-source Python library and proxy server you host yourself. It gives you a unified, OpenAI-compatible API in front of 100+ model providers. You pip install it, run it as a Docker container, and it lives in your infrastructure. You own the uptime, the scaling, and the configuration. The Anthropic and OpenAI credentials live in your environment. Nothing leaves your network unless you tell it to.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenRouter&lt;/strong&gt; is a managed cloud service. You create an account, buy credits, and point your OpenAI SDK at &lt;code&gt;https://openrouter.ai/api/v1&lt;/code&gt; with an OpenRouter API key. You don't run anything. The model request goes through OpenRouter's infrastructure, which routes to whichever provider serves that model. Their business model is a 5.5% fee on credit purchases, with provider token rates passed through without markup.&lt;/p&gt;

&lt;p&gt;The practical implication: if you need your prompts to stay inside your infrastructure, OpenRouter is immediately off the table. If you want zero infrastructure overhead and just want to access 200+ models through one API key in the next ten minutes, LiteLLM has a steeper setup curve than OpenRouter.&lt;/p&gt;

&lt;p&gt;Once you understand that distinction, the comparison becomes a lot cleaner.&lt;/p&gt;




&lt;h2&gt;
  
  
  LiteLLM: where it's genuinely good and where it breaks
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What works well
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Provider coverage and SDK compatibility.&lt;/strong&gt; LiteLLM supports 100+ providers - OpenAI, Anthropic, AWS Bedrock, Google Vertex, Mistral, Groq, Cohere, Together AI, Ollama, and more through a single OpenAI-compatible format. You write standard OpenAI SDK code once, and routing to a different provider is a model string change. For teams with self-hosted models, this is particularly useful because LiteLLM routes to your own endpoints with the same interface as cloud providers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Load balancing across deployments.&lt;/strong&gt; You can define multiple deployments of the same model across providers or regions, and LiteLLM load-balances across them with configurable strategies: simple-shuffle, least-busy, latency-based, cost-based. This is the right level of control for teams managing both cloud and self-hosted infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Virtual keys with per-key budgets.&lt;/strong&gt; Each virtual key can have its own budget and rate limit. For a small team where one engineer owns the gateway config, this is enough. You issue a key per service, set a budget, done.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where it breaks
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;YAML at scale.&lt;/strong&gt; LiteLLM config is YAML. For a solo engineer with three models, it's fine. For a platform team managing 40 engineers across four squads with different model access requirements, it becomes a coordination problem. Every time a squad needs a new model routing rule, someone has to edit the same YAML file, test the change, and redeploy. We had two merge conflicts in one week.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SSO is Enterprise only.&lt;/strong&gt; We needed Okta. That's behind the enterprise license. The open-source version doesn't support corporate SSO. For most organizations past a certain size, this is a hard requirement, not a preference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Redis dependency.&lt;/strong&gt; Distributed rate limiting in LiteLLM requires Redis. This is fine in normal operation. The edge case: if Redis has an availability issue, LiteLLM's rate limiting can fail open - requests go through with no limits enforced. In a runaway job scenario, this means your safety net disappears at exactly the wrong moment. We tested this. It behaved as documented, which means the behavior is intentional but it's worth understanding before you depend on it in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Team-level budget enforcement.&lt;/strong&gt; Per-key budgets work. Per-team budgets that span multiple keys with a shared ceiling — the kind of thing a platform team needs to charge back spend to different business units - require more config work and, the enterprise tier handles this cleanly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Solo engineers and small teams prototyping self-hosted model access. MIT license, zero vendor relationship, full infrastructure control. The SSO and governance features are there if you pay for the enterprise tier - budget for that if you're running more than 10 engineers through it.&lt;/p&gt;




&lt;h2&gt;
  
  
  OpenRouter: where it's genuinely good and where it breaks
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What works well
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Zero setup to first request.&lt;/strong&gt; Create account, buy credits, change base URL. That's it. No infrastructure to run, no container to maintain, no YAML to write. For rapid prototyping or a hackathon, this is the right level of effort.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model breadth.&lt;/strong&gt; 300+ models accessible through one API key. Including models that would otherwise require separate API accounts with separate providers — Mistral, Nous, Perplexity, and others available through OpenRouter before they had easy direct API access. For experimentation across frontier models, this is genuinely useful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Intelligent routing options.&lt;/strong&gt; OpenRouter's routing suffixes are a nice abstraction: &lt;code&gt;:nitro&lt;/code&gt; routes to highest-throughput provider, &lt;code&gt;:floor&lt;/code&gt; routes to cheapest, &lt;code&gt;:online&lt;/code&gt; injects web search results. You can also pass a &lt;code&gt;models&lt;/code&gt; array with fallback priority. For teams that don't want to think about provider selection, the defaults work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unified billing.&lt;/strong&gt; One invoice, one credit balance, across every provider you're using. For teams where multi-provider accounting is a headache, this is real simplification.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where it breaks
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Everything lives in OpenRouter's infrastructure.&lt;/strong&gt; Your prompts, your responses, your API keys - all pass through OpenRouter's systems. For teams with data residency requirements, regulated workloads, or compliance obligations that specify where inference data can travel, this is a hard blocker. There's no self-hosted option and no VPC deployment path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The 5.5% credit fee compounds.&lt;/strong&gt; OpenRouter charges 5.5% on credit purchases. Provider token rates pass through without markup. On low volumes, this is fine. At $50k/month in inference spend, you're paying $2,750/month to OpenRouter in platform fees on top of model costs. At $200k/month, it's $11,000/month. The math is worth doing before you commit to this as your production routing layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No team-level governance.&lt;/strong&gt; OpenRouter doesn't have a concept of "team A can only use these models" or "developer X has a $500/month cap." Access control is per API key. Budget management is at the account level. For a solo developer this is fine. For a platform team managing 40 engineers with different access requirements, you're building governance on top of OpenRouter rather than getting it from OpenRouter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No self-hosted model support.&lt;/strong&gt; If you're running a fine-tuned model on your own infrastructure, OpenRouter can't route to it. Your routing split between OpenRouter (for cloud providers) and some other system (for your own models) means split observability, split cost tracking, and split governance. We had this problem and it was worse than it sounds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Individual developers and small teams who want fast access to many models with zero infrastructure. Also genuinely useful as the cloud-provider routing layer for teams that pair it with a self-hosted solution for their own models - though that means managing two systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  Head-to-head on the things that matter in production
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;OpenRouter&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Self-hosted (Docker, pip)&lt;/td&gt;
&lt;td&gt;Managed cloud only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data residency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Your infrastructure&lt;/td&gt;
&lt;td&gt;OpenRouter's infrastructure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Provider coverage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;100+ (incl. self-hosted)&lt;/td&gt;
&lt;td&gt;300+ (cloud only)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Self-hosted model support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SSO / OKTA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Enterprise license&lt;/td&gt;
&lt;td&gt;Enterprise tier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Per-team budget caps&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Limited without Enterprise&lt;/td&gt;
&lt;td&gt;Not available&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Rate limiting&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Redis-backed (fail-open risk)&lt;/td&gt;
&lt;td&gt;Managed (their infra)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Semantic caching&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ (Redis)&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Guardrails&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Basic hooks&lt;/td&gt;
&lt;td&gt;Not native&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Compliance certs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pricing model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Open-source + Enterprise license&lt;/td&gt;
&lt;td&gt;5.5% credit purchase fee&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MCP / agent support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Config model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;YAML file&lt;/td&gt;
&lt;td&gt;Dashboard + API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Good for prototyping&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅✅ (easier)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Good for 40+ engineers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;With Enterprise license&lt;/td&gt;
&lt;td&gt;With governance workarounds&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Where we went after hitting both ceilings
&lt;/h2&gt;

&lt;p&gt;We ran LiteLLM for about six weeks. The YAML config problem was manageable. The SSO requirement wasn't - we needed Okta and weren't going to pay the enterprise license for a gateway that still had the Redis failure-open edge case and no native self-hosted model observability.&lt;/p&gt;

&lt;p&gt;We used OpenRouter for a parallel data enrichment workload during the same period. It was excellent for the first two months. Then the workload scaled, the data residency question came from legal, and the 5.5% fee at our run rate became a real number on a real spreadsheet.&lt;/p&gt;

&lt;p&gt;Neither tool was wrong. Both were right for earlier stages of what we were building. The problem was that we'd outgrown the ceiling of both at roughly the same time.&lt;/p&gt;

&lt;p&gt;We ended up on &lt;a href="https://www.truefoundry.com/ai-gateway" rel="noopener noreferrer"&gt;TrueFoundry's AI Gateway&lt;/a&gt;. The specific things that mattered for our situation:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In-memory rate limiting, no Redis dependency.&lt;/strong&gt; Auth, budget checks, and rate limits all happen in-memory in the gateway process - no external dependency in the hot path, no failure-open edge case under Redis load. The benchmarks show ~3–4ms added latency at 350+ RPS on a single vCPU, which matched our own testing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Full VPC deployment.&lt;/strong&gt; Everything runs inside our Kubernetes cluster. No inference data, no control plane traffic leaves our infrastructure. This answered the legal/compliance question cleanly - no carve-outs, no "the dashboard is SaaS but the inference is on-prem" nuance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-hosted and cloud models unified.&lt;/strong&gt; Our Llama deployment and our OpenAI and Anthropic traffic go through the same gateway endpoint. Same cost attribution dashboard, same rate limiting, same audit trail. No split observability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-team budgets enforced on the hot path.&lt;/strong&gt; When a team hits their token budget, subsequent requests return rate-limit errors before spend accumulates. The enforcement happens before the API call, not as an alert after.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SSO out of the box.&lt;/strong&gt; Okta via SAML, no enterprise license gating.&lt;/p&gt;

&lt;p&gt;The tradeoff: If you're a two-person team shipping fast, LiteLLM or OpenRouter will get you further faster. The decision point for us was when compliance requirements and multi-team governance became real - that's when the infrastructure investment in a proper gateway started paying off.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to pick between them for your situation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use LiteLLM if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You want full infrastructure control and MIT-licensed open source&lt;/li&gt;
&lt;li&gt;You have self-hosted models that need to route through the same system as your cloud providers&lt;/li&gt;
&lt;li&gt;You're comfortable managing YAML config and owning the gateway's uptime&lt;/li&gt;
&lt;li&gt;You can absorb the enterprise license cost when you need SSO and team governance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use OpenRouter if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You want zero infrastructure to manage and the fastest path to first request&lt;/li&gt;
&lt;li&gt;You need access to many models, including newer ones from smaller providers&lt;/li&gt;
&lt;li&gt;Your workload doesn't have data residency or compliance requirements&lt;/li&gt;
&lt;li&gt;You're fine with account-level billing and don't need per-team governance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Consider moving beyond both when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Legal or compliance asks where your inference data lives and "OpenRouter's servers" isn't acceptable&lt;/li&gt;
&lt;li&gt;You have self-hosted models that need the same governance as your cloud provider traffic&lt;/li&gt;
&lt;li&gt;Multiple teams need separate budget caps enforced before they spend, not after&lt;/li&gt;
&lt;li&gt;The Redis failure-open scenario is a real risk for your rate limiting SLA&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;What pushed you toward LiteLLM or OpenRouter — and what made you stay or leave? Has anyone found a clean way to unify governance across both (self-hosted via LiteLLM + cloud via OpenRouter) without running two separate observability stacks. Drop it in the comments.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>devops</category>
      <category>mlops</category>
    </item>
    <item>
      <title>I checked six LLM-as-judge tools against human labels. The scoreboard was the wrong thing to read.</title>
      <dc:creator>Maya Andersson</dc:creator>
      <pubDate>Thu, 25 Jun 2026 17:51:07 +0000</pubDate>
      <link>https://dev.to/maya_andersson_dev/i-checked-six-llm-as-judge-tools-against-human-labels-the-scoreboard-was-the-wrong-thing-to-read-2imp</link>
      <guid>https://dev.to/maya_andersson_dev/i-checked-six-llm-as-judge-tools-against-human-labels-the-scoreboard-was-the-wrong-thing-to-read-2imp</guid>
      <description>&lt;p&gt;Most LLM-as-judge comparisons rank tools by which one gives you a number fastest. That is the wrong axis. A judge you have not validated against human labels is not a measurement, it is a vibe with a decimal point. So I ran six tools the way a methodologist would: not "which one scores," but "which one helps me prove the score is trustworthy."&lt;/p&gt;

&lt;p&gt;Trust here has a specific meaning. An LLM judge inherits known failure modes: position bias (it favors the first answer it sees), verbosity bias (it rewards longer outputs), and self-preference (it scores outputs from its own model family higher). None of these show up in the score itself. They show up only when you compare the judge against a human-labeled set and compute agreement. The standard instrument for that is Cohen's kappa, not raw accuracy, because raw accuracy lies whenever your classes are imbalanced.&lt;/p&gt;

&lt;p&gt;So the criterion I graded each tool on was simple: how much friction does it put between me and a confusion matrix against human labels?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DeepEval (G-Eval).&lt;/strong&gt; The broadest eval breadth of the group, honestly. Chain-of-thought scoring via G-Eval, a pytest-style harness, a large catalog of metrics. It is the tool I reach for when I want coverage. What it does not do for you is the human-agreement step. You write the judge, you collect the labels, you compute kappa yourself. Reference: Liu et al., "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment" (arXiv:2303.16634), which is worth reading precisely because it measures Spearman correlation with human judgment rather than asserting it. (G-Eval is the paper's method; DeepEval is the tool that implements it.)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Confident AI.&lt;/strong&gt; The hosted layer on top of DeepEval. Adds storage, sharing, a dashboard. The validation gap is identical, because it is the same engine underneath. You get a nicer place to keep results, not a built-in human-agreement workflow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evidently.&lt;/strong&gt; Strong on report dashboards and drift detection. If your problem is "the judge looked fine in March and I want to know when it drifts," this fits. It is monitoring-shaped, not validation-shaped. It will not hand you a kappa against a held-out human set as a first-class step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Braintrust.&lt;/strong&gt; The side-by-side run-comparison UI is genuinely useful for spotting where two judge configurations disagree. That is disagreement-spotting, which is upstream of validation but not the same as it. Seeing two columns diverge tells you something is off, not whether either column agrees with a human.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Promptfoo.&lt;/strong&gt; Treats judges as test assertions. Lightweight, CI-friendly, easy to wire into a pipeline. Thin on judge-versus-human statistics by design, it is a testing tool, not a measurement-theory tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Future AGI.&lt;/strong&gt; Sits in the middle of this list, not at the top of it. It is an end-to-end open-source platform rather than an eval-only tool, and its evaluation surface is hybrid: deterministic functions, grounded checks, and LLM-as-judge under one interface. The hybrid framing is the interesting part for this question, because the deterministic and grounded paths give you cheaper anchors to sanity-check the judge path against. It still does not crown itself the answer to the human-agreement problem. You bring the labels. DeepEval has broader raw eval breadth; Future AGI trades some of that breadth for the hybrid local-plus-judge structure. (Source: github.com/future-agi/future-agi.)&lt;/p&gt;

&lt;p&gt;The finding across all six: not one of them treats "compute judge agreement with human labels and show me the confusion matrix" as the default first action. Every tool optimizes for producing a score. The validation is left as an exercise for the user, which is exactly the part most teams skip.&lt;/p&gt;

&lt;p&gt;Here is the procedure I actually run, regardless of tool:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Hand-label 200 examples on the dimension I care about. Two annotators where I can afford it, so I can also measure human-human agreement.&lt;/li&gt;
&lt;li&gt;Run the candidate judge on the identical 200.&lt;/li&gt;
&lt;li&gt;Compute Cohen's kappa, not accuracy.&lt;/li&gt;
&lt;li&gt;Deploy the judge only when kappa clears roughly 0.6, and even then I read the confusion matrix to see which class it gets wrong.&lt;/li&gt;
&lt;li&gt;Rewrite the rubric against those errors and re-measure.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The tool choice changes how pleasant steps 2 through 5 are. It does not change whether you have to do them.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Why Cohen's kappa instead of accuracy?&lt;/strong&gt; Accuracy is inflated by class imbalance. If 90 percent of your examples are "pass," a judge that says "pass" every time scores 90 percent accuracy and zero usefulness. Kappa corrects for agreement that would happen by chance, so it does not reward that degenerate strategy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What kappa is good enough?&lt;/strong&gt; There is no universal threshold, but I treat roughly 0.6 as the floor for deploying a judge on a non-trivial dimension, and I want to see where the disagreements land before trusting it. Lower can be acceptable on genuinely subjective dimensions, see the open question below.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do I need 200 labels specifically?&lt;/strong&gt; No. 200 is a practical balance between annotation cost and a confusion matrix you can actually read. The point is a held-out human set, not the exact count.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can one tool just do the validation for me?&lt;/strong&gt; None of the six I tested ship human-agreement-with-confusion-matrix as the default workflow. They produce scores; you supply and compare the labels.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open question
&lt;/h2&gt;

&lt;p&gt;Cohen's kappa assumes a meaningful ground truth to agree with. On highly subjective dimensions (helpfulness, tone, "did this answer feel complete"), human annotators themselves often only reach kappa of 0.4 to 0.5 with each other. A judge cannot beat the ceiling set by human-human disagreement. So how should we report a judge's kappa relative to the human-human kappa on the same set, and is there a clean way to estimate the subjectivity ceiling of a dimension before we spend the labeling budget? If you have a method you trust here, I would like to see it.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
      <category>evaluation</category>
      <category>mlops</category>
    </item>
    <item>
      <title>Request tagging for LLM evals with Bifrost dimension headers</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Thu, 25 Jun 2026 16:01:58 +0000</pubDate>
      <link>https://dev.to/marcuswwchen/request-tagging-for-llm-evals-with-bifrost-dimension-headers-38li</link>
      <guid>https://dev.to/marcuswwchen/request-tagging-for-llm-evals-with-bifrost-dimension-headers-38li</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Request tagging with Bifrost dimension headers (&lt;code&gt;x-bf-dim-*&lt;/code&gt;) stamps checkpoint and run metadata onto every LLM eval call, so you slice scores by model version instead of guessing which change moved the aggregate.&lt;/p&gt;

&lt;p&gt;We ran roughly 12,000 eval requests across four fine-tuned checkpoints last sprint, and when aggregate accuracy moved three points I couldn't tell which checkpoint produced which response. Our eval harness stored prompts and scores in one table; the routing layer recorded latency and provider somewhere else, and nothing carried the experiment ID end to end. We moved the eval traffic behind &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;the open-source AI gateway&lt;/a&gt; from Maxim AI, and used its custom dimension headers to stamp each request with the checkpoint and run ID. Request tagging turned a join-by-timestamp guessing game into a filter.&lt;/p&gt;

&lt;h2&gt;
  
  
  What request tagging means for LLM evals
&lt;/h2&gt;

&lt;p&gt;Request tagging attaches key-value metadata to each LLM API call so downstream logs, traces, and metrics can be grouped by that metadata. In Bifrost, any header prefixed &lt;code&gt;x-bf-dim-*&lt;/code&gt; becomes a custom dimension that is auto-forwarded to logs, traces, and Prometheus, which lets you group eval scores by checkpoint, prompt version, or suite without modifying your harness.&lt;/p&gt;

&lt;p&gt;I lead the fine-tuning and evaluation team at Nexus Labs, a Series B company building enterprise agent automation. Our problem was attribution, not measurement. A scoring function that returns 0.81 is useless if you can't tie that number to &lt;code&gt;agentqa-v7-lora-r16&lt;/code&gt; versus &lt;code&gt;agentqa-v6&lt;/code&gt;. Most eval setups solve this by threading an experiment ID through every layer of application code, which breaks the moment someone forgets a kwarg. Pushing the metadata into a request header at the gateway means the harness stays dumb and the dimension travels with the request.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stamping requests with x-bf-dim headers
&lt;/h2&gt;

&lt;p&gt;Bifrost is a &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;drop-in replacement&lt;/a&gt; for the OpenAI base URL, so the only change to our harness was the &lt;code&gt;base_url&lt;/code&gt; and three extra headers. The gateway holds the provider keys, so the client API key is unused.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8080/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unused-bifrost-holds-keys&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;eval_case&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;extra_headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x-bf-dim-checkpoint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agentqa-v7-lora-r16&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x-bf-dim-run-id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval-2026-06-19-batch3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x-bf-dim-suite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool-routing-adversarial&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every request in that batch now carries three dimensions. When the scorer writes its verdict, I don't need to correlate anything by hand; the gateway already recorded the dimensions next to the latency, token counts, and resolved provider. The same endpoint fronts &lt;a href="https://docs.getbifrost.ai/providers/supported-providers/overview" rel="noopener noreferrer"&gt;20+ providers&lt;/a&gt;, so when I shadow a hosted model against a self-hosted checkpoint, both legs of the comparison get tagged identically and land in the same store.&lt;/p&gt;

&lt;h2&gt;
  
  
  Slicing eval results in observability
&lt;/h2&gt;

&lt;p&gt;The dimensions are only useful if the read path is cheap. Bifrost writes telemetry through &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;async observability&lt;/a&gt; with under 0.1ms of added overhead, using SQLite by default and Postgres for production volume. The sinks include Prometheus, OpenTelemetry, Datadog, and BigQuery, so I query the same dimensions from whichever tool the rest of the team already watches.&lt;/p&gt;

&lt;p&gt;In practice I pull a Prometheus query grouped by &lt;code&gt;checkpoint&lt;/code&gt; and &lt;code&gt;suite&lt;/code&gt;, then compute per-slice accuracy from the scorer table joined on &lt;code&gt;run_id&lt;/code&gt;. That is where the three-point aggregate move resolved: checkpoint v7 gained on the general suite and lost on the adversarial tool-routing suite, which the average had flattened. This kind of per-segment attribution is the whole reason I distrust single-number eval reports. Aggregate metrics are a summary statistic, and summary statistics hide structure by design. The methodology argument is old; the &lt;a href="https://arxiv.org/abs/2211.09110" rel="noopener noreferrer"&gt;HELM evaluation work&lt;/a&gt; made the case for multi-metric, multi-scenario reporting years ago. Tagging at the gateway is the plumbing that makes per-scenario reporting cheap enough to actually do on every run.&lt;/p&gt;

&lt;p&gt;One detail that saved me time: the dimensions are arbitrary strings, so I tag prompt-template hashes too. When a template edit slipped into a run, the &lt;code&gt;prompt_hash&lt;/code&gt; dimension showed two distinct values inside one supposedly clean batch, and I caught a contaminated comparison before it reached a decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;This is not free infrastructure. Bifrost runs as a separate Go service, so you operate one more process, and a serious deployment needs Postgres rather than the default SQLite once you push real eval volume through it. If your stack is pure Python and you want everything in-process, a library like LiteLLM keeps fewer moving parts, at the cost of the gateway-level telemetry I'm describing here. Bifrost's ecosystem is also younger than LiteLLM's, so you will find fewer community examples for edge integrations.&lt;/p&gt;

&lt;p&gt;The dimension headers are forwarded, not validated. Nothing stops a typo in &lt;code&gt;x-bf-dim-checkpoint&lt;/code&gt; from creating a phantom slice, so I keep the tag values in one constants module and assert against it in the harness. Cluster-mode horizontal scaling is an &lt;a href="https://www.getmaxim.ai/bifrost/enterprise" rel="noopener noreferrer"&gt;enterprise feature&lt;/a&gt;, not part of the open-source core, which matters if your eval fleet outgrows a single instance. For a four-checkpoint sprint on one box, none of this bit me. Know your scale before you assume it won't.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;Request tagging with &lt;code&gt;x-bf-dim-*&lt;/code&gt; dimension headers moved attribution out of my eval code and into the gateway, which is where it belongs when many checkpoints and suites share one pipeline. The model was never the hard part. Knowing which model produced which number was. If you want to see the tagging and observability path end to end, book a demo: &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;https://getmaxim.ai/bifrost/book-a-demo&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;Bifrost observability docs&lt;/a&gt; for the metrics and sink configuration&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/providers/supported-providers/overview" rel="noopener noreferrer"&gt;Supported providers overview&lt;/a&gt; for the unified endpoint&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.getmaxim.ai/bifrost/resources/buyers-guide" rel="noopener noreferrer"&gt;Bifrost buyer's guide&lt;/a&gt; for deployment patterns&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2211.09110" rel="noopener noreferrer"&gt;HELM: Holistic Evaluation of Language Models&lt;/a&gt; on multi-metric eval reporting&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://opentelemetry.io/docs/" rel="noopener noreferrer"&gt;OpenTelemetry documentation&lt;/a&gt; for trace and metric standards&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>mlops</category>
      <category>llm</category>
    </item>
    <item>
      <title>Async inference for long-running diffusion jobs through Bifrost</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Thu, 25 Jun 2026 14:53:19 +0000</pubDate>
      <link>https://dev.to/elise_moreau/async-inference-for-long-running-diffusion-jobs-through-bifrost-4lo7</link>
      <guid>https://dev.to/elise_moreau/async-inference-for-long-running-diffusion-jobs-through-bifrost-4lo7</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Async inference through Bifrost lets long-running diffusion jobs submit and poll with the &lt;code&gt;x-bf-async&lt;/code&gt; header, so SDXL batches survive the 60-second proxy timeouts that were killing our product-photo pipeline.&lt;/p&gt;

&lt;p&gt;A large product-variant batch in our pipeline at Photoroom takes 70 to 110 seconds to render across &lt;a href="https://arxiv.org/abs/2307.01952" rel="noopener noreferrer"&gt;SDXL&lt;/a&gt;, and our &lt;a href="https://docs.aws.amazon.com/elasticloadbalancing/latest/application/application-load-balancers.html#connection-idle-timeout" rel="noopener noreferrer"&gt;AWS ALB closes any connection idle past 60 seconds by default&lt;/a&gt;. When we increased batch sizes to cut per-image GPU cost, the synchronous calls began returning 504s before the diffusion step finished. Clients retried on the 504, which double-queued the same render and roughly doubled GPU load during peak hours. We moved the generation traffic behind &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;the open-source AI gateway&lt;/a&gt; from Maxim AI, and switched the slow jobs to async inference so the HTTP connection no longer has to stay open for the full render.&lt;/p&gt;

&lt;h2&gt;
  
  
  What async inference means at an AI gateway
&lt;/h2&gt;

&lt;p&gt;Async inference at an AI gateway lets a client submit a generation job, receive a job ID, and poll for the result instead of holding one HTTP connection open for the whole compute. &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; exposes this with the &lt;code&gt;x-bf-async: true&lt;/code&gt; request header and an &lt;code&gt;x-bf-async-id&lt;/code&gt; returned on submission, so a 100-second diffusion call decouples from any proxy or load-balancer idle limit between the client and the gateway.&lt;/p&gt;

&lt;p&gt;The nuance here is that the GPU work does not get faster. What changes is the connection model. A synchronous request ties the success of a 100-second render to a TCP connection staying healthy for 100 seconds across two network hops. Async breaks that coupling: the submit call returns in milliseconds, and the poll calls are short and idempotent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Submitting and polling jobs with x-bf-async
&lt;/h2&gt;

&lt;p&gt;The submit request looks like a normal call through the OpenAI-compatible endpoint, with one extra header. Bifrost runs as a &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;drop-in replacement&lt;/a&gt;, so our existing image client only changed at the header layer, not the request body.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Submit a long-running generation job&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8080/v1/images/generations &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-async: true"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{\n    "model": "openai/gpt-image-1",\n    "prompt": "studio product shot, white seamless background",\n    "n": 8\n  }'&lt;/span&gt;
&lt;span class="c"&gt;# Response returns: x-bf-async-id: job_8f2c...&lt;/span&gt;

&lt;span class="c"&gt;# Poll for the result with the returned job id&lt;/span&gt;
curl http://localhost:8080/v1/images/generations &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-async-id: job_8f2c..."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To be precise about what we measured: the submit call returns before the model starts decoding, so the client thread is free in well under a second. The poll interval we settled on is two seconds, which keeps the queue worker cheap without adding noticeable tail latency on completion. We retired the old retry-on-504 logic entirely, because there is no long-held connection left to fail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tagging and observing jobs in flight
&lt;/h2&gt;

&lt;p&gt;Once jobs run detached, you need a way to attribute each one, otherwise a slow render is invisible until a customer complains. Bifrost forwards custom dimension headers prefixed &lt;code&gt;x-bf-dim-*&lt;/code&gt; into logs, traces, and Prometheus, so we tag every submission with the team and the experiment that created it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-dim-team: catalog-enrichment"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-dim-experiment: sdxl-batch-v3"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Those tags land in the &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;observability&lt;/a&gt; layer, which Bifrost writes asynchronously at under 0.1ms overhead per request. We now graph time-to-completion per experiment instead of one aggregate, which is how we found that one prompt template was three times slower than the rest of the batch. For cost attribution across teams, we pair the dimension tags with scoped &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;virtual keys&lt;/a&gt; so each business unit carries its own budget against the same provider pool.&lt;/p&gt;

&lt;p&gt;Routing also mattered here. The gateway unifies &lt;a href="https://docs.getbifrost.ai/providers/supported-providers/overview" rel="noopener noreferrer"&gt;20+ providers&lt;/a&gt; behind one endpoint, and the same async mechanism works whether the job lands on a self-hosted SDXL deployment or a hosted image model, so we can fail a batch over without rewriting the client.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;Async is the wrong default for fast paths. An interactive thumbnail that renders in 900ms gains nothing from submit-and-poll; you add a second round trip and a polling loop for a job that would have finished inside the original connection. We only route batches above roughly 30 seconds of expected render time through &lt;code&gt;x-bf-async&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The honest limitation on the Bifrost side is operational. Production deployments need Postgres backing the gateway, and you self-host the whole thing, which is real infrastructure to run and patch rather than a managed endpoint. The benchmark numbers are strong: Bifrost sustains &lt;a href="https://www.getmaxim.ai/bifrost/resources/benchmarks" rel="noopener noreferrer"&gt;5,000 RPS on a single instance at 100% success with about 11µs of overhead on a t3.xlarge&lt;/a&gt;, but those figures describe a node you operate. The ecosystem is also younger than older proxies like LiteLLM, so some integration paths have fewer community examples to copy from. For our team the trade was clearly worth it, since the alternative was tuning load-balancer timeouts per route and still losing jobs at the tail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;Async inference did not make our diffusion models faster; it made long renders survivable by removing the dependency on a single long-lived connection. The &lt;code&gt;x-bf-async&lt;/code&gt; submit-and-poll model, plus dimension tags for attribution, turned a class of intermittent 504s into a measurable queue we can reason about. If you run image or video generation jobs that routinely cross your proxy timeout, this is the pattern I would try first.&lt;/p&gt;

&lt;p&gt;If you want to see async inference and the rest of the gateway against your own workload, book a demo: &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;https://getmaxim.ai/bifrost/book-a-demo&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;Bifrost observability docs&lt;/a&gt; for the async write path and metrics sinks&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.getmaxim.ai/bifrost/resources/benchmarks" rel="noopener noreferrer"&gt;Bifrost benchmarks&lt;/a&gt; for the overhead and throughput figures&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2307.01952" rel="noopener noreferrer"&gt;SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/elasticloadbalancing/latest/application/application-load-balancers.html#connection-idle-timeout" rel="noopener noreferrer"&gt;AWS Application Load Balancer connection idle timeout&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>mlops</category>
      <category>computervision</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
