Atelje Vagabond

Posted on Apr 24 • Edited on Apr 28 • Originally published at blog.ateljevagabond.se

Why You Need MLOps: When CI/CD for Machine Learning Becomes Mandatory

#ai #machinelearning #mlops #datascience

For six months, the team did everything right. They had a brilliant lead data scientist, Dr. Alan. They had a unique dataset. After hundreds of experiments and countless hours of training, they finally hit the magic number.

Accuracy: 94%.

The model converged. The investor demo was flawless. The funding was secured. The excitement in the room was palpable.

Then, they made the decision that breaks almost every new ML team:

"It works on Alan's machine. Let's just wrap it in an API and ship it to production tomorrow."

Six months of hard science was about to collide with the hard reality of software engineering.

The comic above isn't just a funny illustration; it is the autobiography of thousands of companies trying to deploy machine learning for the first time.

What followed for Dr. Alan's team wasn't a model failure. It was a system failure. The live data didn't match the clean training data. Predictions started silently degrading. Cloud costs exploded because GPU instances were left running idle. When things broke, no one could reproduce the exact combination of code and data that built the original model.

They learned an expensive lesson: A model in a Jupyter notebook is a hypothesis. A model in production is an obligation.

The Engineering Necessity of MLOps

The transition from a research prototype to a live production service introduces engineering challenges that break traditional software deployment methodologies.

Machine Learning Operations (MLOps) isn't just a buzzword or a set of "best practices." It is the engineering discipline required to apply DevOps principles—continuous integration, continuous delivery, infrastructure-as-code—specifically to the unique lifecycle of machine learning.

Unlike traditional software CI/CD, which primarily manages code versions, MLOps must manage three distinct, intertwined artifact types:

Code: The training scripts, feature engineering logic, and serving wrappers.
Data: The training datasets, validation splits, and live inference data.
Models: The serialized artifacts (pickles, ONNX files), container images, and hyperparameters.

The complexity multiplies because a single production "rollout" must atomically coordinate all three. Furthermore, you need the ability to rollback any one artifact independently without causing cascading failures in the others.

Defining the Architectural Threshold: When Is MLOps Mandatory?

How do you know when you've moved past the "prototype" phase and need a formal MLOps framework? It's not based on your model's accuracy score. It is determined by the operational tempo and complexity of your system.

If your system meets these criteria, MLOps is no longer optional; it's a mandatory architectural requirement.

System Characteristic	Local/Prototype Stage	Production Stage (MLOps Required)
Deployment Frequency	Ad-hoc (Manual, periodic updates)	High velocity (Weekly, daily, or automated triggers)
Data Variability	Fixed, frozen CSVs or tables	Streaming data, semi-structured inputs, evident concept drift
System Scale	Single user, local laptop GPU	Distributed throughput, high concurrent users, auto-scaling clusters
Auditability	Unknown (It's somewhere in a source Notebook)	Fully traceable lineage for compliance (banking, healthcare)
Failure Mode	Easily reproducible and debugged locally	Non-deterministic, difficult to trace (e.g., Training-Serving Skew)

The engineering threshold is crossed the moment the cost of manual monitoring, debugging, and firefighting exceeds the cost of building robust automation tooling.

The Key Architectural Pain Points MLOps Solves

Without a formal MLOps architecture, your system accumulates specific types of technical debt that degrade reliability and performance over time.

1. Training-Serving Skew (The Silent Killer)

This is the most critical engineering failure. It happens when the logic used to calculate features during training runs differently than the logic used during real-time inference.

For example, if your data scientist calculates a "7-day rolling average" in Pandas for training, but the production engineer reimplements that logic in Java for the serving API, tiny discrepancies will creep in. The model receives data in production that is mathematically different from what it saw during training, leading to junk predictions despite a high accuracy score. MLOps solves this through standardized feature stores that ensure consistent logic.

2. Model Drift and Data Decay

Model performance degrades over time not because the model "breaks," but because the world changes. The statistical properties of the input data shift. Without automated monitoring and automated retraining triggers, your model will confidently serve obsolete predictions.

3. Reproducibility Failure

When an incident occurs in production at 3 AM, can your team immediately reproduce the exact state—the specific code commit, the exact slice of data, the hyperparameters, and library dependencies—that led to that deployed model? If not, you don't have a production system; you have a black box. MLOps ensures every deployed artifact is immutable and traceable back to its origin.

The Cloud Native MLOps Stack: A Deep Dive

Modern MLOps architectures rely on cloud-native managed services to handle the heavy lifting of compute scheduling and container management, allowing teams to focus on the workflow logic.

While many clouds offer solutions, the choice often comes down to existing infrastructure and compliance needs.

⚙️ Google Cloud Platform: Vertex AI

Vertex AI emphasizes unified workflows where pipeline steps run as isolated containers.

Functionality Focus: Vertex AI Pipelines is the core orchestration engine, supporting Kubeflow Pipelines (KFP) and TFX. It allows you to define your workflow as a Directed Acyclic Graph (DAG): Data Validation → Feature Engineering → Training → Evaluation.
Key Component Notes: The Vertex AI Feature Store (V2) is now built on BigQuery for offline storage. It uses timestamp-based resolution to ensure point-in-time correctness, reducing training-serving skew.
Important Deprecation Warning: Be aware that Google's Legacy Feature Store API will be shut down in early 2027. Furthermore, the "Optimized online serving" for V2 is also deprecated; Google is directing users toward Bigtable online serving for low-latency scenarios. Plan any new architecture accordingly.

☁️ Microsoft Azure: Azure Machine Learning (Azure ML)

Azure ML shines in regulated enterprise environments due to its deep integration with Azure's governance and security fabric.

Architectural Strength: Security is paramount. Role-Based Access Control (RBAC) is handled via Microsoft Entra ID (formerly Azure AD), meaning identity policy is centrally managed rather than replicated in the ML platform.
Automation: It relies heavily on event-driven automation via Azure Event Grid. Events like ModelRegistered or RunCompleted can trigger downstream pipelines, automated validation checks, or deployments.
Lifecycle Management: Azure ML has strong model registry capabilities with full lineage tracking. While it supports MLflow, native deployment workflows often rely on Azure SDK v2 tags to manage lifecycle states (e.g., tagging a model as candidate vs production).

Operational Caveats: The Hidden Costs of ML Infrastructure

Before rolling out any architecture, you must address the elephant in the room: Operational Expenditure (OpEx).

Scaling ML infrastructure—especially GPU-accelerated distributed training—introduces massive cost variability that shocks organizations unprepared for it.

The GPU Price Tag: On Google Cloud, a single node with 8x H100 GPUs (the current gold standard for LLM work) can run upwards of $90 per hour on-demand. A 24-hour training run is a $2,000+ event. If you need multi-node distributed training, those costs scale linearly per day.
The "Zombie Cluster" Problem: In Azure ML, if compute clusters are configured with a minimum node count greater than zero, those nodes run continuously, regardless of whether a job is active. Without automated teardown triggers on job completion (or failure!), idle GPU hours will accumulate silently. You won't know until the five-figure bill arrives at the end of the month.

Architectural Requirement: Operational budget planning and FinOps practices must be integrated from Day 1. You need automated cluster teardown triggers, strict GPU utilization alerts, and scheduled pipeline runs to avoid on-demand burst pricing.

Implementing strict Cloud Cost Optimization and FinOps Practices is just as critical as the ML code itself.

Common Mistakes When Adopting MLOps

The failures we see are rarely related to the math; they are related to the process.

1. Assigning MLOps to the wrong team.
MLOps sits at the intersection of Data Science, Data Engineering, and DevOps. Handing the entire responsibility to a data science team with no infrastructure experience, or a DevOps team with no ML exposure, is a recipe for disaster. Pipelines will technically "run," but they won't be robust.

2. Skipping the Data Audit.
MLOps is architecture built on data assumptions. If you build pipelines before auditing your data reality—schema consistency, null distributions, ingestion latency—you will build a very expensive system that automates the ingestion of garbage data.

3. Treating MLOps as a "One-Time Setup."
MLOps infrastructure is not static. As noted in the Vertex AI section above, cloud APIs deprecate, SDK versions end-of-life, and pricing models change. If you don't budget for ongoing platform engineering maintenance, your pipelines will break within 18 months.

Summary: MLOps is System Resilience

MLOps isn't a product you buy; it's the discipline that shifts machine learning from a research science experiment into a reliable, scalable production service.

Ultimately, the infrastructure choices made during the prototype phase—feature contracts, data formats, registry design—become severe technical debt at scale. Establishing a rigorous foundation for your MLOps and Data Management ensures your system survives production loads without requiring continuous, expensive rework.

Top comments (1)

Ibrahim YILMAZ • Apr 25

☝️