TL;DR: Apply the same GitOps discipline you use for application code to ML model deployments, and you get version history, rollback, and promotion gates that actually work, instead of the SSH-and-pray workflow most teams are still running.
The Problem
There's a model running in production right now that nobody on your team can explain. It was trained six weeks ago, deployed by someone who's since moved to a different team, and the only record of what version it is lives in a Slack message that's been buried under 4,000 other messages.
When it starts making bad predictions, what's your rollback plan? If your answer involves SSHing into a server, editing a config file by hand, and hoping the right weights get loaded, you're in the majority. That doesn't make it less of a disaster.
I spent the better part of last year helping platform teams get their ML deployment story straight. The pattern I kept seeing: teams had decent model training pipelines, reasonable experiment tracking in MLflow, and then a complete gap between "model registered" and "model serving traffic." The gap got filled with shell scripts, manual steps, and a whole lot of tribal knowledge.
The fix isn't a new tool. It's applying discipline you already have from application deployments to the model deployment layer.
Before we moved to GitOps for model deployments, a typical promotion cycle looked like this. A data scientist trains a new version, registers it in MLflow, then files a ticket. A platform engineer picks up the ticket, SSH-es into the model server, updates the model path, restarts the serving process, and manually validates that predictions look reasonable. Start to finish: 4 to 6 hours on a good day, longer when the engineer is in meetings or the server is being weird.
Rollback? There was no rollback. The best-case scenario was that someone remembered what the previous model path was.
What Most Teams Try First (And Why It Fails)
The first instinct is usually scripts. Someone writes a deploy.sh that takes a model version as an argument, connects to the serving infrastructure, and handles the update. This is better than pure manual steps, but it fails in a few predictable ways.
First, scripts don't have memory. You can run deploy.sh with model version 47, then run it again with version 51, and there's no audit trail of who ran what or why. When something goes wrong, you're back to grep-ing through logs and asking around.
Second, scripts don't handle promotion gates. You can't encode "this model can only go to production if it passed staging validation for 24 hours" in a shell script without it becoming a sprawling mess that nobody wants to maintain.
Third, and this one bites hardest: scripts assume the current state. If someone manually changes something on the serving infrastructure, your script has no way of detecting that drift. The next run might succeed or fail unpredictably depending on what changed and when.
MLflow solves the experiment tracking and model registry side well. You get version numbers, artifact storage in S3, stage transitions (Staging, Production), and a clean API. What MLflow doesn't give you is a Kubernetes-native way to declare "this cluster should be running model version 47 right now" and enforce that continuously.
That's where KServe and ArgoCD come in.
The Architecture
The full stack has five layers working together.
MLflow + S3 handle model artifacts. Every trained model version gets registered with MLflow, which stores the artifact URI pointing to a path in S3. The URI looks something like s3://ml-models-prod/fraud-detector/v47/model.pkl. MLflow's registry gives you a version number and stage metadata. The actual weights live in S3.
KServe InferenceService is the Kubernetes abstraction for serving. Instead of managing a Pod or Deployment by hand, you define an InferenceService custom resource that describes what model to load, from where, and how to scale. KServe handles the rest: downloading the artifact from S3, loading it into the serving framework (Triton, TorchServe, SKLearn Server), and exposing an HTTP endpoint.
Git holds the desired state. A values.yaml file in your repository specifies which model version each environment should run. Promoting from staging to production is a PR that bumps a version number. The PR is the change review, the approval gate, and the audit trail all at once.
ArgoCD reconciles the cluster to match what's in Git. When the PR merges, ArgoCD detects the change and applies the updated KServe InferenceService. If someone manually changes the InferenceService on the cluster, ArgoCD detects the drift and reverts it.
Istio manages traffic splitting. During canary promotion, a VirtualService routes 10% of traffic to the new model version while 90% continues to the stable version. If metrics look good after a soak period, you update the weights and do a full cutover.
Prometheus collects serving metrics. Latency (p99 in particular), throughput, and prediction distribution histograms give you the signals needed to decide whether a canary is healthy or needs to be rolled back.
The Workflow
Here's how a model promotion actually works end to end.
A data scientist trains a new model, evaluates it against the validation set, and if it passes threshold, registers it in MLflow:
import mlflow
with mlflow.start_run():
mlflow.sklearn.log_model(model, "model")
mlflow.log_metrics({"f1_score": 0.94, "auc": 0.97})
run_id = mlflow.active_run().info.run_id
client = mlflow.tracking.MlflowClient()
model_uri = f"runs:/{run_id}/model"
mv = client.create_model_version("fraud-detector", model_uri, run_id)
# mv.version == "47"
That registration triggers a CI pipeline (GitHub Actions or Tekton, depending on your setup) that opens a pull request bumping the version in the dev environment's values file.
values.yaml structure:
environments:
dev:
model:
name: fraud-detector
version: "47"
storageUri: "s3://ml-models-prod/fraud-detector/v47"
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
minReplicas: 1
maxReplicas: 3
staging:
model:
name: fraud-detector
version: "45"
storageUri: "s3://ml-models-prod/fraud-detector/v45"
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
minReplicas: 2
maxReplicas: 5
prod:
model:
name: fraud-detector
version: "43"
storageUri: "s3://ml-models-prod/fraud-detector/v43"
resources:
requests:
cpu: "4"
memory: "8Gi"
limits:
cpu: "8"
memory: "16Gi"
minReplicas: 5
maxReplicas: 20
KServe InferenceService (stable):
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: fraud-detector
namespace: ml-serving-prod
annotations:
argocd.argoproj.io/sync-wave: "1"
spec:
predictor:
serviceAccountName: kserve-s3-sa
sklearn:
storageUri: "s3://ml-models-prod/fraud-detector/v43"
resources:
requests:
cpu: "4"
memory: "8Gi"
limits:
cpu: "8"
memory: "16Gi"
minReplicas: 5
maxReplicas: 20
scaleTarget: 80
scaleMetric: concurrency
KServe InferenceService (canary variant):
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: fraud-detector
namespace: ml-serving-prod
annotations:
argocd.argoproj.io/sync-wave: "1"
spec:
predictor:
serviceAccountName: kserve-s3-sa
sklearn:
storageUri: "s3://ml-models-prod/fraud-detector/v47"
resources:
requests:
cpu: "4"
memory: "8Gi"
limits:
cpu: "8"
memory: "16Gi"
minReplicas: 1
maxReplicas: 5
canaryTrafficPercent: 10
ArgoCD ApplicationSet for multi-environment management:
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: fraud-detector-serving
namespace: argocd
spec:
generators:
- list:
elements:
- env: dev
cluster: dev-cluster
namespace: ml-serving-dev
- env: staging
cluster: staging-cluster
namespace: ml-serving-staging
- env: prod
cluster: prod-cluster
namespace: ml-serving-prod
template:
metadata:
name: "fraud-detector-{{env}}"
spec:
project: ml-serving
source:
repoURL: https://github.com/org/ml-gitops
targetRevision: HEAD
path: "environments/{{env}}"
helm:
valueFiles:
- values.yaml
destination:
server: "{{cluster}}"
namespace: "{{namespace}}"
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- RespectIgnoreDifferences=true
Istio VirtualService for canary traffic split:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: fraud-detector-vs
namespace: ml-serving-prod
spec:
hosts:
- fraud-detector.ml-serving-prod.svc.cluster.local
http:
- match:
- headers:
x-canary:
exact: "true"
route:
- destination:
host: fraud-detector-predictor-canary
port:
number: 80
weight: 100
- route:
- destination:
host: fraud-detector-predictor-default
port:
number: 80
weight: 90
- destination:
host: fraud-detector-predictor-canary
port:
number: 80
weight: 10
After the PR merges to dev, ArgoCD picks up the change within 3 minutes (the default sync interval) and applies the updated InferenceService. The model downloads from S3, the serving pod comes up, and the endpoint starts responding. At this point you can run your automated evaluation suite against the dev endpoint.
Promoting to staging is another PR. A human reviews it, checks the dev evaluation results, and approves. Merge, ArgoCD syncs, done. Production promotion follows the same pattern but includes an additional step: the canary InferenceService gets deployed first with 10% traffic, and a GitHub Actions workflow monitors Prometheus metrics for a configured soak period (we use 2 hours for most models) before opening the full-cutover PR automatically.
Drift Detection
Prediction drift is the sneaky failure mode. The model is technically serving, latency looks fine, but the distribution of predictions has shifted because the input data changed. You won't catch this with a liveness probe.
KServe's sklearn server exposes prediction histograms as Prometheus metrics out of the box. You define alerting rules that fire when the distribution deviates beyond a threshold from the baseline captured at deployment time.
Prometheus PrometheusRule for drift alerting:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: fraud-detector-drift
namespace: ml-serving-prod
labels:
prometheus: kube-prometheus
role: alert-rules
spec:
groups:
- name: fraud-detector.drift
interval: 2m
rules:
- alert: PredictionDriftDetected
expr: |
abs(
avg_over_time(fraud_detector_prediction_mean[10m])
- avg_over_time(fraud_detector_prediction_mean[60m] offset 1d)
) > 0.15
for: 10m
labels:
severity: warning
model: fraud-detector
env: prod
annotations:
summary: "Prediction distribution shift detected for fraud-detector"
description: "Mean prediction shifted by {{ $value | humanizePercentage }} from yesterday's baseline. Check for input data schema changes."
- alert: ModelLatencyHigh
expr: |
histogram_quantile(0.99,
sum(rate(fraud_detector_request_duration_seconds_bucket[5m])) by (le)
) > 0.5
for: 5m
labels:
severity: critical
model: fraud-detector
env: prod
annotations:
summary: "p99 latency above 500ms for fraud-detector"
description: "p99 latency is {{ $value }}s. SLA threshold is 500ms."
- alert: ModelErrorRateHigh
expr: |
rate(fraud_detector_request_total{status_code=~"5.."}[5m])
/
rate(fraud_detector_request_total[5m]) > 0.01
for: 5m
labels:
severity: critical
model: fraud-detector
env: prod
annotations:
summary: "Error rate above 1% for fraud-detector"
When this alert fires, it sends to PagerDuty (or your alert routing of choice via AlertManager). The on-call engineer's first action is to check whether a canary is active. If it is, rolling back is a single command:
git revert HEAD~1
git push origin main
ArgoCD detects the revert within 3 minutes and redeploys the previous InferenceService version. In practice, our rollbacks averaged 4 minutes from decision to stable serving.
Results
| Metric | Before | After |
|---|---|---|
| Time to deploy new model version | 4 to 6 hours | 8 minutes to production canary |
| Rollback capability | None (manual rebuild) |
git revert, avg 4 minutes |
| Drift detection time | 6 hours (user reports) | 15 minutes (automated alert) |
| Deployment audit trail | Slack messages | Full Git history with PR reviews |
| Environment parity | Best effort | Enforced via ApplicationSet |
| Config drift prevention | None | ArgoCD selfHeal |
The number that surprised me most was the drift detection improvement. We caught a data schema change within 15 minutes on the new system. The same type of change previously went undetected for 6 hours before a user complaint surfaced it. That's not a monitoring win, it's a business outcome.
Lessons Learned
Start with the values.yaml contract. The shape of that file is the most important design decision you'll make. Get the team to agree on it before writing any ArgoCD config. Everything else follows from it.
S3 artifact URIs in the InferenceService spec, not model names. MLflow stage names ("Production", "Staging") are mutable. If you reference a stage name in your InferenceService spec, two different model versions could map to the same stage name over time, and your Git history loses meaning. Reference the explicit S3 URI with the version number baked in.
selfHeal is non-negotiable. Turn it on in your ArgoCD sync policy. Without selfHeal, a manual kubectl edit on the InferenceService will drift silently and nobody will notice until it matters.
Canary soak time depends on your traffic volume. For a high-volume fraud model processing 50k requests per minute, 30 minutes of canary is enough to get statistically significant signal. For a low-volume model processing 100 requests per day, 2 hours of canary at 10% gives you 20 requests through the new version. Adjust accordingly, or route specific customers to the canary instead of random percentage splitting.
Model cold start affects canary rollouts. Large models take time to download from S3 and load into memory. A 2GB model on a cold node might take 3 to 4 minutes before it's ready to serve. Account for this in your readiness probe timeouts and don't let your monitoring system flag the canary as failing during the startup window.
Try It Yourself
The repository structure I've described looks like this:
ml-gitops/
├── environments/
│ ├── dev/
│ │ ├── values.yaml
│ │ └── templates/
│ │ ├── inference-service.yaml
│ │ └── virtual-service.yaml
│ ├── staging/
│ │ ├── values.yaml
│ │ └── templates/
│ └── prod/
│ ├── values.yaml
│ └── templates/
├── base/
│ ├── inference-service-template.yaml
│ └── prometheus-rules.yaml
└── applicationset.yaml
Prerequisites before you start:
- Kubernetes cluster (1.28 or newer)
- KServe 0.12 or newer installed
- ArgoCD 2.9 or newer installed
- Istio 1.20 or newer installed
- MLflow tracking server accessible from the cluster
- S3 bucket with appropriate IRSA or Workload Identity configured for KServe pods
The ArgoCD ApplicationSet in this post assumes a Helm-based templating approach where each environment folder contains a values.yaml and a templates directory with the InferenceService and VirtualService manifests. You could also use Kustomize overlays. The concepts are identical.
Start with dev only. Get one model version deploying cleanly through ArgoCD before adding staging and prod. Add the canary workflow only after the basic promotion gate is working reliably.
The jump from "it works in dev" to "it's reliable in prod" is mostly about the Prometheus alerting and the canary soak automation. Those two pieces are what make the system trustworthy enough for the team to stop second-guessing every deployment.
Resources:



Top comments (0)