TL;DR: I replaced ad-hoc model deployments with a fully declarative GitOps pipeline using KServe and ArgoCD. Every model version lives in Git, every change goes through a PR, and rollbacks take one git revert.
The Problem
Every ML team I've worked with has the same dirty secret: their model deployments are snowflakes.
The Python script that "works on the data scientist's machine." The Slack message that says "hey can you deploy the new model." The SSH session into the GPU node that nobody documented. Meanwhile, the same team's microservices are humming along with ArgoCD, automated rollbacks, PR-gated deploys, full audit trails.
That gap is embarrassing, and it's completely unnecessary.
KServe got accepted into CNCF as an Incubating project in September 2025. The tooling to close this gap is mature enough for production. Here's what the actual problem looks like in practice:
- Someone manually SSHes into a node and runs a deployment script. No record of what version went live.
- A model update silently replaces the previous one. There's no rollback path.
- Two data scientists think different model versions are running in staging. Both are right, sort of.
- An incident happens. Nobody can tell what changed or when.
I've lived through all of these. The fix isn't a better runbook or more Slack discipline. It's treating model deployments the same way we treat application deployments.
What I Tried First (And Why It Failed)
Attempt 1: Wrapping deployments in shell scripts
The first instinct was to write a deploy_model.sh that calls kubectl apply with the right image tag. This is better than nothing, but it's not GitOps. The script lives somewhere, gets edited ad-hoc, and there's still no PR-gated workflow. The script is the new snowflake.
Attempt 2: Baking models into Docker images
The idea: train the model, package the weights into a Docker image, deploy the image via a normal Deployment. This works surprisingly well for small models under a few hundred MB. It breaks down fast when the model is 2GB or 14GB. Your Docker build times blow up, your registry costs climb, and now your CI pipeline is bottlenecked on model artifact size.
More importantly, you lose the semantic layer. Your Git history shows model:sha256-abc123 instead of fraud-detector/v2.5.0 sklearn 2 replicas 50 RPS target. The config and the artifact are fused. That's hard to review and harder to reason about.
Attempt 3: What actually worked
Separate the artifact from the config. The model weights live in S3, content-addressed and immutable. Git holds the pointer and all the serving configuration. A Kubernetes controller keeps the cluster in sync with what Git says. That's it.
The Solution
The stack I use and recommend:
| Layer | Tool | Why |
|---|---|---|
| Model serving | KServe v0.14+ | Kubernetes-native CRD, multi-framework, built-in canary |
| GitOps controller | ArgoCD | Declarative sync, health checks, rollback |
| Model storage | S3 | Content-addressable, versioned, immutable |
| Model versioning | MLflow | Tracks lineage from training to deployment |
| Ingress | Istio | Traffic splitting for canary rollouts |
| Secrets | AWS IRSA | No credentials in Git, ever |
KServe is the linchpin. It exposes a single InferenceService CRD that ArgoCD manages like any other Kubernetes resource.
Step 1: Install KServe
# cert-manager is a prerequisite
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.17.0/cert-manager.yaml
kubectl create ns kserve
helm install kserve-crd oci://ghcr.io/kserve/charts/kserve-crd \
--version v0.14.1 \
--namespace kserve
helm install kserve oci://ghcr.io/kserve/charts/kserve \
--version v0.14.1 \
--namespace kserve \
--set kserve.controller.deploymentMode=RawDeployment
I use RawDeployment mode. It uses standard Kubernetes Deployments and Services instead of Knative, which means fewer moving parts, better compatibility with existing Prometheus and HPA setups, and no cold-start complexity on the critical path.
Step 2: Structure your Git repo
models/
├── base/
│ └── kustomization.yaml
├── fraud-detector/
│ ├── kustomization.yaml
│ ├── inference-service.yaml
│ └── service-account.yaml
├── image-classifier/
│ ├── kustomization.yaml
│ └── inference-service.yaml
└── overlays/
├── staging/
│ └── kustomization.yaml
└── production/
└── kustomization.yaml
Kustomize overlays let you parameterize resource limits, replica counts, and model URIs per environment without duplicating YAML.
Step 3: Define the InferenceService
This is the core resource. Here's a real example for a scikit-learn fraud detection model stored in S3:
# models/fraud-detector/inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: fraud-detector
namespace: ml-serving
labels:
app: fraud-detector
team: ml-platform
model-version: "2.4.1"
annotations:
serving.kserve.io/deploymentMode: RawDeployment
spec:
predictor:
minReplicas: 2
maxReplicas: 10
scaleTarget: 50
scaleMetric: rps
serviceAccountName: kserve-s3-sa
model:
modelFormat:
name: sklearn
storageUri: "s3://prod-ml-models/fraud-detector/v2.4.1"
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2"
memory: "4Gi"
env:
- name: SKLEARN_SERVER_WORKERS
value: "2"
The storageUri is the version pointer. Bumping v2.4.1 to v2.5.0 and raising a PR is your deploy-new-model workflow.
For GPU workloads:
# models/image-classifier/inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: image-classifier
namespace: ml-serving
labels:
model-version: "1.3.0"
spec:
predictor:
minReplicas: 1
maxReplicas: 4
serviceAccountName: kserve-s3-sa
model:
modelFormat:
name: pytorch
storageUri: "s3://prod-ml-models/image-classifier/v1.3.0"
runtimeVersion: "23.08-py3"
resources:
requests:
cpu: "2"
memory: "8Gi"
nvidia.com/gpu: "1"
limits:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "1"
nodeSelector:
accelerator: nvidia-a10g
Step 4: Wire up the S3 service account
Don't put AWS credentials in manifests. Use IRSA on EKS:
# models/fraud-detector/service-account.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: kserve-s3-sa
namespace: ml-serving
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/kserve-model-reader
The IAM role needs s3:GetObject and s3:ListBucket on your model bucket. KServe's storage initializer picks up the IRSA token automatically.
Step 5: Create the ArgoCD Application
# argocd/apps/ml-models.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: ml-models
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: ml-platform
source:
repoURL: https://github.com/phonotech/ml-manifests
targetRevision: main
path: models/overlays/production
destination:
server: https://kubernetes.default.svc
namespace: ml-serving
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- RespectIgnoreDifferences=true
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m
ignoreDifferences:
- group: serving.kserve.io
kind: InferenceService
jsonPointers:
- /status
- /metadata/annotations/serving.kserve.io~1deploymentMode
The ignoreDifferences block is critical. KServe's controller writes back to the InferenceService status and some annotations. Without it, ArgoCD will perpetually detect drift and attempt to re-sync, creating a noisy feedback loop.
Step 6: The deployment workflow
Here's what a model update looks like end to end:
- Data scientist trains a new model, registers the artifact in MLflow, uploads weights to
s3://prod-ml-models/fraud-detector/v2.5.0/ - They open a PR updating
storageUriand themodel-versionlabel ininference-service.yaml - PR gets reviewed and merged to
main - ArgoCD detects the diff within 3 minutes (or immediately with webhooks), syncs the new
InferenceServicespec - KServe's storage initializer pulls the new weights into the pod
- New revision comes up healthy, traffic cuts over
The model version is in Git history. You can git revert it. You can see exactly what changed between v2.4.1 and v2.5.0 in the PR diff.
To trigger ArgoCD immediately via webhook from GitHub Actions:
# .github/workflows/sync-models.yaml
name: Notify ArgoCD on model manifest change
on:
push:
branches: [main]
paths:
- 'models/**'
jobs:
sync:
runs-on: ubuntu-latest
steps:
- name: Trigger ArgoCD sync
run: |
curl -s -X POST \
-H "Authorization: Bearer ${{ secrets.ARGOCD_TOKEN }}" \
https://argocd.internal.ca/api/v1/applications/ml-models/sync
Canary rollouts
KServe's built-in canary support is where this pattern earns its keep.
# Step 1: Deploy canary at 10% traffic
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: fraud-detector
namespace: ml-serving
spec:
predictor:
canaryTrafficPercent: 10
model:
modelFormat:
name: sklearn
storageUri: "s3://prod-ml-models/fraud-detector/v2.5.0"
resources:
requests:
cpu: "500m"
memory: "1Gi"
KServe automatically routes 90% to the last stable revision and 10% to v2.5.0. If the new model performs well, merge another PR bumping canaryTrafficPercent to 50, then promote to 100 by removing the field. If the canary is bad, set canaryTrafficPercent: 0 to pin back to stable immediately.
In RawDeployment mode, you handle canary at the Istio level:
# istio/virtualservice-fraud-detector.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: fraud-detector
namespace: ml-serving
spec:
hosts:
- fraud-detector.ml-serving.svc.cluster.local
http:
- route:
- destination:
host: fraud-detector-v2-4-1-predictor
port:
number: 8080
weight: 90
- destination:
host: fraud-detector-v2-5-0-predictor
port:
number: 8080
weight: 10
Both the InferenceService and the VirtualService are in Git. The traffic split is in Git. Everything is auditable and revertible.
Results
I won't pretend I have clean before/after numbers from a single project because this pattern spans multiple engagements. Here's what consistently holds:
| Metric | Before | After |
|---|---|---|
| Model deployment method | Manual SSH or ad-hoc scripts | PR-gated, Git-backed |
| Audit trail | None or Slack history | Full Git history |
| Rollback time | 30 minutes to hours | One git revert, seconds |
| Canary traffic split | Not possible without Istio knowledge | Config field in YAML |
| Time to detect config drift | Never (no baseline) | Continuous, ArgoCD UI |
| Secret management | Often hard-coded or in .env files |
IRSA, no credentials in Git |
The operational improvement that surprises people most: the on-call burden drops significantly when you can answer "what version is running, what changed, who approved it" in under 30 seconds by looking at Git.
Lessons Learned
1. The ignoreDifferences config is not optional. Skip it and you'll spend a weekend wondering why ArgoCD is perpetually out of sync when nothing real has changed. KServe mutates its own resources. Tell ArgoCD which fields to ignore.
2. Model size determines your storage strategy. Under 500MB, the default S3 init container approach is fine. Over a few GB, you need a shared model cache PVC or a pre-baked image. Planning this up front saves a painful migration later.
3. Always set nodeSelector for GPU workloads. Without it, your InferenceService might land on a CPU node and silently fall back to CPU inference. Set the affinity, set the tolerations, pin it.
4. Start with RawDeployment mode. Knative is powerful but it adds complexity. Get the core pattern working first, then add Knative if you genuinely need scale-to-zero economics.
5. GitOps creates friction on purpose. The PR workflow adds a step that direct kubectl apply doesn't. That step is the point. If your team resents the friction, they haven't lived through the 2am incident where nobody knows what changed.
Try It Yourself
The five things you actually need to get started:
- KServe installed (Helm, RawDeployment mode, cert-manager prerequisite)
- A models-manifests repo with
InferenceServiceYAML per model, Kustomize overlays for environments - ArgoCD Application pointing at
overlays/production,selfHeal: true, withignoreDifferenceson KServe status fields - IRSA or Workload Identity for S3 access
- Branch protection on
mainso model version bumps require PR review
The canary rollout and GitHub Actions webhook are enhancements. Get the core working first.



Top comments (0)