DEV Community

Mateen Anjum
Mateen Anjum

Posted on

GitOps for ML Model Deployment: A Real Pipeline, Not a Toy Demo

TL;DR: I replaced ad-hoc model deployments with a fully declarative GitOps pipeline using KServe and ArgoCD. Every model version lives in Git, every change goes through a PR, and rollbacks take one git revert.


The Problem

Every ML team I've worked with has the same dirty secret: their model deployments are snowflakes.

The Python script that "works on the data scientist's machine." The Slack message that says "hey can you deploy the new model." The SSH session into the GPU node that nobody documented. Meanwhile, the same team's microservices are humming along with ArgoCD, automated rollbacks, PR-gated deploys, full audit trails.

That gap is embarrassing, and it's completely unnecessary.

KServe got accepted into CNCF as an Incubating project in September 2025. The tooling to close this gap is mature enough for production. Here's what the actual problem looks like in practice:

  • Someone manually SSHes into a node and runs a deployment script. No record of what version went live.
  • A model update silently replaces the previous one. There's no rollback path.
  • Two data scientists think different model versions are running in staging. Both are right, sort of.
  • An incident happens. Nobody can tell what changed or when.

I've lived through all of these. The fix isn't a better runbook or more Slack discipline. It's treating model deployments the same way we treat application deployments.


What I Tried First (And Why It Failed)

Attempt 1: Wrapping deployments in shell scripts

The first instinct was to write a deploy_model.sh that calls kubectl apply with the right image tag. This is better than nothing, but it's not GitOps. The script lives somewhere, gets edited ad-hoc, and there's still no PR-gated workflow. The script is the new snowflake.

Attempt 2: Baking models into Docker images

The idea: train the model, package the weights into a Docker image, deploy the image via a normal Deployment. This works surprisingly well for small models under a few hundred MB. It breaks down fast when the model is 2GB or 14GB. Your Docker build times blow up, your registry costs climb, and now your CI pipeline is bottlenecked on model artifact size.

More importantly, you lose the semantic layer. Your Git history shows model:sha256-abc123 instead of fraud-detector/v2.5.0 sklearn 2 replicas 50 RPS target. The config and the artifact are fused. That's hard to review and harder to reason about.

Attempt 3: What actually worked

Separate the artifact from the config. The model weights live in S3, content-addressed and immutable. Git holds the pointer and all the serving configuration. A Kubernetes controller keeps the cluster in sync with what Git says. That's it.


The Solution

The stack I use and recommend:

Layer Tool Why
Model serving KServe v0.14+ Kubernetes-native CRD, multi-framework, built-in canary
GitOps controller ArgoCD Declarative sync, health checks, rollback
Model storage S3 Content-addressable, versioned, immutable
Model versioning MLflow Tracks lineage from training to deployment
Ingress Istio Traffic splitting for canary rollouts
Secrets AWS IRSA No credentials in Git, ever

KServe is the linchpin. It exposes a single InferenceService CRD that ArgoCD manages like any other Kubernetes resource.

Step 1: Install KServe

# cert-manager is a prerequisite
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.17.0/cert-manager.yaml

kubectl create ns kserve

helm install kserve-crd oci://ghcr.io/kserve/charts/kserve-crd \
  --version v0.14.1 \
  --namespace kserve

helm install kserve oci://ghcr.io/kserve/charts/kserve \
  --version v0.14.1 \
  --namespace kserve \
  --set kserve.controller.deploymentMode=RawDeployment
Enter fullscreen mode Exit fullscreen mode

I use RawDeployment mode. It uses standard Kubernetes Deployments and Services instead of Knative, which means fewer moving parts, better compatibility with existing Prometheus and HPA setups, and no cold-start complexity on the critical path.

Step 2: Structure your Git repo

models/
├── base/
│   └── kustomization.yaml
├── fraud-detector/
│   ├── kustomization.yaml
│   ├── inference-service.yaml
│   └── service-account.yaml
├── image-classifier/
│   ├── kustomization.yaml
│   └── inference-service.yaml
└── overlays/
    ├── staging/
    │   └── kustomization.yaml
    └── production/
        └── kustomization.yaml
Enter fullscreen mode Exit fullscreen mode

Kustomize overlays let you parameterize resource limits, replica counts, and model URIs per environment without duplicating YAML.

Step 3: Define the InferenceService

This is the core resource. Here's a real example for a scikit-learn fraud detection model stored in S3:

# models/fraud-detector/inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-detector
  namespace: ml-serving
  labels:
    app: fraud-detector
    team: ml-platform
    model-version: "2.4.1"
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
spec:
  predictor:
    minReplicas: 2
    maxReplicas: 10
    scaleTarget: 50
    scaleMetric: rps
    serviceAccountName: kserve-s3-sa
    model:
      modelFormat:
        name: sklearn
      storageUri: "s3://prod-ml-models/fraud-detector/v2.4.1"
      resources:
        requests:
          cpu: "500m"
          memory: "1Gi"
        limits:
          cpu: "2"
          memory: "4Gi"
      env:
        - name: SKLEARN_SERVER_WORKERS
          value: "2"
Enter fullscreen mode Exit fullscreen mode

The storageUri is the version pointer. Bumping v2.4.1 to v2.5.0 and raising a PR is your deploy-new-model workflow.

For GPU workloads:

# models/image-classifier/inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: image-classifier
  namespace: ml-serving
  labels:
    model-version: "1.3.0"
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 4
    serviceAccountName: kserve-s3-sa
    model:
      modelFormat:
        name: pytorch
      storageUri: "s3://prod-ml-models/image-classifier/v1.3.0"
      runtimeVersion: "23.08-py3"
      resources:
        requests:
          cpu: "2"
          memory: "8Gi"
          nvidia.com/gpu: "1"
        limits:
          cpu: "4"
          memory: "16Gi"
          nvidia.com/gpu: "1"
      nodeSelector:
        accelerator: nvidia-a10g
Enter fullscreen mode Exit fullscreen mode

Step 4: Wire up the S3 service account

Don't put AWS credentials in manifests. Use IRSA on EKS:

# models/fraud-detector/service-account.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kserve-s3-sa
  namespace: ml-serving
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/kserve-model-reader
Enter fullscreen mode Exit fullscreen mode

The IAM role needs s3:GetObject and s3:ListBucket on your model bucket. KServe's storage initializer picks up the IRSA token automatically.

Step 5: Create the ArgoCD Application

# argocd/apps/ml-models.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: ml-models
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: ml-platform
  source:
    repoURL: https://github.com/phonotech/ml-manifests
    targetRevision: main
    path: models/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: ml-serving
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - RespectIgnoreDifferences=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m
  ignoreDifferences:
    - group: serving.kserve.io
      kind: InferenceService
      jsonPointers:
        - /status
        - /metadata/annotations/serving.kserve.io~1deploymentMode
Enter fullscreen mode Exit fullscreen mode

The ignoreDifferences block is critical. KServe's controller writes back to the InferenceService status and some annotations. Without it, ArgoCD will perpetually detect drift and attempt to re-sync, creating a noisy feedback loop.

Step 6: The deployment workflow

Here's what a model update looks like end to end:

  1. Data scientist trains a new model, registers the artifact in MLflow, uploads weights to s3://prod-ml-models/fraud-detector/v2.5.0/
  2. They open a PR updating storageUri and the model-version label in inference-service.yaml
  3. PR gets reviewed and merged to main
  4. ArgoCD detects the diff within 3 minutes (or immediately with webhooks), syncs the new InferenceService spec
  5. KServe's storage initializer pulls the new weights into the pod
  6. New revision comes up healthy, traffic cuts over

The model version is in Git history. You can git revert it. You can see exactly what changed between v2.4.1 and v2.5.0 in the PR diff.

To trigger ArgoCD immediately via webhook from GitHub Actions:

# .github/workflows/sync-models.yaml
name: Notify ArgoCD on model manifest change
on:
  push:
    branches: [main]
    paths:
      - 'models/**'

jobs:
  sync:
    runs-on: ubuntu-latest
    steps:
      - name: Trigger ArgoCD sync
        run: |
          curl -s -X POST \
            -H "Authorization: Bearer ${{ secrets.ARGOCD_TOKEN }}" \
            https://argocd.internal.ca/api/v1/applications/ml-models/sync
Enter fullscreen mode Exit fullscreen mode

Canary rollouts

KServe's built-in canary support is where this pattern earns its keep.

# Step 1: Deploy canary at 10% traffic
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-detector
  namespace: ml-serving
spec:
  predictor:
    canaryTrafficPercent: 10
    model:
      modelFormat:
        name: sklearn
      storageUri: "s3://prod-ml-models/fraud-detector/v2.5.0"
      resources:
        requests:
          cpu: "500m"
          memory: "1Gi"
Enter fullscreen mode Exit fullscreen mode

KServe automatically routes 90% to the last stable revision and 10% to v2.5.0. If the new model performs well, merge another PR bumping canaryTrafficPercent to 50, then promote to 100 by removing the field. If the canary is bad, set canaryTrafficPercent: 0 to pin back to stable immediately.

In RawDeployment mode, you handle canary at the Istio level:

# istio/virtualservice-fraud-detector.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: fraud-detector
  namespace: ml-serving
spec:
  hosts:
    - fraud-detector.ml-serving.svc.cluster.local
  http:
    - route:
        - destination:
            host: fraud-detector-v2-4-1-predictor
            port:
              number: 8080
          weight: 90
        - destination:
            host: fraud-detector-v2-5-0-predictor
            port:
              number: 8080
          weight: 10
Enter fullscreen mode Exit fullscreen mode

Both the InferenceService and the VirtualService are in Git. The traffic split is in Git. Everything is auditable and revertible.


Results

I won't pretend I have clean before/after numbers from a single project because this pattern spans multiple engagements. Here's what consistently holds:

Metric Before After
Model deployment method Manual SSH or ad-hoc scripts PR-gated, Git-backed
Audit trail None or Slack history Full Git history
Rollback time 30 minutes to hours One git revert, seconds
Canary traffic split Not possible without Istio knowledge Config field in YAML
Time to detect config drift Never (no baseline) Continuous, ArgoCD UI
Secret management Often hard-coded or in .env files IRSA, no credentials in Git

The operational improvement that surprises people most: the on-call burden drops significantly when you can answer "what version is running, what changed, who approved it" in under 30 seconds by looking at Git.


Lessons Learned

1. The ignoreDifferences config is not optional. Skip it and you'll spend a weekend wondering why ArgoCD is perpetually out of sync when nothing real has changed. KServe mutates its own resources. Tell ArgoCD which fields to ignore.

2. Model size determines your storage strategy. Under 500MB, the default S3 init container approach is fine. Over a few GB, you need a shared model cache PVC or a pre-baked image. Planning this up front saves a painful migration later.

3. Always set nodeSelector for GPU workloads. Without it, your InferenceService might land on a CPU node and silently fall back to CPU inference. Set the affinity, set the tolerations, pin it.

4. Start with RawDeployment mode. Knative is powerful but it adds complexity. Get the core pattern working first, then add Knative if you genuinely need scale-to-zero economics.

5. GitOps creates friction on purpose. The PR workflow adds a step that direct kubectl apply doesn't. That step is the point. If your team resents the friction, they haven't lived through the 2am incident where nobody knows what changed.


Try It Yourself

The five things you actually need to get started:

  1. KServe installed (Helm, RawDeployment mode, cert-manager prerequisite)
  2. A models-manifests repo with InferenceService YAML per model, Kustomize overlays for environments
  3. ArgoCD Application pointing at overlays/production, selfHeal: true, with ignoreDifferences on KServe status fields
  4. IRSA or Workload Identity for S3 access
  5. Branch protection on main so model version bumps require PR review

The canary rollout and GitHub Actions webhook are enhancements. Get the core working first.


Top comments (0)