Sodiq Jimoh

Posted on Mar 30

Beyond InferenceService Readiness: 5 GitOps Failure Modes That Break KServe Deployments

#ai #kubernetes #devops #gitops

A sequel to my KServe readiness post — five GitOps control-plane failure modes with exact terminal output, diagnostics, and repeatable fixes for ArgoCD + KServe stacks.

This post is a follow-up to my earlier KServe piece on endpoint readiness:

👉 Why Your KServe InferenceService Won't Become Ready: Four Production Failures and Fixes

That article focused on why an InferenceService may not become Ready.

This one zooms out to a broader question:

What breaks when the GitOps control plane itself is unstable?

Most GitOps + AI serving tutorials still focus on the happy path — install ArgoCD, apply KServe, deploy InferenceService, done. But in real platform work, the happy path is the easy part.

The hard part is when your app is OutOfSync, the webhook has no endpoints, and everything looks healthy except the thing you actually need.

This post covers the five failure modes that repeatedly broke KServe deployments in a real production-grade platform build, with exact terminal output, root causes, and the fixes that worked.

All failures come from hands-on implementation work documented here:
Project repo: github.com/sodiq-code/neuroscale-platform

The platform context

Stack:

ArgoCD — GitOps reconciliation
KServe — model serving (InferenceService, runtimes)
Knative + Kourier — serving networking
Kyverno — policy guardrails
Backstage — self-service PR generation

GitOps root app:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: neuroscale-infrastructure
  namespace: argocd
spec:
  source:
    repoURL: https://github.com/sodiq-code/neuroscale-platform.git
    targetRevision: main
    path: infrastructure/apps

Failure Mode 1: Webhook Has No Endpoints — Sync Fails Cluster-Wide

Time lost: ~1 hour | Impact: All InferenceService operations blocked

Symptom

ArgoCD syncs child apps and hits this:

$ kubectl -n argocd describe application ai-model-alpha
...
Message: admission webhook
  "inferenceservice.kserve-webhook-server.validator.webhook"
  denied the request: Internal error occurred:
  no endpoints available for service "kserve-webhook-server-service"

Meanwhile the KServe controller pod shows only 1 of 2 containers ready:

$ kubectl -n kserve get pods
NAME                                        READY   STATUS
kserve-controller-manager-8d7c5b9f4-xr2lm  1/2     Running

$ kubectl -n kserve describe pod kserve-controller-manager-8d7c5b9f4-xr2lm
...
  kube-rbac-proxy:
    State:   Waiting
    Reason:  ImagePullBackOff
    Image:   gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1
Events:
  Warning  Failed  kubelet
    Failed to pull image: unexpected status code 403 Forbidden

Root Cause

The kube-rbac-proxy sidecar inside kserve-controller-manager was pulling from gcr.io/kubebuilder/ — a registry that restricted access in late 2025. The manager container was healthy but because the sidecar was not running, the webhook server had no valid certificate endpoint. Result: every InferenceService apply or update was blocked cluster-wide.

Fix

Remove the sidecar via Kustomize strategic merge patch:

# infrastructure/serving-stack/patches/
#   kserve-controller-kube-rbac-proxy-image.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kserve-controller-manager
  namespace: kserve
spec:
  template:
    spec:
      containers:
        - name: kube-rbac-proxy
          $patch: delete

Verify webhook endpoints are restored after re-sync:

$ kubectl -n kserve get endpoints kserve-webhook-server-service
NAME                           ENDPOINTS          AGE
kserve-webhook-server-service  10.42.0.23:9443    45s

Lesson

When webhook endpoints are missing, your app YAML is never the real problem. Diagnose the controller first. An external registry access change can silently kill your entire admission layer cluster-wide with no obvious error in the app itself.

Failure Mode 2: CRD Deleted by a Misapplied Patch — All Endpoints Gone Instantly

Time lost: 4 minutes recovery | Impact: SEV-1 equivalent — all InferenceServices deleted

Symptom

All InferenceService objects disappeared silently:

$ kubectl -n default get inferenceservices
No resources found in default namespace.

$ kubectl -n argocd get application demo-iris-2
NAME          SYNC STATUS   HEALTH STATUS
demo-iris-2   OutOfSync      Missing

Root Cause

A Kustomize patch file named remove-inferenceservice-crd.yaml was mistakenly applied directly with kubectl apply -f instead of being used as a build-time patch inside kustomization.yaml. The file contained a $patch: delete directive:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: inferenceservices.serving.kserve.io
$patch: delete

When applied directly, it deleted the actual CRD from Kubernetes. When a CRD is deleted, Kubernetes immediately garbage-collects every custom resource of that type. Every InferenceService was gone within seconds.

Fix

Restore the CRD immediately:

kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.12.1/kserve.yaml

kubectl wait crd/inferenceservices.serving.kserve.io \
  --for=condition=Established --timeout=60s

kubectl -n argocd patch application demo-iris-2 \
  --type merge \
  -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'

Lesson

$patch: delete in a Kustomize file is a build-time instruction — it tells kustomize build to omit that resource from output. It must never be applied directly with kubectl apply -f. Ambiguous filenames like remove-inferenceservice-crd.yaml are dangerous footguns. In a production cluster with 50 deployed models this would be a full SEV-1.

⚠️ Rule: Any file containing $patch: delete must only ever be referenced inside a kustomization.yaml patches block, never applied directly.

Failure Mode 3: Permanent OutOfSync Due to Label Key Mismatch

Time lost: 2 weeks undetected | Impact: CI was green while policy enforcement was silently broken

Symptom

A PR is merged, ArgoCD syncs, but the InferenceService stays OutOfSync/Degraded:

$ kubectl -n argocd get application demo-iris-2
NAME          SYNC STATUS   HEALTH STATUS
demo-iris-2   OutOfSync      Degraded

Kyverno denies the resource at admission:

Error from server: error when creating "STDIN":
  admission webhook "clusterpolice.kyverno.svc" denied the request:
  resource InferenceService/default/test-model was blocked due to the following policies
  require-standard-labels-inferenceservice:
    check-owner-and-cost-center-on-isvc: 'validation error:
    InferenceService resources must set metadata.labels.owner and
    metadata.labels.cost-center.
    rule check-owner-and-cost-center-on-isvc failed at path
    /metadata/labels/cost-center/'

But the label is present in the manifest:

$ kubectl -n default get inferenceservice demo-iris-2 \
    -o jsonpath='{.metadata.labels}' | python3 -m json.tool
{
    "owner": "platform-team",
    "costCenter": "ai-platform"
}

Root Cause

costCenter (camelCase) and cost-center (kebab-case) are completely different Kubernetes label keys. The Backstage template skeleton was generating costCenter. The Kyverno policy required cost-center. CI passed because CI used the same manifest that would pass — the mismatch only surfaced at admission time.

Additionally, kyverno-cli apply exits with code 0 even when policy violations are found. CI was checking $? rather than ${PIPESTATUS[0]}, so the CI step appeared green while enforcement was completely broken for two weeks.

Fix

Standardize on kebab-case throughout (Kubernetes convention):

# Backstage template skeleton
# apps/${{ values.name }}/inference-service.yaml
labels:
  owner: platform-team
  cost-center: ai-platform   # was: costCenter

Fix the CI Kyverno check to catch actual violations:

set +e
docker run --rm -v "$PWD:/work" -w /work ghcr.io/kyverno/kyverno-cli:v1.12.5 \
  apply infrastructure/kyverno/policies/*.yaml \
  --resource "${app_files[@]}" \
  2>&1 | tee /tmp/kyverno-output.txt
kyverno_exit="${PIPESTATUS[0]}"
set -e

if [ "${kyverno_exit}" -ne 0 ] \
    || grep -qE "^FAIL" /tmp/kyverno-output.txt \
    || grep -qE "fail: [1-9][0-9]*" /tmp/kyverno-output.txt; then
  echo "Kyverno policy violations detected. Failing CI."
  exit 1
fi

Lesson

$? captures the exit code of tee, not kyverno. ${PIPESTATUS[0]} captures kyverno's actual exit code. "Guardrails exist" and "guardrails enforce" are different states. The most dangerous failure mode for a policy system is silent false positives — everything looks green while nothing is actually being enforced.

Failure Mode 4: Kyverno Install Breaks ArgoCD Reconciliation Loop

Time lost: 2–5 minutes per cluster | Impact: All ArgoCD apps enter Unknown state

Symptom

After adding Kyverno to the platform, previously healthy apps enter Unknown state:

$ kubectl -n argocd get applications
NAME                       SYNC STATUS   HEALTH STATUS
neuroscale-infrastructure  Synced         Healthy
serving-stack              Unknown        Unknown    # was Healthy 10 minutes ago
policy-guardrails          Synced         Healthy

$ kubectl -n argocd describe application serving-stack
...
Message: rpc error: code = Unavailable desc = connection refused

Root Cause

Kyverno installs its own ValidatingWebhookConfiguration and MutatingWebhookConfiguration during install. While Kyverno is initializing, the webhook configurations are registered but point to endpoints that do not exist yet. During this window, any kubectl apply operation — including ArgoCD's sync reconciliation loop — times out waiting for a response from a not-yet-running webhook server. This cascades into the ArgoCD repo-server losing its connection.

Fix

Add a Kyverno webhookAnnotations ConfigMap patch to suppress automatic webhook registration during the initialization window:

# infrastructure/kyverno/kustomization.yaml
patches:
  - target:
      kind: ConfigMap
      name: kyverno
    patch: |-
      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: kyverno
        namespace: kyverno
      data:
        webhookAnnotations: "{}"

After Kyverno reaches Running state, force a hard refresh:

kubectl -n argocd patch application serving-stack \
  --type merge \
  -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'

Lesson

Adding a policy engine to an existing cluster disrupts all other ArgoCD-managed applications during the install window. In production, this requires a maintenance window or a canary install strategy. Kyverno must be fully healthy before any other component syncs.

Failure Mode 5: Stale Admission Webhook Blocks All Resource Creation

Time lost: 30+ minutes | Impact: All Deployments in the namespace silently blocked

Symptom

After fixing the repo-server, apps sync but Deployments never appear:

$ kubectl get applications -n argocd
NAME                       SYNC STATUS   HEALTH STATUS
neuroscale-infrastructure  Synced         Healthy
test-app                   Synced         Progressing   # stuck

$ kubectl get deploy -n default
No resources found in default namespace.

ArgoCD shows the Deployment as "synced" but it does not exist — a contradiction. Checking conditions:

$ kubectl -n argocd get application test-app -o yaml | grep -A 20 conditions
  conditions:
  - message: 'Failed sync attempt: one or more objects failed to apply,
      reason: Internal error occurred: failed calling webhook
      "validate.nginx.ingress.kubernetes.io":
      failed to call webhook: Post
      "https://ingress-nginx-controller-admission.ingress-nginx.svc:443/...":
      dial tcp 10.96.x.x:443: connect: connection refused'
    type: SyncError

Root Cause

A ValidatingWebhookConfiguration from a previous cluster experiment was still registered but pointing to a service that no longer existed. Kubernetes admission webhooks are cluster-scoped. The stale ingress-nginx webhook was intercepting every resource creation attempt and failing them — the error only appears in ArgoCD events, not on the Deployment itself.

Fix

# Discover stale webhooks
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations

# Delete the stale one
kubectl delete validatingwebhookconfiguration ingress-nginx-admission

# Force ArgoCD to retry
kubectl -n argocd patch application test-app \
  --type merge \
  -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'

After deletion:

$ kubectl get deploy -n default
NAME          READY   UP-TO-DATE   AVAILABLE   AGE
nginx-test    1/1     1            1           23s

Lesson

A stale webhook from a previous workload silently blocks all resource creation in the affected namespace for hours without any obvious error message. The admission error only appears in ArgoCD events logs, not on the resource itself. Always check for stale webhooks before blaming manifests.

The Triage Sequence That Saves Hours

When a KServe app is failing in ArgoCD, run this exact order before touching any manifest:

# 1. Environment gate — if this fails, stop and fix environment first
kubectl get nodes
kubectl -n argocd get applications

# 2. Control-plane health
kubectl -n kserve get deploy,pods,svc,endpoints
kubectl get crd | grep serving.kserve.io

# 3. Controller logs
kubectl -n kserve logs deploy/kserve-controller-manager --tail=100

# 4. Webhook availability
kubectl -n kserve get endpoints kserve-webhook-server-service

# 5. Stale webhooks
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations

# 6. App-level sync error detail
kubectl -n argocd get application <app-name> -o yaml | grep -A 20 conditions

Only after every step above passes should you edit app manifests.

Why This Matters for Platform Teams

A platform is credible when it supports both:

Self-service delivery — the Golden Path works
Self-service recovery — failures are understandable and fixable without a platform expert

Most teams build the first and postpone the second. That creates operational debt fast.

The fix is not more dashboards. It is better failure-model documentation, tighter GitOps guardrails, and the discipline to document what breaks — not just what works.

A platform is not "done" when the happy path works. It's done when the failure path is understandable and recoverable.

What I Would Improve Next

Pre-merge CI assertions for probe and resource fields in rendered manifests
Explicit dependency ordering using ArgoCD sync waves to prevent Kyverno install disruption
Conformance checks for Helm dependency values nesting to catch silently ignored overrides
Policy test fixtures that verify both pass and fail cases in CI

DEV Community

Beyond InferenceService Readiness: 5 GitOps Failure Modes That Break KServe Deployments

The platform context

Failure Mode 1: Webhook Has No Endpoints — Sync Fails Cluster-Wide

Symptom

Root Cause

Fix

Lesson

Failure Mode 2: CRD Deleted by a Misapplied Patch — All Endpoints Gone Instantly

Symptom

Root Cause

Fix

Lesson

Failure Mode 3: Permanent OutOfSync Due to Label Key Mismatch

Symptom

Root Cause

Fix

Lesson

Failure Mode 4: Kyverno Install Breaks ArgoCD Reconciliation Loop

Symptom

Root Cause

Fix

Lesson

Failure Mode 5: Stale Admission Webhook Blocks All Resource Creation

Symptom

Root Cause

Fix

Lesson

The Triage Sequence That Saves Hours

Why This Matters for Platform Teams

What I Would Improve Next

See Also

Top comments (0)