A sequel to my KServe readiness post — five GitOps control-plane failure modes with exact terminal output, diagnostics, and repeatable fixes for ArgoCD + KServe stacks.
This post is a follow-up to my earlier KServe piece on endpoint readiness:
👉 Why Your KServe InferenceService Won't Become Ready: Four Production Failures and Fixes
That article focused on why an InferenceService may not become Ready.
This one zooms out to a broader question:
What breaks when the GitOps control plane itself is unstable?
Most GitOps + AI serving tutorials still focus on the happy path — install ArgoCD, apply KServe, deploy InferenceService, done. But in real platform work, the happy path is the easy part.
The hard part is when your app is OutOfSync, the webhook has no endpoints, and everything looks healthy except the thing you actually need.
This post covers the five failure modes that repeatedly broke KServe deployments in a real production-grade platform build, with exact terminal output, root causes, and the fixes that worked.
All failures come from hands-on implementation work documented here:
Project repo: github.com/sodiq-code/neuroscale-platform
The platform context
Stack:
- ArgoCD — GitOps reconciliation
-
KServe — model serving (
InferenceService, runtimes) - Knative + Kourier — serving networking
- Kyverno — policy guardrails
- Backstage — self-service PR generation
GitOps root app:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: neuroscale-infrastructure
namespace: argocd
spec:
source:
repoURL: https://github.com/sodiq-code/neuroscale-platform.git
targetRevision: main
path: infrastructure/apps
Failure Mode 1: Webhook Has No Endpoints — Sync Fails Cluster-Wide
Time lost: ~1 hour | Impact: All InferenceService operations blocked
Symptom
ArgoCD syncs child apps and hits this:
$ kubectl -n argocd describe application ai-model-alpha
...
Message: admission webhook
"inferenceservice.kserve-webhook-server.validator.webhook"
denied the request: Internal error occurred:
no endpoints available for service "kserve-webhook-server-service"
Meanwhile the KServe controller pod shows only 1 of 2 containers ready:
$ kubectl -n kserve get pods
NAME READY STATUS
kserve-controller-manager-8d7c5b9f4-xr2lm 1/2 Running
$ kubectl -n kserve describe pod kserve-controller-manager-8d7c5b9f4-xr2lm
...
kube-rbac-proxy:
State: Waiting
Reason: ImagePullBackOff
Image: gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1
Events:
Warning Failed kubelet
Failed to pull image: unexpected status code 403 Forbidden
Root Cause
The kube-rbac-proxy sidecar inside kserve-controller-manager was pulling from gcr.io/kubebuilder/ — a registry that restricted access in late 2025. The manager container was healthy but because the sidecar was not running, the webhook server had no valid certificate endpoint. Result: every InferenceService apply or update was blocked cluster-wide.
Fix
Remove the sidecar via Kustomize strategic merge patch:
# infrastructure/serving-stack/patches/
# kserve-controller-kube-rbac-proxy-image.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: kserve-controller-manager
namespace: kserve
spec:
template:
spec:
containers:
- name: kube-rbac-proxy
$patch: delete
Verify webhook endpoints are restored after re-sync:
$ kubectl -n kserve get endpoints kserve-webhook-server-service
NAME ENDPOINTS AGE
kserve-webhook-server-service 10.42.0.23:9443 45s
Lesson
When webhook endpoints are missing, your app YAML is never the real problem. Diagnose the controller first. An external registry access change can silently kill your entire admission layer cluster-wide with no obvious error in the app itself.
Failure Mode 2: CRD Deleted by a Misapplied Patch — All Endpoints Gone Instantly
Time lost: 4 minutes recovery | Impact: SEV-1 equivalent — all InferenceServices deleted
Symptom
All InferenceService objects disappeared silently:
$ kubectl -n default get inferenceservices
No resources found in default namespace.
$ kubectl -n argocd get application demo-iris-2
NAME SYNC STATUS HEALTH STATUS
demo-iris-2 OutOfSync Missing
Root Cause
A Kustomize patch file named remove-inferenceservice-crd.yaml was mistakenly applied directly with kubectl apply -f instead of being used as a build-time patch inside kustomization.yaml. The file contained a $patch: delete directive:
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: inferenceservices.serving.kserve.io
$patch: delete
When applied directly, it deleted the actual CRD from Kubernetes. When a CRD is deleted, Kubernetes immediately garbage-collects every custom resource of that type. Every InferenceService was gone within seconds.
Fix
Restore the CRD immediately:
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.12.1/kserve.yaml
kubectl wait crd/inferenceservices.serving.kserve.io \
--for=condition=Established --timeout=60s
kubectl -n argocd patch application demo-iris-2 \
--type merge \
-p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'
Lesson
$patch: delete in a Kustomize file is a build-time instruction — it tells kustomize build to omit that resource from output. It must never be applied directly with kubectl apply -f. Ambiguous filenames like remove-inferenceservice-crd.yaml are dangerous footguns. In a production cluster with 50 deployed models this would be a full SEV-1.
⚠️ Rule: Any file containing
$patch: deletemust only ever be referenced inside akustomization.yamlpatches block, never applied directly.
Failure Mode 3: Permanent OutOfSync Due to Label Key Mismatch
Time lost: 2 weeks undetected | Impact: CI was green while policy enforcement was silently broken
Symptom
A PR is merged, ArgoCD syncs, but the InferenceService stays OutOfSync/Degraded:
$ kubectl -n argocd get application demo-iris-2
NAME SYNC STATUS HEALTH STATUS
demo-iris-2 OutOfSync Degraded
Kyverno denies the resource at admission:
Error from server: error when creating "STDIN":
admission webhook "clusterpolice.kyverno.svc" denied the request:
resource InferenceService/default/test-model was blocked due to the following policies
require-standard-labels-inferenceservice:
check-owner-and-cost-center-on-isvc: 'validation error:
InferenceService resources must set metadata.labels.owner and
metadata.labels.cost-center.
rule check-owner-and-cost-center-on-isvc failed at path
/metadata/labels/cost-center/'
But the label is present in the manifest:
$ kubectl -n default get inferenceservice demo-iris-2 \
-o jsonpath='{.metadata.labels}' | python3 -m json.tool
{
"owner": "platform-team",
"costCenter": "ai-platform"
}
Root Cause
costCenter (camelCase) and cost-center (kebab-case) are completely different Kubernetes label keys. The Backstage template skeleton was generating costCenter. The Kyverno policy required cost-center. CI passed because CI used the same manifest that would pass — the mismatch only surfaced at admission time.
Additionally, kyverno-cli apply exits with code 0 even when policy violations are found. CI was checking $? rather than ${PIPESTATUS[0]}, so the CI step appeared green while enforcement was completely broken for two weeks.
Fix
Standardize on kebab-case throughout (Kubernetes convention):
# Backstage template skeleton
# apps/${{ values.name }}/inference-service.yaml
labels:
owner: platform-team
cost-center: ai-platform # was: costCenter
Fix the CI Kyverno check to catch actual violations:
set +e
docker run --rm -v "$PWD:/work" -w /work ghcr.io/kyverno/kyverno-cli:v1.12.5 \
apply infrastructure/kyverno/policies/*.yaml \
--resource "${app_files[@]}" \
2>&1 | tee /tmp/kyverno-output.txt
kyverno_exit="${PIPESTATUS[0]}"
set -e
if [ "${kyverno_exit}" -ne 0 ] \
|| grep -qE "^FAIL" /tmp/kyverno-output.txt \
|| grep -qE "fail: [1-9][0-9]*" /tmp/kyverno-output.txt; then
echo "Kyverno policy violations detected. Failing CI."
exit 1
fi
Lesson
$? captures the exit code of tee, not kyverno. ${PIPESTATUS[0]} captures kyverno's actual exit code. "Guardrails exist" and "guardrails enforce" are different states. The most dangerous failure mode for a policy system is silent false positives — everything looks green while nothing is actually being enforced.
Failure Mode 4: Kyverno Install Breaks ArgoCD Reconciliation Loop
Time lost: 2–5 minutes per cluster | Impact: All ArgoCD apps enter Unknown state
Symptom
After adding Kyverno to the platform, previously healthy apps enter Unknown state:
$ kubectl -n argocd get applications
NAME SYNC STATUS HEALTH STATUS
neuroscale-infrastructure Synced Healthy
serving-stack Unknown Unknown # was Healthy 10 minutes ago
policy-guardrails Synced Healthy
$ kubectl -n argocd describe application serving-stack
...
Message: rpc error: code = Unavailable desc = connection refused
Root Cause
Kyverno installs its own ValidatingWebhookConfiguration and MutatingWebhookConfiguration during install. While Kyverno is initializing, the webhook configurations are registered but point to endpoints that do not exist yet. During this window, any kubectl apply operation — including ArgoCD's sync reconciliation loop — times out waiting for a response from a not-yet-running webhook server. This cascades into the ArgoCD repo-server losing its connection.
Fix
Add a Kyverno webhookAnnotations ConfigMap patch to suppress automatic webhook registration during the initialization window:
# infrastructure/kyverno/kustomization.yaml
patches:
- target:
kind: ConfigMap
name: kyverno
patch: |-
apiVersion: v1
kind: ConfigMap
metadata:
name: kyverno
namespace: kyverno
data:
webhookAnnotations: "{}"
After Kyverno reaches Running state, force a hard refresh:
kubectl -n argocd patch application serving-stack \
--type merge \
-p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'
Lesson
Adding a policy engine to an existing cluster disrupts all other ArgoCD-managed applications during the install window. In production, this requires a maintenance window or a canary install strategy. Kyverno must be fully healthy before any other component syncs.
Failure Mode 5: Stale Admission Webhook Blocks All Resource Creation
Time lost: 30+ minutes | Impact: All Deployments in the namespace silently blocked
Symptom
After fixing the repo-server, apps sync but Deployments never appear:
$ kubectl get applications -n argocd
NAME SYNC STATUS HEALTH STATUS
neuroscale-infrastructure Synced Healthy
test-app Synced Progressing # stuck
$ kubectl get deploy -n default
No resources found in default namespace.
ArgoCD shows the Deployment as "synced" but it does not exist — a contradiction. Checking conditions:
$ kubectl -n argocd get application test-app -o yaml | grep -A 20 conditions
conditions:
- message: 'Failed sync attempt: one or more objects failed to apply,
reason: Internal error occurred: failed calling webhook
"validate.nginx.ingress.kubernetes.io":
failed to call webhook: Post
"https://ingress-nginx-controller-admission.ingress-nginx.svc:443/...":
dial tcp 10.96.x.x:443: connect: connection refused'
type: SyncError
Root Cause
A ValidatingWebhookConfiguration from a previous cluster experiment was still registered but pointing to a service that no longer existed. Kubernetes admission webhooks are cluster-scoped. The stale ingress-nginx webhook was intercepting every resource creation attempt and failing them — the error only appears in ArgoCD events, not on the Deployment itself.
Fix
# Discover stale webhooks
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations
# Delete the stale one
kubectl delete validatingwebhookconfiguration ingress-nginx-admission
# Force ArgoCD to retry
kubectl -n argocd patch application test-app \
--type merge \
-p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'
After deletion:
$ kubectl get deploy -n default
NAME READY UP-TO-DATE AVAILABLE AGE
nginx-test 1/1 1 1 23s
Lesson
A stale webhook from a previous workload silently blocks all resource creation in the affected namespace for hours without any obvious error message. The admission error only appears in ArgoCD events logs, not on the resource itself. Always check for stale webhooks before blaming manifests.
The Triage Sequence That Saves Hours
When a KServe app is failing in ArgoCD, run this exact order before touching any manifest:
# 1. Environment gate — if this fails, stop and fix environment first
kubectl get nodes
kubectl -n argocd get applications
# 2. Control-plane health
kubectl -n kserve get deploy,pods,svc,endpoints
kubectl get crd | grep serving.kserve.io
# 3. Controller logs
kubectl -n kserve logs deploy/kserve-controller-manager --tail=100
# 4. Webhook availability
kubectl -n kserve get endpoints kserve-webhook-server-service
# 5. Stale webhooks
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations
# 6. App-level sync error detail
kubectl -n argocd get application <app-name> -o yaml | grep -A 20 conditions
Only after every step above passes should you edit app manifests.
Why This Matters for Platform Teams
A platform is credible when it supports both:
- Self-service delivery — the Golden Path works
- Self-service recovery — failures are understandable and fixable without a platform expert
Most teams build the first and postpone the second. That creates operational debt fast.
The fix is not more dashboards. It is better failure-model documentation, tighter GitOps guardrails, and the discipline to document what breaks — not just what works.
A platform is not "done" when the happy path works. It's done when the failure path is understandable and recoverable.
What I Would Improve Next
- Pre-merge CI assertions for probe and resource fields in rendered manifests
- Explicit dependency ordering using ArgoCD sync waves to prevent Kyverno install disruption
- Conformance checks for Helm dependency values nesting to catch silently ignored overrides
- Policy test fixtures that verify both pass and fail cases in CI
See Also
-
docs/REALITY_CHECK_MILESTONE_1_GITOPS_SPINE.md— ArgoCD spine failures with exact terminal output -
docs/REALITY_CHECK_MILESTONE_4_GUARDRAILS.md— Kyverno CI false-green and the$PIPESTATUS[0]fix -
infrastructure/INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md— full incident postmortem with 12-section RCA -
docs/REALITY_CHECK_MILESTONE_2_KSERVE_SERVING.md— the kube-rbac-proxy failure in full detail
Jimoh Sodiq Bolaji | Platform Engineer | Technical Content Engineer | Abuja, Nigeria
NeuroScale Platform · Dev.to
Top comments (0)