This article is about a CI system that looked healthy for two weeks while
enforcing absolutely nothing.
The symptom was simple: a PR with a deliberately non-compliant Kubernetes
manifest was passing CI. The Kyverno policy check step showed green.
The manifest was missing a required label. Kyverno should have blocked it.
It did not.
This is the story of why, and the exact fix.
This is part of a series on building a production-hardened AI inference
platform on Kubernetes:
- Part 1: Why Your KServe InferenceService Won't Become Ready
- Part 2: 5 GitOps Failure Modes That Break KServe Deployments
- Part 3: Nine Ways Backstage Breaks Before Your Developer Portal Works
Project repo:
github.com/sodiq-code/neuroscale-platform
The context: what I was enforcing
The NeuroScale platform requires every InferenceService and Deployment
in the default namespace to carry two labels:
-
owner— which team owns the workload -
cost-center— which budget the resource consumption is charged to
These labels feed directly into OpenCost for cost attribution. Without them,
workloads appear as uncategorised spend. With Kyverno admission policies
enforcing them at the cluster level, every resource is guaranteed to carry
cost attribution metadata.
The Kyverno ClusterPolicy looks like this:
# infrastructure/kyverno/policies/
# require-standard-labels-inferenceservice.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-standard-labels-inferenceservice
spec:
validationFailureAction: Enforce
rules:
- name: check-owner-and-cost-center-on-isvc
match:
any:
- resources:
kinds:
- InferenceService
validate:
message: >
InferenceService resources must set
metadata.labels.owner and metadata.labels.cost-center.
pattern:
metadata:
labels:
owner: "?*"
cost-center: "?*"
Admission enforcement works. Apply an InferenceService without those labels
and Kyverno blocks it at the API server:
$ kubectl apply -f bad-model.yaml
Error from server: admission webhook
"clusterpolice.kyverno.svc" denied the request:
resource InferenceService/default/bad-model was blocked
due to the following policies
require-standard-labels-inferenceservice:
check-owner-and-cost-center-on-isvc:
'validation error: InferenceService resources must set
metadata.labels.owner and metadata.labels.cost-center.'
That part worked correctly. The CI part did not.
The false-green: what it looked like
The CI workflow ran kyverno-cli against every changed manifest in apps/
on every pull request. The intent was to catch non-compliant manifests before
merge — shift-left enforcement before the manifest ever reached the cluster.
The original CI command:
docker run --rm -v "$PWD:/work" -w /work \
ghcr.io/kyverno/kyverno-cli:v1.12.5 \
apply infrastructure/kyverno/policies/*.yaml \
--resource "${app_files[@]}"
A test PR was created with a manifest that deliberately lacked cost-center:
# apps/test-bad-model/inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: test-bad-model
namespace: default
labels:
owner: platform-team
# cost-center intentionally missing
Expected result: CI fails.
Actual result: CI passed.
$ git push origin feature/test-bad-policy
# ... CI runs ...
# Result: validate-policies-against-app-manifests: ✅ PASSED
The violation was printed to stdout. The job still showed green.
Root Cause Part 1: kyverno-cli apply exits 0 on violations
This is the core issue. kyverno-cli apply in the version used (v1.12.x)
exits with code 0 when it finds policy violations. It prints them to
stdout, but returns a success exit code.
You can verify this directly:
docker run --rm -v "$PWD:/work" -w /work \
ghcr.io/kyverno/kyverno-cli:v1.12.5 \
apply infrastructure/kyverno/policies/*.yaml \
--resource apps/test-bad-model/inference-service.yaml
# Output:
# PASS: 0, FAIL: 1, WARN: 0, ERROR: 0, SKIP: 0
#
# policy require-standard-labels-inferenceservice ->
# resource default/InferenceService/test-bad-model
# FAIL: check-owner-and-cost-center-on-isvc
#
echo $?
0 # <-- exits 0 despite the FAIL
The violation is visible in stdout. The exit code is 0. Any CI step that
only checks the exit code will report success.
Root Cause Part 2: $? captures tee, not kyverno
The CI command piped output through tee to capture it for logging:
docker run ... kyverno-cli apply ... \
2>&1 | tee /tmp/kyverno-output.txt
Even if kyverno-cli had exited non-zero, $? in bash captures the exit
code of the last command in the pipe — which is tee. tee always
exits 0 if it can write to the file.
This means two separate problems were stacked:
-
kyverno-cli applyexits 0 on violations (kyverno behavior) -
$?capturesteeexit code, not kyverno exit code (bash pipe behavior)
Either problem alone would have caused the false-green.
Together they made the enforcement completely invisible.
The fix: dual check with $PIPESTATUS[0]
${PIPESTATUS[0]} captures the exit code of the first command in a
pipe, regardless of what the subsequent commands return. Combined with
stdout parsing for violation markers, this creates a reliable enforcement
check.
set +e
docker run --rm -v "$PWD:/work" -w /work \
ghcr.io/kyverno/kyverno-cli:v1.12.5 \
apply infrastructure/kyverno/policies/*.yaml \
--resource "${app_files[@]}" \
2>&1 | tee /tmp/kyverno-output.txt
kyverno_exit="${PIPESTATUS[0]}"
set -e
if [ "${kyverno_exit}" -ne 0 ] \
|| grep -qE "^FAIL" /tmp/kyverno-output.txt \
|| grep -qE "fail: [1-9][0-9]*" /tmp/kyverno-output.txt; then
echo "Kyverno policy violations detected. Failing CI."
exit 1
fi
Why two checks instead of one:
The ${PIPESTATUS[0]} check handles cases where kyverno-cli itself
exits non-zero — which may happen in future versions or on error conditions.
The stdout grep checks handle the current v1.12.x behavior where violations
print FAIL to stdout but exit 0. Together they cover both current behavior
and future behavior changes.
Why set +e before the command:
Without set +e, a non-zero exit from kyverno would immediately abort the
script before ${PIPESTATUS[0]} could be captured. set +e disables
errexit temporarily so we can capture and evaluate the exit code explicitly.
Verification: proving the fix works
After the fix, the same test PR with the missing cost-center label now
fails CI correctly:
$ git push origin feature/test-non-compliant
# CI output:
validate-policies-against-app-manifests
Running kyverno policy check...
PASS: 0, FAIL: 1, WARN: 0, ERROR: 0, SKIP: 0
policy require-standard-labels-inferenceservice ->
resource default/InferenceService/test-bad-model
FAIL: check-owner-and-cost-center-on-isvc
Kyverno policy violations detected. Failing CI.
# Result: validate-policies-against-app-manifests: ❌ FAILED
A compliant manifest with both labels passes:
$ kubectl apply -f apps/demo-iris-2/inference-service.yaml
# CI result: validate-policies-against-app-manifests: ✅ PASSED
The complete GitHub Actions workflow step
Here is the full implementation used in the NeuroScale platform:
# .github/workflows/guardrails-checks.yaml
- name: Validate policies against app manifests
run: |
app_files=()
while IFS= read -r -d '' f; do
app_files+=("--resource" "$f")
done < <(find apps/ -name "*.yaml" -print0)
if [ ${#app_files[@]} -eq 0 ]; then
echo "No app manifests found. Skipping policy check."
exit 0
fi
set +e
docker run --rm -v "$PWD:/work" -w /work \
ghcr.io/kyverno/kyverno-cli:v1.12.5 \
apply infrastructure/kyverno/policies/*.yaml \
"${app_files[@]}" \
2>&1 | tee /tmp/kyverno-output.txt
kyverno_exit="${PIPESTATUS[0]}"
set -e
echo "--- Kyverno output ---"
cat /tmp/kyverno-output.txt
echo "--- Exit code: ${kyverno_exit} ---"
if [ "${kyverno_exit}" -ne 0 ] \
|| grep -qE "^FAIL" /tmp/kyverno-output.txt \
|| grep -qE "fail: [1-9][0-9]*" /tmp/kyverno-output.txt; then
echo "Kyverno policy violations detected. Failing CI."
exit 1
fi
echo "All manifests passed policy checks."
Why this matters beyond a single platform
The kyverno-cli apply exit code behavior is not a bug — it is documented
behavior for the apply subcommand. But it is not prominently surfaced in
the getting-started documentation, and most CI examples in the wild use
the exit code check alone.
If your team is using Kyverno for compliance or security enforcement in CI,
and your CI step looks like this:
kyverno apply policies/ --resource manifests/
if [ $? -ne 0 ]; then
echo "Policy violation detected"
exit 1
fi
Your enforcement is silently not enforcing. The admission webhook at the
cluster level is still blocking violations — but your shift-left CI gate
is not. Developers will only discover policy violations after merging and
watching ArgoCD fail, not before.
The distinction between "guardrails exist" and "guardrails enforce" is
exactly what separates platform engineering from platform theater.
The two-layer enforcement model
The fix is not just about the CI step. The NeuroScale platform uses two
enforcement layers that work together:
Layer 1 — PR time (shift-left): kyverno-cli in CI catches violations
before merge. Developers get fast feedback without needing a running cluster.
Layer 2 — Admission time (shift-down): Kyverno admission webhook blocks
non-compliant resources at the Kubernetes API server. Even if CI is bypassed
or misconfigured, nothing non-compliant reaches the cluster.
PR opened
↓
CI: kyverno-cli apply + $PIPESTATUS[0] check
↓ (blocks here if non-compliant)
PR merged
↓
ArgoCD sync
↓
Kyverno admission webhook
↓ (blocks here as second layer)
Resource created in cluster
Layer 1 gives fast developer feedback. Layer 2 is the safety net.
Both are required. Neither alone is sufficient.
Debugging Commands Reference
# Test a policy manually against a manifest
docker run --rm -v "$PWD:/work" -w /work \
ghcr.io/kyverno/kyverno-cli:v1.12.5 \
apply infrastructure/kyverno/policies/require-standard-labels-inferenceservice.yaml \
--resource apps/demo-iris-2/inference-service.yaml
# Check Kyverno admission webhook registrations
kubectl get validatingwebhookconfigurations | grep kyverno
kubectl get mutatingwebhookconfigurations | grep kyverno
# Verify Kyverno pods are healthy
kubectl -n kyverno get pods
kubectl -n kyverno get endpoints kyverno-svc
# List all installed ClusterPolicies and their enforcement mode
kubectl get clusterpolicies -o wide
# Review Kyverno admission decisions in controller logs
kubectl -n kyverno logs deploy/kyverno --tail=50 | \
grep -i "admit\|deny\|block"
What I Would Add Next
- A
kyverno testsubcommand integration in CI for cases that require explicit pass/fail test fixtures rather than live policy simulation - Background scan results surfaced as PR comments using the Kyverno policy report API
- Separate policy validation for
DeploymentandInferenceServiceresources to give more specific failure messages per resource type
See Also
-
infrastructure/kyverno/policies/— all ClusterPolicy definitions -
.github/workflows/guardrails-checks.yaml— complete CI workflow -
docs/REALITY_CHECK_MILESTONE_4_GUARDRAILS.md— full failure documentation -
docs/REALITY_CHECK_MILESTONE_5_COST_PROXY.md— where the fix was implemented
Jimoh Sodiq Bolaji | Platform Engineer | Technical Content Engineer
| Abuja, Nigeria
NeuroScale Platform
· Dev.to
Top comments (0)