I spent an entire Saturday afternoon debugging why my CloudNativePG (CNPG) database cluster refused to initialize, only to find out my own security policies were killing the initdb jobs. I had a "require-resource-limits" policy active across the cluster. It sounded like a great idea: no pod enters the cluster without explicit CPU and memory limits. The documentation makes this look like a five-minute win for cluster stability.
What the docs don't tell you is that many Kubernetes Operators, including CNPG, spawn temporary Jobs or Pods that don't always inherit the limits you've defined in the primary custom resource. The admission controller saw a pod without limits, deemed it "illegal," and blocked it. The operator just kept retrying, and I kept wondering why my database was stuck in a pending state with no obvious error in the operator logs.
This is the gap between "Policy-as-Code" as a concept and Policy-as-Code in a real production environment. If you've ever tried to enforce standards across a multi-node cluster, you've probably looked at OPA Gatekeeper or Kyverno. I've used both. One requires you to learn a specialized language (Rego) that feels like a full-time job, and the other uses YAML.
Why you'd choose a Policy Engine
You reach this decision point when your cluster grows beyond a few hand-rolled manifests. Once you're using ArgoCD to scale your apps, you stop caring about individual pods and start caring about invariants.
These invariants usually fall into a few buckets:
- No one runs a container as root.
- Every deployment has a specific set of labels for monitoring.
- Resource limits are enforced so one runaway AI agent doesn't starve the rest of the node.
- Sidecars are automatically injected without manually editing every deployment.
You can do some of this with Pod Security Admissions (PSA), but PSA is a blunt instrument. It's a "yes or no" switch. A real admission controller allows you to mutate the request on the fly. If a developer forgets a security context, the controller doesn't just reject the pod; it injects the correct one.
Option A: OPA Gatekeeper
Gatekeeper is the industry standard for large-scale enterprises. It's built on Open Policy Agent (OPA), and its primary strength is its absolute precision.
Strengths
The logic is decoupled from the Kubernetes API. Because it uses Rego, you can write incredibly complex queries. If you need a policy that says "Allow this pod only if the user is in the 'dev' group AND the time is between 9 AM and 5 PM AND the image comes from a specific signed registry," Gatekeeper can do it.
Weaknesses
The learning curve is a cliff. Rego is a declarative query language, and if you've never used it, you'll spend more time fighting the syntax than actually securing your cluster. Debugging a failing Rego policy is a nightmare because the error messages are often opaque.
When it shines
Gatekeeper is for environments where compliance is a legal requirement. If you're in a highly regulated industry where you need a mathematical proof of your security posture, the overhead of Rego is worth it.
Option B: Kyverno
Kyverno is the choice for those of us who just want things to work without learning a new language. It uses YAML for everything.
Strengths
It's native to Kubernetes. If you can write a Pod manifest, you can write a Kyverno policy. It handles mutation, validation, and generation. The "generation" part is a killer feature: you can tell Kyverno that whenever a new namespace is created, it should automatically generate a NetworkPolicy and a LimitRange for that namespace.
Weaknesses
YAML has limits. While Kyverno is powerful, it can't match the raw computational logic of Rego for extremely complex edge cases. It's also easier to accidentally create "mutation loops" where a policy changes a resource, which triggers the policy again, ad infinitum.
When it shines
It's perfect for the GitOps-driven homelab or mid-sized production environment. It integrates cleanly with manifest validation pipelines and doesn't require a dedicated "Policy Engineer" to maintain.
Decision Framework
| Criterion | OPA Gatekeeper | Kyverno |
|---|---|---|
| Language | Rego (Specialized) | YAML (K8s Native) |
| Learning Curve | Steep | Shallow |
| Mutation | Possible, but complex | First-class citizen |
| Resource Generation | No | Yes |
| Performance | Extremely high | High |
| Configuration | ConstraintTemplates | ClusterPolicies |
| Ideal User | Compliance/Security Teams | DevOps/Platform Engineers |
My Pick and Why
I use Kyverno. I've tried the "right way" with OPA, but in a lean environment, the cognitive load of Rego is a liability. I'd rather spend my time optimizing my AI agent orchestration than debugging a query language.
However, using Kyverno without a strategy is a fast track to a broken cluster. To make it actually work, you have to move away from the "happy path" and account for infrastructure overhead.
The "Infrastructure Exclusion" Pattern
The biggest mistake I made early on was applying policies globally. I had a policy that required all pods to have a specific security context. Suddenly, my Traefik ingress and ArgoCD controllers started crashing because they needed specific capabilities (like NET_ADMIN) that my policy explicitly forbade.
The fix is to implement a strict exclusion list. You cannot treat your infrastructure components the same way you treat your application workloads. I now use a combination of namespace exclusions and label-based filters to ensure that the "plumbing" of the cluster stays functional.
Here is how I handled the CNPG issue. Instead of a blanket "require limits" policy that blocks everything, I added an exclusion for any resource tagged by the CNPG operator.
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-resource-limits
spec:
rules:
- name: require-resource-limits
match:
any:
- resources:
kinds:
- Pod
generate:
kind: LimitRange
name: default-limit-range
namespace: $(metadata.namespace)
applyTo: Pod
spec:
limits:
- type: Container
max:
memory: 512Mi
exclude:
any:
- labels:
cnpg.io/cluster: "*"
This policy ensures that most pods get a default limit range, but it stays out of the way of the database operator's internal jobs.
Handling Security Contexts without Breaking the Cluster
Another common pitfall is forcing security contexts on pods that actually need to run as root to perform system-level tasks. I've seen this happen with storage drivers and network plugins.
I prefer a "mutate-then-validate" approach. I use Kyverno to inject a sane default security context for everything, and then I create a small set of exceptions for the system namespaces.
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: default-security-context
spec:
rules:
- name: set-default-security-context
match:
any:
- resources:
kinds:
- Pod
# I use mutate here instead of generate to ensure the pod
# spec is modified before it hits the scheduler
mutate:
patchStrategicMerge:
spec:
securityContext:
runAsUser: 1000
runAsGroup: 1000
fsGroup: 2000
supplementalGroups: [2001]
If you apply this globally, you'll likely break your CNI or your CSI driver. You must exclude kube-system and any namespace where you've deployed low-level infrastructure.
The Danger of synchronize: true
Kyverno has a setting called synchronize. When set to true, Kyverno will automatically update the generated resource if the policy changes. This sounds great in theory, but in practice, it can create a synchronization nightmare.
I once had a policy generating NetworkPolicies for every new namespace. I changed the policy to add a new rule, and Kyverno attempted to update every single NetworkPolicy in the cluster simultaneously. This caused a spike in API server latency and, for a few minutes, left some of my internal services unreachable because the policies were in a state of flux.
My rule of thumb now is to avoid synchronize: true for high-churn resources. If you need to update a generated resource across the cluster, it's safer to trigger a rolling update via your GitOps pipeline than to let the admission controller try to rewrite the cluster state on the fly.
Orphaned Resources and the Cleanup Gap
Policy engines are great at creating things, but they're often bad at cleaning them up. I ran into this with a dashboard app called Homarr. I had a policy that generated certain config maps for the dashboard. When I deleted the application via the API, the generated resources stayed behind.
This led to "phantom" items appearing in my dashboard UI. The application was gone, but the configuration lived on in the etcd store. Kyverno doesn't always track the lifecycle of generated resources perfectly.
If you find yourself with orphaned records in your database or config stores, you might have to go in manually. For Homarr, I had to run a few SQL queries to purge the dead references:
-- Clean up orphaned item_layout and item records
DELETE FROM item_layout WHERE itemId NOT IN (SELECT id FROM item);
DELETE FROM item WHERE app_id NOT IN (SELECT id FROM app);
It's a reminder that while "Policy-as-Code" automates the deployment, it doesn't always automate the decommissioning.
Integration with the Wider Stack
A policy engine shouldn't exist in a vacuum. I've found that the most stable setups link Kyverno with other infrastructure tools. For example, I use it to ensure that any ingress resource created in the cluster has the correct annotations for cert-manager and Cloudflare DNS-01.
Instead of reminding every developer to add the cert-manager.io/cluster-issuer annotation, I wrote a mutation policy that adds it automatically if the ingress is in a production namespace. This removes the human element from the TLS chain.
Similarly, I use Kyverno to enforce that all SealedSecrets are tagged with an owner label. This makes it significantly easier to track who owns which secret when I'm auditing the cluster for old, unused credentials.
Lessons Learned
The biggest takeaway from my time with admission controllers is that the "happy path" is a lie. The documentation shows you how to block a pod, but it doesn't show you the three hours of debugging you'll do when a system-critical operator gets blocked by that same policy.
I've learned to follow three strict rules:
-
Test in a sandbox. Never apply a new
ClusterPolicyto a production cluster without running it inauditmode first. Kyverno's audit mode lets you see what would have been blocked without actually blocking it. - Exclude the plumbing. Your infrastructure (Traefik, ArgoCD, CNPG, etc.) should almost always be exempt from general application policies.
- Keep it simple. If a policy requires more than a few lines of complex YAML logic, it's probably time to ask if that constraint should be handled at the CI/CD level rather than the admission level.
I've moved toward using manifest validation in CI to catch the obvious errors before they ever hit the API server. This reduces the load on the admission controller and provides faster feedback to the person writing the YAML.
If you're building out your own infrastructure and need help designing a secure, automated pipeline for AI agents or industrial systems, you can check out my services. I focus on the gap between the documentation and the actual production reality, which is usually where the most expensive bugs live.
Top comments (0)