I spent a weekend debugging a "permission denied" error in a custom controller that only appeared when the pod migrated to a different node. The fix wasn't in the code, but in a ClusterRoleBinding that I'd lazily set to cluster-admin six months prior, which had since been partially overridden by a namespace-level policy I forgot I implemented. It was a reminder that "just give it admin" is a technical debt bomb that eventually explodes.
If you're running a small homelab, cluster-admin is tempting. It's the path of least resistance. But once you start deploying AI agents that can execute code or industrial IoT pipelines that touch physical hardware, a compromised pod with cluster-wide permissions is a catastrophe. You need a way to give your apps exactly what they need to function and nothing more.
What I tried first
My first instinct with RBAC was to use ClusterRoles for everything. I figured if I defined the permissions once at the cluster level, I wouldn't have to keep rewriting the same Role YAML for every new namespace I created. I'd create a ClusterRole for "pod-reader" and then bind it to the ServiceAccount in each namespace.
This worked until I realized I was creating a massive auditing nightmare. I had no easy way to see which specific pods in which namespaces had these permissions without grepping through dozens of ClusterRoleBindings.
Then I tried the opposite: creating hyper-specific Roles for every single microservice. I ended up with a YAML sprawl that was impossible to maintain. I was manually updating 15 different Role objects just to add a patch permission to a deployment. I was essentially treating RBAC like a manual checklist rather than a system.
The gap was in the middle. I needed a pattern that was scalable but strictly scoped.
The actual solution
The goal is to move from "it works" to "it's secure." This requires a three-tier approach: a dedicated ServiceAccount, a scoped Role (or ClusterRole), and a RoleBinding that bridges them.
1. The Minimalist Service Account
Stop using the default service account. If you don't specify one, Kubernetes assigns the default SA in that namespace. If you've accidentally granted permissions to that default account, every single pod in that namespace now has those permissions.
# serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: agent-runtime-sa
namespace: ai-workloads
automountServiceAccountToken: true # Only true if the pod actually needs to talk to the K8s API
I set automountServiceAccountToken: false by default for any pod that doesn't need to query the API server. This prevents the token from being injected into the pod's filesystem, removing one more attack vector.
2. Scoping the Role
Instead of cluster-admin, I define the exact verbs and resources. For an AI agent that needs to monitor its own pods but not touch secrets, the Role looks like this:
# role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: agent-monitor-role
namespace: ai-workloads
rules:
- apiGroups: [""] # The core API group
resources: ["pods", "pods/log"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "list"]
If the agent needs to operate across multiple namespaces but still maintain limited permissions, I use a ClusterRole but bind it with a RoleBinding (not a ClusterRoleBinding). This is a key distinction: a RoleBinding to a ClusterRole grants the permissions of that role only within the namespace of the binding.
3. The Binding
This is where we connect the identity (SA) to the permissions (Role).
# rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: agent-monitor-binding
namespace: ai-workloads
subjects:
- kind: ServiceAccount
name: agent-runtime-sa
namespace: ai-workloads
roleRef:
kind: Role
name: agent-monitor-role
apiGroup: rbac.authorization.k8s.io
Scaling with Policy-as-Code
Manually writing these for every service is tedious. I've started using Kyverno to automate the enforcement of these patterns. If a Job is created without a specific ServiceAccount, or if it's using the default account, Kyverno can either block it or automatically generate the required RBAC.
I implemented a policy that ensures all batch/jobs are linked to a scoped role, which is particularly useful for the ephemeral nature of AI training jobs or data processing tasks.
# kyverno-policy.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: enforce-job-rbac
spec:
background: true
rules:
- name: require-scoped-sa-on-jobs
match:
resources:
- group: batch
resources: ["jobs"]
validate:
message: "Jobs must use a dedicated ServiceAccount. The 'default' account is forbidden."
pattern:
spec:
template:
spec:
serviceAccountName: "!default"
This forces the engineer (me) to actually think about the permissions before the pod ever hits the scheduler. You can read more about how I use these controllers in my post on Kyverno Admission Controllers.
Why this works
The logic here is about reducing the blast radius. In a standard K8s setup, the default service account is a liability. By creating a unique SA for every workload, you create a clear audit trail. When you run kubectl get events, you see exactly which identity is triggering the action.
Using RoleBindings instead of ClusterRoleBindings is the most important part of this architecture. A ClusterRoleBinding is a global hammer. A RoleBinding is a scalpel. Even if you use a ClusterRole (which defines the what), the RoleBinding defines the where.
For complex AI agent workflows, I've moved toward a two-tier system. One SA handles the orchestration (higher privilege, limited to the control plane) and another handles the execution (near-zero privilege, strictly isolated). I detailed this approach in my post on Agent Credential Management.
Lessons learned
The biggest surprise was how often third-party Helm charts ignore least-privilege. I've deployed several "industry standard" operators that requested cluster-admin by default. I've learned to always check the values.yaml for rbac.create: true and then manually inspect the templates to see what they're actually asking for.
I also hit a wall with resourceNames. You can actually restrict a Role to a specific instance of a resource:
rules:
- apiGroups: [""]
resources: ["configmaps"]
resourceNames: ["agent-config-v1"] # Only this specific ConfigMap
verbs: ["get"]
This is incredibly powerful but brittle. If you rotate your ConfigMap name, your application breaks with a 403. I only use resourceNames for critical secrets or global configs that never change.
If I were to do this over again from the start, I would have implemented the Kyverno policies on day one. Trying to retroactively fix RBAC across a cluster with 50+ deployments is a nightmare of "break-fix-repeat."
The takeaway is simple: start with zero permissions. Add one verb at a time until the pod stops crashing. It's slower, but it's the only way to be sure you haven't left a backdoor open. If you're building similar high-stakes infrastructure, you might want to look into infrastructure consulting to avoid these common pitfalls.
Top comments (0)