Adil Khan

Posted on Feb 28

How I Built a Production-Grade Kubernetes RBAC Setup — And Broke It On Purpose

#kubernetes #security #cloud #devops

Most RBAC tutorials show you how to apply a Role and run kubectl auth can-i. Then they call it done.

That never sat right with me. In production, your workload doesn't authenticate using your kubeconfig. It authenticates using a ServiceAccount token mounted inside the pod. So if you've never tested RBAC from inside a running container, you haven't actually tested RBAC.

This project fixes that. I built a minimal but realistic RBAC setup for an observability tool, validated it from inside a live deployment, and then intentionally broke it to understand what failure actually looks like at the API server level.

The full source is here: github.com/adil-khan-723/K8s-RBAC

The Setup

Everything lives inside a dedicated observability namespace. The workload — a test deployment — runs under a purpose-built ServiceAccount called log-reader-sa. A namespace-scoped Role defines exactly what that identity is allowed to do. A RoleBinding connects the two.

observability (namespace)
│
├── log-reader-sa        ← Dedicated ServiceAccount
├── log-reader-role      ← Namespace-scoped Role
├── log-reader-binding   ← Binds SA → Role
└── testing (Deployment) ← Live workload for validation

No ClusterRoles. No ClusterRoleBindings. Everything contained.

Why Each Decision Was Made

Never Use the Default ServiceAccount

The default ServiceAccount exists in every namespace automatically. Using it means every workload in that namespace shares the same identity. If one workload gets permissions, every other workload riding the default SA inherits them silently.

In any environment with more than one workload, this is a privilege creep problem waiting to happen. The fix is simple: create a dedicated ServiceAccount per workload.

Namespace-Scoped Role, Not ClusterRole

A ClusterRole bound with a ClusterRoleBinding grants access across every namespace — current and future. Even a ClusterRole bound with a RoleBinding still references a cluster-level object, which creates reuse risks and makes auditing harder.

A namespace-local Role and RoleBinding keeps the permission surface completely contained. If you can't explain why a workload needs cluster-wide access, it doesn't need a ClusterRole.

Verbs Are Not Optional

RBAC doesn't have a "read-only mode" switch. You have to declare every verb individually. The role grants get, list, and watch — and nothing else. Write verbs (create, update, patch, delete) are not included. Kubernetes does not default to denying write operations if you forget; it denies everything you don't explicitly allow.

Subresources Are Not Inherited

This is the one that catches people the most.

Access to pods does not grant access to pods/log. They are treated as completely separate targets by the API server. A role missing pods/log will fail silently during kubectl apply and loudly at runtime when your monitoring tool tries to pull logs and gets a 403.

Every subresource you need must appear explicitly in the rules.

What Was Granted

Resource	Verbs Granted
`pods`	`get`, `list`, `watch`
`pods/log`	`get`
`deployments`	`get`, `list`
`secrets`	Denied
Everything else	Denied

Testing From Inside the Pod

This is the part most tutorials skip. I deployed a real workload under log-reader-sa and tested API calls directly from inside the container.

Test	Result
List pods in namespace	✅ Allowed
Get pod logs	✅ Allowed
Access secrets	❌ 403 Forbidden
Delete a pod	❌ 403 Forbidden
Access resources outside namespace	❌ 403 Forbidden

Testing this way matters because it confirms the ServiceAccount token is correctly mounted, the API server is reachable from inside the pod, and the policy behaves exactly as written — not as assumed.

Breaking It On Purpose

I removed pods/log from the Role rules to simulate a common production misconfiguration. The result was immediate: a 403 Forbidden response every time log retrieval was attempted.

This turned into a useful debugging exercise. There are four failure types that look similar on the surface but require completely different fixes:

401 Unauthorized — the identity wasn't recognized. Token is missing, expired, or invalid.

403 Forbidden — the identity was recognized, the request reached the API server, but the action isn't permitted. This is a RBAC problem.

404 Not Found — the resource doesn't exist. Not an authorization issue at all.

Connection refused / timeout — the API server wasn't reached. Networking problem, not RBAC.

When you see a 403, you've already confirmed that the workload has connectivity, the token is valid, and the API server is up. The investigation starts and ends with the Role definition.

The Security Picture

If this workload were compromised, the damage is bounded:

Read pod metadata and logs within observability — yes
Access secrets — no
Modify or delete anything — no
Move laterally to other namespaces — no
Escalate to cluster-level access — no

This is what blast radius limitation looks like in practice. The attacker gets read access to one namespace. That's a recoverable incident. Cluster-admin on a compromised workload is not.

What This Project Reinforced

RBAC in Kubernetes is an authorization layer evaluated at the API server for every request. The evaluation checks four things: who is making the request, what verb they're using, what resource they're targeting, and which namespace it's in.

Roles define what is permitted. Bindings attach identities to those permissions. Neither inherits anything. Neither assumes anything. Everything must be declared.

The habits worth building:

Start with zero permissions and add only what you can justify
Test from inside the workload, not just from your terminal
List subresources explicitly — they are never implied
Know the difference between a 401, 403, and 404
Give every workload its own ServiceAccount
Avoid ClusterRoleBindings unless the requirement is genuinely cluster-wide

Source

Full manifests and project structure: github.com/adil-khan-723/K8s-RBAC

If you've been writing RBAC policy without testing it from inside a running pod, this is a good starting point for closing that gap.

DEV Community