Automate Kubernetes RBAC Audits with Python and a Custom Controller

#python #kubernetes #devops

Originally published on kuryzhev.cloud

The Night Our Cluster Opened Its Doors to Everyone

A single kubectl apply from a well-meaning junior developer bound cluster-admin to the CI service account — and our Kubernetes cluster stayed wide open for 11 days before anyone noticed. No alert fired. No pipeline failed. No one said a word. The misconfiguration was discovered only because a compliance review happened to trigger a manual RBAC audit, completely unrelated to the incident itself. That audit was the only reason we found it at all.

The setup at the time was a multi-tenant cluster with 40-plus namespaces, shared by six product teams. The developer had been debugging a Helm deployment failure and needed broader permissions temporarily. They added the binding, fixed the issue, and forgot to clean it up. Completely understandable. Completely dangerous. For those 11 days, any workload running under that CI service account — or anything that could impersonate it — had full cluster-admin rights. In a cluster that hosts a payments namespace and a production environment for a regulated service, that is not a theoretical risk. That is an open door.

The real lesson was not about the developer. It was about us. We had no automated Kubernetes RBAC compliance automation in place. Manual review of RBAC objects at scale is structurally impossible to do reliably. With 40+ namespaces, hundreds of RoleBindings, and teams deploying multiple times per day, the gap between "what we think the permissions look like" and "what they actually are" grows every single day. This article is about the Python controller we built to close that gap permanently.

Why RBAC Drift Happens Faster Than You Can Review It

Kubernetes RBAC has a design property that most people do not think about until it bites them: it is additive by default. When you run kubectl apply on a RoleBinding, it never removes excess permissions that were added out-of-band. Helm chart upgrades can silently widen permission scopes between versions, and nobody notices because the upgrade succeeds and the app works. The permissions just quietly expand.

We identified three root causes in our postmortem. First, there was no enforcement at admission time. A developer could bind any role to any subject and the API server would accept it without question. Second, we had no diff-based alerting on RBAC objects — changes to ClusterRoleBindings were not treated as security-relevant events, so they never triggered anything. Third, and most critically, ClusterRoleBindings live outside namespace-scoped review cycles. Most teams think about permissions at the namespace level. ClusterRoleBindings are invisible in that mental model.

The two most dangerous patterns we see in the wild are the system:masters group and wildcard verb rules (verbs: ["*"]). Both are completely legal YAML. Both pass kubectl apply without any warning. Wildcard verbs are particularly insidious because they survive role changes — if you later add a new resource type to the API server, anything with wildcard verbs automatically gets access to it. No one had to make a decision. The permission just appeared.

There is also the system:anonymous and system:unauthenticated subject problem. Before hardening, default GKE clusters can have bindings to these subjects. If you have never audited for them, run this one-liner right now before reading further:

kubectl get clusterrolebindings -o json \
  | jq '[.items[] | select(.subjects[]?.name == "system:anonymous")]'

If that returns anything other than an empty array, stop and fix it. Then come back and build the controller.

Building the Audit Engine — A Python Controller That Catches What Humans Miss

The controller we built uses the kubernetes Python client at version 28.1.0, which maps to the Kubernetes 1.28 API. One critical gotcha here: mismatched client versions cause silent deserialization errors on newer CRD fields. Pin the version and do not let dependabot auto-upgrade it without a test run against your cluster version first.

The architecture is straightforward. A watch loop runs against ClusterRoleBinding, RoleBinding, and ClusterRole resources. Policy rules live in a ConfigMap at kube-system/rbac-audit-policy. When a violation is detected, the controller writes a RBACViolation custom resource under the CRD group audit.kuryzhev.cloud/v1alpha1 and exposes Prometheus metrics on port 8080. Here is the full controller implementation:

# rbac_auditor_controller.py
# Requires: kubernetes==28.1.0, prometheus_client==0.20.0
# Deploy as a Deployment in kube-system with a dedicated ServiceAccount

import time
import yaml
import logging
from kubernetes import client, config, watch
from kubernetes.client.exceptions import ApiException
from prometheus_client import start_http_server, Counter, Gauge

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("rbac-auditor")

# --- Prometheus metrics ---
VIOLATIONS_TOTAL = Counter(
    "rbac_violations_total",
    "Total RBAC violations detected",
    ["namespace", "kind", "name", "rule"]
)
LAST_AUDIT_TS = Gauge("rbac_last_audit_timestamp", "Unix timestamp of last audit cycle")
BINDINGS_SCANNED = Counter("rbac_bindings_scanned_total", "Total bindings evaluated")

# --- Load policy from ConfigMap ---
def load_policy(core_v1: client.CoreV1Api) -> dict:
    cm = core_v1.read_namespaced_config_map("rbac-audit-policy", "kube-system")
    return yaml.safe_load(cm.data["policy.yaml"])

# --- Core violation checks ---
def check_binding(binding, policy: dict) -> list[str]:
    violations = []
    subjects = binding.subjects or []

    # Rule 1: Forbidden subjects (e.g. system:anonymous)
    forbidden_subjects = policy.get("forbidden_subjects", [])
    for subject in subjects:
        if subject.name in forbidden_subjects:
            violations.append(f"forbidden_subject:{subject.name}")

    # Rule 2: Binding references a forbidden ClusterRole by name
    role_ref = binding.role_ref
    if role_ref.name in policy.get("forbidden_roles", []):
        violations.append(f"forbidden_role_ref:{role_ref.name}")

    return violations

def check_clusterrole(role: client.V1ClusterRole, policy: dict) -> list[str]:
    violations = []
    for rule in (role.rules or []):
        verbs = rule.verbs or []
        # Flag wildcard verbs — most dangerous pattern in any RBAC config
        if "*" in verbs:
            violations.append("wildcard_verb")
        # Flag full explicit verb set — functionally identical to wildcard
        full_verbs = {"get", "list", "watch", "create", "update", "patch", "delete"}
        if full_verbs.issubset(set(verbs)):
            violations.append("full_verb_set_equivalent_to_wildcard")
    return violations

# --- Write RBACViolation CRD ---
def write_violation_crd(custom_api: client.CustomObjectsApi, name: str,
                         namespace: str, kind: str, rules: list[str]):
    body = {
        "apiVersion": "audit.kuryzhev.cloud/v1alpha1",
        "kind": "RBACViolation",
        "metadata": {
            # Truncate to 63 chars — Kubernetes name length limit
            "name": f"{kind.lower()}-{name}".replace(":", "-")[:63],
            "namespace": namespace or "cluster-scoped",
        },
        "spec": {
            "sourceKind": kind,
            "sourceName": name,
            "sourceNamespace": namespace,
            "violatedRules": rules,
        }
    }
    try:
        custom_api.create_namespaced_custom_object(
            group="audit.kuryzhev.cloud", version="v1alpha1",
            namespace="kube-system", plural="rbacviolations", body=body
        )
        logger.info(f"Created RBACViolation for {kind}/{name}: {rules}")
    except ApiException as e:
        if e.status == 409:
            # Already exists from a previous cycle — not an error
            logger.debug(f"RBACViolation for {kind}/{name} already exists, skipping")
        else:
            raise

# --- Main watch loop ---
def run_audit_loop():
    config.load_incluster_config()  # Use load_kube_config() for local dev
    rbac_v1 = client.RbacAuthorizationV1Api()
    core_v1 = client.CoreV1Api()
    custom_api = client.CustomObjectsApi()

    start_http_server(8080)  # Expose /metrics for Prometheus scraping
    logger.info("RBAC Auditor started, metrics on :8080")

    while True:
        try:
            policy = load_policy(core_v1)
            # Reset resource_version to "0" on each cycle — prevents 410 Gone
            # after etcd compaction, which would silently stop the watch loop
            resource_version = "0"

            # Audit ClusterRoleBindings — most dangerous scope, audit first
            crbs = rbac_v1.list_cluster_role_binding()
            for crb in crbs.items:
                BINDINGS_SCANNED.inc()
                violations = check_binding(crb, policy)
                if violations:
                    for v in violations:
                        VIOLATIONS_TOTAL.labels(
                            namespace="cluster-scoped",
                            kind="ClusterRoleBinding",
                            name=crb.metadata.name,
                            rule=v
                        ).inc()
                    write_violation_crd(custom_api, crb.metadata.name,
                                        None, "ClusterRoleBinding", violations)

            # Audit ClusterRoles for wildcard verbs
            crs = rbac_v1.list_cluster_role()
            for cr in crs.items:
                violations = check_clusterrole(cr, policy)
                if violations:
                    for v in violations:
                        VIOLATIONS_TOTAL.labels(
                            namespace="cluster-scoped",
                            kind="ClusterRole",
                            name=cr.metadata.name,
                            rule=v
                        ).inc()
                    write_violation_crd(custom_api, cr.metadata.name,
                                        None, "ClusterRole", violations)

            LAST_AUDIT_TS.set(time.time())
            logger.info("Audit cycle complete. Sleeping 300s.")
            # resync_period=300 — informer cache pattern, avoids hammering API server
            # Running full re-list every 30s on 500+ bindings adds ~120ms latency per cycle
            time.sleep(300)

        except ApiException as e:
            logger.error(f"API error during audit: {e.status} {e.reason}")
            time.sleep(30)

if __name__ == "__main__":
    run_audit_loop()

Watch out for the 410 Gone response from the API server. This happens after etcd compaction and it silently kills your watch loop if you do not handle it. The fix is to reset resource_version="0" at the start of each audit cycle, which forces a full re-list rather than trying to resume a stale watch. We learned this the hard way when the controller appeared healthy but had stopped receiving events for six hours.

The other gotcha: do not grant the auditor service account cluster-admin just because it needs to read RBAC objects. I see this in tutorials constantly and it makes me cringe every time. The auditor only needs get, list, watch on RBAC resources. Here is the full deployment manifest with correctly scoped permissions:

# rbac-auditor-rbac.yaml
# Apply with: kubectl apply -f deploy/rbac-auditor-rbac.yaml

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: rbac-audit-policy
  namespace: kube-system
data:
  policy.yaml: |
    forbidden_subjects:
      - "system:anonymous"
      - "system:unauthenticated"
    forbidden_roles:
      - "cluster-admin"
    # Namespaces where ClusterRoleBinding is never expected
    restricted_namespaces:
      - "production"
      - "payments"

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: rbac-auditor
  namespace: kube-system
automountServiceAccountToken: true  # Required for in-cluster config

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: rbac-auditor-reader
rules:
  # Minimal read-only access — do NOT grant cluster-admin here (common mistake #1)
  - apiGroups: ["rbac.authorization.k8s.io"]
    resources: ["roles", "rolebindings", "clusterroles", "clusterrolebindings"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get"]
  # Write access scoped only to our own CRD in kube-system
  - apiGroups: ["audit.kuryzhev.cloud"]
    resources: ["rbacviolations"]
    verbs: ["get", "list", "create", "update", "patch"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: rbac-auditor-binding
  annotations:
    # Flag for 90-day rotation review — the controller itself will enforce this
    audit.kuryzhev.cloud/reviewed-at: "2024-11-01"
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: rbac-auditor-reader
subjects:
  - kind: ServiceAccount
    name: rbac-auditor
    namespace: kube-system

If the auditor pod starts and you immediately see Forbidden: User "system:serviceaccount:kube-system:rbac-auditor" cannot list resource "clusterrolebindings" in the logs, the ClusterRoleBinding was not applied correctly. Check namespace alignment first — that accounts for 80% of these failures.

Wiring It Into Your Pipeline — From Detection to Remediation

Detection without action is just expensive logging. The controller exposes three Prometheus metrics: rbac_violations_total as a counter labeled by namespace, kind, name, and rule; rbac_last_audit_timestamp as a gauge so you can alert if the auditor itself stops running; and rbac_bindings_scanned_total to track coverage. The alerting rule we use fires when rbac_violations_total exceeds 0 for more than 15 minutes, giving the system one cycle to self-resolve before paging someone.

In our Flux setup, annotated bindings trigger a webhook that opens a GitHub issue with the offending manifest diff and the specific policy rule that was violated. The developer who owns the namespace gets assigned automatically based on a CODEOWNERS-style mapping. This is important: the goal is not to block deployments automatically everywhere. The goal is to make the violation visible and attributed within minutes, not days.

Auto-remediation is opt-in and deliberately scoped. The controller will only delete a RoleBinding if it carries the label audit.kuryzhev.cloud/auto-remediate: "true". We never touch unlabeled production bindings automatically. I made the mistake of enabling aggressive auto-remediation in a previous job and watched it delete a legitimate emergency access binding during an incident. Never again. Humans should approve remediation for anything in production scope.

For CI pipeline integration, we run rbac-police at version 0.3.2 as a gate against rendered Helm manifests before any cluster deployment. The command is straightforward:

# Run in CI before helm upgrade — fails pipeline on any HIGH severity finding
rbac-police eval --severity HIGH --fail-on-findings ./manifests/

This catches the obvious mistakes before they ever reach the cluster. The in-cluster controller then handles drift that happens through out-of-band changes, kubectl one-liners during incidents, and Helm upgrades that quietly expand scopes. You need both layers.

You can find more Kubernetes security and automation patterns at kuryzhev.cloud — we cover real production scenarios, not just happy-path documentation.

Prevention Checklist — Stop the Drift Before It Starts

After running this system in production for several months, here is the checklist we now enforce on every cluster. Each item has a reason. Skip none of them.

1. Deploy OPA Gatekeeper with wildcard verb constraints at admission time. The K8sNoWildcardVerbs and K8sRestrictClusterAdmin constraint templates from the open-policy-agent/gatekeeper-library block the binding at kubectl apply time, before it ever reaches etcd. This is your first line of defense. The in-cluster controller is your second. You want both.

2. Run rbac-police eval in every CI pipeline that renders Helm charts. Use --severity HIGH --fail-on-findings. Do not let it run in warn-only mode — warn-only modes get ignored within two weeks of deployment. If it finds something, the pipeline fails. Period.

3. Add a 90-day review annotation to every ClusterRoleBinding at creation time. The annotation key is audit.kuryzhev.cloud/reviewed-at. The controller flags any binding where this annotation is missing or older than 90 days. This forces a human to periodically confirm that a binding is still needed. Most "temporary" bindings survive indefinitely because no one ever looks at them again.

4. Never watch only RoleBindings and ignore ClusterRoleBindings. This is the single most common mistake I see in home-grown audit scripts. The most dangerous over-permissioned bindings are almost always cluster-scoped. If your audit does not cover ClusterRoleBindings, it covers maybe 40% of the actual risk surface.

5. Protect the RBACViolation CRD itself with RBAC. If developers can delete RBACViolation objects, they can silently clear audit findings without fixing the underlying binding. The CRD should be read-only for all non-auditor service accounts. We found one developer had done exactly this in a dev cluster to make the noise go away before a demo. The binding stayed. The violation disappeared. That is worse than having no audit at all.

6. Set automountServiceAccountToken: false on all pods except the auditor itself. Most workloads do not need API server access. Defaulting to no token mount reduces the blast radius of any container escape significantly.

7. Alert on rbac_last_audit_timestamp staleness, not just on violations. If the auditor itself goes down — OOM kill, eviction, image pull failure — you want to know immediately. A silent auditor is worse than no auditor because it creates false confidence. We alert if the timestamp is more than 10 minutes old.

Kubernetes RBAC compliance automation is not a one-time project. It is a continuous process. The controller we built did not eliminate human judgment — it made human judgment possible again by reducing the signal-to-noise ratio to something manageable. We went from "we audit RBAC quarterly if we remember" to "we know about every violation within five minutes." That 11-day window closed to under one audit cycle. That is the outcome that matters.