Olivia Craft

Posted on May 3

CLAUDE.md for Kubernetes: 13 Rules That Make AI Write Production-Safe K8s

#kubernetes #devops #claudeai #aitools

You ask Claude to "write a Deployment for this service" and 30 seconds later you have a manifest with image: myapp:latest, no resource limits, no probes, a cluster-admin ClusterRoleBinding, and DB_PASSWORD: "supersecret" sitting in plain YAML. The AI didn't fail. It pattern-matched on the millions of half-baked tutorial manifests on the public internet — because nobody told it which patterns are unacceptable in production.

A CLAUDE.md at the root of your infra repo fixes that. Claude Code reads it on every task. Cursor, Aider, Codex, and any AI that respects context files do the same. Below are 13 rules I drop into every Kubernetes repo. Each one closes a class of incident waiting to happen, and each one is short enough that Claude won't ignore it.

Rule 1 — No `cluster-admin` for Workloads

Why: Binding a workload's ServiceAccount to cluster-admin is the textbook lateral-movement path. One compromised pod and an attacker owns every namespace, every secret, every node. Workloads get the least privilege required, scoped to a Role inside their own namespace.

Bad:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: app-binding
roleRef:
  kind: ClusterRole
  name: cluster-admin
subjects:
  - kind: ServiceAccount
    name: my-app
    namespace: default

Good:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: payments
  name: payments-reader
rules:
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["secrets"]
    resourceNames: ["payments-tls"]
    verbs: ["get"]

Rule for CLAUDE.md:

## RBAC
- Workloads never bind to `cluster-admin` or any wildcard ClusterRole.
- Default to `Role` + `RoleBinding` scoped to the workload's namespace.
- Use `resourceNames:` to narrow access to specific objects whenever possible.
- ClusterRoles are only for cluster-scoped controllers, with explicit justification in a comment.

Rule 2 — Every Container Sets `requests` AND `limits`

Why: Without requests, the scheduler treats the pod as zero-cost and packs it onto saturated nodes. Without limits, a runaway pod eats the node's CPU and triggers OOMKill cascades on its neighbors. The two together are the cheapest reliability lever in Kubernetes.

Bad:

containers:
  - name: api
    image: myorg/api:1.4.2

Good:

containers:
  - name: api
    image: myorg/api:1.4.2
    resources:
      requests:
        cpu: "100m"
        memory: "128Mi"
      limits:
        cpu: "500m"
        memory: "512Mi"

For memory-sensitive JVM or Node services, set requests == limits so the kernel doesn't OOMKill you under burst load.

Rule for CLAUDE.md:

## Resource Requests & Limits
- Every container declares both `resources.requests` and `resources.limits` for cpu and memory.
- Memory: `requests == limits` for predictable QoS (Guaranteed class).
- CPU: requests below limits is fine; never omit requests.
- Numbers come from observed usage, not guesses — flag any container without prior load data.

Rule 3 — Pin Image Tags, Never `:latest`

Why: :latest is non-reproducible. Two pods of the same Deployment can run different binaries, rollbacks become impossible, and CI builds drift from production silently. Pin to semver or, better, an immutable digest.

Bad:

image: myorg/api:latest

Good:

image: myorg/api:1.4.2
# even better:
image: myorg/api@sha256:8f1c4b2e7a93...

Rule for CLAUDE.md:

## Image Tags
- Never `:latest` in any manifest, Helm value, or kustomize overlay.
- Production manifests pin to MAJOR.MINOR.PATCH and prefer `@sha256:` digests.
- Tag must match an artifact pushed by CI — no manual `docker push` to prod tags.

Rule 4 — Run as Non-Root with a Hardened `securityContext`

Why: A container running as UID 0 with default capabilities is one CVE away from the host. Drop to a non-root user, drop all capabilities, disable privilege escalation, and make the root filesystem read-only. The cost is one block of YAML; the payoff is cutting most container-escape exploits at the knees.

Bad:

containers:
  - name: api
    image: myorg/api:1.4.2
    # implicitly root, all caps, writable rootfs

Good:

spec:
  template:
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 10001
        runAsGroup: 10001
        fsGroup: 10001
        seccompProfile:
          type: RuntimeDefault
      containers:
        - name: api
          image: myorg/api:1.4.2
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities:
              drop: ["ALL"]

If the app needs scratch space, mount an emptyDir at /tmp — don't loosen readOnlyRootFilesystem.

Rule for CLAUDE.md:

## securityContext
- Pod-level: `runAsNonRoot: true`, explicit `runAsUser`, `seccompProfile: RuntimeDefault`.
- Container-level: `allowPrivilegeEscalation: false`, `readOnlyRootFilesystem: true`, `capabilities.drop: [ALL]`.
- Add capabilities back only with a YAML comment explaining the specific syscall need.
- Writable paths use `emptyDir` volumes — never relax `readOnlyRootFilesystem`.

Rule 5 — Every Workload Has Both `readinessProbe` and `livenessProbe`

Why: Without a readiness probe, rolling updates send traffic to pods that haven't finished starting — users see 502s during every deploy. Without a liveness probe, a deadlocked process stays "Running" forever and never restarts. The two probes do different jobs; you need both.

Bad:

containers:
  - name: api
    image: myorg/api:1.4.2
    ports:
      - containerPort: 8080

Good:

containers:
  - name: api
    image: myorg/api:1.4.2
    ports:
      - containerPort: 8080
    readinessProbe:
      httpGet:
        path: /healthz/ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 10
      failureThreshold: 3
    livenessProbe:
      httpGet:
        path: /healthz/live
        port: 8080
      initialDelaySeconds: 30
      periodSeconds: 20
      failureThreshold: 5
    startupProbe:
      httpGet:
        path: /healthz/live
        port: 8080
      failureThreshold: 30
      periodSeconds: 5

/healthz/ready should fail when downstream deps are out — DB unreachable, queue down. /healthz/live should fail only when the process itself is wedged. Wire startupProbe for slow boots so liveness doesn't kill cold starts.

Rule for CLAUDE.md:

## Probes
- Every long-running workload defines both `readinessProbe` and `livenessProbe`.
- Readiness reflects dependency health (DB, cache); liveness reflects process health only.
- Use `startupProbe` for any service with boot time over 10s instead of inflating liveness `initialDelaySeconds`.
- Never reuse the same handler for both probes when downstream deps are part of readiness.

Rule 6 — Secrets via `secretKeyRef`, Never Inline

Why: Plaintext values in manifests end up in git, in Helm release history, in kubectl get -o yaml dumps, and in screenshots in incident channels. Reference a Secret object instead — and ideally have External Secrets Operator or Vault inject it, so the cluster Secret itself isn't living in git either.

Bad:

env:
  - name: DB_PASSWORD
    value: "supersecret123"
  - name: STRIPE_KEY
    value: "sk_live_8a2b..."

Good:

env:
  - name: DB_PASSWORD
    valueFrom:
      secretKeyRef:
        name: db-credentials
        key: password
  - name: STRIPE_KEY
    valueFrom:
      secretKeyRef:
        name: payments-stripe
        key: api_key

For the Secret itself, prefer ExternalSecrets pulling from AWS Secrets Manager / Vault / GCP Secret Manager — the cluster Secret is then auto-rotated and never sits in git.

Rule for CLAUDE.md:

## Secrets
- Never put plaintext credentials in env, ConfigMap, or Helm values.
- All sensitive env vars use `valueFrom.secretKeyRef`.
- Cluster Secrets are managed by ExternalSecrets / Vault / SOPS — never `kubectl create secret` in production.
- Mount file-based secrets to `/var/run/secrets/<name>` with `defaultMode: 0400`.

Rule 7 — Default-Deny Network Policies

Why: Out of the box, every pod can talk to every other pod across every namespace. That's fine for kubectl run but unacceptable in production: a compromised frontend can hit the database directly, Redis, internal admin panels, anything. Lock the network down to known flows with NetworkPolicy.

Bad: rely on the default open mesh.

Good — start with default-deny, then allow specific flows:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: payments
spec:
  podSelector: {}
  policyTypes: ["Ingress", "Egress"]
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-allow-frontend
  namespace: payments
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: payments-api
  policyTypes: ["Ingress"]
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: web
          podSelector:
            matchLabels:
              app.kubernetes.io/name: web-frontend
      ports:
        - protocol: TCP
          port: 8080

Egress policies are the half people skip — and the half attackers love. Block them too unless you've explicitly allow-listed the destination.

Rule for CLAUDE.md:

## NetworkPolicy
- Every production namespace ships with a `default-deny-all` ingress AND egress policy.
- Per-workload allow policies are explicit about source/destination by label, port, and protocol.
- Egress is allowed only to known dependencies (DB, internal services, external APIs by FQDN via Cilium/CNI features).
- DNS to `kube-dns` must be explicitly allowed in the egress allow-list.

Rule 8 — One `ServiceAccount` Per Workload

Why: The default ServiceAccount is shared across every workload that doesn't specify one. Audit logs become useless ("who did this?" — "default, in five namespaces"), and any RoleBinding granted to it leaks across apps. Each workload gets its own SA, named after the workload.

Bad:

spec:
  template:
    spec:
      # uses default — shared, ambient, hard to audit
      containers: [...]

Good:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: payments-api
  namespace: payments
automountServiceAccountToken: false
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payments-api
  namespace: payments
spec:
  template:
    spec:
      serviceAccountName: payments-api
      automountServiceAccountToken: false  # if app doesn't talk to the API server
      containers: [...]

If the app does need an API token, set automountServiceAccountToken: true only on the SA, not the pod.

Rule for CLAUDE.md:

## ServiceAccount
- Every Deployment / StatefulSet / DaemonSet sets `serviceAccountName:` explicitly.
- The `default` SA is never used by application workloads.
- `automountServiceAccountToken: false` unless the app calls the Kubernetes API.
- ServiceAccount name matches the workload name for auditability.

Rule 9 — Helm: Values for Config, Templates for Structure

Why: Hardcoding env-specific numbers in templates/ is how you end up forking charts per environment. Smuggling structure into values.yaml is how diffs become unreviewable. Templates describe shape; values describe the dial settings.

Bad: replicas: 5 literal in templates/deployment.yaml, plus a 400-line values.yaml containing whole sub-manifests.

Good:

# values.yaml
replicaCount: 3
image:
  repository: myorg/payments-api
  tag: 1.4.2
resources:
  requests: { cpu: 100m, memory: 128Mi }
  limits:   { cpu: 500m, memory: 512Mi }
probes:
  ready: { path: /healthz/ready, port: 8080 }
  live:  { path: /healthz/live,  port: 8080 }

# templates/deployment.yaml (excerpt)
spec:
  replicas: {{ .Values.replicaCount }}
  template:
    spec:
      containers:
        - name: api
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
          resources: {{ toYaml .Values.resources | nindent 12 }}
          readinessProbe:
            httpGet:
              path: {{ .Values.probes.ready.path }}
              port: {{ .Values.probes.ready.port }}

Per-environment values live in values-prod.yaml, values-staging.yaml — never inside the templates.

Rule for CLAUDE.md:

## Helm Discipline
- Templates contain shape; values contain configuration only.
- No environment-specific literals in `templates/` — extract to `values-<env>.yaml`.
- Use named templates (`_helpers.tpl`) for repeated label/annotation blocks.
- `helm lint` and `helm template | kubeconform` run in CI for every chart.

Rule 10 — Validate Manifests Before `apply`

Why: A typo in apiVersion, a missing required field, a quoted boolean — kubectl apply will happily push it and you'll find out at 3am when the rollout silently stalls. Catch it in CI with static analysis and server-side dry-run.

Bad:

kubectl apply -f .

Good:

# server-side dry-run hits real admission controllers (PSA, OPA/Gatekeeper, Kyverno):
kubectl apply --dry-run=server -f manifests/

# schema validation against your cluster's API version:
kubeconform -strict -summary -kubernetes-version 1.29.0 manifests/

# opinionated linting (probes, limits, image tags):
kube-linter lint manifests/

# Helm:
helm lint ./chart
helm template ./chart -f values-prod.yaml | kubeconform -strict -

Run all four in CI and block merges on failure.

Rule for CLAUDE.md:

## Manifest Validation
- CI runs: `kubeconform -strict`, `kube-linter lint`, and `kubectl apply --dry-run=server` on every PR.
- Helm charts additionally run `helm lint` and `helm template | kubeconform`.
- New rules from `kube-linter` / Gatekeeper land as warnings first, then blocking — never silently disabled.
- Local `kubectl apply` against prod is a break-glass action, logged and reviewed.

Rule 11 — Declarative `apply` Only — No `edit`, No `create`

Why: Imperative commands drift from git. kubectl edit deployment in production is how 3am incidents start: somebody fixes the running cluster, forgets the YAML, and the next CI deploy reverts the fix. GitOps means the repo is the source of truth, period.

Bad:

kubectl edit deployment api
kubectl create deployment api --image=myorg/api:1.4.2
kubectl scale deployment api --replicas=10

Good:

# 1. change the YAML in git
# 2. open a PR, get it reviewed
# 3. merge, let ArgoCD/Flux reconcile — or:
kubectl apply -f deployments/api.yaml

Diff first, then apply:

kubectl diff -f deployments/api.yaml
kubectl apply -f deployments/api.yaml

Rule for CLAUDE.md:

## GitOps Discipline
- Production changes go through a PR — never `kubectl edit` / `create` / `scale` directly.
- The cluster mirrors a git repo via ArgoCD or Flux. Drift is a bug.
- `kubectl diff` before any manual `apply`; capture the diff in the change record.
- Break-glass commands are logged in an incident channel within 15 minutes.

Rule 12 — Pin Rollout Timing and Disruption Budgets Explicitly

Why: maxUnavailable: 25% and terminationGracePeriodSeconds: 30 are the defaults, not decisions. State them in the manifest so reviewers can reason about availability, and pair the Deployment with a PodDisruptionBudget so node drains don't take the whole service down.

Bad: rely on implicit defaults; no PDB.

Good:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payments-api
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    spec:
      terminationGracePeriodSeconds: 30
      containers:
        - name: api
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 10 && kill -SIGTERM 1"]
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: payments-api
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app.kubernetes.io/name: payments-api

The preStop sleep 10 lets the load balancer remove the pod from rotation before SIGTERM hits — no in-flight requests get dropped.

Rule for CLAUDE.md:

## Rollout & Disruption
- Production Deployments declare `strategy.rollingUpdate` explicitly: `maxSurge` and `maxUnavailable`.
- Stateless services use `maxUnavailable: 0` to keep capacity flat during rollouts.
- Every multi-replica workload has a `PodDisruptionBudget` (`minAvailable: N-1` or `maxUnavailable: 1`).
- `terminationGracePeriodSeconds` and a `preStop` sleep are required for HTTP services.

Rule 13 — Use the `app.kubernetes.io/*` Label Schema

Why: Without consistent labels, you can't query ("show me everything in the payments platform"), you can't bill back ("how much CPU did team-payments use this month?"), and your dashboards stay broken. The app.kubernetes.io/* set is the convention every tool — Lens, ArgoCD, Datadog, Grafana — already understands.

Bad:

metadata:
  labels:
    app: api

Good:

metadata:
  labels:
    app.kubernetes.io/name: payments-api
    app.kubernetes.io/instance: payments-api-prod
    app.kubernetes.io/version: "1.4.2"
    app.kubernetes.io/component: backend
    app.kubernetes.io/part-of: payments-platform
    app.kubernetes.io/managed-by: helm
    # team / cost-center labels for billback:
    team: payments
    cost-center: cc-2204

Apply the same set to the Service, ServiceAccount, ConfigMap, and PDB so selectors stay consistent.

Rule for CLAUDE.md:

## Labels
- Every object carries the full `app.kubernetes.io/*` set: name, instance, version, component, part-of, managed-by.
- `team` and `cost-center` labels are mandatory for billback and ownership routing.
- The Service `selector` matches `app.kubernetes.io/name` + `app.kubernetes.io/instance` — never a single `app:` label.
- Label values are lowercase kebab-case; never include secrets, PII, or environment URLs.

How to Use These Rules

Drop a CLAUDE.md at the root of your infra repo, next to your manifests/, helm/, or kustomize/ directory.
Paste the rules above. Keep what fits, edit what doesn't, add anything specific to your stack (CNI, service mesh, GitOps controller).
Restart Claude Code in the project so it picks up the new context file.

CLAUDE.md is a per-repo contract. Vague guidance ("write secure manifests") gets ignored. Concrete guidance ("runAsNonRoot: true, readOnlyRootFilesystem: true, capabilities.drop: [ALL] on every container") changes every output the model produces.

The same file works for Cursor, Aider, Codex, Copilot Workspace, and any AI that respects context files. Symlink it from .cursorrules and AGENTS.md if you want belt-and-braces coverage.

Want the Full Pack?

These 13 rules are one chapter of the CLAUDE.md Rules Pack — 35+ stacks (Go, Rust, Python, FastAPI, Next.js, React Native, Terraform, Docker, Kubernetes, Postgres, and more) of production-tested AI guardrails, packaged as drop-in CLAUDE.md files.

→ Get the pack on Gumroad — one-time payment, lifetime updates.

Free CLAUDE.md sample (Kubernetes edition): gist.github.com/oliviacraft

— Olivia · @OliviaCraftLat

DEV Community

CLAUDE.md for Kubernetes: 13 Rules That Make AI Write Production-Safe K8s

Rule 1 — No `cluster-admin` for Workloads

Rule 2 — Every Container Sets `requests` AND `limits`

Rule 3 — Pin Image Tags, Never `:latest`

Rule 4 — Run as Non-Root with a Hardened `securityContext`

Rule 5 — Every Workload Has Both `readinessProbe` and `livenessProbe`

Rule 6 — Secrets via `secretKeyRef`, Never Inline

Rule 7 — Default-Deny Network Policies

Rule 8 — One `ServiceAccount` Per Workload

Rule 9 — Helm: Values for Config, Templates for Structure

Rule 10 — Validate Manifests Before `apply`

Rule 11 — Declarative `apply` Only — No `edit`, No `create`

Rule 12 — Pin Rollout Timing and Disruption Budgets Explicitly

Rule 13 — Use the `app.kubernetes.io/*` Label Schema

How to Use These Rules

Want the Full Pack?

Top comments (0)

Rule 1 — No cluster-admin for Workloads

Rule 2 — Every Container Sets requests AND limits

Rule 3 — Pin Image Tags, Never :latest

Rule 4 — Run as Non-Root with a Hardened securityContext

Rule 5 — Every Workload Has Both readinessProbe and livenessProbe

Rule 6 — Secrets via secretKeyRef, Never Inline

Rule 7 — Default-Deny Network Policies

Rule 8 — One ServiceAccount Per Workload

Rule 9 — Helm: Values for Config, Templates for Structure

Rule 10 — Validate Manifests Before apply

Rule 11 — Declarative apply Only — No edit, No create

Rule 12 — Pin Rollout Timing and Disruption Budgets Explicitly

Rule 13 — Use the app.kubernetes.io/* Label Schema

How to Use These Rules

Want the Full Pack?

Rule 1 — No `cluster-admin` for Workloads

Rule 2 — Every Container Sets `requests` AND `limits`

Rule 3 — Pin Image Tags, Never `:latest`

Rule 4 — Run as Non-Root with a Hardened `securityContext`

Rule 5 — Every Workload Has Both `readinessProbe` and `livenessProbe`

Rule 6 — Secrets via `secretKeyRef`, Never Inline

Rule 8 — One `ServiceAccount` Per Workload

Rule 10 — Validate Manifests Before `apply`

Rule 11 — Declarative `apply` Only — No `edit`, No `create`

Rule 13 — Use the `app.kubernetes.io/*` Label Schema