DEV Community

Olivia Craft
Olivia Craft

Posted on

Cursor Rules for Kubernetes: AI-Assisted K8s Guide 2026

Kubernetes is the platform where a single missing line in a YAML file can take down every pod in a namespace, silently OOM-kill a service for three weeks before anyone notices the spike in 5xxs, or expose the production Postgres password in a ConfigMap that ships to every region. The manifest renders. kubectl apply returns deployment.apps/api configured. The pods go Running. Six hours later the node runs out of memory because nothing set a limit, the readiness probe passes immediately because someone copied a httpGet: / from a tutorial and / returns 200 even when the downstream database is on fire, and the rollout that was supposed to be zero-downtime dumps every in-flight request because there's no PodDisruptionBudget and the cluster autoscaler drained the node during peak traffic.

Then you add an AI assistant.

Cursor and Claude Code were trained on a decade of Kubernetes YAML, most of it blog-post snippets that predate best practices — image: nginx:latest with no digest, Deployment manifests with no resources: block, Service objects that expose type: LoadBalancer to the open internet, Secret data inlined in the repo as base64 (which is not encryption), kubectl apply -f . from a developer laptop instead of a GitOps pipeline, securityContext missing entirely so containers run as root with every Linux capability, livenessProbe and readinessProbe configured identically so a flap in one restarts the pod instead of pulling it out of the service, and namespaces with no NetworkPolicy so every pod can talk to every other pod in the cluster. Ask for "a Deployment for my Node API," and you get a manifest that works on minikube and detonates in production the first time a node reboots.

The fix is .cursorrules — one file in the repo that tells the AI what idiomatic, production-grade Kubernetes looks like. Eight rules below, each with the failure mode, the rule that prevents it, and a before/after. Copy-paste .cursorrules at the end. Examples target modern Kubernetes (1.28+) with standard tooling (kubectl, Helm, Kustomize, Argo CD), but the rules apply equally to EKS, GKE, AKS, and self-managed clusters.


How Cursor Rules Work for Kubernetes Projects

Cursor reads project rules from two locations: .cursorrules (a single file at the repo root, still supported) and .cursor/rules/*.mdc (modular files with frontmatter, recommended for any non-trivial cluster repo). For Kubernetes I recommend modular rules so that platform-level conventions (RBAC, NetworkPolicy) don't bleed into application-level manifests, and so Helm chart authoring rules don't fire when you're editing raw YAML:

.cursor/rules/
  k8s-manifests.mdc      # Deployment, Service, Ingress conventions
  k8s-resources.mdc      # requests, limits, QoS, HPA, VPA
  k8s-probes.mdc         # liveness, readiness, startup probe discipline
  k8s-security.mdc       # securityContext, PSA, NetworkPolicy, RBAC
  k8s-secrets.mdc        # external secret managers, sealed secrets
  k8s-rollout.mdc        # strategy, PDB, surge, unavailable, maxReplicas
  k8s-observability.mdc  # labels, logs, metrics, namespaces
  k8s-gitops.mdc         # no kubectl apply in prod, Argo/Flux patterns
Enter fullscreen mode Exit fullscreen mode

Frontmatter controls activation: globs: ["**/*.yaml", "**/*.yml", "**/Chart.yaml", "**/kustomization.yaml"] with alwaysApply: false. Now the rules.


Rule 1: Declarative GitOps — Never kubectl apply Against Production

The single most common AI failure in Kubernetes is "here's the kubectl command to deploy it." Cursor cheerfully emits kubectl apply -f deployment.yaml or kubectl edit deploy api and calls that a deployment strategy. It is not. It is an imperative change that leaves no audit trail, drifts from the repo within hours, and means the cluster state is whatever the last sleepy human typed at 3am. Production Kubernetes runs one way: the git repo is the source of truth, a controller (Argo CD, Flux) reconciles the cluster to the repo, and humans never touch the cluster directly except to observe it.

The rule:

Every resource in a non-dev cluster has a manifest in git, reconciled
by a GitOps controller (Argo CD, Flux). Drift is auto-healed or alerted.

Forbidden in production: kubectl apply / create / edit / patch / scale /
delete, and helm install/upgrade from a developer laptop.
Allowed: kubectl get / describe / logs / top / exec / rollout status /
port-forward — all read-only.

Every manifest includes the six recommended labels
(app.kubernetes.io/{name,instance,version,component,part-of,managed-by})
and omits server-populated fields (status, clusterIP, nodeName).

CI runs kubeconform, kube-linter, and a dry-run diff on every PR.
Enter fullscreen mode Exit fullscreen mode

Bad — imperative commands, no audit trail, drift from repo:

kubectl create deployment api --image=myregistry/api:latest
kubectl scale deployment api --replicas=5
# Two weeks later
kubectl edit deployment api              # bumps memory — never recorded
kubectl set image deployment/api api=myregistry/api:v1.4.3
Enter fullscreen mode Exit fullscreen mode

Six months later, nobody knows what's in the cluster. The repo's deployment.yaml still says replicas: 3 and image: myregistry/api:v1.2.0.

Good — declarative manifest committed to the repo, synced by Argo CD:

# argocd/applications/api.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: api-production
  namespace: argocd
spec:
  project: payments
  source:
    repoURL: https://github.com/acme/infra
    targetRevision: main
    path: apps/api/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: payments
  syncPolicy:
    automated: { prune: true, selfHeal: true }
    syncOptions: [ServerSideApply=true]
Enter fullscreen mode Exit fullscreen mode

The image bump from 1.4.3 to 1.4.4 is a PR. The replica increase from 5 to 8 is a PR. The rollback is git revert. The cluster matches the repo at every moment, and the history of every change is in git log.


Rule 2: Resource Requests and Limits — On Every Container, No Exceptions

A Deployment with no resources: block is a loaded gun pointed at your cluster. Without requests, the scheduler places the pod on any node that fits physically, with no room for the workload — so three pods pile onto one node, all thrash for CPU, and latency triples. Without limits, a single misbehaving pod allocates until the kernel OOM-killer fires, and the killer rarely picks the process you'd prefer. AI models skip resources entirely because half of their training data is "hello world" manifests and the other half is copy-pasted from the same tutorial where somebody set memory: 2Gi and never revisited it.

The rule:

Every container (including init and sidecar) in every Deployment,
StatefulSet, DaemonSet, Job, or CronJob MUST set:
  resources.requests.cpu
  resources.requests.memory
  resources.limits.memory

resources.limits.cpu is optional on latency-sensitive services
(CPU throttling hurts p99 tail latency). Set it for batch/offline
workloads that shouldn't disturb neighbors.

Choosing values:
  - Start from p99 under load + 20% headroom.
  - Memory request == memory limit (Guaranteed QoS): the pod is not
    evicted first under node pressure.
  - CPU request matches steady-state, not peak.

Forbidden: omitting resources, burstable memory (request < limit on
latency-sensitive services), or "looked fine on my laptop" guesses.

Every namespace has a LimitRange (defaults) and a ResourceQuota (cap).
Enter fullscreen mode Exit fullscreen mode

Bad — no resources, scheduler and kernel decide the outcome:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: api
          image: myregistry/api:v1.4.3
          ports:
            - containerPort: 8080
Enter fullscreen mode Exit fullscreen mode

Best Effort QoS. First to be evicted under any node pressure. Nothing stops a memory leak from eating the whole node.

Good — explicit requests and limits, Guaranteed QoS on memory:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  namespace: payments
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: api
          image: myregistry/api@sha256:c2f7b8a9...
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              memory: "512Mi"  # == request for Guaranteed memory QoS
              # no cpu limit — CPU throttling hurts tail latency
          ports:
            - containerPort: 8080
Enter fullscreen mode Exit fullscreen mode
# namespaces/payments/limitrange.yaml — enforces defaults cluster-wide
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: payments
spec:
  limits:
    - type: Container
      default:
        memory: "256Mi"
      defaultRequest:
        cpu: "100m"
        memory: "256Mi"
      max:
        cpu: "4"
        memory: "8Gi"
Enter fullscreen mode Exit fullscreen mode

Now a manifest without resources: inherits sensible defaults, and one that tries to request 32 cores is rejected at the API server.


Rule 3: Probes — Liveness, Readiness, and Startup, Each Tuned for Its Job

livenessProbe and readinessProbe do different things, and AI-generated manifests almost always configure them identically (or both to httpGet: /). This is actively harmful. The liveness probe decides "restart this container"; the readiness probe decides "pull this pod out of the Service." If they check the same endpoint with the same thresholds, a downstream database blip restarts a pod that would have recovered in five seconds, and a slow-booting pod gets traffic before it's ready. Startup probes (added in 1.16) solve the "my JVM takes 45 seconds to warm up" problem that otherwise forces you to set a useless initialDelaySeconds: 60 on liveness.

The rule:

Every traffic-serving container has THREE probes with THREE jobs:
  - readinessProbe /readyz — "should this pod receive traffic?"
      May check downstream deps (DB, cache) with short timeouts.
      Failure removes the pod from Service endpoints.
  - livenessProbe /livez — "should this container be restarted?"
      In-process check only. NEVER checks downstream deps —
      a DB blip must not restart every pod in a loop.
  - startupProbe — "has the app booted?" Liveness and readiness
      don't evaluate until startup succeeds. Prefer this over
      livenessProbe.initialDelaySeconds for slow boots.

Tuning: readiness fast (period 5-10s, threshold 2-3); liveness slow
(period 10-30s, threshold 3-5); startup's period * threshold covers
expected max boot time. timeoutSeconds 1-2 for cheap checks.

Never: same endpoint for liveness and readiness; exec probes that fork
a shell; omitting probes because "the process exits on failure"
(containers hang more often than they exit cleanly).
Enter fullscreen mode Exit fullscreen mode

Bad — liveness and readiness identical, checks database, no startup probe:

containers:
  - name: api
    image: myregistry/api@sha256:...
    livenessProbe:
      httpGet:
        path: /
        port: 8080
      initialDelaySeconds: 60  # hacky — masks slow boot
      periodSeconds: 10
    readinessProbe:
      httpGet:
        path: /
        port: 8080
      initialDelaySeconds: 60
      periodSeconds: 10
Enter fullscreen mode Exit fullscreen mode

/ probably returns 200 even when the app is broken. The 60-second delay hides slow boots. A database glitch restarts every pod in a loop.

Good — three probes, each doing one job:

containers:
  - name: api
    image: myregistry/api@sha256:...
    startupProbe:
      httpGet:
        path: /livez
        port: 8080
      periodSeconds: 5
      failureThreshold: 12  # up to 60s to start
    livenessProbe:
      httpGet:
        path: /livez
        port: 8080
      periodSeconds: 20
      timeoutSeconds: 2
      failureThreshold: 3   # ~60s to declare unhealthy
    readinessProbe:
      httpGet:
        path: /readyz        # checks DB + cache reachability
        port: 8080
      periodSeconds: 5
      timeoutSeconds: 2
      failureThreshold: 2
Enter fullscreen mode Exit fullscreen mode
// /livez — in-process only. Cheap. Doesn't depend on anything external.
http.HandleFunc("/livez", func(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK)
})

// /readyz — actually verify we can serve a request.
http.HandleFunc("/readyz", func(w http.ResponseWriter, r *http.Request) {
    ctx, cancel := context.WithTimeout(r.Context(), 500*time.Millisecond)
    defer cancel()
    if err := db.PingContext(ctx); err != nil {
        http.Error(w, "db unavailable", http.StatusServiceUnavailable)
        return
    }
    if err := cache.PingContext(ctx); err != nil {
        http.Error(w, "cache unavailable", http.StatusServiceUnavailable)
        return
    }
    w.WriteHeader(http.StatusOK)
})
Enter fullscreen mode Exit fullscreen mode

Now a DB outage removes the pod from the Service (readiness fails) but doesn't restart it in a loop (liveness still passes). Startup is bounded at ~60s without hacky delays.


Rule 4: Image Pinning — By Digest, Not Tag, Never :latest

image: myregistry/api:latest is the most common way AI-generated manifests ship a broken pod at 2am. The tag is mutable — whoever pushed last wins, and the pod that restarts at 2am runs a different binary than the pod that started at 4pm. imagePullPolicy: Always makes it worse by pulling on every restart. Even semantic tags (:v1.4.3) can be retagged at the registry. The only way to know what's running is to pin by digest (@sha256:...), which is what OCI guarantees is immutable.

The rule:

Production manifests pin images BY DIGEST:
  image: myregistry/api@sha256:c2f7b8a9d...

Semantic tags are acceptable in dev overlays; production overlays
resolve to a digest. :latest and no-tag references are forbidden
outside developer sandboxes.

imagePullPolicy: IfNotPresent when digest-pinned (immutable content
means no reason to re-pull). `Always` with a mutable tag re-pulls
on every restart and may fetch a different binary.

Supply chain: all production images come from a private registry,
are scanned (Trivy, Grype) and signed (Cosign). Kyverno or Gatekeeper
rejects unsigned images at admission. Use renovate + a digest resolver
to bump tag and digest in lockstep on PRs.

Prefer workload identity (IRSA on EKS, Workload Identity on GKE) to
imagePullSecrets for private-registry auth.
Enter fullscreen mode Exit fullscreen mode

Bad — mutable tag, pulls every restart, no supply-chain check:

spec:
  containers:
    - name: api
      image: myregistry/api:latest
      imagePullPolicy: Always
Enter fullscreen mode Exit fullscreen mode

Good — digest-pinned, supply-chain gated:

spec:
  containers:
    - name: api
      # Source image: myregistry/api:v1.4.3
      # Digest resolved at release time; updated via renovate PR.
      image: myregistry/api@sha256:c2f7b8a9d4e1f5a2b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b
      imagePullPolicy: IfNotPresent
Enter fullscreen mode Exit fullscreen mode

A Kyverno ClusterPolicy with verifyImages matching myregistry/* and your Cosign public key rejects any unsigned image at admission. Now you know exactly which bytes are running, and the cluster refuses to run anything that wasn't signed by your release pipeline.


Rule 5: Security Context — Non-Root, Read-Only Root Filesystem, Drop Capabilities

A Kubernetes pod with no securityContext runs as root with every Linux capability, writable root filesystem, and no seccomp profile. That's the default because Kubernetes values backwards compatibility; it is not the setting you want in production. AI models happily generate manifests without a security context because most tutorials skip it. The pod runs fine — until the first CVE in a base image turns a read vulnerability into a container escape.

The rule:

Every Pod has a restrictive pod-level securityContext; every container
tightens it further. Baseline (Pod Security Admission "restricted"):

Pod:       runAsNonRoot: true
           runAsUser, runAsGroup, fsGroup: non-zero
           seccompProfile.type: RuntimeDefault
Container: allowPrivilegeEscalation: false
           privileged: false
           readOnlyRootFilesystem: true
           capabilities.drop: [ALL]

Writable dirs (/tmp, /var/cache) use emptyDir volumes.

Every namespace has the PSA label:
  pod-security.kubernetes.io/enforce: restricted

Forbidden: hostNetwork/hostPID/hostIPC, hostPath volumes (rare CSI
exceptions), privileged: true anywhere, adding NET_ADMIN / SYS_ADMIN
without a code-comment justification.

Every namespace has a default-deny NetworkPolicy + explicit allow rules.
Enter fullscreen mode Exit fullscreen mode

Bad — root container, writable fs, every capability:

spec:
  containers:
    - name: api
      image: myregistry/api@sha256:...
      # no securityContext — runs as root with ALL caps, writable root fs
      ports:
        - containerPort: 8080
Enter fullscreen mode Exit fullscreen mode

Good — hardened, least-privilege:

spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 10001
    runAsGroup: 10001
    fsGroup: 10001
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: api
      image: myregistry/api@sha256:...
      securityContext:
        allowPrivilegeEscalation: false
        privileged: false
        readOnlyRootFilesystem: true
        capabilities:
          drop: ["ALL"]
      volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: cache
          mountPath: /var/cache/app
      ports:
        - containerPort: 8080
  volumes:
    - name: tmp
      emptyDir: {}
    - name: cache
      emptyDir: {}
Enter fullscreen mode Exit fullscreen mode
# namespaces/payments/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: payments
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted
Enter fullscreen mode Exit fullscreen mode
# namespaces/payments/default-deny.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: payments
spec:
  podSelector: {}
  policyTypes: [Ingress, Egress]
Enter fullscreen mode Exit fullscreen mode

Now the pod cannot write to its root filesystem, cannot escalate, cannot add capabilities, and cannot talk to any pod outside the explicit allow rules.


Rule 6: Secrets — External Managers and Ephemeral Mounts, Never Inline

Base64 is not encryption. kind: Secret with data: {password: cGFzc3dvcmQ=} committed to a git repo is a plaintext password visible to anyone with read access. AI-generated manifests do this constantly because the Secret API is named "Secret" and the tutorials say "use Secrets." The production pattern is: secrets live in a dedicated manager (AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager), an operator (External Secrets Operator, Vault agent injector, Secrets Store CSI Driver) projects them into the pod at runtime, and the git repo references them by name — never by value.

The rule:

Secret MATERIAL (passwords, keys, certs) never appears in YAML,
Helm values, Kustomize secretGenerator literals, or inline env vars.

Secrets live in an external manager (AWS Secrets Manager, GCP Secret
Manager, Azure Key Vault, Vault). The cluster pulls them via:
  - External Secrets Operator (ExternalSecret → Secret)
  - Secrets Store CSI Driver (files at runtime)
  - Vault agent injector (sidecar writes to the pod fs)

Sealed Secrets or SOPS-encrypted secrets are acceptable for pure
GitOps when external managers aren't available.

Consumption: prefer mounted files over env vars (env vars leak via
process listings, crash dumps, /proc/<pid>/environ). If env vars
are unavoidable, use valueFrom.secretKeyRef — never a literal value.

RBAC scopes secrets:get/list/watch to specific service accounts.
No human has secrets:list in production; break-glass access is
audited and time-boxed.

Every secret has a rotation cadence. ExternalSecret.refreshInterval
is set (e.g., 1h). Pods pick up new values via sidecar refresh or
a rolling restart triggered by a secret-checksum annotation.
Enter fullscreen mode Exit fullscreen mode

Bad — password in git as base64:

apiVersion: v1
kind: Secret
metadata:
  name: db-credentials
  namespace: payments
type: Opaque
data:
  username: cG9zdGdyZXM=           # "postgres"
  password: c3VwZXJzZWNyZXQxMjM=   # "supersecret123"
---
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
        - name: api
          env:
            - name: DB_PASSWORD
              value: "supersecret123"  # and also inline. twice.
Enter fullscreen mode Exit fullscreen mode

Good — External Secrets Operator pulls from AWS Secrets Manager:

# apps/api/externalsecret.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: api-db
  namespace: payments
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets
    kind: ClusterSecretStore
  target:
    name: api-db
    creationPolicy: Owner
  data:
    - secretKey: username
      remoteRef: { key: payments/api/db, property: username }
    - secretKey: password
      remoteRef: { key: payments/api/db, property: password }
Enter fullscreen mode Exit fullscreen mode
# Deployment consumption — reference, never literal
env:
  - name: DB_USER
    valueFrom: { secretKeyRef: { name: api-db, key: username } }
  - name: DB_PASSWORD
    valueFrom: { secretKeyRef: { name: api-db, key: password } }
Enter fullscreen mode Exit fullscreen mode

The git repo has the reference, not the value. Rotating the password is a change in AWS Secrets Manager; the next refresh pulls it. Prefer a CSI mount at /var/run/secrets/db over env vars when the app can be taught to read files — env vars leak through /proc/<pid>/environ.


Rule 7: Rollouts — Strategy, PodDisruptionBudget, HPA, and Graceful Shutdown

A Kubernetes rollout that looks safe can still drop traffic: the default RollingUpdate with maxUnavailable: 25% combined with a cluster autoscaler event can remove every pod on a draining node simultaneously. The Deployment.spec.strategy is only half the story — you also need a PodDisruptionBudget so voluntary disruptions respect your availability target, an HPA so replica count scales with load, and a container that respects SIGTERM so in-flight requests drain instead of getting dropped.

The rule:

Every user-facing Deployment MUST have all four of:
  - spec.strategy.rollingUpdate with EXPLICIT maxSurge and
    maxUnavailable (critical services: maxUnavailable: 0).
  - A sibling PodDisruptionBudget sized to survive a node drain.
  - A HorizontalPodAutoscaler with explicit min/max and a metric target.
  - terminationGracePeriodSeconds set to the app's real drain time.

Graceful shutdown: on SIGTERM, fail readiness first, let the Service
remove the pod, finish in-flight work, then exit. A preStop `sleep 5-10`
covers kube-proxy iptables propagation.

Strategy: RollingUpdate for stateless services; Recreate only when
two versions cannot run simultaneously; Argo Rollouts / Flagger for
canary or blue-green on high-stakes paths.

Production replicas >= 2. Singletons must carry a comment justifying
why. Never terminationGracePeriodSeconds: 0, never skip the PDB
("we have many replicas" doesn't help when a single drained node
takes out the only pods on that node).
Enter fullscreen mode Exit fullscreen mode

Bad — default rolling update, no PDB, no HPA, no graceful shutdown:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  # no strategy specified — defaults to maxSurge/maxUnavailable: 25%
  template:
    spec:
      containers:
        - name: api
          image: myregistry/api@sha256:...
Enter fullscreen mode Exit fullscreen mode

Good — explicit strategy, PDB, HPA, graceful shutdown:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  namespace: payments
spec:
  replicas: 3  # HPA overrides, but this is the floor the controller uses
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: api
          image: myregistry/api@sha256:...
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 10"]
          # SIGTERM handler in the app drains connections over up to 50s
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
  namespace: payments
spec:
  minAvailable: 2   # we need at least 2 pods at all times
  selector:
    matchLabels:
      app.kubernetes.io/name: api
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api
  namespace: payments
spec:
  scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: api }
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target: { type: Utilization, averageUtilization: 70 }
Enter fullscreen mode Exit fullscreen mode

The app handles SIGTERM: flip readiness to failing, wait ~5s for kube-proxy to propagate, then srv.Shutdown(ctx) with a timeout matching the drain window. Node drains, deploys, and autoscaler downscales all keep serving.


Rule 8: Observability — Structured Labels, Logs, Metrics, and Namespace Hygiene

A Kubernetes workload you can't observe is a workload you can't debug at 3am. AI-generated manifests routinely omit the standard labels, ship logs in a random format (or to files inside the container that nobody collects), and expose no metrics endpoint — which means kubectl logs is the only troubleshooting tool and incidents blow past SLO before anyone understands what's wrong.

The rule:

Every resource carries the six recommended labels:
  app.kubernetes.io/{name, instance, version, component, part-of, managed-by}

Every Pod:
  - Logs to stdout/stderr in STRUCTURED JSON with ts, level, msg,
    trace_id, span_id, request_id. Never log to files inside the container.
  - Exposes /metrics in Prometheus format on a dedicated port
    (9090/9102), scraped by a ServiceMonitor or PodMonitor.
  - Propagates OpenTelemetry trace_id/span_id and includes them
    in every log line.

Namespaces: one per (team, environment) — payments-production and
payments-staging are separate namespaces, never the same. Each has
a ResourceQuota, a LimitRange, a default-deny NetworkPolicy, and
a dedicated ServiceAccount per workload (never `default`).

Services: stable meaningful names (api, db, cache). Internal services
use ClusterIP; external traffic goes through Ingress or a Gateway.
Never type: LoadBalancer on an internal service.

Forbidden: shipping without labels, "line soup" unstructured logs,
skipping metrics ("the LB knows" — it doesn't know queue depth,
GC pauses, or anything inside your process), deploying to `default`.
Enter fullscreen mode Exit fullscreen mode

Bad — no structured labels, no metrics, logs to a file:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  labels:
    app: api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
        - name: api
          image: myregistry/api@sha256:...
          volumeMounts:
            - name: logs
              mountPath: /var/log/api
      volumes:
        - name: logs
          emptyDir: {}
Enter fullscreen mode Exit fullscreen mode

app: api is not enough — queries like "all payments components in production" fail. /var/log/api is invisible to the cluster log pipeline.

Good — full labels, stdout JSON logs, Prometheus metrics:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  namespace: payments
  labels:
    app.kubernetes.io/name: api
    app.kubernetes.io/instance: api-production
    app.kubernetes.io/version: "1.4.3"
    app.kubernetes.io/component: backend
    app.kubernetes.io/part-of: payments
    app.kubernetes.io/managed-by: argocd
spec:
  replicas: 3
  selector:
    matchLabels:
      app.kubernetes.io/name: api
      app.kubernetes.io/instance: api-production
  template:
    metadata:
      labels:
        app.kubernetes.io/name: api
        app.kubernetes.io/instance: api-production
        app.kubernetes.io/version: "1.4.3"
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
        prometheus.io/path: "/metrics"
    spec:
      serviceAccountName: api
      containers:
        - name: api
          image: myregistry/api@sha256:...
          ports:
            - name: http
              containerPort: 8080
            - name: metrics
              containerPort: 9090
          env:
            - name: LOG_FORMAT
              value: json
---
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: api
  namespace: payments
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: api
  podMetricsEndpoints:
    - port: metrics
      interval: 30s
Enter fullscreen mode Exit fullscreen mode
// Structured, stdout, one event per line.
log.Info("request",
    "method", r.Method,
    "path", r.URL.Path,
    "status", status,
    "duration_ms", dur.Milliseconds(),
    "trace_id", trace.SpanContextFromContext(r.Context()).TraceID().String(),
)
Enter fullscreen mode Exit fullscreen mode

Now the log aggregator indexes by trace_id, Prometheus scrapes /metrics without extra configuration, and "show me all components of payments running version 1.4.3" is a one-line Kubernetes label selector.


The Complete .cursorrules File for Kubernetes

Drop this into your repo root as .cursorrules, or split into .cursor/rules/*.mdc files. It's the consolidated version of every rule above plus the tooling defaults that go with them.

# Kubernetes Cursor Rules

## GitOps
- Production cluster state is reconciled from a git repo by Argo CD or Flux.
- Never emit kubectl apply / kubectl edit / kubectl patch / kubectl scale /
  helm install / helm upgrade instructions for production.
- Every resource has a committed manifest. Drift is auto-healed or alerted.
- CI runs kubeconform, kube-linter, and a dry-run diff on every PR.

## Resources
- Every container sets resources.requests.cpu, resources.requests.memory,
  and resources.limits.memory.
- Memory request == memory limit for latency-sensitive services (Guaranteed QoS).
- Do NOT set CPU limits on latency-sensitive services (throttling hurts p99).
- Every namespace has a LimitRange (defaults) and a ResourceQuota (cap).
- Sidecars and init containers also declare resources.

## Probes
- livenessProbe checks "should this container be restarted" — in-process only.
- readinessProbe checks "should this pod receive traffic" — may verify
  downstream dependencies with short timeouts.
- startupProbe bounds slow boots; never use initialDelaySeconds as a workaround.
- Different endpoints for liveness vs readiness. /livez and /readyz.
- Never use exec probes that spawn a shell in the hot loop.

## Images
- Production manifests pin images by digest (image: repo@sha256:...).
- Tag-only or :latest images are rejected in any non-dev overlay.
- imagePullPolicy: IfNotPresent when digest-pinned.
- All production images come from a private registry, scanned, and signed.
- Image signatures enforced by Kyverno / Gatekeeper / connaisseur at admission.

## Security
- Every pod sets runAsNonRoot, runAsUser, runAsGroup, fsGroup, seccompProfile.
- Every container sets allowPrivilegeEscalation: false, privileged: false,
  readOnlyRootFilesystem: true, capabilities.drop: [ALL].
- Writable dirs use emptyDir volumes mounted at known paths (/tmp, /var/cache).
- hostNetwork, hostPID, hostIPC, hostPath volumes are forbidden with rare
  documented exceptions.
- Every namespace has the Pod Security Admission label at `restricted`.
- Every namespace has a default-deny NetworkPolicy + explicit allow rules.

## Secrets
- Secret MATERIAL never lives in the git repo in plaintext OR base64.
- Secrets come from an external manager (AWS/GCP/Azure/Vault) via the
  External Secrets Operator or Secrets Store CSI Driver.
- Sealed Secrets or SOPS-encrypted are acceptable for GitOps when external
  managers are unavailable.
- Prefer mounted files over env vars for secret consumption.
- RBAC scopes `secrets:get/list/watch` to specific service accounts only.

## Rollouts
- Every Deployment sets strategy.rollingUpdate.maxSurge and maxUnavailable
  explicitly. Critical services use maxUnavailable: 0.
- Every user-facing workload has a PodDisruptionBudget.
- Every user-facing workload has a HorizontalPodAutoscaler with min/max
  replicas and a metric target.
- terminationGracePeriodSeconds is set to the app's real drain time.
- Containers handle SIGTERM: fail readiness, drain, exit cleanly.
- preStop hook with a sleep covers kube-proxy propagation.
- production replicas >= 2. Singletons require a comment justifying them.

## Observability
- Every resource has the six recommended labels (name, instance, version,
  component, part-of, managed-by).
- Logs are structured JSON on stdout/stderr. Never log to files.
- /metrics endpoint exposed on a dedicated port, scraped by a PodMonitor
  or ServiceMonitor.
- Trace propagation via OpenTelemetry; trace_id present in every log line.
- One namespace per (team, environment). Never mix prod and staging.
- Every workload has its own ServiceAccount; never use `default`.
- External traffic via Ingress / Gateway, not type: LoadBalancer on
  internal Services.

## Tooling
- kubeconform --strict on every YAML.
- kube-linter run against every chart / kustomization.
- Polaris or Datree for policy checks in CI.
- Helm charts use helm template | kubeconform; no chart-only validation.
- Kustomize overlays have a base/ and overlays/{dev,staging,production}.
- All manifests live on a branch protected with required reviews.
Enter fullscreen mode Exit fullscreen mode

Real Examples: AI-Generated Manifests Before and After Rules

Here's what changes in practice when the rules above are loaded into Cursor.

Example 1: "Give me a Deployment and Service for my Node API."

Without rules — typical AI output:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
        - name: api
          image: mycompany/api:latest
          ports:
            - containerPort: 3000
          env:
            - name: DB_PASSWORD
              value: "supersecret"
---
apiVersion: v1
kind: Service
metadata:
  name: api
spec:
  type: LoadBalancer
  selector:
    app: api
  ports:
    - port: 80
      targetPort: 3000
Enter fullscreen mode Exit fullscreen mode

Sin count: :latest, no resources, no probes, no securityContext, password inline, default namespace, type: LoadBalancer on an internal service, only the app label.

With rules in .cursorrules — same prompt, production-grade output:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  namespace: payments
  labels: &labels
    app.kubernetes.io/name: api
    app.kubernetes.io/instance: api-production
    app.kubernetes.io/version: "1.4.3"
    app.kubernetes.io/component: backend
    app.kubernetes.io/part-of: payments
    app.kubernetes.io/managed-by: argocd
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate: { maxSurge: 1, maxUnavailable: 0 }
  selector:
    matchLabels:
      app.kubernetes.io/name: api
      app.kubernetes.io/instance: api-production
  template:
    metadata:
      labels: *labels
    spec:
      serviceAccountName: api
      terminationGracePeriodSeconds: 60
      securityContext:
        runAsNonRoot: true
        runAsUser: 10001
        fsGroup: 10001
        seccompProfile: { type: RuntimeDefault }
      containers:
        - name: api
          image: mycompany/api@sha256:c2f7b8a9d...
          ports:
            - { name: http,    containerPort: 3000 }
            - { name: metrics, containerPort: 9090 }
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities: { drop: ["ALL"] }
          resources:
            requests: { cpu: "200m", memory: "256Mi" }
            limits:   { memory: "256Mi" }
          env:
            - name: DB_PASSWORD
              valueFrom: { secretKeyRef: { name: api-db, key: password } }
          volumeMounts:
            - { name: tmp, mountPath: /tmp }
          lifecycle:
            preStop: { exec: { command: ["/bin/sh", "-c", "sleep 10"] } }
          startupProbe:    { httpGet: { path: /livez, port: http }, periodSeconds: 5,  failureThreshold: 12 }
          livenessProbe:   { httpGet: { path: /livez, port: http }, periodSeconds: 20, failureThreshold: 3 }
          readinessProbe:  { httpGet: { path: /readyz, port: http }, periodSeconds: 5, failureThreshold: 2 }
      volumes:
        - { name: tmp, emptyDir: {} }
---
apiVersion: v1
kind: Service
metadata: { name: api, namespace: payments, labels: *labels }
spec:
  type: ClusterIP
  selector:
    app.kubernetes.io/name: api
    app.kubernetes.io/instance: api-production
  ports:
    - { name: http, port: 80, targetPort: http }
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata: { name: api, namespace: payments }
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app.kubernetes.io/name: api
      app.kubernetes.io/instance: api-production
Enter fullscreen mode Exit fullscreen mode

Labels. Digest pin. Probes for each job. PDB. Hardened security context. Secrets via reference. Metrics port. No type: LoadBalancer on an internal service.

The same prompt for "a CronJob that runs a nightly sync" produces a manifest with concurrencyPolicy: Forbid, backoffLimit, ttlSecondsAfterFinished, startingDeadlineSeconds, the same hardened securityContext, and resource limits — instead of a bare schedule + image: sync:latest.


Get the Full Pack

These eight rules cover the highest-leverage Kubernetes patterns where AI assistants consistently fail — the ones that turn into incidents, audit findings, and 3am pages. Drop them into .cursorrules and you'll see the difference on the very next prompt.

If you want the same depth for Terraform, Docker, Go, Java, Rust, TypeScript, Python, React, Next.js, and more — all the rules I've packaged from a year of refining Cursor configs across production clusters — they're all at:

oliviacraft.lat

One pack. Twenty-plus languages and frameworks. Battle-tested rules with before/after examples for each. Stop fighting your AI assistant and start shipping production-grade manifests on the first try.

Top comments (0)