Young Gao

Posted on Mar 22

Kubernetes Security Hardening for Production AI Workloads in 2026

#kubernetes #security #mlops #devops

Kubernetes Security Hardening for Production AI Workloads in 2026

Running AI workloads on Kubernetes introduces security challenges that don't exist in traditional deployments. GPU access requires privileged containers. Model serving endpoints face adversarial inputs. Training jobs pull untrusted data. And ML pipelines often run with far more permissions than they need.

This guide covers practical, copy-paste security configurations for production AI/ML workloads on Kubernetes.

Why AI Workloads Are Different

Standard K8s security guides assume stateless web services. AI workloads break those assumptions:

Challenge	Traditional App	AI Workload
Container privileges	None needed	GPU access requires device plugins
Data access	Database credentials	Training datasets (TB+), model registries
Network surface	HTTP endpoints	gRPC model serving + metrics + training coordination
Resource abuse	CPU/memory limits	GPU memory exhaustion, VRAM leaks
Supply chain	Package dependencies	Model files can contain executable code

1. Pod Security Standards for ML Pods

Don't run training pods as root. Even with GPU requirements, you can restrict privileges:

# ml-namespace.yaml — Enforce restricted pod security
apiVersion: v1
kind: Namespace
metadata:
  name: ml-training
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest
    pod-security.kubernetes.io/warn: restricted
---
# For GPU workloads that need device access, use baseline instead
apiVersion: v1
kind: Namespace
metadata:
  name: ml-serving
  labels:
    pod-security.kubernetes.io/enforce: baseline
    pod-security.kubernetes.io/warn: restricted

Training pod with minimal privileges:

apiVersion: v1
kind: Pod
metadata:
  name: training-job
  namespace: ml-training
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 1000
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: trainer
    image: registry.internal/ml-trainer:v2.1@sha256:abc123...
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop: ["ALL"]
    resources:
      limits:
        nvidia.com/gpu: 1
        memory: "32Gi"
        cpu: "8"
      requests:
        nvidia.com/gpu: 1
        memory: "16Gi"
        cpu: "4"
    volumeMounts:
    - name: training-data
      mountPath: /data
      readOnly: true
    - name: model-output
      mountPath: /output
    - name: tmp
      mountPath: /tmp
  volumes:
  - name: training-data
    persistentVolumeClaim:
      claimName: training-dataset
      readOnly: true
  - name: model-output
    persistentVolumeClaim:
      claimName: model-artifacts
  - name: tmp
    emptyDir:
      sizeLimit: 10Gi

Key points:

readOnlyRootFilesystem: true prevents writing to the container filesystem
Training data mounted as readOnly
tmp uses emptyDir with a size limit to prevent disk exhaustion
Image pinned by digest, not tag

2. Network Policies for Model Serving

Model serving endpoints should only accept traffic from authorized services:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: model-serving-ingress
  namespace: ml-serving
spec:
  podSelector:
    matchLabels:
      app: model-server
  policyTypes:
  - Ingress
  - Egress
  ingress:
  # Only allow traffic from the API gateway
  - from:
    - namespaceSelector:
        matchLabels:
          name: api-gateway
      podSelector:
        matchLabels:
          app: gateway
    ports:
    - protocol: TCP
      port: 8080  # HTTP inference
    - protocol: TCP
      port: 8081  # gRPC inference
  egress:
  # Allow DNS resolution
  - to:
    - namespaceSelector: {}
      podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53
  # Allow pulling models from registry
  - to:
    - ipBlock:
        cidr: 10.0.50.0/24  # Internal model registry
    ports:
    - protocol: TCP
      port: 443
  # Block everything else — no internet access for serving pods

Training pods need stricter network rules:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: training-network-policy
  namespace: ml-training
spec:
  podSelector:
    matchLabels:
      app: training-job
  policyTypes:
  - Ingress
  - Egress
  ingress: []  # No inbound connections
  egress:
  # DNS only
  - to:
    - namespaceSelector: {}
      podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53
  # Training data storage only
  - to:
    - ipBlock:
        cidr: 10.0.100.0/24  # Data lake
    ports:
    - protocol: TCP
      port: 443
  # Model registry for saving results
  - to:
    - ipBlock:
        cidr: 10.0.50.0/24
    ports:
    - protocol: TCP
      port: 443
  # BLOCK ALL INTERNET ACCESS
  # Training pods should never reach the internet

3. RBAC for ML Pipeline Service Accounts

The #1 RBAC mistake in ML platforms: giving training jobs cluster-admin because "it's easier."

# Least-privilege service account for training jobs
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ml-trainer
  namespace: ml-training
  annotations:
    # If using AWS IRSA or GCP Workload Identity
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/ml-trainer-role
automountServiceAccountToken: false  # Don't mount token unless needed
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: ml-trainer-role
  namespace: ml-training
rules:
# Can read ConfigMaps for hyperparameters
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["get", "list"]
  resourceNames: ["training-config", "model-hyperparams"]
# Can create/update model artifacts
- apiGroups: [""]
  resources: ["persistentvolumeclaims"]
  verbs: ["get"]
# Can write training metrics
- apiGroups: [""]
  resources: ["events"]
  verbs: ["create"]
# CANNOT: list secrets, create pods, exec into containers
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: ml-trainer-binding
  namespace: ml-training
subjects:
- kind: ServiceAccount
  name: ml-trainer
  namespace: ml-training
roleRef:
  kind: Role
  name: ml-trainer-role
  apiGroup: rbac.authorization.k8s.io

Model serving gets a separate, even more restricted account:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: model-server
  namespace: ml-serving
automountServiceAccountToken: false
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: model-server-role
  namespace: ml-serving
rules:
# Can only read model configs
- apiGroups: [""]
  resources: ["configmaps", "secrets"]
  verbs: ["get"]
  resourceNames: ["model-registry-credentials", "serving-config"]

4. Securing GPU Node Pools

GPU nodes are expensive and powerful. Isolate them:

# gpu-nodepool-config.yaml
# Taint GPU nodes so only ML workloads can schedule on them
apiVersion: v1
kind: Node
metadata:
  name: gpu-node-1
  labels:
    node-type: gpu
    gpu-type: a100
spec:
  taints:
  - key: nvidia.com/gpu
    value: "true"
    effect: NoSchedule

# Only ML workloads tolerate the GPU taint
apiVersion: batch/v1
kind: Job
metadata:
  name: training-job
spec:
  template:
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Equal
        value: "true"
        effect: NoSchedule
      nodeSelector:
        node-type: gpu
      # ... rest of pod spec

Monitor GPU utilization to detect cryptomining:

# Prometheus alert for GPU abuse
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: gpu-security-alerts
spec:
  groups:
  - name: gpu-security
    rules:
    - alert: UnexpectedGPUUsage
      expr: |
        DCGM_FI_DEV_GPU_UTIL > 90
        and on(pod)
        kube_pod_labels{label_app!~"training-job|model-server"}
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Non-ML pod using GPU at >90% for 5+ minutes"
        description: "Pod {{ $labels.pod }} is consuming GPU resources unexpectedly. Possible cryptomining."

5. Runtime Security with Falco

Falco rules for ML-specific threats:

# falco-ml-rules.yaml
- rule: Model File Download from Untrusted Source
  desc: Detect model files downloaded from non-approved registries
  condition: >
    spawned_process and
    (proc.name in (curl, wget, python3, python)) and
    (proc.cmdline contains "huggingface" or
     proc.cmdline contains ".safetensors" or
     proc.cmdline contains ".onnx" or
     proc.cmdline contains ".pkl") and
    not proc.cmdline contains "registry.internal"
  output: >
    Model download from untrusted source
    (user=%user.name command=%proc.cmdline container=%container.name)
  priority: WARNING

- rule: Pickle Deserialization in ML Container
  desc: Detect pickle.load/loads which can execute arbitrary code
  condition: >
    spawned_process and
    container and
    proc.name = python3 and
    (proc.cmdline contains "pickle.load" or
     proc.cmdline contains "torch.load" or
     proc.cmdline contains "joblib.load")
  output: >
    Unsafe model deserialization detected
    (command=%proc.cmdline container=%container.name image=%container.image.repository)
  priority: CRITICAL

- rule: GPU Device Access by Non-ML Container
  desc: Non-ML containers should not access GPU devices
  condition: >
    open_read and
    fd.name startswith /dev/nvidia and
    not container.image.repository in (approved_ml_images)
  output: >
    Unauthorized GPU access (container=%container.name file=%fd.name)
  priority: CRITICAL

- list: approved_ml_images
  items:
  - registry.internal/ml-trainer
  - registry.internal/model-server
  - registry.internal/ml-pipeline

6. Secrets Management for Model Registries

Never mount registry credentials as environment variables:

# Use External Secrets Operator with your vault
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: model-registry-credentials
  namespace: ml-serving
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-backend
    kind: ClusterSecretStore
  target:
    name: model-registry-credentials
  data:
  - secretKey: registry-token
    remoteRef:
      key: ml/model-registry
      property: access-token
  - secretKey: signing-key
    remoteRef:
      key: ml/model-signing
      property: cosign-private-key

# Mount as file, not env var — prevents leaking via /proc/environ
containers:
- name: model-server
  volumeMounts:
  - name: registry-creds
    mountPath: /secrets/registry
    readOnly: true
volumes:
- name: registry-creds
  secret:
    secretName: model-registry-credentials
    defaultMode: 0400  # Owner read-only

The Complete Checklist

Layer	Control	Config
Namespace	Pod Security Standards	`restricted` for training, `baseline` for serving
Pod	Non-root, read-only FS	`runAsNonRoot`, `readOnlyRootFilesystem`
Network	Deny-all + allowlist	No internet for training pods
RBAC	Least privilege SA	No cluster-admin, no secret listing
GPU	Taints + node selectors	Only ML pods on GPU nodes
Runtime	Falco rules	Detect pickle, unauthorized GPU, untrusted downloads
Secrets	External Secrets Operator	Mount as files, auto-rotate
Images	Digest pinning	Never use `:latest` for ML images

Key Takeaways

Training pods should never have internet access — they pull data from internal storage, not the open internet.
GPU nodes need dedicated taints — prevents resource theft and cryptomining.
Model files are code — monitor for unsafe deserialization (pickle, torch.load) with Falco.
Separate service accounts per pipeline stage — training, serving, and monitoring each get minimal RBAC.
Pin everything by digest — container images, model files, and base images.

The organizations running GPU clusters without these controls are one compromised model file away from a full cluster takeover.

Based on production Kubernetes security audits for AI/ML platforms running on EKS, GKE, and bare-metal GPU clusters.

DEV Community

Kubernetes Security Hardening for Production AI Workloads in 2026

Kubernetes Security Hardening for Production AI Workloads in 2026

Why AI Workloads Are Different

1. Pod Security Standards for ML Pods

2. Network Policies for Model Serving

3. RBAC for ML Pipeline Service Accounts

4. Securing GPU Node Pools

5. Runtime Security with Falco

6. Secrets Management for Model Registries

The Complete Checklist

Key Takeaways

Top comments (0)