DEV Community

Young Gao
Young Gao

Posted on

Kubernetes Security Hardening for Production AI Workloads in 2026

Kubernetes Security Hardening for Production AI Workloads in 2026

Running AI workloads on Kubernetes introduces security challenges that don't exist in traditional deployments. GPU access requires privileged containers. Model serving endpoints face adversarial inputs. Training jobs pull untrusted data. And ML pipelines often run with far more permissions than they need.

This guide covers practical, copy-paste security configurations for production AI/ML workloads on Kubernetes.

Why AI Workloads Are Different

Standard K8s security guides assume stateless web services. AI workloads break those assumptions:

Challenge Traditional App AI Workload
Container privileges None needed GPU access requires device plugins
Data access Database credentials Training datasets (TB+), model registries
Network surface HTTP endpoints gRPC model serving + metrics + training coordination
Resource abuse CPU/memory limits GPU memory exhaustion, VRAM leaks
Supply chain Package dependencies Model files can contain executable code

1. Pod Security Standards for ML Pods

Don't run training pods as root. Even with GPU requirements, you can restrict privileges:

# ml-namespace.yaml — Enforce restricted pod security
apiVersion: v1
kind: Namespace
metadata:
  name: ml-training
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest
    pod-security.kubernetes.io/warn: restricted
---
# For GPU workloads that need device access, use baseline instead
apiVersion: v1
kind: Namespace
metadata:
  name: ml-serving
  labels:
    pod-security.kubernetes.io/enforce: baseline
    pod-security.kubernetes.io/warn: restricted
Enter fullscreen mode Exit fullscreen mode

Training pod with minimal privileges:

apiVersion: v1
kind: Pod
metadata:
  name: training-job
  namespace: ml-training
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 1000
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: trainer
    image: registry.internal/ml-trainer:v2.1@sha256:abc123...
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop: ["ALL"]
    resources:
      limits:
        nvidia.com/gpu: 1
        memory: "32Gi"
        cpu: "8"
      requests:
        nvidia.com/gpu: 1
        memory: "16Gi"
        cpu: "4"
    volumeMounts:
    - name: training-data
      mountPath: /data
      readOnly: true
    - name: model-output
      mountPath: /output
    - name: tmp
      mountPath: /tmp
  volumes:
  - name: training-data
    persistentVolumeClaim:
      claimName: training-dataset
      readOnly: true
  - name: model-output
    persistentVolumeClaim:
      claimName: model-artifacts
  - name: tmp
    emptyDir:
      sizeLimit: 10Gi
Enter fullscreen mode Exit fullscreen mode

Key points:

  • readOnlyRootFilesystem: true prevents writing to the container filesystem
  • Training data mounted as readOnly
  • tmp uses emptyDir with a size limit to prevent disk exhaustion
  • Image pinned by digest, not tag

2. Network Policies for Model Serving

Model serving endpoints should only accept traffic from authorized services:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: model-serving-ingress
  namespace: ml-serving
spec:
  podSelector:
    matchLabels:
      app: model-server
  policyTypes:
  - Ingress
  - Egress
  ingress:
  # Only allow traffic from the API gateway
  - from:
    - namespaceSelector:
        matchLabels:
          name: api-gateway
      podSelector:
        matchLabels:
          app: gateway
    ports:
    - protocol: TCP
      port: 8080  # HTTP inference
    - protocol: TCP
      port: 8081  # gRPC inference
  egress:
  # Allow DNS resolution
  - to:
    - namespaceSelector: {}
      podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53
  # Allow pulling models from registry
  - to:
    - ipBlock:
        cidr: 10.0.50.0/24  # Internal model registry
    ports:
    - protocol: TCP
      port: 443
  # Block everything else — no internet access for serving pods
Enter fullscreen mode Exit fullscreen mode

Training pods need stricter network rules:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: training-network-policy
  namespace: ml-training
spec:
  podSelector:
    matchLabels:
      app: training-job
  policyTypes:
  - Ingress
  - Egress
  ingress: []  # No inbound connections
  egress:
  # DNS only
  - to:
    - namespaceSelector: {}
      podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53
  # Training data storage only
  - to:
    - ipBlock:
        cidr: 10.0.100.0/24  # Data lake
    ports:
    - protocol: TCP
      port: 443
  # Model registry for saving results
  - to:
    - ipBlock:
        cidr: 10.0.50.0/24
    ports:
    - protocol: TCP
      port: 443
  # BLOCK ALL INTERNET ACCESS
  # Training pods should never reach the internet
Enter fullscreen mode Exit fullscreen mode

3. RBAC for ML Pipeline Service Accounts

The #1 RBAC mistake in ML platforms: giving training jobs cluster-admin because "it's easier."

# Least-privilege service account for training jobs
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ml-trainer
  namespace: ml-training
  annotations:
    # If using AWS IRSA or GCP Workload Identity
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/ml-trainer-role
automountServiceAccountToken: false  # Don't mount token unless needed
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: ml-trainer-role
  namespace: ml-training
rules:
# Can read ConfigMaps for hyperparameters
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["get", "list"]
  resourceNames: ["training-config", "model-hyperparams"]
# Can create/update model artifacts
- apiGroups: [""]
  resources: ["persistentvolumeclaims"]
  verbs: ["get"]
# Can write training metrics
- apiGroups: [""]
  resources: ["events"]
  verbs: ["create"]
# CANNOT: list secrets, create pods, exec into containers
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: ml-trainer-binding
  namespace: ml-training
subjects:
- kind: ServiceAccount
  name: ml-trainer
  namespace: ml-training
roleRef:
  kind: Role
  name: ml-trainer-role
  apiGroup: rbac.authorization.k8s.io
Enter fullscreen mode Exit fullscreen mode

Model serving gets a separate, even more restricted account:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: model-server
  namespace: ml-serving
automountServiceAccountToken: false
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: model-server-role
  namespace: ml-serving
rules:
# Can only read model configs
- apiGroups: [""]
  resources: ["configmaps", "secrets"]
  verbs: ["get"]
  resourceNames: ["model-registry-credentials", "serving-config"]
Enter fullscreen mode Exit fullscreen mode

4. Securing GPU Node Pools

GPU nodes are expensive and powerful. Isolate them:

# gpu-nodepool-config.yaml
# Taint GPU nodes so only ML workloads can schedule on them
apiVersion: v1
kind: Node
metadata:
  name: gpu-node-1
  labels:
    node-type: gpu
    gpu-type: a100
spec:
  taints:
  - key: nvidia.com/gpu
    value: "true"
    effect: NoSchedule
Enter fullscreen mode Exit fullscreen mode
# Only ML workloads tolerate the GPU taint
apiVersion: batch/v1
kind: Job
metadata:
  name: training-job
spec:
  template:
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Equal
        value: "true"
        effect: NoSchedule
      nodeSelector:
        node-type: gpu
      # ... rest of pod spec
Enter fullscreen mode Exit fullscreen mode

Monitor GPU utilization to detect cryptomining:

# Prometheus alert for GPU abuse
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: gpu-security-alerts
spec:
  groups:
  - name: gpu-security
    rules:
    - alert: UnexpectedGPUUsage
      expr: |
        DCGM_FI_DEV_GPU_UTIL > 90
        and on(pod)
        kube_pod_labels{label_app!~"training-job|model-server"}
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Non-ML pod using GPU at >90% for 5+ minutes"
        description: "Pod {{ $labels.pod }} is consuming GPU resources unexpectedly. Possible cryptomining."
Enter fullscreen mode Exit fullscreen mode

5. Runtime Security with Falco

Falco rules for ML-specific threats:

# falco-ml-rules.yaml
- rule: Model File Download from Untrusted Source
  desc: Detect model files downloaded from non-approved registries
  condition: >
    spawned_process and
    (proc.name in (curl, wget, python3, python)) and
    (proc.cmdline contains "huggingface" or
     proc.cmdline contains ".safetensors" or
     proc.cmdline contains ".onnx" or
     proc.cmdline contains ".pkl") and
    not proc.cmdline contains "registry.internal"
  output: >
    Model download from untrusted source
    (user=%user.name command=%proc.cmdline container=%container.name)
  priority: WARNING

- rule: Pickle Deserialization in ML Container
  desc: Detect pickle.load/loads which can execute arbitrary code
  condition: >
    spawned_process and
    container and
    proc.name = python3 and
    (proc.cmdline contains "pickle.load" or
     proc.cmdline contains "torch.load" or
     proc.cmdline contains "joblib.load")
  output: >
    Unsafe model deserialization detected
    (command=%proc.cmdline container=%container.name image=%container.image.repository)
  priority: CRITICAL

- rule: GPU Device Access by Non-ML Container
  desc: Non-ML containers should not access GPU devices
  condition: >
    open_read and
    fd.name startswith /dev/nvidia and
    not container.image.repository in (approved_ml_images)
  output: >
    Unauthorized GPU access (container=%container.name file=%fd.name)
  priority: CRITICAL

- list: approved_ml_images
  items:
  - registry.internal/ml-trainer
  - registry.internal/model-server
  - registry.internal/ml-pipeline
Enter fullscreen mode Exit fullscreen mode

6. Secrets Management for Model Registries

Never mount registry credentials as environment variables:

# Use External Secrets Operator with your vault
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: model-registry-credentials
  namespace: ml-serving
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-backend
    kind: ClusterSecretStore
  target:
    name: model-registry-credentials
  data:
  - secretKey: registry-token
    remoteRef:
      key: ml/model-registry
      property: access-token
  - secretKey: signing-key
    remoteRef:
      key: ml/model-signing
      property: cosign-private-key
Enter fullscreen mode Exit fullscreen mode
# Mount as file, not env var — prevents leaking via /proc/environ
containers:
- name: model-server
  volumeMounts:
  - name: registry-creds
    mountPath: /secrets/registry
    readOnly: true
volumes:
- name: registry-creds
  secret:
    secretName: model-registry-credentials
    defaultMode: 0400  # Owner read-only
Enter fullscreen mode Exit fullscreen mode

The Complete Checklist

Layer Control Config
Namespace Pod Security Standards restricted for training, baseline for serving
Pod Non-root, read-only FS runAsNonRoot, readOnlyRootFilesystem
Network Deny-all + allowlist No internet for training pods
RBAC Least privilege SA No cluster-admin, no secret listing
GPU Taints + node selectors Only ML pods on GPU nodes
Runtime Falco rules Detect pickle, unauthorized GPU, untrusted downloads
Secrets External Secrets Operator Mount as files, auto-rotate
Images Digest pinning Never use :latest for ML images

Key Takeaways

  1. Training pods should never have internet access — they pull data from internal storage, not the open internet.
  2. GPU nodes need dedicated taints — prevents resource theft and cryptomining.
  3. Model files are code — monitor for unsafe deserialization (pickle, torch.load) with Falco.
  4. Separate service accounts per pipeline stage — training, serving, and monitoring each get minimal RBAC.
  5. Pin everything by digest — container images, model files, and base images.

The organizations running GPU clusters without these controls are one compromised model file away from a full cluster takeover.


Based on production Kubernetes security audits for AI/ML platforms running on EKS, GKE, and bare-metal GPU clusters.

Top comments (0)