Kubernetes Security Hardening for Production AI Workloads in 2026
Running AI workloads on Kubernetes introduces security challenges that don't exist in traditional deployments. GPU access requires privileged containers. Model serving endpoints face adversarial inputs. Training jobs pull untrusted data. And ML pipelines often run with far more permissions than they need.
This guide covers practical, copy-paste security configurations for production AI/ML workloads on Kubernetes.
Why AI Workloads Are Different
Standard K8s security guides assume stateless web services. AI workloads break those assumptions:
| Challenge | Traditional App | AI Workload |
|---|---|---|
| Container privileges | None needed | GPU access requires device plugins |
| Data access | Database credentials | Training datasets (TB+), model registries |
| Network surface | HTTP endpoints | gRPC model serving + metrics + training coordination |
| Resource abuse | CPU/memory limits | GPU memory exhaustion, VRAM leaks |
| Supply chain | Package dependencies | Model files can contain executable code |
1. Pod Security Standards for ML Pods
Don't run training pods as root. Even with GPU requirements, you can restrict privileges:
# ml-namespace.yaml — Enforce restricted pod security
apiVersion: v1
kind: Namespace
metadata:
name: ml-training
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/enforce-version: latest
pod-security.kubernetes.io/warn: restricted
---
# For GPU workloads that need device access, use baseline instead
apiVersion: v1
kind: Namespace
metadata:
name: ml-serving
labels:
pod-security.kubernetes.io/enforce: baseline
pod-security.kubernetes.io/warn: restricted
Training pod with minimal privileges:
apiVersion: v1
kind: Pod
metadata:
name: training-job
namespace: ml-training
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: trainer
image: registry.internal/ml-trainer:v2.1@sha256:abc123...
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
resources:
limits:
nvidia.com/gpu: 1
memory: "32Gi"
cpu: "8"
requests:
nvidia.com/gpu: 1
memory: "16Gi"
cpu: "4"
volumeMounts:
- name: training-data
mountPath: /data
readOnly: true
- name: model-output
mountPath: /output
- name: tmp
mountPath: /tmp
volumes:
- name: training-data
persistentVolumeClaim:
claimName: training-dataset
readOnly: true
- name: model-output
persistentVolumeClaim:
claimName: model-artifacts
- name: tmp
emptyDir:
sizeLimit: 10Gi
Key points:
-
readOnlyRootFilesystem: trueprevents writing to the container filesystem - Training data mounted as
readOnly -
tmpusesemptyDirwith a size limit to prevent disk exhaustion - Image pinned by digest, not tag
2. Network Policies for Model Serving
Model serving endpoints should only accept traffic from authorized services:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: model-serving-ingress
namespace: ml-serving
spec:
podSelector:
matchLabels:
app: model-server
policyTypes:
- Ingress
- Egress
ingress:
# Only allow traffic from the API gateway
- from:
- namespaceSelector:
matchLabels:
name: api-gateway
podSelector:
matchLabels:
app: gateway
ports:
- protocol: TCP
port: 8080 # HTTP inference
- protocol: TCP
port: 8081 # gRPC inference
egress:
# Allow DNS resolution
- to:
- namespaceSelector: {}
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
# Allow pulling models from registry
- to:
- ipBlock:
cidr: 10.0.50.0/24 # Internal model registry
ports:
- protocol: TCP
port: 443
# Block everything else — no internet access for serving pods
Training pods need stricter network rules:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: training-network-policy
namespace: ml-training
spec:
podSelector:
matchLabels:
app: training-job
policyTypes:
- Ingress
- Egress
ingress: [] # No inbound connections
egress:
# DNS only
- to:
- namespaceSelector: {}
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
# Training data storage only
- to:
- ipBlock:
cidr: 10.0.100.0/24 # Data lake
ports:
- protocol: TCP
port: 443
# Model registry for saving results
- to:
- ipBlock:
cidr: 10.0.50.0/24
ports:
- protocol: TCP
port: 443
# BLOCK ALL INTERNET ACCESS
# Training pods should never reach the internet
3. RBAC for ML Pipeline Service Accounts
The #1 RBAC mistake in ML platforms: giving training jobs cluster-admin because "it's easier."
# Least-privilege service account for training jobs
apiVersion: v1
kind: ServiceAccount
metadata:
name: ml-trainer
namespace: ml-training
annotations:
# If using AWS IRSA or GCP Workload Identity
eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/ml-trainer-role
automountServiceAccountToken: false # Don't mount token unless needed
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: ml-trainer-role
namespace: ml-training
rules:
# Can read ConfigMaps for hyperparameters
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "list"]
resourceNames: ["training-config", "model-hyperparams"]
# Can create/update model artifacts
- apiGroups: [""]
resources: ["persistentvolumeclaims"]
verbs: ["get"]
# Can write training metrics
- apiGroups: [""]
resources: ["events"]
verbs: ["create"]
# CANNOT: list secrets, create pods, exec into containers
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: ml-trainer-binding
namespace: ml-training
subjects:
- kind: ServiceAccount
name: ml-trainer
namespace: ml-training
roleRef:
kind: Role
name: ml-trainer-role
apiGroup: rbac.authorization.k8s.io
Model serving gets a separate, even more restricted account:
apiVersion: v1
kind: ServiceAccount
metadata:
name: model-server
namespace: ml-serving
automountServiceAccountToken: false
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: model-server-role
namespace: ml-serving
rules:
# Can only read model configs
- apiGroups: [""]
resources: ["configmaps", "secrets"]
verbs: ["get"]
resourceNames: ["model-registry-credentials", "serving-config"]
4. Securing GPU Node Pools
GPU nodes are expensive and powerful. Isolate them:
# gpu-nodepool-config.yaml
# Taint GPU nodes so only ML workloads can schedule on them
apiVersion: v1
kind: Node
metadata:
name: gpu-node-1
labels:
node-type: gpu
gpu-type: a100
spec:
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
# Only ML workloads tolerate the GPU taint
apiVersion: batch/v1
kind: Job
metadata:
name: training-job
spec:
template:
spec:
tolerations:
- key: nvidia.com/gpu
operator: Equal
value: "true"
effect: NoSchedule
nodeSelector:
node-type: gpu
# ... rest of pod spec
Monitor GPU utilization to detect cryptomining:
# Prometheus alert for GPU abuse
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: gpu-security-alerts
spec:
groups:
- name: gpu-security
rules:
- alert: UnexpectedGPUUsage
expr: |
DCGM_FI_DEV_GPU_UTIL > 90
and on(pod)
kube_pod_labels{label_app!~"training-job|model-server"}
for: 5m
labels:
severity: critical
annotations:
summary: "Non-ML pod using GPU at >90% for 5+ minutes"
description: "Pod {{ $labels.pod }} is consuming GPU resources unexpectedly. Possible cryptomining."
5. Runtime Security with Falco
Falco rules for ML-specific threats:
# falco-ml-rules.yaml
- rule: Model File Download from Untrusted Source
desc: Detect model files downloaded from non-approved registries
condition: >
spawned_process and
(proc.name in (curl, wget, python3, python)) and
(proc.cmdline contains "huggingface" or
proc.cmdline contains ".safetensors" or
proc.cmdline contains ".onnx" or
proc.cmdline contains ".pkl") and
not proc.cmdline contains "registry.internal"
output: >
Model download from untrusted source
(user=%user.name command=%proc.cmdline container=%container.name)
priority: WARNING
- rule: Pickle Deserialization in ML Container
desc: Detect pickle.load/loads which can execute arbitrary code
condition: >
spawned_process and
container and
proc.name = python3 and
(proc.cmdline contains "pickle.load" or
proc.cmdline contains "torch.load" or
proc.cmdline contains "joblib.load")
output: >
Unsafe model deserialization detected
(command=%proc.cmdline container=%container.name image=%container.image.repository)
priority: CRITICAL
- rule: GPU Device Access by Non-ML Container
desc: Non-ML containers should not access GPU devices
condition: >
open_read and
fd.name startswith /dev/nvidia and
not container.image.repository in (approved_ml_images)
output: >
Unauthorized GPU access (container=%container.name file=%fd.name)
priority: CRITICAL
- list: approved_ml_images
items:
- registry.internal/ml-trainer
- registry.internal/model-server
- registry.internal/ml-pipeline
6. Secrets Management for Model Registries
Never mount registry credentials as environment variables:
# Use External Secrets Operator with your vault
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: model-registry-credentials
namespace: ml-serving
spec:
refreshInterval: 1h
secretStoreRef:
name: vault-backend
kind: ClusterSecretStore
target:
name: model-registry-credentials
data:
- secretKey: registry-token
remoteRef:
key: ml/model-registry
property: access-token
- secretKey: signing-key
remoteRef:
key: ml/model-signing
property: cosign-private-key
# Mount as file, not env var — prevents leaking via /proc/environ
containers:
- name: model-server
volumeMounts:
- name: registry-creds
mountPath: /secrets/registry
readOnly: true
volumes:
- name: registry-creds
secret:
secretName: model-registry-credentials
defaultMode: 0400 # Owner read-only
The Complete Checklist
| Layer | Control | Config |
|---|---|---|
| Namespace | Pod Security Standards |
restricted for training, baseline for serving |
| Pod | Non-root, read-only FS |
runAsNonRoot, readOnlyRootFilesystem
|
| Network | Deny-all + allowlist | No internet for training pods |
| RBAC | Least privilege SA | No cluster-admin, no secret listing |
| GPU | Taints + node selectors | Only ML pods on GPU nodes |
| Runtime | Falco rules | Detect pickle, unauthorized GPU, untrusted downloads |
| Secrets | External Secrets Operator | Mount as files, auto-rotate |
| Images | Digest pinning | Never use :latest for ML images |
Key Takeaways
- Training pods should never have internet access — they pull data from internal storage, not the open internet.
- GPU nodes need dedicated taints — prevents resource theft and cryptomining.
- Model files are code — monitor for unsafe deserialization (pickle, torch.load) with Falco.
- Separate service accounts per pipeline stage — training, serving, and monitoring each get minimal RBAC.
- Pin everything by digest — container images, model files, and base images.
The organizations running GPU clusters without these controls are one compromised model file away from a full cluster takeover.
Based on production Kubernetes security audits for AI/ML platforms running on EKS, GKE, and bare-metal GPU clusters.
Top comments (0)