Stop Running LLM Workloads on Vanilla Kubernetes

#llm #kubernetes #devops #security

TL;DR: Kubernetes schedules LLM workloads well, but it does not give them the isolation boundary they need once they start calling tools, executing code, or handling tenant data.

Open Source Summit North America made one thing obvious: the cloud native crowd has moved from "can Kubernetes run LLM workloads?" to "what breaks when we trust Kubernetes too much?"

That is the right question.

The default Kubernetes security model assumes a pod is mostly an application packaging unit. It gives you namespaces, cgroups, seccomp, AppArmor, service accounts, admission control, and network policy. All of that matters. None of it changes the central fact that normal containers share the host kernel.

For a stateless API, that tradeoff is usually fine. For an LLM tool runner that can read files, call APIs, invoke Python, shell out to package managers, and chain actions across systems, that boundary starts looking thin.

The uncomfortable version is this: vanilla Kubernetes is orchestration, not containment.

The Problem

LLM inference by itself is not the scary part. A model server that receives a prompt and returns tokens is mostly a specialized API service with GPU scheduling problems.

The risk changes when the workload gains agency:

Prompt input
  -> retrieval
  -> tool selection
  -> code execution
  -> network call
  -> file write
  -> another tool call

At that point, the workload is no longer just serving traffic. It is interpreting untrusted text and turning it into actions.

That is why the recent CNCF security conversation around AI sandboxing matters. Kubernetes can restart a failed pod, route around a bad node, and roll a deployment. It cannot understand whether a prompt is trying to turn a tool into an escape path. It also cannot turn a shared kernel into a hard tenant boundary.

What I Tried First

My first instinct was the usual Kubernetes hardening stack:

apiVersion: v1
kind: Pod
metadata:
  name: llm-worker
spec:
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: worker
      image: ghcr.io/example/llm-worker:latest
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        capabilities:
          drop: ["ALL"]

That should still be the baseline. The mistake is treating it as the finish line.

Pod Security Standards reduce obvious footguns. NetworkPolicy controls blast radius. RBAC prevents a compromised workload from casually listing secrets or mutating the cluster. Admission policies keep the platform honest.

But an LLM agent running untrusted code is not just a badly configured web pod. It is closer to a multi tenant execution service. That needs a runtime boundary, not only a YAML checklist.

The Runtime Choice

The Kubernetes primitive that makes this manageable is RuntimeClass.

Instead of creating one magical "secure cluster," you route workloads by risk:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: kata
handler: kata

Then each workload declares the boundary it needs:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tool-using-agent
spec:
  replicas: 3
  selector:
    matchLabels:
      app: tool-using-agent
  template:
    metadata:
      labels:
        app: tool-using-agent
    spec:
      runtimeClassName: kata
      serviceAccountName: llm-agent
      containers:
        - name: agent
          image: ghcr.io/example/tool-agent:2026.05

My rule of thumb:

Workload	Runtime	Why
Plain inference API	`runc` or `gvisor`	Low tool risk, latency sensitive
Retrieval worker with narrow egress	`gvisor`	Better syscall boundary with less operational change
Agent that calls tools	`kata`	VM boundary per pod, Kubernetes friendly
Arbitrary code execution	Firecracker style microVM	Treat it like untrusted tenant compute

gVisor is the easiest first step because it integrates as an OCI runtime through runsc. Kata is the better fit when the isolation requirement is stronger and a VM per pod is acceptable. Firecracker is the most interesting boundary for code execution, but it is also the one I would least casually bolt onto an existing cluster without a real operations plan.

The Minimum Policy Set

The runtime is only one layer. I would not run LLM workloads without this set:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: llm-worker-egress
spec:
  podSelector:
    matchLabels:
      app: tool-using-agent
  policyTypes: ["Egress"]
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              name: model-gateway
      ports:
        - protocol: TCP
          port: 443
    - to:
        - namespaceSelector:
            matchLabels:
              name: telemetry
      ports:
        - protocol: TCP
          port: 4317

Also make the service account boring:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: llm-agent
automountServiceAccountToken: false

If the workload does not need Kubernetes API access, do not mount a token. If it does, bind only the exact verbs it needs.

Benchmark Plan

I am not going to fake GPU numbers from a laptop. The package needs a real GPU node before publishing final performance claims.

This is the harness I would run:

Runtime	Cold start p50	Cold start p95	Tokens per second	RSS overhead	Notes
runc	TODO	TODO	TODO	TODO	baseline
gVisor	TODO	TODO	TODO	TODO	syscall boundary
Kata	TODO	TODO	TODO	TODO	VM per pod
Firecracker	TODO	TODO	TODO	TODO	strongest code runner candidate

The important part is measuring the right things. Startup time matters for bursty agents. Throughput matters for inference. RSS overhead matters because GPU nodes are already expensive. Operational failure modes matter more than all three.

The Takeaway

If you are running a normal model server, Kubernetes plus standard hardening may be enough.

If you are running tool using agents, code execution, tenant prompts, or workloads with broad egress, plain pods are the wrong abstraction. Use Kubernetes for scheduling. Use sandboxed runtimes for containment. Keep policy enforcement outside the model path where possible.

Kubernetes is still the control plane. It just should not be the only security boundary.