Compute Containers

#kata #containers

Kata Containers is a secure container runtime that runs each container inside a lightweight virtual machine, combining the isolation of VMs with the speed and developer experience of containers. It enables both speed and isolation for running containerized workloads.

Goal: strengthen multi-tenant isolation beyond what namespaces/cgroups alone can safely provide, especially against kernel-level escapes.

Approach: replace the usual low-level runtime (e.g., runc) with a runtime that boots a micro-VM per sandbox and runs the container inside it.

Result: similar workflow (docker, containerd, Kubernetes), but each workload gets its own guest kernel and userspace boundary.

How it started?

Started Dec 15 2017, via merger of two secure container efforts - Intel’s Clear Containers and Hyper.sh’s runV both of which used hardware virtualization to isolate containers inside lightweight virtual machines.

Part of OpenStack (OpenInfra) Foundation, positioned as a community-driven way to blend VM-grade isolation with container workflows.

Default runtimes such as runc already offer strong performance and a smooth developer experience. However, in multi-tenant or untrusted workload scenarios, a simple container boundary may not provide adequate isolation since all containers share the same host kernel. Kata Containers introduces an additional layer of defense by leveraging hardware virtualization to isolate each workload. This design makes it significantly more difficult for a compromised container to escape or impact other containers and the host system.

Description

Prior - software isolation between containers

With Kata - hardware (hypervisor driven) isolation between containers

At a high level, a Kata deployment introduces:

Hypervisor layer: QEMU, Cloud Hypervisor, or Firecracker, using KVM to create a minimal VM per pod or container.

Guest assets: a trimmed guest kernel and rootfs optimized for fast boot and small memory footprint.

Kata agent: runs inside the guest; receives commands from the host runtime to create namespaces, mounts, and processes for the container.

Runtime integration: an OCI-compatible runtime that containerd/Docker/Kubelet can call instead of runc.

Make creation of container faster: Prepare rootfs/images for container, and other side is to create sandbox for pod. Use hotplug to combine them together and start pod.

Make small as a container: Don’t need to allocate real memory, use nvdimm to allocate virtual memory to device. Additionally use KSM to efficiently use memory; very high density than traditional VMs.

Networking: MacVTAP to bridge veth pair to TAP device. Additionally, apply tc rules to perform traffic transforms.

Storage: support volumes either from block-device or from 9pfs. With virtio-blk, when you have block-device storage on the host, performance is closer to BM / native performance than not using block-device on host. See https://dev.to/gansvv/storage-for-kata-containers-9pfs-vs-virtio-blk-f6n

CPU/Memory multi-tenancy:

Networking support for K8s multi-tenancy:

From the workload’s perspective, it is still a container; from the host’s perspective, it is a VM with a single-tenant container payload.

Containers

Firecracker doesn’t support emulation environments, so GPUs don’t work with Firecracker.

Outer runtime meant for lifecycling of VM. Inner runtime is OCP compliant, adheres to CNI/CSI/CRI, enables k8s networking and storage functionality.

Fortify applications running on cluster:

Kata has already been validated with GPU on BM:

Running a confidential environment: components should be trustworthy - kernel, guest image, memory are in a specific known state. Workload owner can then review the attestation report and decide actions, e.g., releasing secrets into that environment.

RATS reference arch and attestation:

Runtime class on pod spec - kata-qemu-gpu-snp for AMD system, tdx for tdx system, cca for arm system:

Confidential containers: encrypted/signed container images (host no longer sees the images), confidential containers for every pod.

Terminology

GDR, GPS: Meaning: GPUDirect RDMA (Remote Direct Memory Access) and GPUDirect Storage (GDS) are NVIDIA technologies for high-performance data transfer, enabling direct paths between GPUs and network/storage devices, bypassing the CPU to slash latency and boost bandwidth, crucial for AI/HPC; while GPUDirect RDMA connects GPUs to network interfaces (NICs) or storage adapters, GDS specifically links GPUs directly to local/remote storage (like NVMe-oF), both leveraging PCIe for speed.
Cold plug vs Hot plug VFIO: Context: VFIO devices are hot-plugged on a bridge by default. In a confidential compute environment, hot-plugging can compromise security. Kata supports cold-plugging of VFIO devices to a bridge-port, root-port, or switch-port. Meaning: In VFIO (Virtual Function I/O), "cold plug" means attaching a device (like a GPU) to a VM before the VM starts, while "hot plug" means attaching it while the VM is running, with "cold" being the standard, reliable way (passthrough) and "hot" being a more advanced, dynamic (but potentially complex) method for live device addition/removal

Good References