Michael Levan

Posted on May 31 • Originally published at cloudnativedeepdive.com

Agent Substrate: The Agentic AI Isolation Layer On K8s

#ai #programming #kubernetes #agentskills

Isolation/sandboxing Agents give the ability to run agentic workflows in a safe, secure, and governed way. Without it, your Agents can access just about anything you can along with doing any type of web research and API calls.

With sandboxing solving this agentic issue, the next question is "where and how will sandboxes run?" and that's where Substrate comes into play.

In this blog, you'll learn about what Substrate is and how to deploy it in GKE.

Prerequistes

To follow along with this blog post from a hands-on perspective, you will need:

A GCP account
A GKE cluster

What Is Agent Substrate

There are two things that Kubernetes is incredibly good at out of the box:

Orchestration
Clustering worker nodes to ensure users have a pool of GPU, CPU, and memory

What can be built on top of k8s that isn't out of the box is higher levels of efficiency for hardware resource management, lower latency, and the implementation of Agentic workflows (e.g - running Agents and isolating Agents). However, the primitives of Kubernetes (Pods, autoscaling, clustering of Worker Nodes) is still very-much needed, so there needs to be a tool/platform for the Agentic era that builds on top of what we know as k8s today. Something that has its own Control Plane/management layer, but still uses what Kubernetes has to offer.

That's where Agent Susbtrate comes into play.

Underneath the hood, Substrate uses gvisor (same thing as the Agent Sandbox project from the CNCF SIG), which is a container sandbox developed by Google that focuses on security, isolation, and the ability to use it in an efficient fashion (e.g - not take up a ton of hardware resources).

Substrate Internals

There are four main parts to Substrate:

ate-api-server (control plane)
atenet-router (the Envoy/DNS router)
valkey (the state store)
pod-certificate-controller itself

And the "agent-like Actors" along with Workers.

You will also see atelet, which is a Per-node Agent (DaemonSet, runs on every worker node) and it manages Worker Pods, drives runsc checkpoint/restore, streams snapshots to/from the GCS bucket that you will be creating in an upcoming section.

System Components

And the four workloads mount podCertificate volumes for all said system components. The pod certs are so that these components (or rather, the Pods running the components) get auto-issued, auto-rotated TLS certs to do mTLS between each other.

💡

Per Google: Pod Certificates is a native Kubernetes feature that automatically issues short-lived X.509 TLS certificates directly to running Pods. Introduced as an alpha feature in Kubernetes v1.34 and advanced to Beta in v1.35, this capability allows workloads to authenticate to the kube-apiserver and establish mutual TLS (mTLS) with other workloads natively.

Pod Certificates are a hard requirement for Agent Substrate, as they're how Substrate gives each component an auto-rotated per-pod mTLS identity. The pod declares a podCertificate projected volume source, which triggers a PodCertificateRequest, and the signer fulfills it. The kubelet projects (and auto-rotates) the credential bundle into the pod, and that volume must be mounted for the pod to run.

To clarify two separate distinctions:

Pod certs == identity for Substrate's own infrastructure (the four pods above). This is what needs Pod Certificates.
Actor identity == the SessionIdentity gRPC service (MintJWT/MintCert), backed by the session-id JWT/CA pool secrets. Actor/worker/ateom podsdo not mount podCertificate volumes.

So the feature isn't about giving agents certs, it's about the platform securing itself.

Actors

Substrate runs Agent-like workloads called “actors”. It then maps the actors onto what Substrate calls "workers", which are k8s Pods. With workers, you get:

Functionality for managing the actors lifecycle (e.g. - create, destroy, suspend, resume actors)
The ability to assign actors to workers in real time
Route incoming traffic to actors.

Because of Substrate's efficiency in how Actors run, you can run a plethora of Actors on a Single Worker. Google tested this with 250 Stateful Actors across only 8 Pods (the Workers).

Interacting With Substrate

Because Substrate has its own management plane and resources, you can interact with it via its own command-line tool, ate.

e.g - kubectl ate (more to come on this in the configuration sections that are upcoming).

Environment Configuration Needs/Prereqs

There are a few things that you will need configured for your Google Kubernetes Engine (GKE) cluster, GCP environment, and CLI tools.

gcloud and all of the auth that goes with it to manage your GCP and GKE environment on the terminal.

export PROJECT_ID=<your-project-id>

gcloud auth login
gcloud auth application-default login --project="$PROJECT_ID"
gcloud auth configure-docker gcr.io

The required APIs for Substrate.

gcloud services enable \
  cloudresourcemanager.googleapis.com \
  container.googleapis.com \
  networkconnectivity.googleapis.com \
  serviceusage.googleapis.com \
  storage.googleapis.com \
  --project="$PROJECT_ID"

The Agent Substrate repo cloned down in your local environment. You can clone it from here.
Local tools on your terminal
1. Go (v1.26.3 or above)
2. kubectl
3. git
4. openssl for converging the Valkey CA cert (more on that later)

Why Use GKE or Kind?

ThepodCertificate projected volume source is code in the kubelet/apiserver, but it's behind feature gates that default to off as of k8s 1.36. To use it, you need to turn them on via the k8s API Server. Something like:

--feature-gates=PodCertificateRequest=true,ClusterTrustBundle=true,ClusterTrustBundleProjection=true
--runtime-config=certificates.k8s.io/v1beta1=true

The problem is that not all managed k8s services (for example, AKS) allow you to turn on this feature. GKE does as it provides a "knob" out of the box and unmanged/raw k8s clusters (Kind, Kubeadm, etc.) allow you to because you manage the configuration.

ko

kois a build tool for Go container images from Google. It builds an image straight from Go source without a Dockerfile and a Docker daemon. Images are built and pushed by ko to your KO_DOCKER_REPO. valkey (state store) can be deployed for you by an install scrip so you don't have to install them manually

Configure Your Environment

With the prereqs, environment configs, and explanations of Agent Substrate and its components, let's get hands-on and deploy the Substrate environment.

Within the substrate directory that you cloned, run the following:

cp hack/ate-dev-env.sh.example .ate-dev-env.sh

Edit .ate-dev-env.sh with your environment configs. Since you already have a GKE cluster per the Prerequisites section, you will only need the following in the file:

  # --- Project / identity ---
  export PROJECT_ID=my-substrate-proj
  export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")

  # --- Your existing cluster ---
  export CLUSTER_NAME=substrate-poc
  export CLUSTER_LOCATION=us-central1-c
  # Set to your kubeconfig context so install-ate.sh skips `gcloud get-credentials`:
  export KUBECTL_CONTEXT=gke_my-substrate-proj_us-central1-c_substrate-poc

  # --- Snapshot bucket (GCE_REGION is the BUCKET's region, not the cluster's) ---
  export GCE_REGION=us-central1
  export BUCKET_NAME=snapshot-substrate-test-${PROJECT_ID}

  # --- Image registry for ko ---
  export KO_DOCKER_REPO="gcr.io/${PROJECT_ID}/ate-images"
  export KO_DEFAULTPLATFORMS=linux/amd64

Derive the two identities from step 2.

export ATELET_PRINCIPAL="principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/ate-system/sa/atelet"
export NODE_SA="${PROJECT_NUMBER}-compute@developer.gserviceaccount.com"

Ensure the GKE cluster has the Pod Certificate beta APIs and Workload Identity enabled.

source .ate-dev-env.sh

gcloud container clusters update "$CLUSTER_NAME" \
  --location="$CLUSTER_LOCATION" --project="$PROJECT_ID" \
  --enable-kubernetes-unstable-apis=certificates.k8s.io/v1beta1/podcertificaterequests,certificates.k8s.io/v1beta1/clustertrustbundles

gcloud container clusters update "$CLUSTER_NAME" \
  --location="$CLUSTER_LOCATION" --project="$PROJECT_ID" \
  --workload-pool="${PROJECT_ID}.svc.id.goog"

Create a snapshot bucket for your Actors.

gcloud storage buckets create "gs://${BUCKET_NAME}" \
  --project="$PROJECT_ID" --location="$GCE_REGION" --uniform-bucket-level-access

Create IAM permissions for atelet for when it is interacting with the bucket.

gcloud storage buckets add-iam-policy-binding "gs://${BUCKET_NAME}" \
  --member="$ATELET_PRINCIPAL" --role=roles/storage.objectAdmin
gcloud storage buckets add-iam-policy-binding "gs://${BUCKET_NAME}" \
  --member="$ATELET_PRINCIPAL" --role=roles/storage.bucketViewer

Grant project-level IAM permissions for the GKE nodes and atelet.

gcloud projects add-iam-policy-binding "$PROJECT_ID" \
  --member="serviceAccount:${NODE_SA}" --role=roles/storage.objectViewer
gcloud projects add-iam-policy-binding "$PROJECT_ID" \
  --member="serviceAccount:${NODE_SA}" --role=roles/artifactregistry.reader

gcloud projects add-iam-policy-binding "$PROJECT_ID" \
  --member="$ATELET_PRINCIPAL" --role=roles/storage.objectAdmin
gcloud projects add-iam-policy-binding "$PROJECT_ID" \
  --member="$ATELET_PRINCIPAL" --role=roles/artifactregistry.reader

New Node Pools

Mounting the Pod Certificate volume is a kubelet (node-level) capability, and a node's kubelet config is fixed when the node is created. Enabling the beta APIs on the control plane doesn't retroactively apply to nodes that already exist. Since this was an existing cluster, its nodes predate the enablement, so they have to be recreated to pick up the feature. The simplest way to get fresh nodes is a new node pool (a same-version upgrade won't recreate them because the nodes already match the control-plane version).

Create c3 type node pools.

  gcloud container node-pools create substrate-pool \
    --cluster="$CLUSTER_NAME" --location="$CLUSTER_LOCATION" --project="$PROJECT_ID" \
    --machine-type=c3-standard-4 --num-nodes=1 \
    --workload-metadata=GKE_METADATA

Wait for the node pools.

kubectl get nodes -l cloud.google.com/gke-nodepool=substrate-poo

Delete the old node pools.

  gcloud container node-pools delete default-pool \
    --cluster="$CLUSTER_NAME" --location="$CLUSTER_LOCATION" --project="$PROJECT_ID"

With the cluster environment configured and installed, let's install Agent Substrate.

Installing Substrate

Within the substrate directory, you will see install-ate.sh file under the hack directory, which builds the core images (via ko, pushed to KO_DOCKER_REPO) and deploys the Agent Substrate control plane/management plane and node components:

The CRDs
ate-api-server (control plane)
pod-certificate-controller (in-cluster mTLS signer that fulfills the PodCertificateRequests)
atelet (node DaemonSet)
atenet (DNS + Envoy router)
valkey (dynamic state store).
Run the following command:

./hack/install-ate.sh --deploy-ate-system

You'll see the installation in progress.

Wait for the system Pods to come up.

kubectl get pods -n ate-system --watch

After the Pods come up, Substrate is now installed.

Install The Substrate CLI

With the Substrate system up and running, you need a way to interact with it's control/management plane. To do that, you'll use the ate sub-command.

Install the command.

go install ./cmd/kubectl-ate

Add the binary to your path.

echo 'export PATH="$PATH:$(go env GOPATH)/bin"' >> ~/.zshrc
source ~/.zshrc

Test out the sub-command.

kubectl ate --help

You now have ate installed and are ready to interact with Agent Substrate.

Wrapping Up

As the Agentic AI era continues to change how we think about Agents, so will the systems that we run them on. The next phase of "the systems we run them on" is Sandboxes, which will continue to rise in popularity for many organizations, as it gives the ability to isolate Agents from an ingress and egress perspective, along with what actions they can take with the tools that are available to them. I see Sandboxes being especially important as autonomous Agents become more relevant as well.

DEV Community