DEV Community: Matt Camp

Orchestrating Secure AI Agents on Amazon EKS

Matt Camp — Fri, 27 Mar 2026 12:24:23 +0000

Subtitle: How we went from scaling video analysis on EKS to running autonomous coding agents in a custom agent harness, and why Kubernetes was the obvious choice.

The backstory

A couple of years ago, AWS published a case study about how our team at Unitary scales Amazon EKS with Karpenter. Three engineers managing 1,000+ nodes at peak, processing 26 million videos a day, 50-70% cost reduction with Spot Instances. It was a good story about what a small team can do with the right infrastructure.

What that case study didn't cover is what happened next. As our engineering team grew, we started leaning heavily on AI coding agents (Cursor, then Claude Code and OpenAI Codex) to keep pace with development across multiple customer projects. And we hit a wall that will be familiar to anyone running these tools at scale.

The problem with AI coding agents in production

If you've used Claude Code or Codex, you know the experience: the agent is powerful, but it needs you there. You're approving tool calls, answering questions, watching the terminal. Running --dangerously-skip-permissions on a developer machine is too risky for most teams. One bad tool call with production credentials is a serious incident. So the human sits there, babysitting.

This works for individual productivity. Developers are good at multi-tasking; you can review output in one terminal while doing other work. But it doesn't scale to a team running agents across multiple codebases, and it doesn't work when nobody is watching. The agent that loops on a failing test for an hour doesn't care that you stepped into a meeting.

We needed the same operational maturity for our AI coding tools that we'd built for our ML inference pipelines: safe to run unattended, with automated guardrails replacing the human in the loop, and it must scale. That meant Kubernetes.

Why EKS was the natural fit

We'd already solved the hard scaling problems on EKS. Karpenter handles node provisioning. We know how to run mixed workloads across Spot and On-Demand. Our team understands the operational model.

AI coding agents have a different resource profile from ML inference. They're long-running (minutes to hours), I/O-heavy rather than GPU-bound, and each one needs an isolated environment with repository access and API credentials. But the Kubernetes primitives are the same: pods for isolation, Jobs for lifecycle management, Secrets for credentials, NetworkPolicies for egress control.

So we built Osmia, an open-source orchestration layer that turns these primitives into a managed AI coding agent platform. We've released it under Apache 2.0.

Architecture on EKS

Osmia runs as a single controller pod that watches for incoming tasks (from ticket systems, webhooks, or direct API calls) and translates each one into a Kubernetes Job.

Each agent pod runs as non-root with a read-only root filesystem and all Linux capabilities dropped. You can optionally layer on gVisor or Kata for defence in depth. Credentials are scoped per task via IRSA (we use IAM Roles for Service Accounts for any AWS-side access rather than static credentials). An optional NetworkPolicy restricts outbound traffic to HTTPS and SSH, which is enough for git operations and API calls and nothing else. A watchdog monitors every running agent and terminates jobs that exceed their cost ceiling.

The controller itself is a standard Go binary using controller-runtime, the same framework that powers most Kubernetes operators. On EKS, it runs as a Deployment with a single replica.

Session persistence. Agent sessions can be persisted across retries and continuations using PVC-backed storage. Two backends are available: shared-pvc uses a single ReadWriteMany PVC (EFS on EKS) with per-task subdirectories - simpler to operate. per-taskrun-pvc dynamically creates and deletes a dedicated ReadWriteOnce PVC (EBS gp3 on EKS) per task run - stronger isolation. Session data includes the agent's conversation history (~/.claude/) and optionally the workspace, so retry pods can resume with --resume rather than starting from scratch.

Secrets management on EKS

Agent pods need credentials: API keys for the AI engine, repository access tokens, and sometimes task-specific secrets.

Native AWS Secrets Manager. Osmia ships a built-in AWS Secrets Manager backend that reads secrets directly via the AWS SDK v2. On EKS with IRSA, no credential configuration is needed - the SDK picks up the pod identity token automatically. The setup is three steps:

Create an IAM role with secretsmanager:GetSecretValue permission:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["secretsmanager:GetSecretValue"],
      "Resource": "arn:aws:secretsmanager:eu-west-1:123456789:secret:osmia/*"
    }
  ]
}

Annotate the Osmia controller's ServiceAccount with the role ARN:

metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/osmia-secrets

Configure the backend in osmia-config.yaml:

secret_resolver:
  backends:
    - scheme: "aws-sm"
      backend: "aws-secrets-manager"
      config:
        region: "eu-west-1"
        cache_ttl: "5m"

Secret references use the aws-sm://secret-name#json-field URI format. If your secret is a JSON object (which is typical in Secrets Manager), the #field fragment extracts a specific key. If the secret is a plain string, omit the fragment. The backend caches values in memory with a configurable TTL (default 5 minutes) to avoid hitting the Secrets Manager API on every job creation. Multiple secret references pointing at the same secret name share a single cached API call.

For multi-account deployments, the backend supports cross-account access via STS AssumeRole - set assume_role_arn to the target account's role, and the backend handles credential refresh automatically.

You can also run multiple backends simultaneously. A team might keep their AI engine API keys in K8s Secrets (simpler to rotate) while pulling task-specific database credentials from Secrets Manager. The multi-backend resolver dispatches by URI scheme, so the two coexist without any changes to the agent configuration:

secret_resolver:
  backends:
    - scheme: "k8s"
      backend: "k8s"
    - scheme: "aws-sm"
      backend: "aws-secrets-manager"
      config:
        region: "eu-west-1"
  policy:
    allowed_schemes: ["k8s", "aws-sm"]
    blocked_env_patterns: ["AWS_*"]

External Secrets Operator (zero-code alternative). If your team already runs External Secrets Operator, you can continue using it. ESO syncs secrets from AWS Secrets Manager into Kubernetes Secrets on a configurable refresh interval. Osmia's built-in K8s backend reads those synced secrets with no code changes. Authentication is via IRSA on the ESO service account, so no static AWS credentials are involved.

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: osmia-anthropic-key
  namespace: osmia
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: osmia-anthropic-key
  data:
    - secretKey: api_key
      remoteRef:
        key: osmia/anthropic-api-key
        property: api_key

HashiCorp Vault. Osmia also ships a built-in Vault backend for teams using Vault with Kubernetes auth. Configure it via the vault:// scheme in the same secret_resolver.backends array.

What Karpenter gives us here

Agent pods are bursty. A Monday morning might bring 30 tickets; a Saturday brings none. Karpenter handles this the same way it handled our inference scaling: provisioning nodes as demand rises and consolidating as it falls.

For agent workloads specifically, we configure Karpenter NodePools that prefer cost-optimised instance types (agent pods need CPU and memory, not GPUs). Unlike our inference workloads, we run agent pods on On-Demand instances. A Spot reclamation on a job that's been running for 30 minutes means you lose all the token spend and progress. The job can restart (agent tasks are idempotent), but you're paying twice. Spot made sense for our short-lived, stateless video processing. It doesn't make sense for agents that run for tens of minutes.

The intelligence layer

The part that goes beyond basic orchestration:

Real-time trajectory scoring. Every tool call streams from the agent pod as NDJSON events. The controller scores whether the agent is making progress or stuck in a loop, calling run_tests five times with the same failure, thrashing between files without converging. When the score drops below threshold, the controller intervenes: injects a hint, or terminates the job before it burns through the budget.

Per-codebase memory. After each task, the controller extracts facts, patterns, and issues from the agent's work and stores them in a knowledge graph. The next task on the same codebase gets that context injected into its prompt. Facts decay over time; stale knowledge is pruned automatically. This isn't per-session memory. It's team-wide, cross-task, and persistent.

Engine routing. We track per-engine success rates by task type. A documentation task might route to a different engine than a complex refactoring task, based on historical performance data rather than intuition.

Human-in-the-loop continuation. When a long-running agent exhausts its turn limit, the controller doesn't silently fail or blindly retry. Instead it pauses the task and sends a Slack approval request showing the operator the turn count, cost so far, and a progress summary with Continue and Stop buttons. On approval, a new pod resumes the session with full conversation history via --resume. On rejection, the task fails cleanly with the operator's username recorded. This is configurable per-engine (continuation_prompt: true, max_continuations: 3) and requires session persistence to be enabled.

AWS Bedrock. The Cline engine supports provider: "bedrock" for teams that want all LLM traffic to stay within their AWS account. Combined with the native Secrets Manager backend for credentials, an all-AWS deployment: EKS + Bedrock + Secrets Manager is possible with no API calls leaving the AWS network boundary.

Deployment

Getting Osmia running on an existing EKS cluster is a Helm install:

helm repo add osmia https://unitaryai.github.io/osmia
helm install osmia osmia/osmia \
  --namespace osmia-system \
  --create-namespace \
  -f values-eks.yaml

The EKS-specific values overlay configures IRSA annotations on the service account, ALB ingress for the webhook endpoint (so GitHub/GitLab can deliver events), Karpenter-compatible node selectors for agent pods, and the API key secret name (engines.claude-code.auth.api_key_secret). We're working on publishing a complete examples/aws/ directory in the repository with EKS-specific deployment guides, IRSA configuration, and ALB setup.

What we learned

A few things surprised us.

Agent pods need more memory than we initially expected. Language models generate large context windows, and the agent processes (Claude Code, Codex CLI) keep substantial state in memory. We settled on 4Gi requests for most workloads. Undersizing causes OOMKills that look like agent failures.

NetworkPolicies turned out to matter more than sandboxing for most threat models. A compromised agent that can make arbitrary outbound HTTP requests is more dangerous than one that can read files on its own filesystem. Egress control is the higher-priority control to enable.

Spot does not work well for agent workloads. A reclaimed agent job loses all its progress and token spend. Restarting is safe but expensive. We moved agent pods to On-Demand and kept Spot for our shorter-lived workloads where interruption cost is low.

fsGroup is mandatory for freshly formatted EBS volumes. Agent pods run as non-root (UID 10000). When a new EBS volume is attached, the kubelet formats it and the resulting filesystem is owned by root so the non-root container can't write to it. The fix is fsGroup: 10000 on the pod security context, which tells Kubernetes to chown the mounted volume on attach. This is now the default in the Osmia Helm chart, but it's a common stumbling block when running non-root workloads against freshly provisioned EBS.

ReadWriteOnce PVCs require careful deduplication in the job spec. When two VolumeMount entries reference the same PVC claim name, the kubelet volume manager deadlocks. NodePublishVolume is never called and the pod stays in ContainerCreating indefinitely. The Osmia job builder now deduplicates PVC-backed volumes automatically. Related: the Helm chart defaults to Recreate deployment strategy for the controller, because RollingUpdate triggers a Multi-Attach error when the incoming pod tries to mount a ReadWriteOnce volume before the outgoing pod releases it.

Structured logging (via Go's slog) paid off immediately. Every task run produces a structured audit trail. When a task produces an unexpected result, you can trace exactly what happened without guessing.

From ML pipelines to AI agents

The core insight is that running AI coding agents at scale is an infrastructure problem, and it's one that Kubernetes (and EKS specifically) is well-suited to solve. The same team that managed 1,000 nodes for video analysis now manages autonomous coding agents with the same tools, the same operational model, and the same security posture.

If you're already running workloads on EKS and experimenting with AI coding agents, you're closer to production-grade agent orchestration than you might think.

Links:

GitHub: github.com/unitaryai/osmia
Docs: unitaryai.github.io/osmia
Original EKS case study: aws.amazon.com/solutions/case-studies/unitary-eks-case-study

DeepRacer-for-Cloud v5.2.2 now available with new real-time training metrics

Matt Camp — Fri, 03 May 2024 08:37:03 +0000

DeepRacer-for-Cloud provides a great way for developers to train DeepRacer models on EC2 (or other cloud compute instances, or even local servers) however many users have noticed that unlike the official AWS console it didn't provide the kind of friendly web UI showing the current state of training.

While there are some fantastic log analysis notebooks available these can be a little tricky to set up and often require re-loading vast amounts of log data to get a refreshed view of the metrics.

Deepracer-for-Cloud v5.2.2 is now available and has added an exciting new feature which enables real-time metrics visualisation using Grafana.

Under the hood this involves creating three new containers for Telegraf, InfluxDB, and Grafana.

The Robomaker simulation workers send the training metrics to Telegraf, which aggregates and stores them in the InfluxDB time-series database. Grafana provides a presentation layer for interactive dashboards.

Getting started

To use this new feature you will need v5.2.2 of Deepracer-for-Cloud, and also the v5.2.2 Robomaker container image.

Updating DeepRacer-for-Cloud

If you're installing DRfC for the first time then it should already download the correct image and templates, but if you're upgrading an existing install then you'll need to do a few steps:

If you installed DRfC the recommended way by cloning the GitHub repo then you should do a git pull on the master branch to fetch the latest updates.

To enable real-time metrics you need to add two additional lines to your system.env file:

DR_TELEGRAF_HOST=telegraf
DR_TELEGRAF_PORT=8092

In almost all cases you can paste these directly in without modifying the values, as the hostname will reference the telegraf container running inside Docker.

If this is your first install then these lines will need to be uncommented.

Updating the Robomaker container image

First pull the updated container image from DockerHub. Use the cpu or gpu tag as appropriate for your system.

docker pull awsdeepracercommunity/deepracer-robomaker:5.2.2-cpu

or

docker pull awsdeepracercommunity/deepracer-robomaker:5.2.2-gpu

Then update the DR_ROBOMAKER_IMAGE line in system.env to set to the new image you just pulled.

DR_ROBOMAKER_IMAGE=5.2.1-cpu

Starting the metrics stack

You can then start the metrics containers using dr-start-metrics. (You might need to relogin or reload your shell to pick up the new changes in bin/activate.sh)

This will start the three new containers. If it's the first time starting the metrics stack then Grafana will need to run some database migrations that can take 30-60 seconds before the web UI is available.

Collecting metrics

As long as the two Telegraf lines have been added to system.env and you have v5.2.2 of the robomaker container then all you have to do is start training normally and the metrics will be automatically generated.

Using the dashboards

Once the metrics stack is running you should be able to access the Grafana web UI on port 3000 (eg, http://localhost:3000 if running locally)

Grafana initially starts with an admin user provisioned (username admin, password admin). It will prompt you to choose a new password upon first connect, so you should do this right away.

A template dashboard is provided to show how to access basic DeepRacer training metrics. You can use this dashboard as a base to build your own more customised dashboards.

After connecting to the Grafana Web UI with a browser use the menu to browse to the Dashboards section.

The template dashboard called DeepRacer Training template should be visible, showing graphs of reward, progress, and completed lap times.

As this is an automatically provisioned dashboard you are not able to save changes to it, however you can copy it by clicking on the small cog icon to enter the dashboard settings page, and then clicking Save as to make an editable copy.

Grafana dashboards are interactive - you can over over datapoints to see more details, and you can click and drag on a graph panel to zoom in.

You can also change the time range using the selector box on the top right, and also select an auto-refresh period from the selector next to that.

A full user guide on how to work the dashboards is available on the Grafana website.

Currently we record metrics for training and evaluation sessions such as reward, progress, average and best lap times but in the future we'll be adding more even metrics and dashboards.