Matt Camp

Posted on Mar 27

Orchestrating Secure AI Agents on Amazon EKS

#ai #agents #aws #kubernetes

Subtitle: How we went from scaling video analysis on EKS to running autonomous coding agents in a custom agent harness, and why Kubernetes was the obvious choice.

The backstory

A couple of years ago, AWS published a case study about how our team at Unitary scales Amazon EKS with Karpenter. Three engineers managing 1,000+ nodes at peak, processing 26 million videos a day, 50-70% cost reduction with Spot Instances. It was a good story about what a small team can do with the right infrastructure.

What that case study didn't cover is what happened next. As our engineering team grew, we started leaning heavily on AI coding agents (Cursor, then Claude Code and OpenAI Codex) to keep pace with development across multiple customer projects. And we hit a wall that will be familiar to anyone running these tools at scale.

The problem with AI coding agents in production

If you've used Claude Code or Codex, you know the experience: the agent is powerful, but it needs you there. You're approving tool calls, answering questions, watching the terminal. Running --dangerously-skip-permissions on a developer machine is too risky for most teams. One bad tool call with production credentials is a serious incident. So the human sits there, babysitting.

This works for individual productivity. Developers are good at multi-tasking; you can review output in one terminal while doing other work. But it doesn't scale to a team running agents across multiple codebases, and it doesn't work when nobody is watching. The agent that loops on a failing test for an hour doesn't care that you stepped into a meeting.

We needed the same operational maturity for our AI coding tools that we'd built for our ML inference pipelines: safe to run unattended, with automated guardrails replacing the human in the loop, and it must scale. That meant Kubernetes.

Why EKS was the natural fit

We'd already solved the hard scaling problems on EKS. Karpenter handles node provisioning. We know how to run mixed workloads across Spot and On-Demand. Our team understands the operational model.

AI coding agents have a different resource profile from ML inference. They're long-running (minutes to hours), I/O-heavy rather than GPU-bound, and each one needs an isolated environment with repository access and API credentials. But the Kubernetes primitives are the same: pods for isolation, Jobs for lifecycle management, Secrets for credentials, NetworkPolicies for egress control.

So we built Osmia, an open-source orchestration layer that turns these primitives into a managed AI coding agent platform. We've released it under Apache 2.0.

Architecture on EKS

Osmia runs as a single controller pod that watches for incoming tasks (from ticket systems, webhooks, or direct API calls) and translates each one into a Kubernetes Job.

Each agent pod runs as non-root with a read-only root filesystem and all Linux capabilities dropped. You can optionally layer on gVisor or Kata for defence in depth. Credentials are scoped per task via IRSA (we use IAM Roles for Service Accounts for any AWS-side access rather than static credentials). An optional NetworkPolicy restricts outbound traffic to HTTPS and SSH, which is enough for git operations and API calls and nothing else. A watchdog monitors every running agent and terminates jobs that exceed their cost ceiling.

The controller itself is a standard Go binary using controller-runtime, the same framework that powers most Kubernetes operators. On EKS, it runs as a Deployment with a single replica.

Session persistence. Agent sessions can be persisted across retries and continuations using PVC-backed storage. Two backends are available: shared-pvc uses a single ReadWriteMany PVC (EFS on EKS) with per-task subdirectories - simpler to operate. per-taskrun-pvc dynamically creates and deletes a dedicated ReadWriteOnce PVC (EBS gp3 on EKS) per task run - stronger isolation. Session data includes the agent's conversation history (~/.claude/) and optionally the workspace, so retry pods can resume with --resume rather than starting from scratch.

Secrets management on EKS

Agent pods need credentials: API keys for the AI engine, repository access tokens, and sometimes task-specific secrets.

Native AWS Secrets Manager. Osmia ships a built-in AWS Secrets Manager backend that reads secrets directly via the AWS SDK v2. On EKS with IRSA, no credential configuration is needed - the SDK picks up the pod identity token automatically. The setup is three steps:

Create an IAM role with secretsmanager:GetSecretValue permission:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["secretsmanager:GetSecretValue"],
      "Resource": "arn:aws:secretsmanager:eu-west-1:123456789:secret:osmia/*"
    }
  ]
}

Annotate the Osmia controller's ServiceAccount with the role ARN:

metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/osmia-secrets

Configure the backend in osmia-config.yaml:

secret_resolver:
  backends:
    - scheme: "aws-sm"
      backend: "aws-secrets-manager"
      config:
        region: "eu-west-1"
        cache_ttl: "5m"

Secret references use the aws-sm://secret-name#json-field URI format. If your secret is a JSON object (which is typical in Secrets Manager), the #field fragment extracts a specific key. If the secret is a plain string, omit the fragment. The backend caches values in memory with a configurable TTL (default 5 minutes) to avoid hitting the Secrets Manager API on every job creation. Multiple secret references pointing at the same secret name share a single cached API call.

For multi-account deployments, the backend supports cross-account access via STS AssumeRole - set assume_role_arn to the target account's role, and the backend handles credential refresh automatically.

You can also run multiple backends simultaneously. A team might keep their AI engine API keys in K8s Secrets (simpler to rotate) while pulling task-specific database credentials from Secrets Manager. The multi-backend resolver dispatches by URI scheme, so the two coexist without any changes to the agent configuration:

secret_resolver:
  backends:
    - scheme: "k8s"
      backend: "k8s"
    - scheme: "aws-sm"
      backend: "aws-secrets-manager"
      config:
        region: "eu-west-1"
  policy:
    allowed_schemes: ["k8s", "aws-sm"]
    blocked_env_patterns: ["AWS_*"]

External Secrets Operator (zero-code alternative). If your team already runs External Secrets Operator, you can continue using it. ESO syncs secrets from AWS Secrets Manager into Kubernetes Secrets on a configurable refresh interval. Osmia's built-in K8s backend reads those synced secrets with no code changes. Authentication is via IRSA on the ESO service account, so no static AWS credentials are involved.

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: osmia-anthropic-key
  namespace: osmia
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: osmia-anthropic-key
  data:
    - secretKey: api_key
      remoteRef:
        key: osmia/anthropic-api-key
        property: api_key

HashiCorp Vault. Osmia also ships a built-in Vault backend for teams using Vault with Kubernetes auth. Configure it via the vault:// scheme in the same secret_resolver.backends array.

What Karpenter gives us here

Agent pods are bursty. A Monday morning might bring 30 tickets; a Saturday brings none. Karpenter handles this the same way it handled our inference scaling: provisioning nodes as demand rises and consolidating as it falls.

For agent workloads specifically, we configure Karpenter NodePools that prefer cost-optimised instance types (agent pods need CPU and memory, not GPUs). Unlike our inference workloads, we run agent pods on On-Demand instances. A Spot reclamation on a job that's been running for 30 minutes means you lose all the token spend and progress. The job can restart (agent tasks are idempotent), but you're paying twice. Spot made sense for our short-lived, stateless video processing. It doesn't make sense for agents that run for tens of minutes.

The intelligence layer

The part that goes beyond basic orchestration:

Real-time trajectory scoring. Every tool call streams from the agent pod as NDJSON events. The controller scores whether the agent is making progress or stuck in a loop, calling run_tests five times with the same failure, thrashing between files without converging. When the score drops below threshold, the controller intervenes: injects a hint, or terminates the job before it burns through the budget.

Per-codebase memory. After each task, the controller extracts facts, patterns, and issues from the agent's work and stores them in a knowledge graph. The next task on the same codebase gets that context injected into its prompt. Facts decay over time; stale knowledge is pruned automatically. This isn't per-session memory. It's team-wide, cross-task, and persistent.

Engine routing. We track per-engine success rates by task type. A documentation task might route to a different engine than a complex refactoring task, based on historical performance data rather than intuition.

Human-in-the-loop continuation. When a long-running agent exhausts its turn limit, the controller doesn't silently fail or blindly retry. Instead it pauses the task and sends a Slack approval request showing the operator the turn count, cost so far, and a progress summary with Continue and Stop buttons. On approval, a new pod resumes the session with full conversation history via --resume. On rejection, the task fails cleanly with the operator's username recorded. This is configurable per-engine (continuation_prompt: true, max_continuations: 3) and requires session persistence to be enabled.

AWS Bedrock. The Cline engine supports provider: "bedrock" for teams that want all LLM traffic to stay within their AWS account. Combined with the native Secrets Manager backend for credentials, an all-AWS deployment: EKS + Bedrock + Secrets Manager is possible with no API calls leaving the AWS network boundary.

Deployment

Getting Osmia running on an existing EKS cluster is a Helm install:

helm repo add osmia https://unitaryai.github.io/osmia
helm install osmia osmia/osmia \
  --namespace osmia-system \
  --create-namespace \
  -f values-eks.yaml

The EKS-specific values overlay configures IRSA annotations on the service account, ALB ingress for the webhook endpoint (so GitHub/GitLab can deliver events), Karpenter-compatible node selectors for agent pods, and the API key secret name (engines.claude-code.auth.api_key_secret). We're working on publishing a complete examples/aws/ directory in the repository with EKS-specific deployment guides, IRSA configuration, and ALB setup.

What we learned

A few things surprised us.

Agent pods need more memory than we initially expected. Language models generate large context windows, and the agent processes (Claude Code, Codex CLI) keep substantial state in memory. We settled on 4Gi requests for most workloads. Undersizing causes OOMKills that look like agent failures.

NetworkPolicies turned out to matter more than sandboxing for most threat models. A compromised agent that can make arbitrary outbound HTTP requests is more dangerous than one that can read files on its own filesystem. Egress control is the higher-priority control to enable.

Spot does not work well for agent workloads. A reclaimed agent job loses all its progress and token spend. Restarting is safe but expensive. We moved agent pods to On-Demand and kept Spot for our shorter-lived workloads where interruption cost is low.

fsGroup is mandatory for freshly formatted EBS volumes. Agent pods run as non-root (UID 10000). When a new EBS volume is attached, the kubelet formats it and the resulting filesystem is owned by root so the non-root container can't write to it. The fix is fsGroup: 10000 on the pod security context, which tells Kubernetes to chown the mounted volume on attach. This is now the default in the Osmia Helm chart, but it's a common stumbling block when running non-root workloads against freshly provisioned EBS.

ReadWriteOnce PVCs require careful deduplication in the job spec. When two VolumeMount entries reference the same PVC claim name, the kubelet volume manager deadlocks. NodePublishVolume is never called and the pod stays in ContainerCreating indefinitely. The Osmia job builder now deduplicates PVC-backed volumes automatically. Related: the Helm chart defaults to Recreate deployment strategy for the controller, because RollingUpdate triggers a Multi-Attach error when the incoming pod tries to mount a ReadWriteOnce volume before the outgoing pod releases it.

Structured logging (via Go's slog) paid off immediately. Every task run produces a structured audit trail. When a task produces an unexpected result, you can trace exactly what happened without guessing.

From ML pipelines to AI agents

The core insight is that running AI coding agents at scale is an infrastructure problem, and it's one that Kubernetes (and EKS specifically) is well-suited to solve. The same team that managed 1,000 nodes for video analysis now manages autonomous coding agents with the same tools, the same operational model, and the same security posture.

If you're already running workloads on EKS and experimenting with AI coding agents, you're closer to production-grade agent orchestration than you might think.

Links:

GitHub: github.com/unitaryai/osmia
Docs: unitaryai.github.io/osmia
Original EKS case study: aws.amazon.com/solutions/case-studies/unitary-eks-case-study

Top comments (1)

April Wong • Jun 11 • Edited

hey team,

Doing some research on agent infra and really glad I found this. Might be a little off topic but we just published a LangChain agent infrastructure benchmark, ABS (our high performance block storage) vs AWS EBS, and I was wondering who actually runs agent workloads on gp3 and turns out you do.

We're 16% faster task completion than io2 across 100K tasks on 1,000 concurrent agents (in fact the heavier the faster), and gp3 wasn't even close. I am wondering if you want to run Osmia on ABS since your per-task PVCs are gp3. You might see a real difference! Happy to set you up if you ever want to test. It's open source so you can DIY. Repo: github.com/nirvana-labs-examples/langchain-benchmarks

Let me know what you think! Thanks.

Nirvana Labs