<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Matt Camp</title>
    <description>The latest articles on DEV Community by Matt Camp (@mattcamp).</description>
    <link>https://dev.to/mattcamp</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1463166%2F336b4922-976f-420d-87e6-7dc4d93a9176.png</url>
      <title>DEV Community: Matt Camp</title>
      <link>https://dev.to/mattcamp</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mattcamp"/>
    <language>en</language>
    <item>
      <title>Orchestrating Secure AI Agents on Amazon EKS</title>
      <dc:creator>Matt Camp</dc:creator>
      <pubDate>Fri, 27 Mar 2026 12:24:23 +0000</pubDate>
      <link>https://dev.to/mattcamp/orchestrating-secure-ai-agents-on-amazon-eks-50kh</link>
      <guid>https://dev.to/mattcamp/orchestrating-secure-ai-agents-on-amazon-eks-50kh</guid>
      <description>&lt;p&gt;&lt;strong&gt;Subtitle:&lt;/strong&gt; How we went from scaling video analysis on EKS to running autonomous coding agents in a custom agent harness, and why Kubernetes was the obvious choice.&lt;/p&gt;

&lt;h3&gt;
  
  
  The backstory
&lt;/h3&gt;

&lt;p&gt;A couple of years ago, AWS published a &lt;a href="https://aws.amazon.com/solutions/case-studies/unitary-eks-case-study/" rel="noopener noreferrer"&gt;case study&lt;/a&gt; about how our team at Unitary scales Amazon EKS with Karpenter. Three engineers managing 1,000+ nodes at peak, processing 26 million videos a day, 50-70% cost reduction with Spot Instances. It was a good story about what a small team can do with the right infrastructure.&lt;/p&gt;

&lt;p&gt;What that case study didn't cover is what happened next. As our engineering team grew, we started leaning heavily on AI coding agents (Cursor, then Claude Code and OpenAI Codex) to keep pace with development across multiple customer projects. And we hit a wall that will be familiar to anyone running these tools at scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  The problem with AI coding agents in production
&lt;/h3&gt;

&lt;p&gt;If you've used Claude Code or Codex, you know the experience: the agent is powerful, but it needs you there. You're approving tool calls, answering questions, watching the terminal. Running &lt;code&gt;--dangerously-skip-permissions&lt;/code&gt; on a developer machine is too risky for most teams. One bad tool call with production credentials is a serious incident. So the human sits there, babysitting.&lt;/p&gt;

&lt;p&gt;This works for individual productivity. Developers are good at multi-tasking; you can review output in one terminal while doing other work. But it doesn't scale to a team running agents across multiple codebases, and it doesn't work when nobody is watching. The agent that loops on a failing test for an hour doesn't care that you stepped into a meeting.&lt;/p&gt;

&lt;p&gt;We needed the same operational maturity for our AI coding tools that we'd built for our ML inference pipelines: safe to run unattended, with automated guardrails replacing the human in the loop, and it must scale. That meant Kubernetes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why EKS was the natural fit
&lt;/h3&gt;

&lt;p&gt;We'd already solved the hard scaling problems on EKS. Karpenter handles node provisioning. We know how to run mixed workloads across Spot and On-Demand. Our team understands the operational model.&lt;/p&gt;

&lt;p&gt;AI coding agents have a different resource profile from ML inference. They're long-running (minutes to hours), I/O-heavy rather than GPU-bound, and each one needs an isolated environment with repository access and API credentials. But the Kubernetes primitives are the same: pods for isolation, Jobs for lifecycle management, Secrets for credentials, NetworkPolicies for egress control.&lt;/p&gt;

&lt;p&gt;So we built &lt;a href="https://github.com/unitaryai/osmia" rel="noopener noreferrer"&gt;Osmia&lt;/a&gt;, an open-source orchestration layer that turns these primitives into a managed AI coding agent platform. We've released it under Apache 2.0.&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture on EKS
&lt;/h3&gt;

&lt;p&gt;Osmia runs as a single controller pod that watches for incoming tasks (from ticket systems, webhooks, or direct API calls) and translates each one into a Kubernetes Job.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwlzlw5rzigaq0ieqjeuj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwlzlw5rzigaq0ieqjeuj.png" alt=" " width="800" height="211"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Each agent pod runs as non-root with a read-only root filesystem and all Linux capabilities dropped. You can optionally layer on gVisor or Kata for defence in depth. Credentials are scoped per task via IRSA (we use IAM Roles for Service Accounts for any AWS-side access rather than static credentials). An optional NetworkPolicy restricts outbound traffic to HTTPS and SSH, which is enough for git operations and API calls and nothing else. A watchdog monitors every running agent and terminates jobs that exceed their cost ceiling.&lt;/p&gt;

&lt;p&gt;The controller itself is a standard Go binary using controller-runtime, the same framework that powers most Kubernetes operators. On EKS, it runs as a Deployment with a single replica.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Session persistence.&lt;/strong&gt; Agent sessions can be persisted across retries and continuations using PVC-backed storage. Two backends are available: &lt;code&gt;shared-pvc&lt;/code&gt; uses a single ReadWriteMany PVC (EFS on EKS) with per-task subdirectories - simpler to operate. &lt;code&gt;per-taskrun-pvc&lt;/code&gt; dynamically creates and deletes a dedicated ReadWriteOnce PVC (EBS gp3 on EKS) per task run - stronger isolation. Session data includes the agent's conversation history (&lt;code&gt;~/.claude/&lt;/code&gt;) and optionally the workspace, so retry pods can resume with &lt;code&gt;--resume&lt;/code&gt; rather than starting from scratch.&lt;/p&gt;

&lt;h3&gt;
  
  
  Secrets management on EKS
&lt;/h3&gt;

&lt;p&gt;Agent pods need credentials: API keys for the AI engine, repository access tokens, and sometimes task-specific secrets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Native AWS Secrets Manager.&lt;/strong&gt; Osmia ships a built-in AWS Secrets Manager backend that reads secrets directly via the AWS SDK v2. On EKS with IRSA, no credential configuration is needed - the SDK picks up the pod identity token automatically. The setup is three steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create an IAM role with &lt;code&gt;secretsmanager:GetSecretValue&lt;/code&gt; permission:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"secretsmanager:GetSecretValue"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:secretsmanager:eu-west-1:123456789:secret:osmia/*"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Annotate the Osmia controller's ServiceAccount with the role ARN:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;eks.amazonaws.com/role-arn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;arn:aws:iam::123456789:role/osmia-secrets&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Configure the backend in &lt;code&gt;osmia-config.yaml&lt;/code&gt;:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;secret_resolver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backends&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;scheme&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aws-sm"&lt;/span&gt;
      &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aws-secrets-manager"&lt;/span&gt;
      &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eu-west-1"&lt;/span&gt;
        &lt;span class="na"&gt;cache_ttl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5m"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Secret references use the &lt;code&gt;aws-sm://secret-name#json-field&lt;/code&gt; URI format. If your secret is a JSON object (which is typical in Secrets Manager), the &lt;code&gt;#field&lt;/code&gt; fragment extracts a specific key. If the secret is a plain string, omit the fragment. The backend caches values in memory with a configurable TTL (default 5 minutes) to avoid hitting the Secrets Manager API on every job creation. Multiple secret references pointing at the same secret name share a single cached API call.&lt;/p&gt;

&lt;p&gt;For multi-account deployments, the backend supports cross-account access via STS AssumeRole - set &lt;code&gt;assume_role_arn&lt;/code&gt; to the target account's role, and the backend handles credential refresh automatically.&lt;/p&gt;

&lt;p&gt;You can also run multiple backends simultaneously. A team might keep their AI engine API keys in K8s Secrets (simpler to rotate) while pulling task-specific database credentials from Secrets Manager. The multi-backend resolver dispatches by URI scheme, so the two coexist without any changes to the agent configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;secret_resolver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backends&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;scheme&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;k8s"&lt;/span&gt;
      &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;k8s"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;scheme&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aws-sm"&lt;/span&gt;
      &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aws-secrets-manager"&lt;/span&gt;
      &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eu-west-1"&lt;/span&gt;
  &lt;span class="na"&gt;policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;allowed_schemes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;k8s"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aws-sm"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;blocked_env_patterns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AWS_*"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;External Secrets Operator (zero-code alternative).&lt;/strong&gt; If your team already runs &lt;a href="https://external-secrets.io/" rel="noopener noreferrer"&gt;External Secrets Operator&lt;/a&gt;, you can continue using it. ESO syncs secrets from AWS Secrets Manager into Kubernetes Secrets on a configurable refresh interval. Osmia's built-in K8s backend reads those synced secrets with no code changes. Authentication is via IRSA on the ESO service account, so no static AWS credentials are involved.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;external-secrets.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ExternalSecret&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;osmia-anthropic-key&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;osmia&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;refreshInterval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1h&lt;/span&gt;
  &lt;span class="na"&gt;secretStoreRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-secrets-manager&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterSecretStore&lt;/span&gt;
  &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;osmia-anthropic-key&lt;/span&gt;
  &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;secretKey&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api_key&lt;/span&gt;
      &lt;span class="na"&gt;remoteRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;osmia/anthropic-api-key&lt;/span&gt;
        &lt;span class="na"&gt;property&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api_key&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;HashiCorp Vault.&lt;/strong&gt; Osmia also ships a built-in Vault backend for teams using Vault with Kubernetes auth. Configure it via the &lt;code&gt;vault://&lt;/code&gt; scheme in the same &lt;code&gt;secret_resolver.backends&lt;/code&gt; array.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Karpenter gives us here
&lt;/h3&gt;

&lt;p&gt;Agent pods are bursty. A Monday morning might bring 30 tickets; a Saturday brings none. Karpenter handles this the same way it handled our inference scaling: provisioning nodes as demand rises and consolidating as it falls.&lt;/p&gt;

&lt;p&gt;For agent workloads specifically, we configure Karpenter NodePools that prefer cost-optimised instance types (agent pods need CPU and memory, not GPUs). Unlike our inference workloads, we run agent pods on On-Demand instances. A Spot reclamation on a job that's been running for 30 minutes means you lose all the token spend and progress. The job can restart (agent tasks are idempotent), but you're paying twice. Spot made sense for our short-lived, stateless video processing. It doesn't make sense for agents that run for tens of minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  The intelligence layer
&lt;/h3&gt;

&lt;p&gt;The part that goes beyond basic orchestration:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-time trajectory scoring.&lt;/strong&gt; Every tool call streams from the agent pod as NDJSON events. The controller scores whether the agent is making progress or stuck in a loop, calling &lt;code&gt;run_tests&lt;/code&gt; five times with the same failure, thrashing between files without converging. When the score drops below threshold, the controller intervenes: injects a hint, or terminates the job before it burns through the budget.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-codebase memory.&lt;/strong&gt; After each task, the controller extracts facts, patterns, and issues from the agent's work and stores them in a knowledge graph. The next task on the same codebase gets that context injected into its prompt. Facts decay over time; stale knowledge is pruned automatically. This isn't per-session memory. It's team-wide, cross-task, and persistent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Engine routing.&lt;/strong&gt; We track per-engine success rates by task type. A documentation task might route to a different engine than a complex refactoring task, based on historical performance data rather than intuition.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Human-in-the-loop continuation.&lt;/strong&gt; When a long-running agent exhausts its turn limit, the controller doesn't silently fail or blindly retry. Instead it pauses the task and sends a Slack approval request showing the operator the turn count, cost so far, and a progress summary with Continue and Stop buttons. On approval, a new pod resumes the session with full conversation history via &lt;code&gt;--resume&lt;/code&gt;. On rejection, the task fails cleanly with the operator's username recorded. This is configurable per-engine (&lt;code&gt;continuation_prompt: true&lt;/code&gt;, &lt;code&gt;max_continuations: 3&lt;/code&gt;) and requires session persistence to be enabled.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS Bedrock.&lt;/strong&gt; The Cline engine supports &lt;code&gt;provider: "bedrock"&lt;/code&gt; for teams that want all LLM traffic to stay within their AWS account. Combined with the native Secrets Manager backend for credentials, an all-AWS deployment: EKS + Bedrock + Secrets Manager is possible with no API calls leaving the AWS network boundary.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deployment
&lt;/h3&gt;

&lt;p&gt;Getting Osmia running on an existing EKS cluster is a Helm install:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm repo add osmia https://unitaryai.github.io/osmia
helm &lt;span class="nb"&gt;install &lt;/span&gt;osmia osmia/osmia &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; osmia-system &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--create-namespace&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-f&lt;/span&gt; values-eks.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The EKS-specific values overlay configures IRSA annotations on the service account, ALB ingress for the webhook endpoint (so GitHub/GitLab can deliver events), Karpenter-compatible node selectors for agent pods, and the API key secret name (&lt;code&gt;engines.claude-code.auth.api_key_secret&lt;/code&gt;). We're working on publishing a complete &lt;code&gt;examples/aws/&lt;/code&gt; directory in the repository with EKS-specific deployment guides, IRSA configuration, and ALB setup.&lt;/p&gt;

&lt;h3&gt;
  
  
  What we learned
&lt;/h3&gt;

&lt;p&gt;A few things surprised us.&lt;/p&gt;

&lt;p&gt;Agent pods need more memory than we initially expected. Language models generate large context windows, and the agent processes (Claude Code, Codex CLI) keep substantial state in memory. We settled on 4Gi requests for most workloads. Undersizing causes OOMKills that look like agent failures.&lt;/p&gt;

&lt;p&gt;NetworkPolicies turned out to matter more than sandboxing for most threat models. A compromised agent that can make arbitrary outbound HTTP requests is more dangerous than one that can read files on its own filesystem. Egress control is the higher-priority control to enable.&lt;/p&gt;

&lt;p&gt;Spot does not work well for agent workloads. A reclaimed agent job loses all its progress and token spend. Restarting is safe but expensive. We moved agent pods to On-Demand and kept Spot for our shorter-lived workloads where interruption cost is low.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;fsGroup is mandatory for freshly formatted EBS volumes.&lt;/strong&gt; Agent pods run as non-root (UID 10000). When a new EBS volume is attached, the kubelet formats it and the resulting filesystem is owned by root so the non-root container can't write to it. The fix is &lt;code&gt;fsGroup: 10000&lt;/code&gt; on the pod security context, which tells Kubernetes to chown the mounted volume on attach. This is now the default in the Osmia Helm chart, but it's a common stumbling block when running non-root workloads against freshly provisioned EBS.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ReadWriteOnce PVCs require careful deduplication in the job spec.&lt;/strong&gt; When two VolumeMount entries reference the same PVC claim name, the kubelet volume manager deadlocks. &lt;code&gt;NodePublishVolume&lt;/code&gt; is never called and the pod stays in &lt;code&gt;ContainerCreating&lt;/code&gt; indefinitely. The Osmia job builder now deduplicates PVC-backed volumes automatically. Related: the Helm chart defaults to &lt;code&gt;Recreate&lt;/code&gt; deployment strategy for the controller, because &lt;code&gt;RollingUpdate&lt;/code&gt; triggers a &lt;code&gt;Multi-Attach&lt;/code&gt; error when the incoming pod tries to mount a &lt;code&gt;ReadWriteOnce&lt;/code&gt; volume before the outgoing pod releases it.&lt;/p&gt;

&lt;p&gt;Structured logging (via Go's &lt;code&gt;slog&lt;/code&gt;) paid off immediately. Every task run produces a structured audit trail. When a task produces an unexpected result, you can trace exactly what happened without guessing.&lt;/p&gt;

&lt;h3&gt;
  
  
  From ML pipelines to AI agents
&lt;/h3&gt;

&lt;p&gt;The core insight is that running AI coding agents at scale is an infrastructure problem, and it's one that Kubernetes (and EKS specifically) is well-suited to solve. The same team that managed 1,000 nodes for video analysis now manages autonomous coding agents with the same tools, the same operational model, and the same security posture.&lt;/p&gt;

&lt;p&gt;If you're already running workloads on EKS and experimenting with AI coding agents, you're closer to production-grade agent orchestration than you might think.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Links:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: &lt;a href="http://github.com/unitaryai/osmia" rel="noopener noreferrer"&gt;github.com/unitaryai/osmia&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Docs: &lt;a href="http://unitaryai.github.io/osmia" rel="noopener noreferrer"&gt;unitaryai.github.io/osmia&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Original EKS case study: &lt;a href="http://aws.amazon.com/solutions/case-studies/unitary-eks-case-study" rel="noopener noreferrer"&gt;aws.amazon.com/solutions/case-studies/unitary-eks-case-study&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>aws</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>DeepRacer-for-Cloud v5.2.2 now available with new real-time training metrics</title>
      <dc:creator>Matt Camp</dc:creator>
      <pubDate>Fri, 03 May 2024 08:37:03 +0000</pubDate>
      <link>https://dev.to/aws-builders/deepracer-for-cloud-v522-now-available-with-new-real-time-training-metrics-7ki</link>
      <guid>https://dev.to/aws-builders/deepracer-for-cloud-v522-now-available-with-new-real-time-training-metrics-7ki</guid>
      <description>&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fofvifo3uoh151lprgb3j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fofvifo3uoh151lprgb3j.png" alt="graph panel" width="800" height="168"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/aws-deepracer-community/deepracer-for-cloud"&gt;DeepRacer-for-Cloud&lt;/a&gt; provides a great way for developers to train DeepRacer models on EC2 (or other cloud compute instances, or even local servers) however many users have noticed that unlike the official AWS console it didn't provide the kind of friendly web UI showing the current state of training. &lt;/p&gt;

&lt;p&gt;While there are some fantastic &lt;a href="https://github.com/aws-deepracer-community/deepracer-analysis"&gt;log analysis notebooks&lt;/a&gt; available these can be a little tricky to set up and often require re-loading vast amounts of log data to get a refreshed view of the metrics.&lt;/p&gt;

&lt;p&gt;Deepracer-for-Cloud v5.2.2 is now available and has added an exciting new feature which enables real-time metrics visualisation using Grafana.&lt;/p&gt;

&lt;p&gt;Under the hood this involves creating three new containers for Telegraf, InfluxDB, and Grafana.&lt;/p&gt;

&lt;p&gt;The Robomaker simulation workers send the training metrics to Telegraf, which aggregates and stores them in the InfluxDB time-series database. Grafana provides a presentation layer for interactive dashboards.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ggez3119mghyb62b0ok.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ggez3119mghyb62b0ok.png" alt="telegraf to influx to grafana" width="673" height="302"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting started
&lt;/h2&gt;

&lt;p&gt;To use this new feature you will need &lt;a href="https://github.com/aws-deepracer-community/deepracer-for-cloud/releases/tag/v5.2.2"&gt;v5.2.2 of Deepracer-for-Cloud&lt;/a&gt;, and also the v5.2.2 Robomaker container image. &lt;/p&gt;

&lt;h3&gt;
  
  
  Updating DeepRacer-for-Cloud
&lt;/h3&gt;

&lt;p&gt;If you're installing DRfC for the first time then it should already download the correct image and templates, but if you're upgrading an existing install then you'll need to do a few steps:&lt;/p&gt;

&lt;p&gt;If you installed DRfC the recommended way by cloning the GitHub repo then you should do a &lt;code&gt;git pull&lt;/code&gt; on the master branch to fetch the latest updates. &lt;/p&gt;

&lt;p&gt;To enable real-time metrics you need to add two additional lines to your &lt;code&gt;system.env&lt;/code&gt; file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DR_TELEGRAF_HOST=telegraf
DR_TELEGRAF_PORT=8092
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In almost all cases you can paste these directly in without modifying the values, as the hostname will reference the telegraf container running inside Docker. &lt;/p&gt;

&lt;p&gt;If this is your first install then these lines will need to be uncommented.&lt;/p&gt;

&lt;h3&gt;
  
  
  Updating the Robomaker container image
&lt;/h3&gt;

&lt;p&gt;First pull the updated container image from DockerHub. Use the cpu or gpu tag as appropriate for your system.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker pull awsdeepracercommunity/deepracer-robomaker:5.2.2-cpu

or

docker pull awsdeepracercommunity/deepracer-robomaker:5.2.2-gpu
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then update the &lt;code&gt;DR_ROBOMAKER_IMAGE&lt;/code&gt; line in &lt;code&gt;system.env&lt;/code&gt; to set to the new image you just pulled.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DR_ROBOMAKER_IMAGE=5.2.1-cpu
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Starting the metrics stack
&lt;/h3&gt;

&lt;p&gt;You can then start the metrics containers using &lt;code&gt;dr-start-metrics&lt;/code&gt;. (You might need to relogin or reload your shell to pick up the new changes in &lt;code&gt;bin/activate.sh&lt;/code&gt;)&lt;/p&gt;

&lt;p&gt;This will start the three new containers. If it's the first time starting the metrics stack then Grafana will need to run some database migrations that can take 30-60 seconds before the web UI is available.&lt;/p&gt;

&lt;h2&gt;
  
  
  Collecting metrics
&lt;/h2&gt;

&lt;p&gt;As long as the two Telegraf lines have been added to system.env and you have v5.2.2 of the robomaker container then all you have to do is start training normally and the metrics will be automatically generated. &lt;/p&gt;

&lt;h2&gt;
  
  
  Using the dashboards
&lt;/h2&gt;

&lt;p&gt;Once the metrics stack is running you should be able to access the Grafana web UI on port 3000 (eg, &lt;a href="http://localhost:3000"&gt;http://localhost:3000&lt;/a&gt; if running locally)&lt;/p&gt;

&lt;p&gt;Grafana initially starts with an admin user provisioned (username &lt;code&gt;admin&lt;/code&gt;, password &lt;code&gt;admin&lt;/code&gt;). It will prompt you to choose a new password upon first connect, so you should do this right away. &lt;/p&gt;

&lt;p&gt;A template dashboard is provided to show how to access basic DeepRacer training metrics. You can use this dashboard as a base to build your own more customised dashboards.&lt;/p&gt;

&lt;p&gt;After connecting to the Grafana Web UI with a browser use the menu to browse to the Dashboards section.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjiwzj43t6l5gzw4sf9m2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjiwzj43t6l5gzw4sf9m2.png" alt="Grafana dashboards screenshot" width="800" height="317"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The template dashboard called DeepRacer Training template should be visible, showing graphs of reward, progress, and completed lap times.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6o9507h992cja1ynucd2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6o9507h992cja1ynucd2.png" alt="Graph panels with data" width="800" height="822"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As this is an automatically provisioned dashboard you are not able to save changes to it, however you can copy it by clicking on the small cog icon to enter the dashboard settings page, and then clicking &lt;code&gt;Save as&lt;/code&gt; to make an editable copy.&lt;/p&gt;

&lt;p&gt;Grafana dashboards are interactive - you can over over datapoints to see more details, and you can click and drag on a graph panel to zoom in.&lt;/p&gt;

&lt;p&gt;You can also change the time range using the selector box on the top right, and also select an auto-refresh period from the selector next to that. &lt;/p&gt;

&lt;p&gt;A full user guide on how to work the dashboards is available on the &lt;a href="https://grafana.com/docs/grafana/latest/dashboards/use-dashboards/"&gt;Grafana website&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Currently we record metrics for training and evaluation sessions such as reward, progress, average and best lap times but in the future we'll be adding more even metrics and dashboards.&lt;/p&gt;

</description>
      <category>deepracer</category>
      <category>machinelearning</category>
      <category>ai</category>
      <category>observability</category>
    </item>
  </channel>
</rss>
