DEV Community

Cover image for How We Use AWS CDK to Deploy OpenClaw for Enterprise Teams — API Key Management Without the Chaos
C.K.Sun
C.K.Sun

Posted on

How We Use AWS CDK to Deploy OpenClaw for Enterprise Teams — API Key Management Without the Chaos

We wanted every employee in the company to use OpenClaw — not just engineers. Product managers writing specs, designers prototyping, ops teams automating workflows. The problem wasn't adoption — it was operations.

Hundreds of employees, each needing their own API key in a local config file. Within a week, keys were committed to git. Within a month, finance couldn't figure out who was spending what. When Anthropic rotated a key, we had to notify everyone individually. We needed infrastructure, not process.

We solved it with four AWS CDK stacks and a 100-line sidecar proxy. Here's how.

The Architecture: Four CDK Stacks

Everything deploys in order with nx deploy:

DEPLOY_ENV=prod nx deploy teamclaw-foundation-infra    # VPC, EFS, ECR, Secrets Manager
DEPLOY_ENV=prod nx deploy teamclaw-cluster-infra       # ECS Fargate, ALB, CloudFront
DEPLOY_ENV=prod nx deploy teamclaw-control-plane-infra # Cognito, DynamoDB, Lifecycle Lambda
DEPLOY_ENV=prod nx deploy teamclaw-admin-infra         # API Gateway, 44 Lambda handlers
Enter fullscreen mode Exit fullscreen mode

After deployment, IT configures API keys in the admin panel. Employees sign up with their company email. That's it — no per-user key setup.

Stack 1: Foundation

VPC with public/private subnets, EFS for per-user data persistence, ECR for container images, and Secrets Manager for the API key pool.

The key design decision: one Secrets Manager secret holds all provider keys as a JSON pool.

// foundation.stack.ts
const apiKeysSecret = new aws_secretsmanager.Secret(this, 'ApiKeysSecret', {
  secretName: `${deployEnv}/teamclaw/api-keys`,
  description: 'Shared API key pool for TeamClaw',
});
Enter fullscreen mode Exit fullscreen mode

The secret format:

{
  "providers": {
    "anthropic": {
      "authType": "apiKey",
      "keys": ["sk-ant-key1...", "sk-ant-key2...", "sk-ant-key3..."]
    },
    "openai": {
      "authType": "apiKey",
      "keys": ["sk-proj-key1...", "sk-proj-key2..."]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Adding a key is one CLI call. No container restart — the sidecar's 60-second cache handles it.

Stack 2: Cluster

ECS Fargate cluster, ALB with per-user path-based routing, CloudFront for WSS termination, and the ECS Task Definition with two containers.

// cluster.stack.ts
const taskDefinition = new aws_ecs.FargateTaskDefinition(this, 'UserTaskDef', {
  family: `teamclaw-user-${deployEnv}`,
  cpu: 1024,
  memoryLimitMiB: 2048,
  taskRole,
  executionRole,
});

// Container 1: Unmodified OpenClaw
taskDefinition.addContainer('teamclaw', {
  image: aws_ecs.ContainerImage.fromRegistry(`${ecrUri}:latest`),
  portMappings: [{ containerPort: 18789 }],
});

// Container 2: Sidecar proxy for credential injection
taskDefinition.addContainer('proxy-sidecar', {
  image: aws_ecs.ContainerImage.fromRegistry(`${sidecarUri}:latest`),
  portMappings: [{ containerPort: 3000 }],
});
Enter fullscreen mode Exit fullscreen mode

The sidecar is the key piece. OpenClaw's config points all providers to http://localhost:3000/anthropic, http://localhost:3000/openai, etc. The sidecar intercepts every API call, strips the dummy auth header, injects the real key from Secrets Manager, and forwards upstream. The real API key never touches the OpenClaw process.

The task role is scoped tight — least privilege:

// Secrets Manager: read-only on the key pool
taskRole.addToPolicy(new aws_iam.PolicyStatement({
  actions: ['secretsmanager:GetSecretValue'],
  resources: [`arn:aws:secretsmanager:${region}:${account}:secret:${deployEnv}/teamclaw/*`],
}));

// EFS: mount user data
taskRole.addToPolicy(new aws_iam.PolicyStatement({
  actions: ['elasticfilesystem:ClientMount', 'elasticfilesystem:ClientWrite'],
  resources: [`arn:aws:elasticfilesystem:${region}:${account}:file-system/*`],
}));

// DynamoDB: sidecar logs usage per request
taskRole.addToPolicy(new aws_iam.PolicyStatement({
  actions: ['dynamodb:PutItem'],
  resources: [`arn:aws:dynamodb:${region}:${account}:table/teamclaw-usage-${deployEnv}`],
}));
Enter fullscreen mode Exit fullscreen mode

Stack 3: Control Plane

Cognito for employee auth, DynamoDB for state, and the Lifecycle Lambda that orchestrates container lifecycle.

When an employee logs in, the user-session Lambda checks DynamoDB. If no container exists, it invokes the Lifecycle Lambda which:

  1. Creates an EFS access point at /users/{userId}
  2. Calls ecs:RunTask to start a Fargate task
  3. Polls DescribeTasks until the container gets a private IP
  4. Creates a per-user ALB target group and listener rule
  5. Records the task ARN and IP in DynamoDB
// Lifecycle Lambda — least privilege for ECS orchestration
lifecycleLambda.addToRolePolicy(new aws_iam.PolicyStatement({
  actions: ['ecs:RunTask', 'ecs:StopTask', 'ecs:DescribeTasks'],
  resources: [
    `arn:aws:ecs:${region}:${account}:cluster/teamclaw-${deployEnv}`,
    `arn:aws:ecs:${region}:${account}:task/teamclaw-${deployEnv}/*`,
    `arn:aws:ecs:${region}:${account}:task-definition/teamclaw-*-${deployEnv}:*`,
  ],
}));
Enter fullscreen mode Exit fullscreen mode

An EventBridge rule triggers idle checking every 15 minutes. Containers idle for 30+ minutes get stopped — real cost savings when Fargate bills per-second.

new aws_events.Rule(this, 'IdleCheckRule', {
  schedule: aws_events.Schedule.rate(Duration.minutes(15)),
}).addTarget(new aws_events_targets.LambdaFunction(lifecycleLambda, {
  event: aws_events.RuleTargetInput.fromObject({ action: 'check-idle', userId: 'system' }),
}));
Enter fullscreen mode Exit fullscreen mode

Stack 4: Admin API

API Gateway HTTP API with Cognito JWT authorizer, 44 Lambda handlers. Users, teams, containers, API keys, integrations, analytics — all managed through the admin panel.

Cross-stack communication uses SSM parameters exclusively — no CDK exports:

// Stack 2 writes:
new aws_ssm.StringParameter(this, 'ClusterNameParam', {
  parameterName: `/tc/${deployEnv}/ecs/clusterName`,
  stringValue: cluster.clusterName,
});

// Stack 3 reads:
const clusterName = aws_ssm.StringParameter.valueForStringParameter(
  this, `/tc/${deployEnv}/ecs/clusterName`
);
Enter fullscreen mode Exit fullscreen mode

This pattern means stacks are fully independent. No deployment ordering issues, no "stack is in a failed state" nightmares.

The Sidecar: How API Key Isolation Works

The sidecar proxy runs on localhost:3000 inside each container. OpenClaw thinks it's talking to the AI provider. Its config has:

providers: {
  anthropic: { baseUrl: 'http://localhost:3000/anthropic', apiKey: 'proxy-managed' },
  openai:    { baseUrl: 'http://localhost:3000/openai',    apiKey: 'proxy-managed' },
}
Enter fullscreen mode Exit fullscreen mode

The sidecar parses the provider from the URL, reads the key pool from Secrets Manager (cached 60s), round-robins across keys, strips the dummy auth, injects the real key, and forwards upstream. Usage gets logged to DynamoDB per request.

The container entrypoint explicitly unsets all provider env vars as its first action:

#!/bin/sh
unset ANTHROPIC_API_KEY OPENAI_API_KEY GOOGLE_API_KEY
node /scripts/generate-config.js
exec openclaw gateway run --port 18789 --auth trusted-proxy
Enter fullscreen mode Exit fullscreen mode

If a skill or MCP tool reads process.env, it finds nothing. The real key only exists in the sidecar's memory, fetched at runtime from Secrets Manager.

Config Hierarchy: Global → Team → User

OpenClaw reads a single openclaw.json. We generate it at startup by merging three tiers:

const globalConfig = loadJson('/efs/system/global-config.json');    // Admin guardrails
const teamConfig   = loadJson(`/efs/teams/${teamId}/team-config.json`); // Team standards
const userConfig   = loadJson(`/efs/users/${userId}/user-config.json`); // Personal prefs

const merged = deepMerge(deepMerge(deepMerge(baseConfig, globalConfig), teamConfig), userConfig);
Enter fullscreen mode Exit fullscreen mode

Same for SOUL.md — OpenClaw's system prompt. Admin sets guardrails, team sets coding standards, user sets preferences. All layered, zero conflicts.

What We Learned

ALB listener rules cap at 100 per listener. Each user gets a routing rule. At 100 concurrent users, you need to request a limit increase or rearchitect.

Per-user Fargate costs ~$12/user/month running 24/7. With 30-minute idle stop, real-world cost is ~$3-4/user/month.

SSM parameter pattern for cross-stack refs is worth it. More verbose than CDK exports, but zero coupling between stacks.

The sidecar pattern works for any container-based AI tool, not just OpenClaw. If you're running any AI agent that needs API keys, a localhost proxy that injects credentials from Secrets Manager is a clean, provider-agnostic solution.

Source

Apache 2.0: github.com/ChenKuanSun/teamclaw — Nx monorepo, 4 CDK stacks, 992 tests.

Top comments (0)