Tarlan Huseynov for AWS Community Builders

Posted on Mar 24

Terraform Your AWS AgentCore

#aiops #agents #agentcore #terraform

There is no shortage of content on agentic workflows and AI today. But most of it stops at the concept. Today we focus on what actually matters in production: the right platform to host your non-deterministic workloads and the right way to ship it. Enter Amazon Bedrock AgentCore and Terraform.

Introduction

Hey Folks! Today we are going all-in on Amazon Bedrock AgentCore and deploying it the right way, with Terraform.

Amazon Bedrock AgentCore brings together everything you need to run production AI agents on AWS: managed runtimes, a unified tool Gateway, persistent memory, Agentcore Policy: Cedar-based policy enforcement, and identity management. When we set out to build a reference implementation that exercises every one of these capabilities, the deployment choice was obvious. Terraform has been shaping cloud infrastructure for over a decade and in the age of GenAI, it's still the right answer. One monorepo, one terraform apply: three AgentCore runtimes, a Gateway connecting 21 tools, Cognito M2M auth, Cedar-based policy enforcement, and persistent session memory, all wired together through a single dependency graph.

This post covers the architecture, what that command actually orchestrates under the hood, and where the hashicorp/aws provider still has gaps on AgentCore coverage and how we bridged them without stepping outside the IaC boundary.

What is AgentCore?

Before we get into the deployment story, let's briefly cover what we're deploying on top of, because AgentCore is more than just a runtime.

Amazon Bedrock AgentCore is AWS's managed platform for running production AI agents. Instead of stitching together Lambda functions, API Gateway routes, DynamoDB tables, and custom auth flows just to give your agent somewhere to live, AgentCore handles that layer for you. Here's what it brings:

AgentCore Runtime - a secure, serverless hosting environment for your agent code, purpose-built for long-running agentic workloads (up to 8 hours), ARM64-based, scales automatically
AgentCore Gateway - a fully managed service that converts APIs, Lambda functions, and MCP servers into tools accessible by your agent, with a single unified endpoint, handling both ingress and egress authentication
AgentCore Memory - persistent session context across invocations, with short-term (within session) and long-term (across sessions) memory, and summarization strategies for older interactions
Policy in AgentCore - Cedar-based access control enforced at the Gateway boundary, controlling which tools an agent can call, deterministic, outside the model, pre-call
AgentCore Identity - credential and identity management for automated workloads, supporting both user-delegated OAuth and machine-to-machine client credentials flows

What We Built

Full source for reference github.com/tarlan-huseynov/agentcore-monorepo

The premise was straightforward: what if you could describe the infrastructure you wanted in plain English?

"Create an SQS queue called order-events in eu-central-1 with a 5-minute visibility timeout."

The agent figures out the CloudFormation schema, generates the desired state, explains what it's about to create, waits for your confirmation, and calls the Cloud Control API, which supports over 1,100 AWS resource types. Ask it what it's costing you and it queries Cost Explorer. Ask it what's in the logs and it searches CloudWatch.

The result is an Infrastructure Bootstrapper Agent, a Strands-based AI agent running on Amazon Bedrock AgentCore that manages real AWS infrastructure, analyzes costs, and searches CloudWatch Logs.

The architecture spans three AgentCore runtimes connected through a Gateway:

User Query
    |
Main Runtime          Strands Agent + AgentCore Memory
    |
AgentCore Gateway     Unified tool connectivity + Policy enforcement
    |-- CCAPI Runtime [MCP]         Cloud Control API (14 tools)
    |-- Cost Explorer Runtime [MCP]  AWS Cost Explorer (7 tools)

Standing this up from scratch involves: 3 agent runtimes, 1 Gateway, 2 Gateway targets, 5 IAM roles, 1 Cognito User Pool with M2M client, 1 OAuth2 credential provider, 1 Memory resource with summarization strategy, 3 CloudWatch log groups, 1 S3 bucket, and a Cedar policy engine.

The runtimes:

Main Orchestrator with Strands SDK ( main Agent )
AgentCore runtime for CCAPI MCP
AgentCore runtime for Cost Explorer MCP

Architecture

Why Not Console or CLI?

The console works fine for a spike. The moment you need to reproduce it, different region, colleague onboarding, teardown after a demo, you're reconstructing from screenshots and memory.

AWS CLI v3 is scriptable but stateless. You run aws bedrock-agentcore create-agent-runtime, it works, you move on. A week later you have no idea whether the policy engine is still attached to the Gateway. terraform plan tells you. aws bedrock-agentcore describe-* tells you one resource at a time.

The real problem with a multi-resource setup like this is ordering and dependency. The Gateway needs the runtimes before targets can be registered. The policy engine needs the Gateway. The main runtime needs the Gateway URL injected as an env var at creation time. Getting this right with scripts means maintaining your own dependency logic. With Terraform, you declare the references and the graph sorts it out. 😎

What One `terraform apply` Actually Does

The entry point uses the standard hashicorp/aws provider:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 6.32"
    }
  }
}

From there, terraform apply runs in dependency order:

S3 bucket created
null_resource.build cross-compiles Python to ARM64, uploads three ZIPs
IAM roles, Cognito User Pool, OAuth2 credential provider created in parallel
All three runtimes created, pointing at their S3 ZIPs
AgentCore Gateway created
null_resource.gateway_targets registers runtimes as Gateway targets
AgentCore Memory + summarization strategy created
null_resource.policy_setup creates the Cedar policy engine and attaches it to the Gateway

No manual steps. No README instructions that someone skips.

The `_CODE_VERSION` Trick

AgentCore caches the S3 ZIP when a runtime is first created. Uploading a new ZIP to S3 does nothing on its own, the runtime won't pick it up unless its config changes.

The fix is injecting the source hash as an environment variable. When code changes, the hash changes, the env var changes, and AgentCore detects a config update and re-fetches from S3:

resource "aws_bedrockagentcore_agent_runtime" "main" {
  environment_variables = {
    # When code changes -> hash changes -> env var changes ->
    # AgentCore detects config update and re-fetches ZIP from S3
    _CODE_VERSION = null_resource.build.triggers.source_hash
    GATEWAY_URL   = aws_bedrockagentcore_gateway.main.gateway_url
  }
}

Each MCP runtime has its own hash. Change ccapi_entrypoint.py and only the CCAPI runtime redeploys.

The Gaps and Their Real Costs

The hashicorp/aws provider is mature and battle-tested. AgentCore is a new service and full provider coverage hasn't caught up yet. Two resource types required CLI workarounds, and one required a lifecycle hack. 😊

Gap 1: Gateway Targets, Missing `grantType`

The Gateway uses Cognito Bearer tokens to authenticate outbound calls to MCP runtimes. That requires grantType: CLIENT_CREDENTIALS on the target credential config. As of ~> 6.32, the provider drops this field silently, no error, just doesn't work.

Workaround: manage targets via a null_resource (last resort solution) that calls an AWS CLI script, triggered by content hashes:

resource "null_resource" "gateway_targets" {
  triggers = {
    ccapi_arn  = aws_bedrockagentcore_agent_runtime.ccapi.agent_runtime_arn
    cost_arn   = aws_bedrockagentcore_agent_runtime.cost_explorer.agent_runtime_arn
    ccapi_code = filesha256("${path.module}/../mcp_servers/ccapi_entrypoint.py")
    cost_code  = filesha256("${path.module}/../mcp_servers/cost_entrypoint.py")
  }

  provisioner "local-exec" {
    command     = "bash scripts/setup_targets.sh"
    working_dir = "${path.module}/.."
    environment = {
      GATEWAY_ID              = aws_bedrockagentcore_gateway.main.gateway_id
      CCAPI_RUNTIME_ARN       = aws_bedrockagentcore_agent_runtime.ccapi.agent_runtime_arn
      CREDENTIAL_PROVIDER_ARN = aws_bedrockagentcore_oauth2_credential_provider.gateway_m2m.credential_provider_arn
      SCOPES                  = "mcp/invoke"
      REGION                  = local.region
    }
  }
}

The cost: null_resource outputs are opaque to terraform plan. You can't preview what the script will do before it runs. Debugging means reading shell output in the apply log. The script itself has to be idempotent, check-before-create logic you write and maintain.

Gap 2: Policy in AgentCore, No Resource Exists

There is no aws_bedrockagentcore_policy_engine resource in the provider at all. Same approach, null_resource with a shell script that calls bedrock-agentcore-control via AWS CLI:

resource "null_resource" "policy_setup" {
  triggers = {
    policy_hash = filesha256("${path.module}/policies/safety.cedar")
    gateway_id  = aws_bedrockagentcore_gateway.main.gateway_id
  }

  provisioner "local-exec" {
    command     = "bash scripts/setup_policy.sh"
    working_dir = "${path.module}/.."
    environment = {
      GATEWAY_ID  = aws_bedrockagentcore_gateway.main.gateway_id
      GATEWAY_ARN = aws_bedrockagentcore_gateway.main.gateway_arn
    }
  }

  lifecycle {
    replace_triggered_by = [
      aws_bedrockagentcore_gateway.main,
      null_resource.gateway_targets,
    ]
  }
}

That replace_triggered_by is easy to miss. Any Gateway update silently detaches the policy engine, Terraform won't flag it, the next terraform plan shows nothing wrong, and your Cedar safety rules are quietly gone until you notice.

The cost: terraform destroy doesn't clean up the policy engine or targets. You need separate teardown scripts or manual cleanup.

Gap 3: Gateway Drift on Two Fields

description and protocol_configuration aren't read back from the API after creation. Every terraform plan shows them as changed. Without the ignore_changes below, each apply would re-create the Gateway and silently detach the policy engine every single time:

lifecycle {
  ignore_changes = [description, protocol_configuration]
}

The cost: those two fields are now unmanaged by Terraform. Console changes to them won't be caught by terraform plan.

Why the Workarounds Are Still Worth It

Every script is hash-triggered, idempotent, and wired into the same dependency graph as the rest of the stack. They run at the right time, in the right order, automatically, as part of terraform apply, not after it.

The state blindspot is real but narrow: two resources that change rarely and have simple, easy-to-verify state. The alternative, a README with post-apply manual steps, is the worse trade-off in practice because those steps get skipped.

When the provider catches up, the null_resource blocks become clean resource declarations and the shell scripts get deleted. 🚀

Bonus: The ARM64 Gotcha

AgentCore runtimes run on Graviton (ARM64). Build your Python dependencies on an x86 machine and you'll get silent import errors at runtime. The packaging script uses uv with explicit platform targeting:

uv pip install \
  --python-platform aarch64-manylinux2014 \
  --python-version "3.12" \
  --target="$BUILD_DIR" \
  --only-binary=:all:

--only-binary=:all: is the critical flag. Without it, packages without ARM64 wheels fall back to compiling from source on your host architecture.

One more thing: AgentCore extracts code to /var/task, which is read-only. One of the upstream MCP packages writes a schema cache to its own package directory at import time, instant PermissionError on startup. The fix is a patched file that redirects the cache to /tmp, applied during packaging before the ZIP is built.

Farewell 😊

We covered what AgentCore brings to the table, why Terraform is still the right deployment story for it, and where the hashicorp/aws provider currently has gaps on AgentCore coverage and how to bridge them without leaving the IaC boundary.

Keep building, keep automating, and let the dependency graph do the ordering! 🚀

DEV Community

Terraform Your AWS AgentCore

Introduction

What is AgentCore?

What We Built

Architecture

Why Not Console or CLI?

What One `terraform apply` Actually Does

The `_CODE_VERSION` Trick

The Gaps and Their Real Costs

Gap 1: Gateway Targets, Missing `grantType`

Gap 2: Policy in AgentCore, No Resource Exists

Gap 3: Gateway Drift on Two Fields

Why the Workarounds Are Still Worth It

Bonus: The ARM64 Gotcha

Farewell 😊

Top comments (0)

Introduction

What is AgentCore?

What We Built

Architecture

Why Not Console or CLI?

What One terraform apply Actually Does

The _CODE_VERSION Trick

The Gaps and Their Real Costs

Gap 1: Gateway Targets, Missing grantType

Gap 2: Policy in AgentCore, No Resource Exists

Gap 3: Gateway Drift on Two Fields

Why the Workarounds Are Still Worth It

Bonus: The ARM64 Gotcha

Farewell 😊

What One `terraform apply` Actually Does

The `_CODE_VERSION` Trick

Gap 1: Gateway Targets, Missing `grantType`