hugolesta

Posted on Jul 3

Give Your LLM Hands: Bedrock Agents, Lambda Tools, and the MCP Pattern in Terraform

#terraform #aws #bedrock #agents

The Knowledge Base from the previous post gives the LLM memory. But memory is passive — the model can tell you what the runbook says, but it can't go and create the Confluence page, look up the Jira ticket, or ping Slack. For that you need an agent: a model that can reason over a goal, decide which tool to call, call it, inspect the result, and loop until the task is done.

This is what Bedrock Agents does. The model driving it is Anthropic Claude; the "tools" it invokes are Lambda functions you own; the schema that tells the model what each function accepts and returns lives in Terraform as a function_schema block. It's the Model Context Protocol pattern, just deployed as AWS-managed infrastructure instead of a local sidecar.

The agent covered here — an incident reporter — takes a PagerDuty incident ID, fetches details, searches Jira for related tickets, writes a Confluence incident page, and posts a Slack notification. Four action groups, four Lambdas, one Knowledge Base, one Terraform file.

Architecture

flowchart TD
    A["Caller — incident ID"] -->|"InvokeAgent"| B["Bedrock Agent — Claude Sonnet — eu cross-region"]
    B -->|"Retrieve context"| C[("Knowledge Base — Aurora pgvector")]
    B -->|"action group"| D["Lambda — pagerduty-tools — get_incident — list_incidents — get_incident_alerts"]
    B -->|"action group"| E["Lambda — jira-tools — search_issues — get_issue"]
    B -->|"action group"| F["Lambda — confluence-tools — create_page — get_page — update_page"]
    B -->|"action group"| G["Lambda — slack-tools — send_message — find_channel"]
    D -->|"PagerDuty API"| H["PagerDuty"]
    E -->|"Jira API"| I["Jira"]
    F -->|"Confluence API"| J["Confluence"]
    G -->|"Slack API"| K["Slack"]
    D & E & F & G -->|"tool results"| B
    B -->|"final response"| A

Each Lambda reads its API credentials from Secrets Manager at runtime — no credentials in environment variables, no secrets in Terraform state beyond the ARN.

The Lambda tools pattern

Bedrock calls this concept "action groups". I find it cleaner to think of it as MCP: the agent has a set of named functions with typed parameters, it decides which ones to call based on the task, calls them, and incorporates the responses. The Lambda is the implementation; the Terraform function_schema block is the schema the model reads.

Every tool Lambda in this setup follows the same structure:

A module call that provisions the Lambda (runtime, memory, timeout, S3 source, env vars)
An aws_lambda_permission that allows bedrock.amazonaws.com to invoke it
An aws_bedrockagent_agent_action_group that declares the function schema

The permission and the action group are separate resources — the permission is at the Lambda level, the action group is at the agent level. Both are required.

Secrets handling

Each Lambda that calls an external API gets the secret name as an environment variable, not the secret value. The Lambda's IAM role has secretsmanager:GetSecretValue on that specific secret ARN. This means:

Rotating a token is a Secrets Manager operation, not a Terraform apply
The secret value never appears in a plan output or state file
CloudWatch logs show PAGERDUTY_SECRET_NAME=pagerduty-api-token, not a token

locals {
  lambdas = {
    pagerduty  = "bedrock-agent-pagerduty-tools"
    jira       = "bedrock-agent-jira-tools"
    confluence = "bedrock-agent-confluence-tools"
    slack      = "bedrock-agent-slack-tools"
  }
}

Lambda module + permission (PagerDuty as example)

module "lambda_pagerduty_tools" {
  source  = "your-org/lambda/aws"
  version = "~> 2.15.0"

  env     = local.env
  project = local.project
  vcs     = local.vcs
  owner   = local.owner

  function_name      = local.lambdas.pagerduty
  handler            = "lambda_function.lambda_handler"
  runtime            = "python3.13"
  networking_enabled = false
  cloudwatch_enabled = true
  memory_size        = 256
  timeout            = 30

  s3_bucket = "your-artifacts-bucket"
  s3_key    = "agents/${local.lambdas.pagerduty}/${local.lambdas.pagerduty}.zip"

  env_vars = {
    PAGERDUTY_SECRET_NAME = module.pagerduty_secret.secret_name
    LOG_LEVEL             = "INFO"
  }

  depends_on = [module.pagerduty_secret]
}

resource "aws_lambda_permission" "bedrock_pagerduty_tools" {
  statement_id  = "AllowBedrockInvoke"
  action        = "lambda:InvokeFunction"
  function_name = "${local.project}-${local.lambdas.pagerduty}-${local.env}"
  principal     = "bedrock.amazonaws.com"
  source_arn    = "arn:aws:bedrock:${local.aws_region}:${local.aws_account_id}:agent/*"
  depends_on    = [module.lambda_pagerduty_tools]
}

The source_arn uses agent/* — wildcard over agents in this account and region. If you want tighter scoping, replace the wildcard with the specific agent ARN after it's created, but that creates a circular dependency you'll have to break with a two-phase apply.

The Agent resource

resource "aws_bedrockagent_agent" "incident_reporter" {
  agent_name                  = "${local.project}-incident-reporter-agent-${local.env}"
  agent_resource_role_arn     = aws_iam_role.incident_reporter_agent.arn
  foundation_model            = "eu.anthropic.claude-sonnet-4-6"
  description                 = "Orchestrates PagerDuty, Jira, Confluence, and Slack for automated incident documentation"
  idle_session_ttl_in_seconds = 600

  instruction = <<-EOT
    You are an AI Incident Reporter for the Cloud Platform team.

    When given a PagerDuty incident, complete ALL of the following steps in order:

    1. FETCH INCIDENT DETAILS
       - Use the pagerduty-tools action group to get full incident details and alerts
       - Extract: title, description, severity, affected service, timeline, assignees

    2. FIND RELATED JIRA TICKETS
       - Use the jira-tools action group to search for related tickets
       - DO NOT create new Jira tickets — only retrieve existing ones

    3. CREATE CONFLUENCE INCIDENT PAGE
       - Use the confluence-tools action group to create a new incident page
       - Include: title, severity, timeline, PagerDuty link, related Jira tickets

    4. NOTIFY SLACK
       - Use the slack-tools action group to send a message to the incident channel
       - Include: summary, severity, Confluence URL, Jira links

    5. RETURN SUMMARY
       - Provide a final summary with all links created

    Guidelines:
    - Always complete all steps — do not skip any
    - If a step fails, report the error and continue with remaining steps
    - Use the knowledge base to understand infrastructure context when relevant
  EOT

  tags = local.default_tags

  depends_on = [aws_iam_role_policy_attachment.incident_reporter_agent]
}

A few things worth noting about this resource:

foundation_model: the eu. prefix routes through AWS's cross-region inference, which lets the agent use model capacity across EU regions (eu-west-1, eu-central-1, eu-west-3) automatically. Without this prefix you're locked to a single region's available capacity. Use it if you're in a EU region.

instruction: this is the system prompt. It's what tells the model the task structure, step ordering, and guard rails. Keep it procedural and explicit — the model will follow numbered steps reliably. Vague instructions produce vague agents.

idle_session_ttl_in_seconds: how long a conversation session stays alive between calls. 600 seconds (10 minutes) is reasonable for a human-in-the-loop flow; drop it lower if you're doing fully automated invocations where sessions shouldn't accumulate.

Agent IAM

The agent's execution role needs three permissions:

data "aws_iam_policy_document" "incident_reporter_agent" {
  statement {
    sid    = "InvokeModel"
    effect = "Allow"
    actions = [
      "bedrock:InvokeModel",
      "bedrock:InvokeModelWithResponseStream",
    ]
    resources = ["*"]
  }

  statement {
    sid    = "KBRetrieve"
    effect = "Allow"
    actions = ["bedrock:Retrieve"]
    resources = [
      "arn:aws:bedrock:${local.aws_region}:${local.aws_account_id}:knowledge-base/${var.knowledge_base_id}",
    ]
  }

  statement {
    sid    = "InvokeLambdas"
    effect = "Allow"
    actions = ["lambda:InvokeFunction"]
    resources = [
      "arn:aws:lambda:${local.aws_region}:${local.aws_account_id}:function:${local.project}-${local.lambdas.pagerduty}-${local.env}",
      "arn:aws:lambda:${local.aws_region}:${local.aws_account_id}:function:${local.project}-${local.lambdas.jira}-${local.env}",
      "arn:aws:lambda:${local.aws_region}:${local.aws_account_id}:function:${local.project}-${local.lambdas.confluence}-${local.env}",
      "arn:aws:lambda:${local.aws_region}:${local.aws_account_id}:function:${local.project}-${local.lambdas.slack}-${local.env}",
    ]
  }
}

InvokeModel on * is required — Bedrock Agents uses this to invoke the foundation model internally, and the resource ARN for FM invocation isn't predictable at the time the role is written if you're using cross-region inference prefixes. You can scope it to arn:aws:bedrock:*::foundation-model/* if you want some constraint without breaking cross-region.

Action groups with function schemas

This is where the MCP analogy is most visible. The function_schema block is effectively a tool manifest: the model reads it at runtime to understand what each function is named, what parameters it expects, and what they mean.

resource "aws_bedrockagent_agent_action_group" "pagerduty_tools" {
  agent_id          = aws_bedrockagent_agent.incident_reporter.agent_id
  agent_version     = "DRAFT"
  action_group_name = "pagerduty-tools"
  description       = "PagerDuty tools: get_incident, list_incidents, get_incident_alerts"

  action_group_executor {
    lambda = "arn:aws:lambda:${local.aws_region}:${local.aws_account_id}:function:${local.project}-${local.lambdas.pagerduty}-${local.env}"
  }

  function_schema {
    member_functions {
      functions {
        name        = "get_incident"
        description = "Retrieve full details of a single PagerDuty incident by its ID"
        parameters {
          map_block_key = "incident_id"
          type          = "string"
          description   = "The PagerDuty incident ID (e.g. P1234AB)"
          required      = true
        }
      }
      functions {
        name        = "list_incidents"
        description = "List active PagerDuty incidents filtered by status and/or service"
        parameters {
          map_block_key = "statuses"
          type          = "string"
          description   = "Comma-separated statuses to filter by: triggered, acknowledged, resolved"
          required      = false
        }
        parameters {
          map_block_key = "limit"
          type          = "integer"
          description   = "Maximum number of incidents to return (default: 20)"
          required      = false
        }
      }
      functions {
        name        = "get_incident_alerts"
        description = "Get all alerts associated with a PagerDuty incident"
        parameters {
          map_block_key = "incident_id"
          type          = "string"
          description   = "The PagerDuty incident ID"
          required      = true
        }
      }
    }
  }

  depends_on = [
    module.lambda_pagerduty_tools,
    aws_lambda_permission.bedrock_pagerduty_tools,
  ]
}

The Jira action group is read-only by design — search_issues and get_issue only. This isn't enforced at the Lambda level (you could build a write-capable Lambda and only expose read functions here), but the intent is captured in both the description field and the agent's instruction. Two layers of "don't create tickets" beats one.

resource "aws_bedrockagent_agent_action_group" "jira_tools" {
  agent_id          = aws_bedrockagent_agent.incident_reporter.agent_id
  agent_version     = "DRAFT"
  action_group_name = "jira-tools"
  description       = "Jira tools (read-only): search_issues, get_issue — no ticket creation"

  action_group_executor {
    lambda = "arn:aws:lambda:${local.aws_region}:${local.aws_account_id}:function:${local.project}-${local.lambdas.jira}-${local.env}"
  }

  function_schema {
    member_functions {
      functions {
        name        = "search_issues"
        description = "Search Jira issues using JQL"
        parameters {
          map_block_key = "jql"
          type          = "string"
          description   = "JQL query string"
          required      = false
        }
        parameters {
          map_block_key = "max_results"
          type          = "integer"
          description   = "Maximum number of results to return (default: 10)"
          required      = false
        }
        parameters {
          map_block_key = "fields"
          type          = "string"
          description   = "Comma-separated list of fields to include"
          required      = false
        }
      }
      functions {
        name        = "get_issue"
        description = "Retrieve full details of a single Jira issue by its key"
        parameters {
          map_block_key = "issue_key"
          type          = "string"
          description   = "The Jira issue key (e.g. OPS-1234)"
          required      = true
        }
      }
    }
  }

  depends_on = [
    module.lambda_jira_tools,
    aws_lambda_permission.bedrock_jira_tools,
  ]
}

Attaching the Knowledge Base

One resource to connect the Knowledge Base built in the previous post:

resource "aws_bedrockagent_agent_knowledge_base_association" "incident_reporter_kb" {
  agent_id             = aws_bedrockagent_agent.incident_reporter.agent_id
  knowledge_base_id    = var.knowledge_base_id
  description          = "Internal infrastructure and platform documentation"
  knowledge_base_state = "ENABLED"
}

With this association in place, the agent automatically retrieves relevant KB chunks before deciding which tools to call. The instruction's "use the knowledge base to understand infrastructure context when relevant" is the trigger — the model decides when to retrieve, not you.

Agent alias and SSM parameters

resource "aws_bedrockagent_agent_alias" "incident_reporter" {
  agent_id         = aws_bedrockagent_agent.incident_reporter.agent_id
  agent_alias_name = "live"
  description      = "Live alias for the Incident Reporter agent"
  tags             = local.default_tags

  depends_on = [
    aws_bedrockagent_agent_action_group.pagerduty_tools,
    aws_bedrockagent_agent_action_group.jira_tools,
    aws_bedrockagent_agent_action_group.confluence_tools,
    aws_bedrockagent_agent_action_group.slack_tools,
    aws_bedrockagent_agent_knowledge_base_association.incident_reporter_kb,
  ]
}

resource "aws_ssm_parameter" "incident_reporter_agent_id" {
  name  = "/${local.env}/incident-reporter/agent_id"
  type  = "String"
  value = aws_bedrockagent_agent.incident_reporter.agent_id
}

resource "aws_ssm_parameter" "incident_reporter_agent_alias_id" {
  name  = "/${local.env}/incident-reporter/agent_alias_id"
  type  = "String"
  value = aws_bedrockagent_agent_alias.incident_reporter.agent_alias_id
}

The alias is required to invoke the agent from application code — you always invoke via agentId + agentAliasId. Publishing a named alias decouples callers from the internal version counter: when you update action groups and Bedrock creates a new draft version, you prepare a new alias version and flip it without changing any caller configuration.

The depends_on on the alias resource is load-bearing. Without it, Terraform might try to create the alias before all action groups exist, which causes Bedrock to create the alias pointing at an agent that's still missing tools.

Field notes

agent_version = "DRAFT" on all action groups. Action groups always attach to DRAFT. When you publish a new agent version (via alias routing), Bedrock snapshots the DRAFT config. If you try to attach action groups to a numbered version, the API rejects it. This trips up people who try to version-lock action groups for safety.
The function description matters more than you'd think. The model reads description fields to decide which tool to call and when. Vague descriptions ("does stuff with PagerDuty") produce agents that hallucinate which tool to invoke. Write descriptions as if you're writing API docs for a junior engineer — precise verbs, clear scope.
Parameter types are limited: string, integer, boolean, number, array. No nested objects. If your Lambda needs structured input (a list of filter criteria, a nested config), serialize it as JSON in a string parameter and document the shape in the description. Ugly but it works.
Lambda cold starts extend agent response time noticeably. Each tool call is a synchronous Lambda invocation. Four cold starts in a chain = 4× the latency penalty. Provision concurrency on the most-called tools if response time matters.
The depends_on on the alias is not optional. Terraform's parallelism will race the alias creation against action group creation without it. The apply succeeds but the agent alias captures an incomplete version with missing tools. You won't notice until runtime.
Cross-region inference (eu. prefix) changes the IAM picture. With the cross-region prefix, the model ARN the agent resolves to at runtime isn't in your account's region — it's wherever AWS routes capacity. InvokeModel on * is practically required here, or you scope to arn:aws:bedrock:*::foundation-model/* to at least limit to foundation models.
Secrets Manager reads happen at every Lambda invocation. For high-throughput agents this adds latency and cost. Cache the secret in the Lambda's global scope with a TTL if invocations are frequent — just don't cache indefinitely or token rotation won't take effect.

The Knowledge Base gave the LLM a read-only window into your documentation. The agent and its Lambda tools give it hands: it can fetch, create, and notify across your tooling stack in a single invocation. The infrastructure for it is one Terraform file, four Lambda deployments, and an IAM role with three statements.

Top comments (1)

Raju Dandigam • Jul 3

Framing the Terraform function_schema layer as the MCP pattern in managed AWS infrastructure is a useful translation for teams that understand cloud primitives better than agent buzzwords. The incident-reporter example also lands because it shows the real complexity: not “call one tool,” but coordinate PagerDuty, Jira, Confluence, and Slack without losing the shape of the action chain. In practice that is where tool-call tracing becomes just as important as the schema itself, because once one action group misfires you need to know whether the bug was in selection, arguments, or downstream state. That is the same observability gap agent-inspect is aimed at from the local-first side. Curious whether you see Bedrock’s action groups staying ergonomic once the number of tools grows, or whether teams eventually need a higher-level harness around them.