DEV Community: AWS Heroes

Lambda Managed Instances with Terraform: Multi-Concurrency, High Memory, and Compute Options

Darryl Ruggles — Fri, 29 May 2026 23:45:10 +0000

Lambda has always been one request at a time per execution environment. Your function starts, processes a single invocation, and sits idle until the next one arrives. If you need to handle a thousand concurrent requests, Lambda spins up a thousand execution environments - each with its own memory, its own cold start, and its own per-GB-second bill.

Lambda Managed Instances changes that model. Announced at re:Invent 2025 and expanded with 32 GB memory / 16 vCPU support in March 2026, LMI runs your functions on EC2 instances in your VPC with AWS handling provisioning, patching, scaling, and load balancing. Each execution environment handles multiple concurrent requests. You keep the Lambda programming model and gain EC2 hardware selection and pricing.

I built a product similarity engine to explore how this works in practice. The handler loads a product catalog with Nova embeddings via Bedrock into memory, uses Amazon Nova Multimodal Embeddings to embed incoming search queries, and computes cosine similarity across categories in parallel using ThreadPoolExecutor. It's the kind of workload that doesn't fit well on standard Lambda - sustained throughput, memory-intensive, with a mix of I/O (Bedrock API calls) and CPU (vector math) that benefits from multi-concurrency and configurable memory-to-vCPU ratios. The project uses Terraform for infrastructure, Python 3.14 with Powertools for observability, and the embedding model is configurable (Nova by default, Titan Text Embeddings V2 as an alternative).

The source code is on GitHub: lambda-managed-instances-similarity-engine

The AWS Compute Continuum

Before diving into the implementation, it helps to understand where Lambda Managed Instances fits in the AWS compute landscape. The options form a continuum from fully managed to fully self-managed:

	Standard Lambda	Lambda Managed Instances	ECS Express Mode	ECS Fargate	EKS
Scaling	Per-invocation, instant	Async, CPU-based and concurrency saturation	Traffic-based, auto	Task-based, minutes	Pod-based, minutes
Concurrency	1 per environment	Multiple per environment	Configurable	Configurable	Configurable
Pricing	Per-request + GB-second	Per-request + EC2 + 15% mgmt fee	Fargate + ALB	Per-vCPU-hour	EC2/Fargate + control plane
Commitment discounts	None	Savings Plans, Reserved Instances	Fargate Savings Plans	Fargate Savings Plans	EC2 Savings, RIs
Cold start	Milliseconds-seconds	Tens of seconds (new instances)	Minutes	Minutes	Minutes
Max invocation	15 minutes	15 minutes (environments long-lived, instances rotated by Lambda)	No limit	No limit	No limit
VPC	Optional	Required	Required	Required	Required
Memory	Up to 10 GB	Up to 32 GB (configurable vCPU ratio)	Configurable	Configurable	Configurable
Ops burden	Zero	Low	Low	Medium	High

When to choose Lambda Managed Instances:

Sustained, predictable throughput (hundreds or thousands of requests per second)
Workloads that benefit from specific EC2 instance types (Graviton4, high-bandwidth networking)
Memory-intensive functions that exceed standard Lambda's 10 GB limit or need configurable memory-to-vCPU ratios
Cost optimization at scale (10M+ invocations/month where EC2 pricing with Savings Plans beats per-GB-second)
Functions that load large datasets into memory and reuse them across requests (embeddings, models, reference data)

When standard Lambda is still better:

Bursty, unpredictable traffic patterns
Low to moderate throughput (standard Lambda's per-invocation pricing wins)
Functions that need instant scaling (LMI scales asynchronously based on CPU utilization and execution-environment saturation; if traffic more than doubles within 5 minutes you may see throttles while capacity catches up)

I've written about several of these compute options in previous projects. My ECS deep dive covers Fargate and ECS Express Mode. The Serverless Data Processor demonstrates Step Functions with both Lambda and Fargate. My Powertools best practices article covers the observability patterns used in this project.

Architecture

The architecture has three layers:

Capacity Provider - The foundation. Defines the VPC configuration, instance requirements (architecture, instance types), and scaling policies. Capacity providers define both the security boundary and the failure blast radius of your workload. All functions assigned to the same capacity provider share EC2 instances and must be mutually trusted. This uses container-based isolation, not Firecracker. A compromised function on a shared capacity provider can affect every other function on the same instances. Separate untrusted workloads, regulated workloads, and production from non-production into distinct capacity providers.

Managed Instances - EC2 instances launched and managed by Lambda in your VPC. They're visible in the EC2 console (tagged as managed by Lambda) but you don't SSH into them, patch them, or configure autoscaling groups - Lambda handles all of that. The lifecycle includes a 14-day rotation for security compliance.

Execution Environments - Containers running your function code on the managed instances. Each environment handles multiple concurrent requests. For Python, each concurrency slot is a separate process with its own memory space.

Networking - VPC connectivity is mandatory. Without proper outbound connectivity, functions execute but logs and traces are silently lost. This project uses private subnets with a NAT Gateway for telemetry transmission and Bedrock API access. For production, consider VPC endpoints to keep traffic on the AWS network.

Two-Level Concurrency

This is what makes Lambda Managed Instances architecturally different from standard Lambda. There are two levels of parallelism:

Level 1 - LMI manages for you: Multiple processes handle concurrent requests. Python's LMI runtime spawns a separate process for each concurrency slot (default: 16 per vCPU). Each process has its own memory space, its own global variables, and its own boto3 clients. No shared mutable state between processes. Scaling decisions are based on both execution environment saturation and CPU utilization, not request count alone.

Level 2 - You manage yourself: Within each request, you can use ThreadPoolExecutor to parallelize I/O operations. If your handler needs to search 5 product categories, you can search them in parallel rather than sequentially.

Combined, this means a single execution environment with 1 vCPU and 10 concurrent processes, each running 4 search threads, can have 40 category searches in flight concurrently. On standard Lambda, you'd need 10 separate execution environments to handle those 10 concurrent requests, each paying per-GB-second for its own copy of the catalog in memory.

Each process receives a request, calls Bedrock to embed the query text, then fans out across categories using ThreadPoolExecutor. The catalog data (loaded from DynamoDB at process init) stays in memory across all requests handled by that process.

Why LMI Instead of Standard Lambda

This workload is a poor fit for standard Lambda and a strong fit for LMI. Here's why:

In-memory catalog at scale. Each process loads the product catalog with embedding vectors into memory at initialization. A 100K product catalog with 384-dimensional vectors is roughly 150 MB per process. With 10 concurrent processes, that's 1.5 GB for catalog data alone. Standard Lambda's maximum is 10 GB total, and you pay per-GB-second for every millisecond of that memory. LMI gives you up to 32 GB with configurable memory-to-vCPU ratios, and you pay EC2 instance pricing regardless of how much memory your function uses.

Multi-concurrency amortizes catalog loading. On standard Lambda, 10 concurrent requests means 10 independent execution environments, each cold-starting and loading the catalog into its own memory, each paying per-GB-second. On LMI, those 10 requests run as 10 processes on one EC2 instance. The catalog loads once per process at init time and stays warm for all subsequent requests routed to that process. At sustained throughput, this eliminates the repeated cold-start penalty.

Sustained throughput economics. A product recommendation API serving a storefront has predictable, sustained traffic - hundreds of requests per second during business hours. Each request involves a Bedrock API call for query embedding (I/O), cosine similarity across categories (CPU), and structured logging (I/O). At 10M+ invocations per month, EC2 pricing with Savings Plans is 60-72% cheaper than standard Lambda's per-GB-second model.

Configurable memory-to-vCPU ratio. This workload is memory-heavy (large catalog) with moderate CPU needs (vector math on 384 dimensions). The 4:1 memory-to-vCPU ratio gives 4 GB of memory per vCPU - enough for the catalog plus Bedrock client overhead. Standard Lambda locks you into a fixed ratio where more memory always means proportionally more CPU and higher cost.

Why Not Fargate?

This project could run on ECS Fargate. The handler logic would move into a FastAPI app, the catalog would load at container startup, and an ALB would handle routing. It would work fine. But the infrastructure footprint would be significantly larger:

	Lambda Managed Instances	ECS Fargate
Application code	Single handler function	Web framework + Dockerfile + health checks
Infrastructure	Capacity provider + function	Cluster + task def + service + ALB + target group + listener rules
Auto-scaling	Built into capacity provider	Application Auto Scaling policies (target tracking, step scaling)
Event triggers	Native (SQS, EventBridge, API Gateway, S3)	Requires separate wiring per trigger
Terraform lines	~200 across 4 modules	~400-500 with ALB, ECR, auto-scaling
Container image	Not needed (zip deployment)	Required (Dockerfile, ECR push, image lifecycle)

For teams already comfortable with Lambda, LMI is the path of least resistance to get EC2 pricing and multi-concurrency without learning container orchestration. You keep the programming model you know and gain the hardware flexibility you need. The reverse is also true: for teams already invested in ECS, Fargate may remain the more operationally familiar choice - the muscle memory, dashboards, deployment pipelines, and on-call runbooks are already in place.

Where Fargate or EKS would be the better choice: custom native dependencies that exceed Lambda layer limits (PyTorch, large ML models), persistent connections (WebSocket, gRPC), specialized instance types not supported by LMI, or workloads that need the Kubernetes ecosystem. I covered Fargate patterns in my ECS deep dive and Kabob Store projects. My EKS Auto Mode article covers Karpenter-based scaling.

One specific area where EKS with Karpenter is significantly more sophisticated: scaling down and cost optimization at idle. LMI's scale-down is conservative - in my testing, 2 EC2 instances remained running overnight with zero traffic (1 per AZ). There's no minimum instance setting, no consolidation, and no way to force scale-to-zero short of deleting the function version or capacity provider. Karpenter, by contrast, actively consolidates workloads onto fewer nodes, replaces larger instances with smaller ones when demand drops, and can use Spot instances for fault-tolerant workloads. If your traffic has significant idle periods (nights, weekends), this difference matters for cost. LMI's simplicity comes at the price of less intelligent scaling.

Setting It Up with Terraform

The complete infrastructure is organized into four Terraform modules: networking, IAM, capacity provider, and Lambda. Every IAM policy follows least privilege, and the configuration follows the AWS Well-Architected Framework Security and Cost Optimization pillars. All resources use official HashiCorp providers (hashicorp/aws and hashicorp/archive where applicable) - no community modules or third-party providers.

For a fully production-hardened deployment, you'd also want to address the Reliability, Performance Efficiency, and Operational Excellence pillars more explicitly. The AWS Serverless Applications Lens emphasizes thinking in concurrent requests, sharing nothing, designing for failures and duplicates, and using versions and aliases for safe reversible deployments. Concretely:

Multi-AZ deployment - subnets in at least two AZs (this demo does)
Encryption at rest with customer-managed KMS keys - on the capacity provider (kms_key_arn on aws_lambda_capacity_provider), DynamoDB (server_side_encryption with kms_key_arn), and CloudWatch Logs (kms_key_id on the log group)
VPC endpoints instead of NAT Gateway (covered later in this article)
Invoke through an alias, not the published version directly - The demo invokes the qualified ARN of the published function version (function:name:1). For production, create an alias (prod, live, stable) pointing to a specific version and have callers invoke the alias ARN. Aliases enable instant rollback by updating one pointer, support traffic-shifting deployments (10% to a new version, then 50%, then 100%), and decouple caller code from version numbers.
Idempotency for downstream side effects - Because Lambda may retry or duplicate events, handlers must remain idempotent - even when using long-lived in-memory state. The Powertools idempotency utility uses DynamoDB to deduplicate requests by a configurable key. For this similarity engine the Bedrock embedding call is a read operation and the only state change is logging, so idempotency is less critical. For handlers that write to DynamoDB, send notifications, or charge a payment, idempotency is essential because LMI's at-least-once delivery semantics mean retries can produce duplicate side effects. The in-memory catalog is read-only and shared across requests, but any per-request state that produces side effects needs deduplication.
CloudWatch alarms on LMI-specific metrics (covered in the Observability section)

The demo includes the basics. The production hardening above is straightforward incremental work.

Provider Configuration

terraform {
  required_version = ">= 1.11.0"

  required_providers {
    aws = {
      # Validated with AWS provider v6.x (tested with 6.31+)
      source  = "hashicorp/aws"
      version = "~> 6.31"
    }
    archive = {
      source  = "hashicorp/archive"
      version = "~> 2.7"
    }
  }
}

provider "aws" {
  region  = var.aws_region
  profile = var.aws_profile  # Set via AWS_PROFILE env var or -var flag

  default_tags {
    tags = {
      Project     = "lambda-managed-instances"
      Environment = var.environment
      ManagedBy   = "terraform"
    }
  }
}

The ~> 6.31 constraint pins to the current stable major (6.31.0 at the time of writing) without locking too tightly. memory_size values above 10240 MB require hashicorp/aws 6.29.0 or later - earlier releases had a schema validator that capped memory_size at 10 GB even for LMI functions (fixed in #46065). Without a recent provider, attempting to set 16 GB or 32 GB on an LMI function fails at terraform plan with a confusing validation error.

IAM: The Two-Role Model

Lambda Managed Instances requires two separate IAM roles - a deliberate separation of concerns:

Operator Role - Allows Lambda to manage EC2 instances on your behalf. Your function code never gets these permissions.

resource "aws_iam_role" "operator" {
  name = "${var.project_name}-${var.environment}-operator"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Principal = {
        Service = "lambda.amazonaws.com"
      }
      Action = "sts:AssumeRole"
      Condition = {
        StringEquals = {
          "aws:SourceAccount" = var.account_id
        }
      }
    }]
  })
}

resource "aws_iam_role_policy_attachment" "operator" {
  role       = aws_iam_role.operator.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaManagedEC2ResourceOperator"
}

Execution Role - Scoped to only what the function needs. No EC2 permissions, no wildcard resources. Bedrock access is limited to specific embedding model families.

# DynamoDB - least privilege: handler only does Query on the category-index GSI.
# The seed script runs locally with the operator's credentials and uses its own
# IAM identity for PutItem - not this role.
resource "aws_iam_role_policy" "execution_dynamodb" {
  name = "dynamodb-access"
  role = aws_iam_role.execution.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Action = ["dynamodb:Query"]
      Resource = [
        var.dynamodb_table,
        "${var.dynamodb_table}/index/*"
      ]
    }]
  })
}

# Bedrock - scoped to the specific configured embedding model
resource "aws_iam_role_policy" "execution_bedrock" {
  name = "bedrock-embeddings"
  role = aws_iam_role.execution.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect   = "Allow"
      Action   = ["bedrock:InvokeModel"]
      Resource = [
        "arn:aws:bedrock:${var.aws_region}::foundation-model/${var.embedding_model_id}"
      ]
    }]
  })
}

The runtime function only needs dynamodb:Query because _load_catalog() queries the category-index GSI rather than scanning the table. No PutItem, no GetItem, no Scan. The seed script (scripts/seed_catalog.py) runs locally on the developer's machine with their own IAM identity - it never assumes the function execution role, so the runtime role doesn't need write permissions. The Bedrock policy is scoped to the exact model ARN configured via var.embedding_model_id, not a wildcard. This is what "least privilege" looks like when you actually walk through the code.

Capacity Provider

The capacity provider defines the EC2 infrastructure where your functions run. The scaling_mode = "Manual" with a target CPU utilization policy gives you control over scaling behavior while still letting Lambda handle the mechanics.

resource "aws_lambda_capacity_provider" "main" {
  name = "${var.project_name}-${var.environment}"

  vpc_config {
    subnet_ids         = var.subnet_ids
    security_group_ids = [var.security_group_id]
  }

  permissions_config {
    capacity_provider_operator_role_arn = var.operator_role_arn
  }

  instance_requirements {
    architectures = [var.instance_architecture]  # "arm64" for Graviton
  }

  capacity_provider_scaling_config {
    max_vcpu_count = var.max_vcpu_count
    scaling_mode   = "Manual"

    scaling_policies = [{
      predefined_metric_type = "LambdaCapacityProviderAverageCPUUtilization"
      target_value           = var.target_cpu_utilization
    }]
  }
}

The capacity provider supports two scaling modes: Auto and Manual. Auto mode is hands-off - Lambda picks an internal target CPU utilization and scales based on AWS-chosen defaults, with no explicit scaling_policies block needed. I chose Manual mode for this project because it lets me set an explicit target (50% in the demo config) so the scaling behavior is predictable and tunable. With a lower target, the capacity provider scales out faster and maintains more headroom for traffic bursts. For a production workload where you trust AWS to pick reasonable defaults, Auto mode is simpler and a valid choice.

Lambda Function with Capacity Provider

Four key differences from a standard Lambda function:

capacity_provider_config attaches the function to LMI
publish = true is required - LMI runs on published versions
memory_size minimum is 2048 MB (2 GB / 1 vCPU)
execution_environment_memory_gib_per_vcpu controls the memory-to-vCPU ratio (new in March 2026)

locals {
  powertools_layer_arn = (
    var.instance_architecture == "arm64"
    ? "arn:aws:lambda:${var.aws_region}:017000801446:layer:AWSLambdaPowertoolsPythonV3-python314-arm64:${var.powertools_layer_version}"
    : "arn:aws:lambda:${var.aws_region}:017000801446:layer:AWSLambdaPowertoolsPythonV3-python314-x86_64:${var.powertools_layer_version}"
  )

  powertools_env_vars = {
    POWERTOOLS_SERVICE_NAME      = "${var.project_name}-handler"
    POWERTOOLS_METRICS_NAMESPACE = var.metrics_namespace
    POWERTOOLS_LOG_LEVEL         = var.log_level
  }
}

resource "aws_lambda_function" "handler" {
  function_name = "${var.project_name}-${var.environment}-handler"
  role          = var.execution_role_arn
  handler       = "handler.lambda_handler"
  runtime       = "python3.14"
  architectures = [var.instance_architecture]
  memory_size   = var.lambda_memory_size
  timeout       = 30
  publish       = true

  filename         = data.archive_file.function.output_path
  source_code_hash = data.archive_file.function.output_base64sha256

  layers = [local.powertools_layer_arn]

  capacity_provider_config {
    lambda_managed_instances_capacity_provider_config {
      capacity_provider_arn                    = var.capacity_provider_arn
      execution_environment_memory_gib_per_vcpu = var.memory_gib_per_vcpu
      per_execution_environment_max_concurrency = var.max_concurrency_per_environment
    }
  }

  logging_config {
    log_format            = "JSON"
    application_log_level = var.log_level
    system_log_level      = "WARN"
    log_group             = aws_cloudwatch_log_group.function.name
  }

  tracing_config {
    mode = "Active"
  }

  environment {
    variables = merge(local.powertools_env_vars, {
      DYNAMODB_TABLE      = aws_dynamodb_table.products.name
      ENVIRONMENT         = var.environment
      EMBEDDING_MODEL_ID  = var.embedding_model_id
      EMBEDDING_DIMENSION = tostring(var.embedding_dimension)
    })
  }
}

The memory_gib_per_vcpu setting is powerful. LMI enforces a minimum of 1 vCPU per execution environment, so the ratio determines how much memory you get for that minimum. Examples at the 8 GB level:

2:1 ratio = 8 GB / 4 vCPUs (compute-heavy: batch processing, data crunching)
4:1 ratio = 8 GB / 2 vCPUs (balanced: API handlers, typical workloads)
8:1 ratio = 8 GB / 1 vCPU (memory-heavy: large in-memory datasets, caching)

The product similarity engine uses 4 GB at 4:1 - 1 vCPU per environment, which is the smallest balanced configuration that fits the catalog plus 10 worker processes.

A Note on Packaging Dependencies

The Powertools layer is pinned to a specific version (minimum 3.23.0 - the first release that officially supports LMI). For everything else, follow AWS's guidance for Python Lambda functions: package all dependencies, including boto3 and botocore, with the function rather than relying on the runtime's bundled copies. Even though boto3 is available in the runtime, package it with your function to avoid version drift. The runtime's boto3 is updated on AWS's schedule, not yours, and version drift between local development and the runtime can produce subtle bugs that are hard to reproduce. For production zip deployments, pip install --target build/ boto3 botocore and ship them in the zip. The demo uses the runtime's boto3 for simplicity, but production code should not.

Multi-Concurrency by Language

LMI supports five runtimes today: Python 3.13+, Node.js 22+, Java 21+, .NET 8+, and Rust on the OS-only runtime. All modern runtimes (Python 3.12+) are based on Amazon Linux 2023, replacing AL2 ahead of its June 2026 end-of-life. Every language handles multi-concurrency differently, and the differences matter - they change how you write the handler, what concurrency bugs you have to worry about, and how memory scales.

Runtime	Concurrency Model	What This Means for Your Handler
Python	Multiple processes per environment	Full isolation - each process has its own memory, globals, and boto3 clients. No thread-safety concerns. Memory multiplies linearly with concurrency.
Node.js	Worker threads with async dispatch	Each worker thread can also handle async requests concurrently. Requires safe handling of shared state.
Java	Single process with OS threads	Multiple threads execute the handler simultaneously in shared memory. Requires explicit thread-safe code: synchronized collections, no shared mutable state, atomic operations. The hardest model to get right.
.NET	.NET Tasks with async processing	Same patterns as ASP.NET Core - thread-safe data structures, no static mutable state.
Rust	Single process, Tokio async tasks	Compile-time enforcement: handlers must implement `Clone + Send`. The compiler catches concurrency bugs that other languages catch at runtime (or in production).

Python is the simplest model because there's no shared memory between concurrent requests. The trade-off is per-process memory multiplication. Java is the hardest because thread safety becomes a concern on every line that touches shared state. Rust is the safest because the compiler refuses to let you write non-thread-safe code in the first place.

This blog focuses on the Python implementation. The patterns shown here (process isolation, ThreadPoolExecutor for parallel I/O within a request, memory tuning around per_execution_environment_max_concurrency) are specific to how Python's LMI runtime works. The architecture concepts (capacity providers, scaling, networking, IAM) apply identically across all five languages, but the handler code patterns would differ if you were writing in Java or Rust.

The Python Handler

Python's LMI runtime uses multiple processes (not threads) for multi-concurrency. Each concurrent request runs in a separate process with its own memory space. Global variables, module-level caches, and boto3 clients are completely isolated between processes. This is simpler than the thread-based and async models above because there are no shared-memory concurrency concerns.

This blog uses Python 3.14, the newest supported version. Note that Lambda's Python 3.14 ships with the JIT and free-threaded mode disabled, so the GIL is still in effect.

The one shared resource: /tmp. All processes in an execution environment share the same /tmp directory. Use request-scoped filenames to prevent collisions.

Handler Structure with Powertools

Following the Powertools best practices pattern - Logger, Tracer, and Metrics decorators in the correct order:

import json
import math
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
import boto3
from aws_lambda_powertools import Logger, Metrics, Tracer
from aws_lambda_powertools.metrics import MetricUnit
from aws_lambda_powertools.utilities.typing import LambdaContext

logger = Logger()
tracer = Tracer()
metrics = Metrics()

# Module-level init runs ONCE PER PROCESS.
# With 10 concurrent processes, this runs 10 times.
# Each process loads its own catalog copy and boto3 clients.
PROCESS_ID = os.getpid()
AWS_REGION = os.environ.get("AWS_REGION", "us-east-1")
EMBEDDING_MODEL_ID = os.environ.get(
    "EMBEDDING_MODEL_ID", "amazon.nova-2-multimodal-embeddings-v1:0"
)
EMBEDDING_DIMENSION = int(os.environ.get("EMBEDDING_DIMENSION", "384"))

dynamodb = boto3.resource("dynamodb", region_name=AWS_REGION)
table = dynamodb.Table(os.environ["DYNAMODB_TABLE"])
bedrock_runtime = boto3.client("bedrock-runtime", region_name=AWS_REGION)
_catalog: dict[str, list[dict]] = {}


@tracer.capture_method
def _load_catalog() -> None:
    """Load product catalog once per process. Uses Query (least privilege)."""
    if _catalog:  # already loaded in this process
        return
    # ... query DynamoDB by category and populate _catalog ...


@logger.inject_lambda_context(log_event=True)
@tracer.capture_lambda_handler
@metrics.log_metrics(capture_cold_start_metric=True)
def lambda_handler(event: dict, context: LambdaContext) -> dict:
    logger.append_keys(process_id=PROCESS_ID)
    _load_catalog()  # no-op after first call in this process

    # Extract params from event body or direct invocation
    body = event.get("body")
    params = json.loads(body) if isinstance(body, str) else (body or event)
    top_k = int(params.get("top_k", 5))

    # Step 1: Embed the query text via Bedrock (I/O-bound)
    query_embedding = _embed_query(params["query"])

    # Step 2: Search categories in parallel (CPU-bound)
    results: dict = {}
    categories = params.get("categories", list(_catalog.keys()))
    with ThreadPoolExecutor(max_workers=4) as executor:
        futures = {
            executor.submit(_search_category, cat, query_embedding, top_k): cat
            for cat in categories
        }
        for future in as_completed(futures):
            results[futures[future]] = future.result()

    metrics.add_metric(name="SearchRequests", unit=MetricUnit.Count, value=1)
    return {"statusCode": 200, "body": json.dumps({"results": results})}

Bedrock Embedding - Configurable Model

The query text is embedded via Amazon Bedrock before similarity search. The model is configurable via the EMBEDDING_MODEL_ID environment variable - Nova Multimodal Embeddings by default, with Titan Text Embeddings V2 as an alternative:

@tracer.capture_method
def _embed_query_nova(text: str) -> list[float]:
    """Nova Multimodal Embeddings request format."""
    request_body = {
        "taskType": "SINGLE_EMBEDDING",
        "singleEmbeddingParams": {
            "embeddingPurpose": "TEXT_RETRIEVAL",
            "embeddingDimension": EMBEDDING_DIMENSION,
            "text": {"truncationMode": "END", "value": text},
        },
    }
    response = bedrock_runtime.invoke_model(
        body=json.dumps(request_body),
        modelId=EMBEDDING_MODEL_ID,
        accept="application/json",
        contentType="application/json",
    )
    response_body = json.loads(response["body"].read())
    return response_body["embeddings"][0]["embedding"]

Product embeddings are generated at seed time using GENERIC_INDEX purpose and stored in DynamoDB alongside the product data. Query embeddings use TEXT_RETRIEVAL purpose at runtime. Nova supports 4 dimension sizes (256, 384, 1024, 3072) - trading off accuracy against memory and compute cost. The demo uses 384 dimensions as a practical balance.

Cosine Similarity - The CPU Bottleneck

The vector similarity computation is the compute-intensive core after the Bedrock call returns. For production, use NumPy - it's 10-50x faster than a pure Python loop and releases the GIL during C-level operations, which makes the ThreadPoolExecutor pattern actually parallel:

import numpy as np

def _cosine_similarity(query: np.ndarray, catalog: np.ndarray) -> np.ndarray:
    """Production version: batch operation across all products in a category."""
    # query shape: (D,), catalog shape: (N, D)
    norms = np.linalg.norm(catalog, axis=1) * np.linalg.norm(query)
    return np.dot(catalog, query) / np.where(norms == 0, 1, norms)

The pure Python version is included in the demo as an educational fallback (no NumPy dependency, easier to read):

def _cosine_similarity_pure(vec_a: list[float], vec_b: list[float]) -> float:
    """Educational version: shows the math, no dependencies."""
    dot = sum(a * b for a, b in zip(vec_a, vec_b))
    norm_a = math.sqrt(sum(a * a for a in vec_a))
    norm_b = math.sqrt(sum(b * b for b in vec_b))
    if norm_a == 0 or norm_b == 0:
        return 0.0
    return dot / (norm_a * norm_b)

The handler code, the capacity provider, the Terraform - none of it would need to change to run on an instance type with hardware-accelerated vector operations. The capacity provider's instance type selection is the only variable.

Process Memory Multiplication

This is the most important thing to understand about Python LMI. Each process loads its own copy of the catalog:

10 concurrent processes x 200 MB catalog = 2 GB just for catalog data

The MemoryUtilization CloudWatch metric tracks total memory consumption across all processes. If you're loading large datasets and running high concurrency, you'll hit memory limits. Tune with:

Reduce PerExecutionEnvironmentMaxConcurrency (fewer processes, less memory)
Increase memory_size (more memory per environment)
Use 8:1 memory_gib_per_vcpu ratio (more memory, fewer vCPUs)
Use shared /tmp as a cross-process cache (load once, read from all processes)

Observability

LMI publishes its own CloudWatch metrics in the AWS/Lambda namespace at 5-minute granularity. The capacity-provider-level metrics describe overall instance utilization; the execution-environment-level metrics describe per-function resource consumption.

Capacity provider metrics (dimensions: CapacityProviderName, InstanceType):

CPUUtilization - CPU usage across all instances in the capacity provider
MemoryUtilization - Memory usage across all instances
vCPUAllocated / vCPUAvailable - Used vs available vCPU count
MemoryAllocated / MemoryAvailable - Used vs available memory

Execution environment metrics (dimensions: FunctionName, CapacityProviderName, Resource):

ExecutionEnvironmentConcurrency - Active concurrent requests per environment
ExecutionEnvironmentConcurrencyLimit - Configured maximum concurrency per environment
ExecutionEnvironmentCPUUtilization - CPU usage of this function's environments
ExecutionEnvironmentMemoryUtilization - Memory usage of this function's environments

Alarms to Set First

If you only set three alarms when adopting LMI, set these:

Capacity provider CPU utilization - Alarm when sustained CPU exceeds your scaling target (e.g., > 80% for 10 minutes if your target is 50%). This indicates the capacity provider is failing to scale out fast enough or has hit max_vcpu_count.
Execution environment concurrency vs limit - Alarm when ExecutionEnvironmentConcurrency reaches ExecutionEnvironmentConcurrencyLimit for sustained periods. This means processes are saturated and incoming requests are being throttled or queued.
Execution environment memory utilization - Alarm when memory exceeds 80%. With Python's per-process memory multiplication, hitting memory limits causes new process spawns to fail (InitResourceExhausted) rather than gradual degradation. Catch this before it happens.

These three cover the LMI-specific failure modes that standard Lambda alarms (Errors, Throttles, Duration) won't catch.

Deployment

Prerequisites

AWS CLI configured with a profile (export AWS_PROFILE=your-profile)
Terraform >= 1.11
Python 3.14+ with boto3 (for the seed script)
Amazon Nova Multimodal Embeddings model enabled in your AWS account (Bedrock console, Model Access)

Deploy

# Clone the repo
git clone https://github.com/RDarrylR/lambda-managed-instances-similarity-engine.git
cd lambda-managed-instances-similarity-engine

# Configure
cp infrastructure/terraform.tfvars.example infrastructure/terraform.tfvars
# Edit terraform.tfvars with your values

# Deploy infrastructure
make init
make apply

# Seed the product catalog
make seed

# Invoke
make invoke

Cost Analysis

Lambda Managed Instances pricing is fundamentally different from standard Lambda. Understanding when each model wins is the key decision.

Standard Lambda pricing (arm64/Graviton):

$0.20 per million requests
$0.0000133334 per GB-second (arm64)
No minimum charge, no idle cost

Lambda Managed Instances pricing:

$0.20 per million requests (same)
EC2 on-demand instance pricing (varies by type)
15% management fee on the EC2 on-demand price
No per-invocation duration charge

The critical difference: standard Lambda charges per GB-second of execution. LMI charges for EC2 time regardless of how many requests you serve. At low volume, you're paying for idle EC2 capacity. At high volume, that fixed EC2 cost is amortized across millions of requests.

Break-Even: Standard Lambda vs LMI

Consider this workload: 4 GB memory, 200ms average duration, sustained traffic.

Standard Lambda cost per request (arm64):

Compute: 4 GB x 0.2s = 0.8 GB-seconds x $0.0000133334 = $0.00001067
Request: $0.0000002
Total: ~$0.0000109 per request

LMI on a c7g.medium (1 vCPU, 2 GB, ~$0.034/hr on-demand):

EC2 + 15% fee: $0.034 x 1.15 = $0.0391/hr
With 10 concurrent processes and 200ms per request, each process handles ~5 req/sec
Instance throughput: ~50 req/sec = ~180,000 req/hr
Cost per request: $0.0391 / 180,000 = ~$0.000000217

At this throughput, LMI is roughly 50x cheaper per request than standard Lambda. But the EC2 cost runs 24/7 whether you have traffic or not.

Monthly Cost Comparison

Monthly Requests	Instances Needed	Standard Lambda (arm64)	LMI On-Demand (c7g.medium)	LMI + 1yr Savings Plan
1M	1	$11	$28 + $0.20 = $28	~$18
10M	1	$109	$28 + $2.00 = $30	~$20
50M	1	$546	$28 + $10.00 = $38	~$28
100M	1	$1,091	$28 + $20.00 = $48	~$38
500M	4	$5,456	$112 + $100.00 = $212	~$172

A single c7g.medium tops out around ~130M requests/month at 50 req/sec sustained. Beyond that, instance count scales roughly linearly with load - 500M req/month requires approximately 4 instances. The LMI columns reflect the actual instance count needed at each volume.

The break-even is around 2.5M requests/month at this memory and duration profile. Below that, standard Lambda wins because you pay nothing when idle. Above that, LMI wins and the advantage grows with volume.

Commitment Discounts Change the Math

LMI supports EC2 Savings Plans and Reserved Instances. Standard Lambda supports Compute Savings Plans (up to 17% discount on duration). The discount gap is significant:

Commitment	Standard Lambda Discount	LMI Discount (EC2)
None (on-demand)	0%	0%
1-year Compute Savings Plan	Up to 17%	Up to 36%
3-year Compute Savings Plan	Up to 17%	Up to 56%
1-year EC2 Reserved Instance	N/A	Up to 40%
3-year EC2 Reserved Instance	N/A	Up to 60%

For predictable production workloads with steady traffic, a 3-year commitment on LMI can reduce costs by 60% on the EC2 portion. Standard Lambda's maximum discount is 17%. This difference widens the gap at scale.

Hidden Costs

Don't forget the supporting infrastructure that LMI requires and standard Lambda doesn't:

NAT Gateway: ~$32/month + $0.045/GB data transfer (required for VPC telemetry)
VPC endpoints (if used instead of NAT): ~$7.20/month per endpoint per AZ
DynamoDB: On-demand reads for catalog loading (minimal for small catalogs, significant at scale)
Bedrock: Nova Multimodal Embeddings per-token pricing for each query embedding
CloudWatch: Log storage and metric costs increase with concurrency

For low-volume workloads, these fixed costs can exceed the compute savings. Factor them into your total cost of ownership.

When Each Pricing Model Wins

Standard Lambda wins when:

Traffic is bursty or unpredictable (you pay nothing at zero traffic)
Monthly volume is below the break-even threshold (~2-3M requests for this workload profile)
You can't commit to 1-year or 3-year terms
You don't need VPC connectivity (avoids NAT Gateway cost)

LMI wins when:

Traffic is sustained and predictable (the EC2 cost is fully amortized)
Monthly volume exceeds 5-10M requests
You can commit to Savings Plans or Reserved Instances
You need more than 10 GB memory or specific instance types
You're already paying for VPC infrastructure

For this demo, expect to pay for:

NAT Gateway (~$0.045/hour + data transfer)
EC2 instances (varies by type, auto-selected by Lambda)
DynamoDB on-demand reads (minimal for this catalog size)
Bedrock embedding calls (per-token pricing for each query)

CLEANUP (IMPORTANT!!)

This infrastructure costs real money while running - approximately $2-4/day even with zero traffic (NAT Gateway + EC2 managed instances). Don't forget about it.

Make sure to destroy all resources when you're done:

make destroy

If the capacity provider fails to delete (it can take a few minutes to drain instances), wait and retry. Verify in the AWS console that no EC2 instances tagged with your project name are still running.

Networking: Three Supported Patterns

LMI requires VPC connectivity - the function execution environments need outbound network access for telemetry transmission and any AWS service calls. AWS documents three supported connectivity patterns:

Public subnets with an internet gateway - simplest, suitable for dev/test only
Private subnets with NAT Gateway - the pattern this demo uses
Private subnets with VPC endpoints - the most AWS-aligned production pattern

NAT Gateway (used in this demo)

Simple to set up - one resource, all outbound traffic routes through it
~$32/month base + $0.045/GB data transfer
Traffic leaves your VPC, crosses the public internet (encrypted), then re-enters AWS
Single point of failure unless you deploy one per AZ (~$64/month for 2-AZ HA)

VPC Endpoints (recommended for production)

For production, the most AWS-aligned pattern is one VPC endpoint per service per AZ. Traffic stays entirely on the AWS network and never touches the public internet. The endpoint set must cover every service the function calls - if you forget one, the function fails silently or hangs. For this workload, that means:

Endpoint	Type	Required For
`com.amazonaws.{region}.logs`	Interface	CloudWatch Logs (Powertools logger output)
`com.amazonaws.{region}.monitoring`	Interface	CloudWatch Metrics (Powertools metrics)
`com.amazonaws.{region}.xray`	Interface	X-Ray tracing (Powertools tracer)
`com.amazonaws.{region}.bedrock-runtime`	Interface	Bedrock embedding API calls
`com.amazonaws.{region}.dynamodb`	Gateway	DynamoDB catalog queries (free, no per-AZ charge)

Critical security group detail: Interface endpoints have their own security groups. They must allow inbound HTTPS (port 443) from the function's security group. The function security group must allow outbound HTTPS to the endpoint security groups. If you skip this, DNS resolves but the connection is silently blocked.

Endpoints should be deployed in each AZ used by the capacity provider to avoid cross-AZ latency and data transfer costs. If your capacity provider has subnets in us-east-1a and us-east-1b, every interface endpoint also needs ENIs in both AZs. This is the same Cross-AZ Tax pattern from my previous blog - cross-AZ data transfer charges apply when traffic from a function in us-east-1a hits an endpoint ENI in us-east-1b. Provision endpoints per AZ to keep traffic local.

Cost math: ~$7.20/month per interface endpoint per AZ. With 4 interface endpoints across 2 AZs, that's ~$58/month - roughly double the single NAT Gateway, but cheaper than 2-AZ NAT Gateway HA. The DynamoDB gateway endpoint is free. At high data transfer volumes (more than ~900 GB/month through the NAT Gateway), endpoints become cheaper because there's no per-GB data transfer surcharge for in-region traffic.

When endpoints win on security: Always. Traffic never leaves the AWS network. You can attach endpoint policies to restrict which resources each endpoint can access (e.g., limit the Bedrock endpoint to specific model ARNs). This aligns with the AWS Well-Architected Security Pillar - minimize the attack surface.

The Terraform for VPC endpoints is straightforward but verbose. I left it out of this demo to keep the focus on LMI itself. A follow-up project could add a networking_mode variable that switches between NAT Gateway and VPC endpoints.

A few things to watch for:

VPC connectivity isn't optional. Lambda Managed Instances requires a VPC. Without outbound connectivity (NAT Gateway or VPC endpoints), your function executes but logs and traces are silently lost. You'll debug a working function with no visible output. This is documented but easy to miss.
Scaling is asynchronous. LMI scales based on CPU utilization and execution-environment saturation, not per-invocation demand. Unlike standard Lambda, scaling isn't triggered by incoming requests - it's driven by resource consumption inside existing execution environments. Because scaling reacts to resource pressure instead of incoming traffic, inefficient code or high memory usage can delay scaling and increase throttling risk. The Scaler component decides when to add or remove instances, and instance launches aren't instant. Lambda maintains headroom so traffic can roughly double within minutes without immediate throttling, but if your traffic more than doubles within 5 minutes, you may see 429 throttles while capacity catches up. This is fundamentally different from standard Lambda's near-instant scaling. Plan for it with the target CPU utilization setting - lower values maintain more headroom.
Process memory multiplies. With Python, each concurrency slot is a separate process. Because Python uses process-based concurrency, memory usage scales linearly with concurrency - each worker process consumes its own memory. With Python, concurrency isn't "free" - each additional request increases memory consumption linearly. If your function uses 500 MB of memory and you set concurrency to 16, that's 8 GB of memory consumed per execution environment. Monitor the MemoryUtilization metric and tune accordingly.
publish = true is required. LMI runs on published function versions, not $LATEST. If you forget this, Terraform applies successfully but the function doesn't run on managed instances. Every code change needs a new published version.
Capacity providers are security boundaries, not isolation boundaries. Functions sharing a capacity provider run in containers on the same EC2 instances. This isn't Firecracker isolation. Separate untrusted workloads into separate capacity providers.
Powertools minimum version matters. Lambda Managed Instances requires Powertools for AWS Lambda (Python) version 3.23.0 or later. Pin the layer version in Terraform rather than using latest.
LMI doesn't scale to zero. Unlike standard Lambda where you pay nothing at zero traffic, LMI keeps a baseline of warm EC2 instances running for high availability. AWS launches a baseline of three managed instances for availability across AZs when you publish a function version with a capacity provider. In my testing with 2 AZs configured, 2 instances remained active overnight with zero traffic, but the documented baseline is three. There's no minimum instance setting, no Karpenter-style consolidation, and no way to force scale-to-zero short of deleting the function version or capacity provider. This is a meaningful cost difference for dev/test environments where you might leave infrastructure running between sessions. Run make destroy when you're not actively using the infrastructure, or design your dev environments to use standard Lambda where idle cost is zero.
Quotas to plan around. LMI has its own service quotas: 1 request per second on capacity provider write APIs (Create/Update/Delete - rate-limited to prevent infrastructure churn), 100 function versions per capacity provider, and 1,000 capacity providers per account per region. These are soft limits but worth knowing when you start automating capacity provider management or running multiple environments.

SAM Support

If you came in from the AWS Serverless plugin angle and are wondering whether SAM supports LMI - yes, it does. AWS::Serverless::CapacityProvider is the SAM resource equivalent to aws_lambda_capacity_provider. The SAM template syntax is more concise but follows the same model: capacity provider definition, function with CapacityProviderConfig property, and IAM roles. I chose Terraform for this project because the LMI Terraform path is less documented in the wild and I wanted to fill that gap, but SAM is a perfectly valid choice if your team already uses it.

Instance Type Selection

The capacity provider's instance_requirements block controls which EC2 instance types Lambda selects. By default, Lambda chooses the best fit automatically. You can constrain this with allowed_instance_types or excluded_instance_types.

Today, the interesting choice is between arm64 (Graviton4 - better price/performance for most workloads) and x86_64. But the architecture of Lambda Managed Instances - your function code running in containers on EC2 instances you specify - means the compute capabilities available to your functions expand with every new EC2 instance type AWS makes available for LMI.

The product similarity engine in this project calls Bedrock for query embeddings (I/O-bound) and then computes cosine similarity on CPU (compute-bound). The handler code isn't coupled to a specific compute architecture. The embedding call is behind a clean interface (_embed_query). The similarity computation is pure math. The instance type is a configuration parameter, not an application concern.

This is the practical difference between Lambda Managed Instances and standard Lambda. Standard Lambda abstracts the hardware entirely - you get what AWS gives you. Lambda Managed Instances lets you choose, and that choice extends to whatever EC2 instance types AWS makes available.

Wrapping Up

Lambda Managed Instances fills the gap between standard Lambda and ECS Fargate. The handler function and event-driven invocation pattern stay the same, but you gain EC2 hardware selection, multi-concurrency, configurable memory-to-vCPU ratios, and commitment-based pricing.

The key decisions:

Use it for sustained, predictable throughput where EC2 pricing beats per-GB-second
Choose your memory-to-vCPU ratio based on whether your workload is compute-bound or memory-bound
Understand the process model for your language - Python uses processes (simple, no shared-memory concerns), Java uses OS threads (requires thread-safe code), Node.js uses worker threads with async dispatch, .NET uses Tasks, and Rust uses Tokio async tasks (handlers must be Clone + Send)
Monitor MemoryUtilization because process memory multiplies with concurrency

The full Terraform configuration, Python handler, seed script, and Makefile are in the GitHub repository.

Resources

Lambda Managed Instances Documentation
Lambda Managed Instances - Python Runtime Guide
32 GB Memory / 16 vCPU Announcement (March 2026)
Capacity Provider Documentation
Amazon Nova Multimodal Embeddings - Embedding model used in this project
Terraform aws_lambda_capacity_provider
Powertools for AWS Lambda Best Practices - Observability patterns used in this project
Elastic Container Service - My Default Choice for Containers on AWS - ECS Fargate and Express Mode comparison
Serverless Data Processor - Step Functions with Lambda and Fargate

Connect with me on X, Bluesky, LinkedIn, GitHub, Medium, Dev.to, or the AWS Community. Check out more of my projects at darryl-ruggles.cloud and join the Believe In Serverless community.

Building AI Agents with Spring AI and Amazon Bedrock AgentCore - Part 5 Deploy MCP client for Conference application on AgentCore Runtime

Vadym Kazulkin — Tue, 26 May 2026 14:40:49 +0000

Introduction

In part 2, we explained how to deploy and run our conference search application on the Amazon Bedrock AgentCore Runtime as the MCP server. In this article, we'll develop the (MCP-) client, capable of talking to our application running on AgentCore Runtime. Later, in part 3, we developed the (MCP-) client, capable of talking to our application running on AgentCore Runtime. In part 4, we looked at how to provide the MCP Tools for the Conference application via AgentCore Gateway in a centralized way.

As we saw in previous articles, the local MCP client for the Conference application, to talk to AgentCore Runtime or Gateway, became quite big. If we have many customers using such a client, changing and operating it can become quite challenging. That's why, in this article, we look at how to deploy and run our MCP client on AgentCore Runtime.

Implement the MCP client for the Conference application to be deployable on AgentCore Runtime

We'll reuse the MCP client based on Spring AI that we implemented in parts 3 and 4. But as we need to make some small changes to deploy it on AgentCore Runtime, I created a new spring-ai-1.1-conference-app-agent-bedrock-agentcore-runtime. It consists of the agent and Infrastructure as Code subfolders.

Let's first look at the changes that we need to make to the client. AgentCore Runtime also supports the HTTP protocol contract, which we'll use to deploy our MCP client and talk to it. This contract puts some requirements on the client:

Container requirements:

Host : 0.0.0.0
Port : 8080 - Standard port for HTTP-based agent communication
Platform : ARM64 Docker container - Required for compatibility with the AgentCore Runtime environment. I usually borrow t4g small EC2 instance on AWS to build it.

Path requirements:

/invocations endpoint: POST endpoint for agent interactions
/ping endpoint: GET endpoint for health checks

You can read more about this topic in the HTTP protocol contract article.

The only changes we need to make to our REST Controller are to implement these path requirements. If we use asynchronous communication, the entry point looks like:

@PostMapping(value = "/invocations", consumes = { "*/*" })
public Flux<String> invocations(@RequestBody String prompt) {
   var token = getAuthTokenViaHttpClient();
   var client = McpClient.async(getMcpClientTransport(token)).build();
   ...
  var toolCallbacks = concatWithStream(asyncMcpToolCallbackProvider
 .getToolCallbacks(), ToolCallbacks.from(new DateTimeTools()));

  return this.chatClient.prompt().user(prompt)
    .toolCallbacks(toolCallbacks)
        .stream().content();
}

For synchronous communication, the entry point looks like:

@PostMapping(value = "/invocations", consumes = { "*/*" })
public String invocations(@RequestBody String prompt) {
   var token = getAuthTokenViaHttpClient();
   var client = McpClient.async(getMcpClientTransport(token)).build();
   ...
  var toolCallbacks = concatWithStream(asyncMcpToolCallbackProvider
 .getToolCallbacks(), ToolCallbacks.from(new DateTimeTools()));

  return this.chatClient.prompt().user(prompt)
    .toolCallbacks(toolCallbacks)
        .call().content();
}

For adding the path to /ping, we have different options. We can either add such a simple method:

@GetMapping("/ping")
public String ping() {
   return "{\"status\": \"healthy\"}";
}

Or use Spring Boot Actuator service and add some properties to the application.properties:

management.endpoints.web.exposure.include=health
management.endpoints.web.base-path=/
management.endpoints.web.path-mapping.health=ping

As we need to deploy our MCP client as an ARM64 Docker container, I also added a simple Docker file:

FROM amazoncorretto:25

COPY target/spring-ai-1.1-conference-app-agent-bedrock-agentcore-runtime-0.0.1-SNAPSHOT.jar app.jar
ENTRYPOINT ["java","-jar","/app.jar"]

Let's build the Docker file and upload it to the Amazon Elastic Container Registry:

# build the application
mvn clean package 

# build the Docker image
sudo docker build --no-cache -t spring-ai-conference-app-agent-bedrock-agentcore-runtime:v1 

# Login to ECR
aws ecr get-login-password --region {region} | sudo docker login --username AWS --password-stdin {account_id}.dkr.ecr.{region}.amazonaws.com  

# Create ECR repository (if it doesn't exist)
aws ecr create-repository --repository-name spring-ai-conference-app-agent-bedrock-agentcore-runtime --image-scanning-configuration scanOnPush=true --region {region}  

# Tag the Docker image
sudo docker tag spring-ai-conference-app-agent-bedrock-agentcore-runtime:v1 {account_id}.dkr.ecr.{region}.amazonaws.com/spring-ai-conference-app-agent-bedrock-agentcore-runtime:v1

# Push the Docker Image to the ECR repository
sudo docker push {account_id}.dkr.ecr.{region}.amazonaws.com/spring-ai-conference-app-agent-bedrock-agentcore-runtime:v1

Please replace AWS {account_id} and {region} with our own values. Also, your version may not be v1 but a different one.

We can also build the Docker image by using Buildpack support built into Spring instead of a Dockerfile. Just use the Maven task spring-boot:build-image.

We don't need to make any other changes on the MCP client itself.

Let's now cover the IaC part with CDK for Java, which I implemented in RuntimeWithMCPStack stack. We've already covered many steps in creating the CDK App and Stack, and even the AgentCore Runtime with the MCP protocol, in part 2. For a more detailed explanation, I refer to this article.

First, let's take a look at the creation of the AgentCore Runtime:

 Runtime.Builder.create(this, "MCPRuntime-125")
   .runtimeName(appName.replace("-", "_")+ "_runtime")
   .protocolConfiguration(ProtocolType.HTTP)
   .description("AgenCore Runtime with MCP protocol for running conference app")
      ...
   .build();

Here we set some common properties, such as the runtime name, description, and protocol (in our case, HTTP).

Now let's look at the relevant code parts to assign this code artifact to the AgentCore Runtime:

var ecrImageURI= ConventionalDefaults.
    getContextVariableValueWithReplacedAccountId(this, "ecrImageURIForConferenceSearchAndApplicationAgent");            

var agentRuntimeArtifact =    
    AgentRuntimeArtifact.fromImageUri(ecrImageURI);
   ....

Runtime.Builder.create(this, "MCPRuntime-125")
    .agentRuntimeArtifact(agentRuntimeArtifact)
    ...
    .build();

First, we get the value of the variable ecrImageURI, which points to the imageURI in the ECR we pushed previously. This is typically done in the cdk.json:

{
  "app": "mvn -e -q compile exec:java",
  "context": {
      ""ecrImageUR": "{AWS_ACCOUNT_ID}.dkr.ecr.us-east-1.amazonaws.com/spring-ai-conference-app-agent-bedrock-agentcore-runtime:v1"
 }
}

Please adjust the value so that it matches your imageURI. We use the placeholder {AWS_ACCOUNT_ID} there. The reason for it is that I don't want to expose the AWS account ID publicly. That's why I wrote the following utility method getContextVariableValueWithReplacedAccountId in the ConventionalDefaults class to replace the placeholder with the real value:

static String getContextVariableValueWithReplacedAccountId(Stack stack, String contextVariableName) {
   var awsAccountId=(String)stack.getNode().tryGetContext("awsAccountId");
   if(awsAccountId == null || awsAccountId.trim().isEmpty()) {
      System.out.println("please provide your aws account id as as content to the call, for example: cdk deploy -c awsAccountId=1234567890101");
    }
   var contextVariableValue= getContextVariableValue(stack, contextVariableName);
   return replaceAWSAccountID(contextVariableValue, awsAccountId);
 }

static String getContextVariableValue(Stack stack, String contextVariableName) {
   return (String)stack.getNode().tryGetContext(contextVariableName);
 }

private static String replaceAWSAccountID(String configParam, String awsAccountId) {
   return configParam.replace("{AWS_ACCOUNT_ID}", awsAccountId);
}

Then we create AgentRuntimeArtifact from the image URI and set it as AgentCore Runtime agentRuntimeArtifact property.

Now let's cover the next part - defining the IAM execution role. It's very difficult to automate this part as it takes plenty of time. If I find it, I'll provide the IaC part in the future :). I refer you to the article IAM Permissions for AgentCore Runtime for more information. You can also read my article Amazon Bedrock AgentCore Runtime - Part 2 Using Bedrock AgentCore Runtime Starter Toolkit with Strands Agents SDK, where I explained this part. In that article, we developed the agent in Python with the Strands Agents framework and deployed it on AgentCore Runtime.

Once we have defined the IAM role, we need to configure it in the cdk.json:

{
  "app": "mvn -e -q compile exec:java",
  "context": {
      "roleArnForTheAgentCoreRuntime": "arn:aws:iam::{AWS_ACCOUNT_ID}:role/service-role/spring-ai-conference-search-application-agentcore-runtime-role",
    ....
   }
}

We use the placeholder for the AWS account ID as explained above. Here is the relevant code to grab the value of the roleArnForTheAgentCoreRuntime variable and set it to the execution role of the Runtime from the RuntimeWithMCPStack:

var roleArnForTheAgentCoreRuntime=ConventionalDefaults
   .getContextVariableValueWithReplacedAccountId(this, "roleArnForTheAgentCoreRuntime");

var role=Role.fromRoleArn(this,"roleArnForTheAgentCoreRuntimeRole", roleArnForTheAgentCoreRuntime);

Runtime.Builder.create(this, "MCPRuntime-123")
   .runtimeName(appName.replace("-", "_")+ "_runtime")
   ...
  .executionRole(role)
  .authorizerConfiguration(RuntimeAuthorizerConfiguration.usingIAM())
  .build();

Here, we also use an IAM authorizer for the inbound AgentCore Runtime authentication. This is the default authentication and authorization mechanism that works automatically without additional configuration. You can also use JSON Web Tokens (JWT) as we showed in part 2.

Now we are ready to deploy our MCP client on the AgentCore Runtime. The command to do it is:

cdk deploy -c awsAccountId={YOUR_AWS_ACCOUINT_ID}

Here is how the AgentCore Runtime looks in the console after its creation:

We'll need the Runtime ARN, which we see in the output of this command. Or we can grab it in the service console.

Now we still need to write a client that communicates with our MCP client on the Runtime. I provided such an InvokeRuntimeAgent client written in Java, but you can use any programming language for which AWS provides a (bedrockagentcore) SDK:

private static final String AGENT_RUNTIME_ARN="arn:aws:bedrock-agentcore:us-east-1:{AWS_ACCOUNT_ID}:runtime/spring_ai_conference_search_application_runtime-143wvBghklZ";

void main() throws Exception {

 String payload =
   "{\"prompt\":\"Please provide me with the list of conferences, including their IDs, 
with the Java topic happening in 2027, with the call for papers open today. 
Also, provide me with the list of my talks with this topic in the title. 
Finally, for each conference and talk retrieved, apply individually for the conference.\"}";

  var httpClient=ApacheHttpClient.builder()
      .connectionTimeout(Duration.ofMinutes(5))
      .socketTimeout(Duration.ofMinutes(5))
      .build();

  var bedrockAgentCoreClient = BedrockAgentCoreClient.builder()
      .region(Region.US_EAST_1)
      .httpClient(httpClient)
      .build();

  var invokeAgentRuntimeRequest = InvokeAgentRuntimeRequest.builder()                 
     .agentRuntimeArn(replaceAWSAccountID(AGENT_RUNTIME_ARN))                               
     .qualifier("DEFAULT")
     .contentType("application/json")
     .payload(SdkBytes.fromUtf8String(payload)).build();
      try (var responseStream = bedrockAgentCoreClient
     .invokeAgentRuntime(invokeAgentRuntimeRequest)) {
     var text = new String(responseStream.readAllBytes(), StandardCharsets.UTF_8); 
        System.out.println(text);
}

Let's go step-by-step through it. First of all, we define RUNTIME_ARN, which we deployed in the step before. Please still use the {AWS_ACCOUNT_ID} placeholder, which will be dynamically replaced with your AWS Account ID. When we create BedrockAgentCoreClient. We also explicitly set the Apache HTTP client with the extended connection and socket timeouts. Default 30-second timeouts maybe to short for communication with the Runtime. Then we create InvokeAgentRuntimeRequest and set the agent Runtime ARN, qualifier (always DEFAULT), content type, and payload. The payload is our prompt. You can see the examples of the prompts in part 4 as we're communicating with the same MCP client, but deployed elsewhere. When we invoke the invokeAgentRuntime method on the bedrockAgentCoreClient by providing the invokeAgentRuntimeRequest and convert the agent response to a string.

Conclusion

In this article, we looked at how to deploy and run our MCP client on AgentCore Runtime. With that, our MCP client now scales nicely within the Runtime.

Of course, you can create a nicer client by providing UI for entering the prompt and providing the agent response as a result. My goal was only to demonstrate how to implement such a client. Now we can change and redeploy our MCP client based on Spring AI on the AgentCore Runtime as often as we want. The client code remains unchanged as long as the Runtime ARN remains unchanged.

Starting from the next article, we'll look at the Spring AI AgentCore functionality. Spring AI AgentCore SDK is an open-source library that brings Amazon Bedrock AgentCore capabilities into Spring AI through familiar patterns: annotations, auto-configuration, and composable advisors.

If you like my content, please follow me on GitHub and give my repositories a star!

Please also check out my website for more technical content and upcoming public speaking activities.

Getting Claude Code off my laptop and onto shared compute

Danielle Heberling — Sat, 23 May 2026 10:12:03 +0000

Running Claude Code on my own machine was easy. Getting it onto shared compute my whole team could trigger was the hard part. There's plenty written about the local side. A lot less about the team side.

I made that move because of how a broken deploy plays out for us. I'm the only DevOps engineer on my team. A CloudFormation deploy fails. A Slack notification fires. And more often than not, someone pings me to ask what went wrong.

I get why. AWS isn't everyone's day to day, and a CREATE_FAILED event with a rollback behind it isn't the friendliest thing to read. The pings weren't the real problem, though. A broken deploy that hinges on one person doesn't scale.

So I decided to build my way out of it. I'd give the team a starting point on a broken deploy without pinging me. It wouldn't fix the problem, but it'd tell them what broke and where to start.

What I built

The result is a tool I'm calling the cfn-investigator. I put a thinned down version on GitHub as headless-claude-on-aws. It's narrowed to CloudFormation only and meant as a jumping-off point, not a copy of what I run at work. Same idea, rebuilt from scratch. It's close enough that you could follow it, learn from it, or fork it as a base.

The shape is small. It's a CodeBuild project that runs Claude Code headlessly. You hand it a failing stack name, optionally with the commit you suspect. It reads the stack state through the AWS MCP server with a read only role, works out the likely cause, and writes a short analysis. The example logs it to CloudWatch with a one line spot to forward it anywhere. Mine posts it in the Slack thread where the alert fired, right under the question.

One design choice worth calling out is how it handles confidence. The system prompt tells it to be honest, including an "unsure" option that ranks hypotheses instead of inventing a clean answer. A ranked shortlist beats a confident wrong guess.

Why it looks the way it does

It is not "best practice."

I picked CodeBuild over Lambda or Fargate, and handed Claude Code an Anthropic API key instead of routing through Bedrock. None were textbook choices. They got me to a working prototype fastest. CodeBuild matched the job. Clone the source, run a script, post the result somewhere. That's what the investigator does. The rest of the reasoning, including why I skipped Bedrock, is in the README.

If I'm being real, the biggest factor was knowing I'd be the only one responsible for this. So I optimized for two things, shipping something that worked and keeping it boring enough to maintain alone. Fancy was a liability. That's an engineering trade-off, not an accident.

The imperfect but working part

This is the part I actually care about.

The repo is a pile of YAML and bash. The IAM is broader than it should be (it uses AWS managed ReadOnlyAccess, which you'd want to scope down). The tools get installed fresh on every run instead of baked into an image. The two role split scopes the MCP server's AWS calls, not Claude itself.

And it works. Last week a deploy failed and the investigator flagged a missing environment variable on a Fargate task definition. The developer saw the message, added the variable, and redeployed without pinging anyone. A broken deploy comes with a starting point attached now, so the next move doesn't wait on one person.

In my opinion we've gotten a little precious about reference architectures. There's a strong pull to wait until you can build the clean, fully managed, perfectly scoped version. But the clean version often doesn't exist yet, or isn't mature, or would take three times as long to ship. Meanwhile the messy version that you actually understand and can keep running yourself is sitting right there, solving the real problem today.

What you might want instead

The reason I had to write all that YAML is that the managed options either didn't exist or weren't mature when I started.

That's changed. Before I write more YAML next time, I want to look at Claude on AWS, Claude Managed Agents, and the Claude Agent SDK. Any of those would let you skip most of the plumbing I built by hand. I haven't used them for real yet, so I can't tell you how they hold up, but they're the first place I'd look now.

I'm sharing my version for the cases where the managed path isn't a fit, and as a concrete example you can pull apart.

Take a look

The code is up at github.com/deeheber/headless-claude-on-aws. The README walks through deploying it, populating the secrets, and kicking off a run.

If you build something like this, I'd love to hear how it went. Especially the parts that didn't work.

Amazon Q Developer CLI ahora es Kiro CLI — ¿Qué cambió y por qué importa?

Carlos Cortez 🇵🇪 [AWS Hero] — Thu, 21 May 2026 01:03:15 +0000

Amazon Q Developer CLI ahora es Kiro CLI — ¿Qué cambió y por qué importa?

Si llevas un tiempo en el ecosistema AWS y usas herramientas de desarrollo con IA, probablemente ya notaste el cambio: Amazon Q Developer CLI ya no existe como tal. Ahora se llama Kiro CLI. Y no, no es solo un rebrand de nombre — es un cambio de filosofía completo.

Vamos a explorar qué pasó, qué cambió realmente, y por qué creo que esto importa más de lo que parece.

Un poco de contexto: ¿qué era Amazon Q Developer CLI?

Amazon Q Developer era el asistente de IA de AWS para desarrolladores. Tenía una versión en el IDE (VS Code, JetBrains), una versión en la consola de AWS, y también una CLI que te permitía interactuar con tu entorno desde la terminal usando lenguaje natural.

La idea era buena: preguntarle a un agente directamente desde tu terminal cosas como "¿qué instancias EC2 tengo corriendo en us-east-1?" o "genera un script para limpiar buckets S3 sin versioning". Útil, pero limitado en su enfoque — era básicamente un chatbot en tu terminal.

Entonces, ¿qué es Kiro?

Kiro es un IDE agéntico construido sobre VS Code, lanzado por AWS. Pero lo que muchos no saben es que también tiene una CLI — Kiro CLI — que reemplaza directamente a Amazon Q Developer CLI.

Lo interesante es que Kiro no es solo "Q con otro nombre". El cambio refleja una evolución real en cómo AWS piensa las herramientas de desarrollo:

Amazon Q Developer CLI	Kiro CLI
Asistente conversacional en terminal	Agente con contexto del proyecto
Respuestas puntuales	Spec-driven + MCP-driven + Steering-driven
Foco en queries rápidas	Foco en workflows completos
Sin memoria de proyecto	Entiende tu arquitectura y convenciones
Comando: `q chat`	Comando: `kiro-cli chat`

La idea aquí es que Kiro no solo responde preguntas — razona sobre tu proyecto, lee tus steering files, se conecta a herramientas externas vía MCP, y actúa en consecuencia.

El cambio más importante: de "asistente" a "agente"

En la práctica esto significa que Kiro CLI opera con capacidades que Amazon Q Developer CLI nunca tuvo de forma nativa:

Core Features de Kiro CLI

Interactive Chat — Conversaciones en lenguaje natural directamente en tu terminal con kiro-cli chat. Puede leer y escribir archivos, ejecutar comandos, y razonar sobre tu código.
Custom Agents — Puedes crear y desplegar agentes especializados para tus workflows específicos. No estás limitado al agente genérico.
MCP Integration — Conecta herramientas y fuentes de datos externas a través del Model Context Protocol. Esto es enorme — puedes conectar Kiro CLI a servidores MCP de CloudWatch, MSK, OpenSearch, Okta, y muchos más.
Smart Hooks — Automatiza workflows con hooks inteligentes que se ejecutan antes o después de comandos específicos.
Agent Steering — Guía al agente con las mejores prácticas y preferencias de tu equipo usando steering files. Esto es lo que hace que Kiro entienda tu contexto, no solo el contexto genérico.
Auto Complete — Sugerencias inteligentes de comandos con contexto mientras escribes en la terminal.

# Antes con Amazon Q Developer CLI
q chat "lista mis funciones Lambda en us-east-1"

# Ahora con Kiro CLI — con contexto de proyecto y MCP
kiro-cli chat
> revisa las funciones Lambda del proyecto y sugiere optimizaciones basándote en las métricas de CloudWatch

La diferencia no es solo sintáctica. Kiro sabe qué proyecto es, tiene acceso a tus herramientas vía MCP, y puede actuar sobre eso.

¿Cómo instalar Kiro CLI hoy?

La instalación es directa:

# macOS / Linux
curl -fsSL https://cli.kiro.dev/install | bash

# Windows (PowerShell)
irm 'https://cli.kiro.dev/install.ps1' | iex

# Verificar instalación
kiro-cli --version

Una vez instalado, empezar es así de simple:

# Navega a tu proyecto
cd mi-proyecto

# Inicia Kiro CLI
kiro-cli

# O directamente al chat
kiro-cli chat

Otros comandos útiles

# Traducir lenguaje natural a comandos bash
kiro-cli translate "muestra los últimos 10 logs de mi función Lambda"

# Habilitar sugerencias inline (requiere zsh)
kiro-cli inline enable

# Deshabilitar sugerencias inline
kiro-cli inline disable

Lo interesante es que kiro-cli translate convierte tu instrucción en el comando bash correspondiente sin ejecutarlo — tú decides si lo corres o no. Perfecto para aprender comandos complejos de AWS CLI.

Kiro CLI en CloudShell

Si no quieres instalar nada localmente, Kiro CLI ya viene disponible en AWS CloudShell. Solo abre CloudShell y ejecuta kiro-cli.

Las sugerencias inline en CloudShell requieren Z shell:

# Cambiar a zsh en CloudShell
zsh

# Las sugerencias inline se habilitan automáticamente
# Para deshabilitarlas:
kiro-cli inline disable

Agent Steering: dale contexto persistente a Kiro CLI

Esta es una de las features más importantes de Kiro CLI y la que marca la diferencia real con lo que teníamos en Amazon Q Developer CLI. Los steering files son archivos markdown que le dan a Kiro conocimiento persistente sobre tu proyecto, tu stack, y las convenciones de tu equipo.

La idea aquí es simple: en vez de re-explicar tu proyecto cada vez que abres una sesión, escribes un steering file una vez y Kiro lo lee automáticamente en cada interacción.

¿Qué poner en un steering file?

Un steering file es un .md que vive en tu proyecto (típicamente en .kiro/steering/) y puede contener:

Stack tecnológico — qué lenguajes, frameworks y servicios usas
Convenciones del equipo — naming conventions, patrones de diseño, estructura de carpetas
Contexto de infraestructura — nombres de instancias, ubicación de logs, usuarios del sistema
Requisitos de compliance — estándares de seguridad, accesibilidad, auditoría
Reglas de negocio — lógica específica de tu dominio que el agente debe respetar

Ejemplo real: steering file para un proyecto serverless

# Project Context

## Stack
- Runtime: Python 3.12 on AWS Lambda
- API: Amazon API Gateway (REST)
- Database: Amazon DynamoDB (single-table design)
- IaC: AWS CDK (Python)

## Conventions
- All Lambda handlers go in `src/handlers/`
- Business logic goes in `src/services/` — never in handlers
- Use structured logging with aws-lambda-powertools
- DynamoDB access patterns use PK/SK with GSI1

## Security
- All API endpoints require IAM authorization
- No hardcoded credentials — use environment variables from Secrets Manager
- Input validation on every handler entry point

Lo que hace esto particularmente poderoso es que Kiro CLI usa este contexto en cada interacción. Si le pides que genere un nuevo endpoint, va a seguir tus convenciones automáticamente — handlers en src/handlers/, lógica en src/services/, con powertools y validación de input. Sin que tengas que repetirlo.

Steering files en la práctica

El blog de AWS sobre Oracle EBS con Kiro CLI muestra un caso real: usan steering files para darle a Kiro el conocimiento de su entorno Oracle — patrones de nombres de instancias, usuarios del OS, ubicación de logs y scripts. Así, cuando preguntan "¿está sano el concurrent manager?", Kiro ya sabe dónde buscar sin que se lo expliquen cada vez.

Para equipos, esto es oro. Un nuevo developer se une, clona el repo, y Kiro CLI ya tiene todo el contexto del proyecto en los steering files. La curva de onboarding se reduce dramáticamente.

Spec-Driven Development: ¿disponible en Kiro CLI?

Acá hay que ser claros porque es una pregunta que muchos se hacen. Spec-Driven Development es una feature del Kiro IDE, no de Kiro CLI.

Según la documentación oficial de Kiro, las Specs viven bajo la sección de documentación del IDE (/docs/specs/), y no aparecen en la sidebar de la CLI. La CLI tiene: Chat, Custom Agents, MCP, Hooks, Steering, Autocomplete, y Headless — pero no Specs.

¿Qué es Spec-Driven Development?

Para los que no lo conocen, es el workflow estrella de Kiro IDE. Funciona así:

Le describes tu idea al agente en lenguaje natural
Kiro genera requirements estructurados (en formato EARS)
Kiro crea un design document con arquitectura, modelos de datos, APIs
Kiro produce un plan de implementación con tareas concretas y ordenadas
Kiro ejecuta cada tarea — escribe código, tests, documentación

Cada spec genera tres archivos clave: requirements.md, design.md, y tasks.md. Hay dos tipos de specs: Feature Specs (para funcionalidades nuevas) y Bugfix Specs (para diagnosticar y corregir bugs).

¿Por qué no está en la CLI?

Mi lectura es que Spec-Driven Development requiere una experiencia visual que la terminal no puede ofrecer fácilmente — la navegación entre archivos de spec, la vista de progreso de tareas, y la interacción con el design document son inherentemente visuales. La CLI está optimizada para workflows más directos: chat, automatización, MCP, y headless.

¿Qué usar entonces?

Necesidad	Herramienta
Desarrollar features completas con specs	Kiro IDE
Chat agéntico desde la terminal	Kiro CLI
Automatización y CI/CD	Kiro CLI (headless)
Conectar herramientas externas vía MCP	Kiro CLI o Kiro IDE
Steering files para contexto de equipo	Kiro CLI o Kiro IDE
Troubleshooting rápido de AWS	Kiro CLI

Mi recomendación: usa ambos. Kiro IDE para desarrollo de features con specs, y Kiro CLI para todo lo que haces en la terminal — troubleshooting, automatización, operaciones, y CI/CD. Los steering files funcionan en ambos, así que tu contexto de proyecto se comparte.

MCP: lo que hace a Kiro CLI realmente poderoso

El Model Context Protocol es un estándar abierto que permite a agentes de IA conectarse de forma segura con herramientas externas, fuentes de datos y servicios. En la práctica esto significa que puedes extender las capacidades de Kiro CLI conectándolo a servidores MCP especializados.

Algunos ejemplos reales que ya existen:

Servidor MCP	Qué hace
CloudWatch MCP	Consulta métricas, logs y alarmas con lenguaje natural
Amazon MSK MCP	Administra clusters de Kafka — topics, configuraciones, health
AWS Diagram MCP	Genera diagramas de arquitectura AWS desde prompts
OpenSearch MCP	Busca índices, inspecciona estado del cluster, diagnósticos
Okta MCP	Gestión de identidades — usuarios, grupos, permisos
AWS Documentation MCP	Busca y lee documentación de AWS en contexto

La configuración de un servidor MCP se hace en ~/.kiro/settings/mcp.json. Una vez configurado, Kiro CLI tiene acceso a las herramientas del servidor y las usa automáticamente cuando son relevantes para tu pregunta.

Lo que hace esto particularmente poderoso es que puedes combinar múltiples servidores MCP. Imagina preguntarle a Kiro CLI: "¿por qué mi aplicación está lenta?" y que automáticamente consulte CloudWatch para métricas, OpenSearch para logs, y te dé un diagnóstico completo — todo desde tu terminal.

Los métodos de login — esto es clave

Acá es donde la cosa se pone interesante de verdad, porque Kiro CLI hereda el modelo de autenticación de Amazon Q Developer pero con matices que importan. Hay cuatro formas de conectarte, y cada una te da acceso a cosas diferentes:

1. Builder ID (Free) — para empezar rápido

Es la forma más simple. Te creas un AWS Builder ID gratis (con tu email, Google, Apple, GitHub o Amazon) y listo, ya puedes usar Kiro CLI y el IDE.

La limitación: tienes límites mensuales de uso y solo funciona en el IDE y la CLI. No tienes acceso a la consola de AWS ni a features avanzados. Pero para proyectos personales y exploración, es más que suficiente.

2. Builder ID + Pro — más límites, tu propia cuenta AWS

Acá es donde empieza a ponerse bueno. Puedes hacer upgrade de tu Builder ID al tier Pro conectándolo a tu propia cuenta de AWS. Esto te da límites de uso mucho más altos.

# Inicia chat con Kiro CLI
kiro-cli chat

# Dentro del chat, escribe:
/subscribe
# Esto abre la consola de AWS para confirmar la suscripción Pro

El punto clave es que con Builder ID Pro tienes límites más altos, pero no todas las features Pro. Algunas features avanzadas solo están disponibles vía IAM Identity Center.

3. IAM Identity Center (Free) — para equipos y organizaciones

Si tu empresa ya usa IAM Identity Center (antes AWS SSO), puedes autenticarte con tu identidad corporativa. Esto te da acceso a la consola de AWS, apps y websites de AWS — algo que Builder ID no puede hacer.

Ideal si tu admin ya configuró el Identity Center en la organización.

4. IAM Identity Center + Pro — el combo completo 🔥

Y acá está el gold standard. Tu admin te suscribe a Amazon Q Developer Pro vía IAM Identity Center, y tienes acceso a todo: CLI, IDE, consola, apps de AWS, features avanzados, límites altos, y control empresarial.

Lo que hace esto particularmente poderoso es que tu empresa tiene control total: puede suscribir usuarios en bulk, trackear uso, cancelar suscripciones, y tú como developer tienes la experiencia completa en todos los canales.

La guía rápida de acceso

🆓 Builder ID — Free

✅ CLI · ✅ IDE · ❌ Consola AWS · ❌ Features Pro
→ Para empezar rápido con proyectos personales

💰 Builder ID — Pro

✅ CLI · ✅ IDE · ❌ Consola AWS · ⚠️ Features Pro parciales
→ Límites más altos, pero no el suite completo

🏢 IAM Identity Center — Free

❌ CLI · ❌ IDE · ✅ Consola AWS · ❌ Features Pro
→ Solo consola, ideal si tu admin aún no activó Pro

🔥 IAM Identity Center — Pro

✅ CLI · ✅ IDE · ✅ Consola AWS · ✅ Features Pro completos
→ La experiencia completa — el gold standard

Mi recomendación sincera: si eres developer individual, empieza con Builder ID Free y cuando sientas los límites, haz upgrade a Pro con tu cuenta AWS. Si estás en una empresa, pídele a tu admin que configure IAM Identity Center con Pro — es la experiencia más completa y además puedes usarlo desde Kiro IDE con toda la potencia.

Headless Mode: Kiro CLI en CI/CD

Una capacidad nueva que no existía en Q Developer CLI es el modo headless — puedes ejecutar prompts de forma no interactiva usando API keys. Esto abre la puerta a integrar Kiro CLI en pipelines de CI/CD.

En la práctica esto significa que puedes automatizar tareas como:

Revisión de código automatizada en PRs
Generación de documentación en cada merge
Análisis de seguridad como paso del pipeline
Generación de tests para código nuevo

¿Debería migrar ya?

Sí, y sin dudarlo. Amazon Q Developer CLI ya no recibe actualizaciones activas. Todo el desarrollo está en Kiro CLI.

Pero más allá de la migración técnica, lo que me parece más valioso es el cambio de mentalidad que propone Kiro: dejar de usar la IA como un buscador glorificado y empezar a usarla como un agente que entiende tu contexto, se conecta a tus herramientas, y trabaja contigo en workflows reales.

La migración en sí es simple:

# 1. Instalar Kiro CLI
curl -fsSL https://cli.kiro.dev/install | bash

# 2. Donde antes usabas 'q chat', ahora usas:
kiro-cli chat

# 3. Donde usabas 'q translate', ahora:
kiro-cli translate "tu instrucción aquí"

El takeaway principal

El cambio de Amazon Q Developer CLI a Kiro CLI no es cosmético. Es una señal clara de hacia dónde va AWS con sus herramientas de desarrollo: agentes con contexto persistente, conectados a tus herramientas vía MCP, que entienden las convenciones de tu equipo vía steering files, y que pueden actuar — no solo responder.

Las capacidades clave que ganamos:

kiro-cli chat — Chat agéntico con capacidad de leer/escribir archivos y ejecutar comandos
kiro-cli translate — Traduce lenguaje natural a bash
kiro-cli inline — Sugerencias inteligentes mientras escribes
MCP Integration — Conecta herramientas externas (CloudWatch, MSK, OpenSearch, etc.)
Agent Steering — Dale contexto persistente a Kiro con las prácticas y convenciones de tu equipo
Custom Agents — Crea agentes especializados para tus workflows
Smart Hooks — Automatización pre/post comandos
Headless Mode — Integración con CI/CD vía API keys

Y un punto importante: Spec-Driven Development es exclusivo del Kiro IDE. Si quieres el workflow completo de specs (requirements → design → tasks → implementación), necesitas el IDE. La CLI es para chat, automatización, MCP, y operaciones desde la terminal. Ambos comparten steering files, así que tu contexto de proyecto funciona en los dos.

Mi recomendación: instala Kiro CLI, crea un steering file para tu proyecto más activo, conecta un servidor MCP relevante para tu stack, y experimenta. La curva de aprendizaje es corta y el salto de productividad es real.

Yo soy Carlos Cortez y esto es Breaking the Cloud — nos vemos pronto.

Sígueme en:

🔗 LinkedIn
🐦 X / Twitter
💻 GitHub
📝 Dev.to
🦸 AWS Community
✍️ Medium

Strands Agents + AgentCore Runtime - a perfect match

Matt Lewis — Wed, 20 May 2026 21:17:54 +0000

This is the third in a series of posts documenting the architecture, implementation, and lessons learned from building the AWS Briefing Agent - a personalised AWS assistant deployed on Amazon Bedrock AgentCore Runtime.

Part 1: Building a Full-Stack AI Agent on Bedrock AgentCore
Part 2: Data Ingestion: RSS Feeds, Knowledge Base, S3 Vectors, and Metadata Filtering
Part 3: Strands Agents + AgentCore Runtime - a perfect match
Part 4: Adding Memory to the Agent
Part 5: Experimenting with API Gateway
Part 6: Observability and Evaluations
Part 7: Third Party Integrations - Identity, Gateway and Slack Notifications

The initial implementation of the AWS Briefing Agent called the AWS News Feed RSS feed on every invocation. After setting up an Amazon Bedrock Knowledge Base, the next step was to refactor the code to take advantage of an agentic framework. The decision was made to adopt Strands Agents SDK as an open source SDK that helps you build and run AI agents in just a few lines of code. In our case, switching to the Knowledge Base and adopting Strands Agents SDK helped us to reduce the number of lines of code in our implementation logic by 75%.

Using Strands Agents SDK

The core of the Strands Agents code is straightforward and shown in the code snippet below:

from strands import Agent
from strands.models import BedrockModel
from strands.agent.conversation_manager import SlidingWindowConversationManager
from strands_tools import retrieve
from agent.tools.slack_formatter.tool import format_slack_message

model = BedrockModel(
    guardrail_id=GUARDRAIL_ID,
    guardrail_version=GUARDRAIL_VERSION,
    guardrail_trace="enabled",
)

agent = Agent(
    system_prompt=_load_system_prompt(),
    model=model,
    tools=[retrieve, format_slack_message] + gateway_tools,
    session_manager=session_manager,
    conversation_manager=SlidingWindowConversationManager(
        window_size=20,
        should_truncate_results=True,
        per_turn=True,
    ),
    callback_handler=None,
)

result = agent(message)

We start by importing a number of classes and functions from two packages (strands-agents and strands-agents-tools) and one local module. Agent is the core class for the agent itself, BedrockModel is the model provider, SlidingWindowConversationManager controls how conversation history is trimmed, and retrieve is a pre-built tool that is used to query a Bedrock Knowledge Base. The format_slack_message is a local custom tool within this project - a Python function decorated with the @tool annotation.

We instantiate the BedrockModel() without specifying a model_id. At this point, Strands uses its default model, which is current Claude Sonnet on Bedrock. We include details of a Bedrock Guardrail when we instantiate the model, purely to demonstrate the use of guardrails which we cover this later in the blog post.

Finally, we create the agent by wiring together its core components.

Deploy to Amazon Bedrock AgentCore Runtime

The AgentCore Runtime Python SDK provides a lightweight wrapper that helps to deploy your agent function as HTTP services

# Import the runtime
from bedrock_agentcore.runtime import BedrockAgentCoreApp

# Initialise the app
app = BedrockAgentCoreApp()

# Decorate the function
@app.entrypoint
def invoke(payload: Dict[str, Any], context: Any = None) -> Dict[str, Any]:
    """Entry point for AgentCore Runtime."""
    message = payload.get("prompt", payload.get("message", ""))
    ...
    return response

BedrockAgentCoreApp wraps your function in an HTTP server that listens om port 8080 with two endpoints:

/invocations - a POST endpoint for agent interactions. This gets invoked when customers call the InvokeAgentRuntime action with the payload in JSON format
/ping - a GET endpoint for health checks to verify your agent is operational and ready to handle requests

The @app.entrypoint decorator registers your invoke function as the handler for incoming requests. When AgentCore Runtime receives a request, it deserialises the JSON body into payload, provides a context object (with session_id, request_headers, etc.), calls your function, and serialises the returned dict back as the HTTP response.

Using the Container Build

When using the @aws/agentcore CLI and running agentcore deploy, the CLI needs to turn the Python source code into a runnable container image on AgentCore Runtime. This is controlled by the build field in the agentcore.json file. The default setting is CodeZip, in which the CLI zips up the Python source code, uploads it, and AgentCore resolves dependencies using uv --no-build. This is fast but has a hard constraint, as every dependency must have a pre-built wheel. In our code, we have a package that only ships source distributions, which required us to switch to the Container build setting. This also makes our build more production-ready.

When you run agentcore deploy with the Container build type, the CLI synthesis a CloudFormation stack that includes a CodeBuild project, an ECR repository, the AgentCore Runtime resource, and IAM roles. The CLI packages the codeLocation directory (agent/) and uploads it to S3 as the CodeBuild source artefact. CodeBuild pulls the provided Dockerfile and builds the container image. You can see all the steps in the CodeBuild project below:

After the image builds successfully, CodeBuild tags it and pushes it to the ECR repository as shown below:

The stack updates the Runtime resource to point at the new ECR image URI. AgentCore pulls the image from ECR the next time it starts a container for an invocation.

Built-In Conversation Managers

In the Strands Agents SDK, the user messages and agent responses are all added to the context. As the conversation grows within a session, this starting having a material impact on response times. We modified the default SlidingWindowConversationManager manager:

reducing the windowSize from the default of 40 to 20. This sets the maximum number of messages to keep
setting the per_turn parameter to false. This runs the sliding window before every model call within the same invocation, rather than waiting until after the agent loop completes.

This reduced the average response time from around 80 seconds down to 15 seconds.

Adding Bedrock Guardrails

Amazon Bedrock Guardrails are designed to help you safely build and deploy responsible generative AI applications with confidence. We decided to include a guardrail in the architecture, to understand where it fits in and what it can provide.

The guardrail itself was defined in CDK with content filters (sexual, violence, hate, insults, misconduct and prompt attack), a topic policy (deny off-topic sports questions), and a managed profanity word list:

# ----------------------------------------------------------------
# Bedrock Guardrail — content safety for the agent
# ----------------------------------------------------------------
guardrail = bedrock.CfnGuardrail(
    self,
    "BriefingAgentGuardrail",
    name="briefing-agent-guardrail",
    description="Content safety guardrail for the AWS Briefing Agent",
    blocked_input_messaging="I'm sorry, I can't process that request. Please rephrase your question about AWS announcements.",
    blocked_outputs_messaging="I'm sorry, I can't provide that response. Let me try a different approach.",
    content_policy_config=bedrock.CfnGuardrail.ContentPolicyConfigProperty(
        filters_config=[
            bedrock.CfnGuardrail.ContentFilterConfigProperty(
                type="SEXUAL",
                input_strength="HIGH",
                output_strength="HIGH",
            ),
            bedrock.CfnGuardrail.ContentFilterConfigProperty(
                type="VIOLENCE",
                input_strength="HIGH",
                output_strength="HIGH",
            ),
            # HATE, INSULTS, MISCONDUCT, PROMPT_ATTACK
        ],
    ),
    topic_policy_config=bedrock.CfnGuardrail.TopicPolicyConfigProperty(
        topics_config=[
            bedrock.CfnGuardrail.TopicConfigProperty(
                name="Sports",
                definition="Questions about sports scores, match results, player transfers, league standings, fixtures, or any sporting events.",
                type="DENY",
            ),
        ],
    ),
    word_policy_config=bedrock.CfnGuardrail.WordPolicyConfigProperty(
        managed_word_lists_config=[
            bedrock.CfnGuardrail.ManagedWordsConfigProperty(
                type="PROFANITY",
            ),
        ],
    ),
)

When the agent is invoked, the request first reaches the AgentCore Runtime and runs the handler code first. The guardrail itself is only applied when the handler makes the Bedrock inference call. Bedrock evaluates the input before running the model inference, and then inspects the output before returning it. We did encounter some interesting behaviour when implementing the guardrail.

IAM Permission Gap

The first invocation after adding the guardrail failed with:

AccessDeniedException: User is not authorized to perform: bedrock:ApplyGuardrail
on resource: arn:aws:bedrock:eu-west-1.xxx

The AgentCore execution role (auto-created by the @aws/agentcore-cdk construct) includes bedrock:InvokeModel and bedrock:InvokeModelWithResponseStream, but not bedrock:ApplyGuardrail. The construct doesn’t know about guardrails — they’re a Bedrock feature, not an AgentCore feature. We ended up having to use the aws iam put-role-policy CLI command to add the missing permission

Topic policies can false-positive on legitimate queries

The initial topic policy denied "questions not related to AWS services, cloud computing, or technology". The intention was that it would be easy to demonstrate, and would ensure that the user input was relevant. However, when the user asked questions such as "what are the top announcements today", the classifier ended up deciding this was a blocked topic. In the end, to demonstrate how topic policies work, we changed it to explicitly deny sporting questions.

Guardrail versions can be deleted by CDK updates

When we updated the topic policy, we changed the version description for the guardrail. The CDK stack updated the guardrail version resource, so that CloudFormation deleted version 1 and created version 2. Unfortunately, the version number is also defined in the agentcore.json file. This meant that the AgentCore Runtime container still had version 1 baked into its environment, which meant calls now failed with the following exception:

ValidationException: The guardrail identifier or version provided in the request does not exist.

In the end it was a case of having to update the version number in agentcore.json, redeploy the agent, and start a new session.

Data Ingestion: RSS Feeds, Knowledge Base, S3 Vectors, and Metadata Filtering

Matt Lewis — Wed, 20 May 2026 21:16:06 +0000

This is the second in a series of posts documenting the architecture, implementation, and lessons learned from building the AWS Briefing Agent - a personalised AWS assistant deployed on Amazon Bedrock AgentCore Runtime.

Part 1: Building a Full-Stack AI Agent on Bedrock AgentCore
Part 2: Data Ingestion: RSS Feeds, Knowledge Base, S3 Vectors, and Metadata Filtering
Part 3: Strands Agents + AgentCore Runtime - a perfect match
Part 4: Adding Memory to the Agent
Part 5: Experimenting with API Gateway
Part 6: Observability and Evaluations
Part 7: Third Party Integrations - Identity, Gateway and Slack Notifications

When I started building the AWS Briefing Agent, the first version queried the AWS What's New RSS feed on every invocation. This worked in terms of showing the agent could return tailored information back to the client. However, it was costly and wasteful, with the same data fetched repeatedly, which added latency to every invocation. The RSS feed also only covers recent information, and it was likely we would want to start searching for releases that had been launched in the past 6 months or more. The next step therefore, was to separate the retrieval by the agent from the ingestion.

Amazon Bedrock Knowledge Base

One of the key design goals was to allow the agent to match a natural language query "what's new in Bedrock this week?" against a large corpus of documents to return the most semantically similar results. This is where Amazon Bedrock Knowledge Base comes into its own. It allows the agent to use RAG (Retrieval-Augmented Generation). By querying the Knowledge Base, we can retrieve relevant documents at query time, and then inject them into the prompt as context. The LLM then generates a response from this retrieved information which we know to be factual.

The python CDK code that creates the Knowledge Base is shown below:

knowledge_base = bedrock.CfnKnowledgeBase(
    self,
    "AnnouncementKnowledgeBase",
    name="aws-briefing-agent-announcements",
    ...
    knowledge_base_configuration=bedrock.CfnKnowledgeBase.KnowledgeBaseConfigurationProperty(
        type="VECTOR",
        vector_knowledge_base_configuration=bedrock.CfnKnowledgeBase.VectorKnowledgeBaseConfigurationProperty(
            embedding_model_arn=f"arn:aws:bedrock:{self.region}::foundation-model/amazon.titan-embed-text-v2:0",
        ),
    ),
    storage_configuration=bedrock.CfnKnowledgeBase.StorageConfigurationProperty(
        type="S3_VECTORS",
        s3_vectors_configuration=bedrock.CfnKnowledgeBase.S3VectorsConfigurationProperty(
            index_name="announcements",
            vector_bucket_arn=f"arn:aws:s3vectors:{self.region}:{self.account}:bucket/briefing-agent-vectors",
        ),
    ),
)

This declares the embeddings model to be used as amazon.titan-embed-text-v2:0 and the vector store as being of type S3_VECTORS. There is no code required to handle aspects such as embeddings. Instead, Bedrock manages all of this for us.

Amazon S3 Vectors

Amazon Bedrock Knowledge Bases support several vector stores. A vector store is the retrieval engine that makes RAG work. It stores documents as numerical embeddings (vectors) that are generated by an embeddings model. At query time, the user's question is embedded, and the vector store finds documents whose embeddings are closest in meaning.

The prototype uses Amazon S3 Vectors as the underlying vector store. S3 Vectors provides cost-effective, elastic, and durable vector storage at up to 90% lower costs for uploading, storing, and querying vectors than alternatives such as OpenSearch Serverless. There is no infrastructure to manage, and it still provides a sub-second query latency which is acceptable for this use case.

Scheduling the Ingestion

The ingestion pipeline is run every 6 hours using Amazon EventBridge Scheduler. This service provides capabilities such as built-in retry policies, time zone support, and dead-letter queues. The schedule triggers an AWS Lambda function that carries out the required processing. This includes:

Lists existing document hashes in S3
Fetches the AWS What’s New RSS feed (~100 announcements)
Fetches 13 AWS blog RSS feeds (aws, machine-learning, compute, security, database, containers, devops, networking, storage, infrastructure-and-automation, developer, big-data, iot)
Fetches the AWS Security Bulletins RSS feed
For each new blog post, fetches the canonical URL and extracts the full article body using a stdlib HTML parser
Parses publication dates into YYYYMMDD integers
Writes .txt and .metadata.json files per new item to S3
Triggers a Bedrock KB ingestion job

Deduplication and Incremental Writes

When the ingestion pipeline runs, most of the content in the various RSS feeds is not new. It was important to find a way to prevent re-fetching and re-writing hundreds of announcements every 6 hours.

To support this, we created an MD5 hash of the blog posts URL, truncated to 12 hex characters. This hash is used as the S3 filename. The sample code snippet is shown below:

def write_to_s3(items, existing_keys=None):
    existing = existing_keys or set()
    for item in items:
        url_hash = hashlib.md5(item["link"].encode()).hexdigest()[:12]
        if url_hash in existing:
            continue # Already in S3, skip
        # ... write doc + metadata files

At startup, get_existing_keys() lists all the .txt files in S3 and extracts the hash from each filename into a set. When processing the blog posts, the Lambda functions computes the URL hash and checks to see if it is already in the set. If it already exists, then it has been ingested in a previous run, and there is no need to re-fetch the page. If the hash does not exist, then the function fetches the page, extracts the content, and writes to S3. The hash gives a stable, deterministic filename derived from the URL. The same URL always produces the same hash.

Chunking Strategy

The chunking strategy is set on the Data Source resource in the CDK stack as shown below:

data_source = bedrock.CfnDataSource(
    self,
    "AnnouncementDataSource",
    name="aws-announcements-s3",
    knowledge_base_id=knowledge_base.attr_knowledge_base_id,
    data_source_configuration=bedrock.CfnDataSource.DataSourceConfigurationProperty(
        type="S3",
        s3_configuration=bedrock.CfnDataSource.S3DataSourceConfigurationProperty(
            bucket_arn=data_bucket.bucket_arn,
        ),
    ),
    vector_ingestion_configuration=bedrock.CfnDataSource.VectorIngestionConfigurationProperty(
        chunking_configuration=bedrock.CfnDataSource.ChunkingConfigurationProperty(
            chunking_strategy="SEMANTIC",
            semantic_chunking_configuration=bedrock.CfnDataSource.SemanticChunkingConfigurationProperty(
                breakpoint_percentile_threshold=92,
                buffer_size=1,
                max_tokens=600,
            ),
        ),
    ),
)

We utilise a SEMANTIC chunking strategy. This uses the embedding model itself to decide where to split. The following three parameters control this behaviour:

breakpoint_percentile_threshold=92 - controls the percentile threshold that will result in a split. A higher threshold requires sentences to be more distinguishable to split the document into different chunks.
max_tokens=600 - the maximum number of tokens that should be included in a single chunk, while honoring sentence boundaries.
buffer_size=1 - for a given sentence, the buffer size defines the number of surrounding sentences to be added for embeddings creation. A larger buffer size might capture more context but can also introduce noise, while a smaller buffer size might miss important context but ensures more precise chunking.

Filtering by Date

One of the goals in writing the agent was that a user could ask to constrain information by how recent it is e.g. "what is new in the past 7 days?".

To help achieve this, at ingestion time for each document, we create an associated metadata.json sidecar file that attaches structured, filterable attributes to a document so the agent can narrow search results without relying only on semantic similarity. An example companion file is shown below:

{
  "metadataAttributes": {
    "published_date": 20260415,
    "service": "amazon-bedrock",
    "category": "artificial-intelligence",
    "source_type": "announcement"
  }
}

During the Knowledge Base sync, Bedrock reads this sidecar and attaches those attributes to every vector chunk generated from that document. At query time, the agent can combine semantic search with metadata filters:

"What's new in Bedrock this week?" → vector similarity for "Bedrock" + greaterThanOrEquals filter on published_date
"Show me security bulletins" → vector similarity + equals filter on source_type: "security-bulletin"
"Lambda announcements from the last month" → vector similarity + filters on both service and published_date

Without the metadata file, the agent would get the most semantically similar results regardless of date or service — so a question about "this week" might return announcements from 3 months ago that happen to be textually similar. The metadata filters let the agent constrain results to the correct time window or service before ranking by relevance.

The naming convention (.metadata.json) is a Bedrock KB convention — it automatically associates the sidecar with its parent document during ingestion. No code links them; the filename pattern is enough.

Bedrock Knowledge Base metadata supports four types: STRING, NUMBER, BOOLEAN and STRING_LIST. There is no native data type. The comparison operators (greaterThan, greaterThanOrEquals, lessThan, lessThanOrEquals) only work with NUMBER. Our original implementation stored published_date as a string ("2026-05-14"). When the agent tried to filter, we got back the following exception:

ValidationException: The filter value type provided isn't supported
for the given operation: GREATER_THAN_OR_EQUALS

The fix was to store dates as YYYYMMDD numbers (so using "20260514" instead of "2026-05-14"). We also inject today's date into the system prompt at runtime so the LLM can easily calculate relative dates.

Note that Amazon S3 Vectors has a strict 2 KB limit on filterable metadata per vector. We found the Bedrock Knowledge Base internal metadata keys (AMAZON_BEDROCK_TEXT and AMAZON_BEDROCK_METADATA) were set as filterable by default, which caused frequent ValidationException errors. The fix was mark both of these keys as non-filterable when creating the vector index:

vector_index = s3vectors.CfnIndex(
    self, "AnnouncementVectorIndex",
    index_name="announcements",
    vector_bucket_name=vector_bucket.vector_bucket_name,
    dimension=1024,  # Titan Embed Text v2
    distance_metric="cosine",
    data_type="float32",
    metadata_configuration=s3vectors.CfnIndex.MetadataConfigurationProperty(
        non_filterable_metadata_keys=[
            "AMAZON_BEDROCK_TEXT",
            "AMAZON_BEDROCK_METADATA",
        ],
    ),
)

This meant the only filterable metadata is contained in the .metadata.json fields, which are the only fields we filter on.

The next post covers how we used an agentic framework (Strands Agents SDK) in combination with AgentCore to really start bringing the briefing agent to life.

Building a Full-Stack AI Agent on Amazon Bedrock AgentCore

Matt Lewis — Wed, 20 May 2026 21:14:23 +0000

This is the first in a series of posts documenting the architecture, implementation, and lessons learned from building the AWS Briefing Agent - a personalised AWS assistant deployed on Amazon Bedrock AgentCore Runtime.

Part 1: Building a Full-Stack AI Agent on Bedrock AgentCore
Part 2: Data Ingestion: RSS Feeds, Knowledge Base, S3 Vectors, and Metadata Filtering
Part 3: Strands Agents + AgentCore Runtime - a perfect match
Part 4: Adding Memory to the Agent
Part 5: Experimenting with API Gateway
Part 6: Observability and Evaluations
Part 7: Third Party Integrations - Identity, Gateway and Slack Notifications

Why build an agent?

The last few years have seen a rapid shift from Generative to Agentic AI. Most of us will remember our first experience with ChatGPT where we entered a prompt and got a response back. This was impressive at the time, but was reliant on a user typing a prompt and reacting to the response. We then saw the emergence of early AI agents that could break down tasks into smaller steps and execute them independently. Over the past year, this has evolved into fully autonomous multi-agent systems capable of completing complex tasks with minimal or even no human supervision.

This shift is accelerating quickly. Gartner predicts that by 2028, more than a third of all enterprise software apps will include Agentic AI, and at least 15% of day-to-day work decisions will be made autonomously by AI agents. For organisations, the question is no longer whether agents will become part of enterprise systems, but how to build them securely, reliably and operate them at scale. From an AWS perspective, Amazon Bedrock AgentCore provides a way to help enterprises achieve this goal.

I decided to build an agent utilising AgentCore and its supporting capabilities and which served a purpose ... helping me keep up to date with all the latest announcements from AWS. This agent brings together Memory, Observability, Gateway, Identity, Evaluations and Registry alongside AgentCore Runtime. It allows the agent to personalise briefings just for me from 13 different RSS feeds including What's New, Blog Posts and Security Bulletins. I can get a daily update, as well as automatically post any briefings I'm really interested in to a Slack channel. And I learnt a lot in the process. This blog series covers my experience in building out this agent.

Why AgentCore Runtime?

Amazon Bedrock AgentCore is an AWS service that has been designed specifically for the task of hosting agents. A common saying I keep on hearing is that Bedrock AgentCore is to agentic applications what AWS Lambda is to event driven applications.

At the heart is AgentCore Runtime, which provides the secure runtime for executing the agent code. AgentCore Runtime provides session-based isolation, where every session is assigned a dedicated Firecracker microVM with isolated CPU, memory and filesystem resources (the same lightweight virtualisation technology that underpins AWS Lambda and AWS Fargate). When the session finishes, the LLM's state information is copied to long-term memory and the entire microVM is destroyed. There is no shared state between sessions, which prevents any cross-session data leakage.

AgentCore Runtime is framework-agnostic and supports all popular frameworks such as Strands Agents, LangGraph and CrewAI. It also works with any LLM, such as models offered by Amazon Bedrock, Anthropic Claude, Google Gemini and OpenAI or even hosted on-premises. It supports long sessions up to 8 hours, which means it can handle complex multi-step tasks or time-consuming background processes. Unlike traditional compute services that charge for pre-allocated resources, AgentCore Runtime uses consumption-based pricing where you only pay for active CPU and memory usage. With this, I/O wait and idle time is free, and you're only charged for actual resource consumption calculated at per-second increments. The runtime automatically scales from zero to thousands of concurrent sessions on demand, with no capacity planning needed, and includes reliability features like checkpointing to recover gracefully from interruptions.

AWS Briefing Agent Architecture

A high-level architecture overview of the AWS Briefing Agent is shown below:

AWS Briefing Agent Client is a next.js static site hosted on AWS Amplify Hosting. It integrates directly with Amazon Cognito using the amazon-cognito-identity-js SDK, implementing a full sign-in, sign-up and email verification flow.
AWS Briefing Agent itself is a Python application built with the Strands Agents SDK and deployed to AgentCore Runtime as a Docker container. The @aws/agentcore CLI handles the full deployment lifecycle. When you run agentcore deploy, the CLI triggers AWS CodeBuild to build the Docker image (ARM64), pushes it to Amazon ECR, and deploys it to AgentCore Runtime.
AgentCore Memory provides persistent user knowledge across sessions using two built-in memory strategies. The SEMANTIC memory strategy extracts factual information and knowledge from conversations that have taken place e.g. that a user works with Lambda and EKS. The USER_PREFERENCE memory strategy identifies and extracts user preferences from conversations e.g. that the user prefers technical deep dives. The agent retrieves relevant memory records at the start of each invocation and injects them as context, enabling personalised briefings from the first message of a new session.
AgentCore Observability is used to instrument all Bedrock API calls, tool invocations and memory operations. This is carried out entirely by setting enableOtel: true in the runtime config and using the opentelemetry-instrument wrapper command. Spans show up in CloudWatch Transaction Search and the CloudWatch GenAI Observability dashboard is populated with the sessions and traces, and provides the ability to drill into individual invocations.
AgentCore Evaluations is configured to run online quality assessments against agent responses using built-in evaluators for Helpfulness, Goal Success Rate, and Correctness. These are shown in the front-end to give an indication on how well the agent is performing for each user.
Bedrock Knowledge Base is created and backed by Amazon S3 Vectors that stores all announcements, blog posts and security bulletins. An ingestion Lambda runs every 6 hours that writes each item as a .txt file alongside a metadata.json file to the S3 bucket, before triggering a Knowledge Base sync. The agent queries the KB via the Strands retrieve tool with metadata filters for date ranges and service names, enabling questions like "what's new in Bedrock this week?"
AgentCore Gateway exposes a managed MCP (Model Context Protocol) endpoint that the agent connects to at runtime for tool discovery. The Slack integration is defined as an OpenAPI spec pointing at the Slack chat.postMessage API, and is registered as a Gateway target. The agent discovers available tools dynamically via the MCP protocol. The Gateway handles authentication and credential injection for this integration with Slack, attaching the stored bot token as a Bearer header on outbound Slack API calls.
AgentCore Identity stores the Slack bot token as an API key credential in its token vault (encrypted at rest via Secrets Manager). When the agent calls the tool to send a briefing to Slack, AgentCore Identity retrieves the bot token and injects it into the outbound request automatically. The agent code never sees or handles the token directly.
AgentCore Registry is a governed catalog for agents, MCP servers, tools, skills, and custom resources. Teams can publish resources, control access through approval workflows, and enable both humans and AI agents to discover tools using semantic and keyword search. Once the Slack integration was working, the briefing agent and the Slack tool where registered in the AgentCore Registry. This makes the tool discoverable by other agents in the organisation.

AWS Briefing Agent in Action

We create a new user and login to the home screen for the AWS Briefing Agent front end. The first time we use the agent, we are asked to provide information about our interests and the type of briefing style we are interested in. These get added to memory, so that the agent can personalise its responses:

We can provide the details of the services we are most interested to the agent. At this point, the agent will pull back the top announcements that it has retrieved from the Knowledge Base, and display them in a briefing summary.

We have also integrated with Slack through Gateway. This means we can ask the Briefing Agent to post the details to our Slack channel:

This means that when we go to our Slack channel, we can see a new message with our briefing, alongside all the links we can click to take us to the original blog posts and announcement articles.

In the next post we cover design decisions made to ingest the data into a Bedrock Knowledge Base to support the agent

Building AI Agents with Spring AI and Amazon Bedrock AgentCore - Part 4 Provide MCP tools for Conference application via AgentCore Gateway

Vadym Kazulkin — Mon, 18 May 2026 14:09:12 +0000

Introduction

In part 2, we explained how to deploy and run our conference search application on the Amazon Bedrock AgentCore Runtime as the MCP server. In this article, we'll develop the (MCP-) client, capable of talking to our application running on AgentCore Runtime. Later, in part 3, we developed the (MCP-) client, capable of talking to our application running on AgentCore Runtime. In this article, we'll look at another alternative to AgentCore Runtime to host MCP servers on AgentCore - AgentCore Gateway.

Provide the MCP Tools for the Conference application via AgentCore Gateway

Let's imagine a hypothetical situation: we not only want to search for the conferences, but also create, search, and apply for the talks for them. With this, our conference application now supports not only the attendee role but also the speakers. This is the reason why I added functionality to support conference search by the open call for papers criteria, see part 2. This is required for the conference speakers to determine whether it's still possible to apply for the conference with their talks.

When searching for conferences, we didn't have a public API, which is why we created MCP. On the other hand, for creating, searching, and applying the talks for the conferences, we indeed have a public API. Let's assume this API is hosted on the Amazon API Gateway. But it could also be any external application that exposes an OpenAPI specification. How to implement such a use case? Of course, we can use Amazon Bedrock AgentCore Gateway to securely connect our API to the AgentCore Gateway. The AgentCore Gateway can expose API functionality as MCP tools. But with this, we'll need to authenticate and hold the connection to multiple sources: AgentCore Runtime and Gateway. Without a centralized approach, customers face significant challenges: discovering and sharing tools across organizations becomes fragmented, managing authentication across multiple MCP servers grows increasingly complex, and maintaining separate gateway instances for each server quickly becomes unmanageable.

The centralized approach, which exposes all the tools from the central (MCP server) endpoint, would be a much better solution for our use case. Luckily, AgentCore Gateway helps to solve these challenges by treating existing MCP servers as native targets. This gives us a single point of control for routing, authentication, and tool management. It makes it as simple to integrate MCP servers as to add other targets to the gateway. AgentCore made it possible by supporting multiple targets. Those are, as of now: OpenAPI, Smithy, Amazon API Gateway, AWS Lambda, MCP Servers, and Integrations:

Conference Talks and Applications Demo

For creating, searching, and applying the talks for the conferences, I implemented a small conference-talks-and-applications-demo:

I currently don't use any database to store the talks and conference applications for simplicity reasons. My goal is only to demonstrate the approach.

I maintain a static list of the talks in the GetConferenceTalksByTitleSubstring class. The search consists of looking for the provided substring of the title.
When creating a new talk, I generate its random ID between 1 and 100 in CreateConferenceTalk class and return the talk with ID, title, and description.
When applying for a talk for a specific conference, I simply acknowledge that the application is created in CreateConferenceTalk class.

I prefer to use AWS SAM as IaC for pure Serverless applications. Unfortunately, AWS SAM doesn't provide any IaC for Amazon Bedrock AgentCore yet. Also, SAM has some limitations, as it's, for example, not possible to create the response codes for each API. And those response codes are required by the OpenAPI specification to be present. That's why I created OpenAPI spec on my own for it. We can refer to this specification when defining the API like this:

MyApi:
  Type: AWS::Serverless::Api
  Properties:
    StageName: !Ref Stage
    DefinitionBody:
        Fn::Transform
           Name: AWS::Include
           Parameters:
             Location: ConferenceTalksAndApplicationsAppAPI-OpenAPISpec.yaml

We also secured our API with an API key, whose value is by definition passed as the HTTP header parameter "x-api-key". This will play a role when we configure the outbound authentication of the AgentCore Gateway API Gateway target:

MyApiKey: 
  Type: AWS::ApiGateway::ApiKey
  ....
  Properties: 
    Name: "ConferenceTalksAndApplicationsAppAPIKey"
    Description: "ConferenceTalksAndApplicationsApp API Key"
    Enabled: true
    GenerateDistinctId: false
    Value: a6ZbcDgjkQW10BN56ASR25
    ...

We also defined an API stage with the name prod. Now, we can deploy this application by executing sam deploy -g, and we will see the individual URL in the response. For example, https://k370s19lk3.execute-api.us-east-1.amazonaws.com/prod. We'll need the REST API ID, which is in our case k370s19lk3, later when creating the IaC for the AgentCore Gateway.

Create AgentCore Gateway with different targets

In part 2, we started to create the IaC for the Conference (Search) application. It consisted mainly of the AgentCore Runtime with the MCP protocol and everything needed for that, like the Cognito User (Client) Pool. We used CDK for Java for it. We'll now call this application the Conference application, as we are extending its functionality beyond the search. Our goal is now to create AgentCore Gateway with 2 targets:

existing AgentCore Runtime with MCP protocol for the conference search (MCP) tools
conference talks and applications demo deployed on Amazon Gateway API to expose all its APIs as (MCP) tools.

You can find the full source code in the GatewayTargetStack class.

Let's go step-by-step through it. We first create the AgentCore Gateway itself:

var gateway= Gateway.Builder.create(this, "Gateway-123")
      .gatewayName(appName.replace("_", "-")+ "-gateway")
      .authorizerConfiguration(CustomJwtAuthorizer.Builder
        .create().allowedClients(
          List.of(UserClientPoolStack.
               userPoolClient.getUserPoolClientId()))
      .discoveryUrl(UserClientPoolStack.COGNITO_DISCOVERY_URL).build())
            .role(RuntimeWithMCPStack.role)
            .description("AgenCore Runtime with MCP protocol for running conference search app")
           .build();

The most interesting part is configuring the custom JWT authorizer as an inbound authentication. Here we reuse the Cognito User (Client) Pool created in part 2. We set the same user client pool ID and discovery URL. We also reuse the same AWS IAM role that we used to create AgentCore Runtime in part 2. Please also read the Getting started with Policy in AgentCore in addition to the resources from part 2 on how to create one.

Now, let's create the AgentCore Gateway target of our MCP Server running on AgentCore Runtime:

GatewayTarget.Builder.create(this, "MCP-Target-123")           
     .targetConfiguration(McpServerTargetConfiguration.create(endpoint))         
     .credentialProviderConfigurations(oauthCredentialProviderConfigs)
         .gatewayTargetName("mcp-target")
         .description("AgentCore Runtime MCP Server Target ")
         .gateway(gateway)
         .build();

We set McpServerTargetConfiguration, which defines that the Gateway target is the MCP Server running on AgentCore Runtime. Also, we set the target name and description, and provide the AgentCore Gateway to which this target belongs. We need to set the endpoint URL, which always follows the same schema:

var endpoint="https://bedrock-agentcore."+region + 
        ".amazonaws.com/runtimes/" +
        RuntimeWithMCPStack.runtime.getAgentRuntimeId()+
       "/invocations? 
       qualifier=DEFAULT&accountId="+Stack.of(this).getAccount();

We obtain the runtime ID property from the created AgentCore Runtime in the RuntimeWithMCPStack stack. The next part is to configure the outbound authentication. This means to configure how the Agentcore Gateway MCP target authenticates with the AgentCore Runtime with the MCP protocol. For this, we need to use AgentCore Identity.
As described in the following issue, it's currently not possible to create the AgentCore Identity with CloudFormation. That's why CDK also can't provide this functionality. That's why we need to create it manually and then provide the configuration for this stack. Let's secure it with the existing OAuth Client. Let's go to AgentCore Identity and click on "Add Outbound Auth" -> "Add OAuth Client". Then select "Custom Provider" -> "Discovery URL" :

We can reuse the Cognito User Pool Client ID, Client Secret, and Discovery URL from part 2.

After we created the AgentCore Identity, let's grab its ARN:

The Client Secret will be automatically stored as a Secret in the AWS Secrets Manager. Let's also grab Secret ARN:

Now, let's configure both in the cdk.json :

{
  "app": "mvn -e -q compile exec:java",
  "context": {
      "agentcoreIdentityOutboundOAuthArn": "arn:aws:bedrock-agentcore:us-east-1:{AWS_ACCOUNT_ID}:token-vault/default/oauth2credentialprovider/resource-provider-oauth-gateway",
      "oAuthSecretArn": "arn:aws:secretsmanager:us-east-1:{AWS_ACCOUNT_ID}:secret:bedrock-agentcore-identity!default/oauth2/resource-provider-oauth-gateway-ba3b089d-toYfaV",
  ....
   }
}

Please replace both values with your individual ARNs. I explained in part 2 how we handle the AWS Account ID. Now, let's create and configure the credential provider:

// CloudFormation, see the issue https://github.com/aws-cloudformation/cloudformation-coverage-roadmap/issues/2391
  var oAuthProviderArn = ConventionalDefaults.getContextVariableValueWithReplacedAccountId(this, "agentcoreIdentityOutboundOAuthArn");
  var oAuthSecretArn = ConventionalDefaults.getContextVariableValueWithReplacedAccountId(this, "oAuthSecretArn");

  var oauthCredentialProviderConfigs = List.of(GatewayCredentialProvider
      .fromOauthIdentityArn(OAuthConfiguration.builder()
          .providerArn(oAuthProviderArn)
          .secretArn(oAuthSecretArn)
          .scopes(List.of())
          .build()));

  GatewayTarget.Builder.create(this, "MCP-Target-123")           
  .credentialProviderConfigurations(oauthCredentialProviderConfigs)
  ...
 .build();

We grab the AgentCore Identity and Secret ARNs and use them to create an OAuth Credential Provider. We then set it when creating the AgentCore Target credential provider configuration.

Now we are done with creating the AgentCore MCP Target. The next step is to create an Amazon API Gateway target. Please also read the article AgentCore Gateway Amazon API Gateway stages to gain an understanding of how AgentCore Gateway obtains the OpenAPI spec from the Amazon Gateway stage.

First of all, let's define the API stage name in the cdk.json:

{
  "app": "mvn -e -q compile exec:java",
  "context": {
      "restApiStageName": "prod"
   }
}

We'll pass the restApiId via the console parameter. We created it above when we deployed the conference talks and applications demo. Similar to AWS Account ID, which is public, we don't want to configure it in cdk.json:

  var restApiId=(String)this.getNode().tryGetContext("restApiId");

  var restApiStageName=ConventionalDefaults.getContextVariableValue(this, "restApiStageName");

  GatewayTarget.Builder.create(this, "APIGATEWAY-Target-123")         
    .targetConfiguration((ApiGatewayTargetConfiguration.Builder.create()
    .restApi(RestApi.fromRestApiId(this, "APIGATEWAY-ID", restApiId))
         .stage(restApiStageName).build()))
            ...
           .gatewayTargetName("apigateway-target")
           .description("Amazon ApiGateway Target ")
           .gateway(gateway)
           .build();

Here, we create the AgentCore Gateway Target as an Amazon API Gateway Target, set the target name and description. We also provide the REST API ID, stage, and AgentCore Gateway to which this target belongs.

We can define the tool filters. With that, we can shrink what Amazon API Gateway APIs will be exposed as MCP tools:

GatewayTarget.Builder.create(this, "APIGATEWAY-Target-123")           
  .targetConfiguration((ApiGatewayTargetConfiguration.Builder.create()
     .apiGatewayToolConfiguration(ApiGatewayToolConfiguration.builder()
       .toolFilters(List.of(
        ApiGatewayToolFilter.builder()
            .filterPath("/talks/{titleSubstring}")                                
                .methods(List.of(ApiGatewayHttpMethod.GET))                 
                .build(),
        ApiGatewayToolFilter.builder()
                .filterPath("/apply")
                .methods(List.of(ApiGatewayHttpMethod.POST))
                .build(),
        ApiGatewayToolFilter.builder()
                .filterPath("/talks")                    
                .methods(List.of(ApiGatewayHttpMethod.POST))
                .build()))

In our example, we expose all 3 APIs (/apply, /talks, //talks/{titleSubstring}) as MCP tools.

Next, let's use the tool override to give the MCP tools the proper names and descriptions:

GatewayTarget.Builder.create(this, "APIGATEWAY-Target-123")           
  .targetConfiguration((ApiGatewayTargetConfiguration.Builder.create()
     .apiGatewayToolConfiguration(ApiGatewayToolConfiguration.builder()
      .toolOverrides(List.of(
          ApiGatewayToolOverride.builder()
            .method(ApiGatewayHttpMethod.POST)
            .name("apply-to-conferences-w-conference-id-talk-id")
            .path("/apply")
            .description("apply to the conference with conference Id and talk Id")
            .build(), 
         ApiGatewayToolOverride.builder()
           .method(ApiGatewayHttpMethod.POST)
           .name("create-new-talk")
           .path("/talks")
           .description("create a new talk with talk Id, title and description")
           .build(),                    
         ApiGatewayToolOverride.builder()
           .method(ApiGatewayHttpMethod.GET)
           .name("get-talks-by-title-substring")
           .path("/talks/{titleSubstring}")
           .description("get talks by their title substring")
           .build()))

With that, LLM can easily find the right tool for the job.

The last part is to define how AgentCore Gateway handles the outbound authentication to the Amazon API Gateway. As described above and in the following issue, it's currently not possible to create the AgentCore Identity with CloudFormation. That's why CDK also can't provide this functionality. That's why we need to create it manually and then provide the configuration for this stack. Let's secure this Target with the API Key, as it is how we secured our Amazon Gateway API. Let's go to AgentCore Identity and click on "Add Outbound Auth" -> "Add API Key" :

Please put the same API Key that we used to secure our API. We defined it in the SAM template.

After we created the AgentCore Identity, let's grab its ARN:

The Client Secret will be automatically stored as a Secret in the AWS Secrets Manager. Let's also grab Secret ARN:

Now, let's configure both in the cdk.json:

{
  "app": "mvn -e -q compile exec:java",
  "context": {
      "agentcoreIdentityOutboundApiKeyArn": "arn:aws:bedrock-agentcore:us-east-1:{AWS_ACCOUNT_ID}:token-vault/default/apikeycredentialprovider/resource-provider-api-key-gateway",
      "apiKeySecretArn": "arn:aws:secretsmanager:us-east-1:{AWS_ACCOUNT_ID}:secret:bedrock-agentcore-identity!default/apikey/resource-provider-api-key-gateway-02d581b0-L9scmD",  ....
   }
}

Please replace both values with your individual ARNs. I explained in part 2 how we handle the AWS Account ID. Now, let's create and configure the credential provider:

var apiKeyProviderArn = ConventionalDefaults.getContextVariableValueWithReplacedAccountId(this, "agentcoreIdentityOutboundApiKeyArn");

var apiKeySecretArn = ConventionalDefaults.getContextVariableValueWithReplacedAccountId(this, "apiKeySecretArn");

var apiKeyProviderConfigs = List.of(GatewayCredentialProvider          
     .fromApiKeyIdentityArn(ApiKeyCredentialProviderProps.builder()
               .providerArn(apiKeyProviderArn)
               .secretArn(apiKeySecretArn)
               .credentialLocation(ApiKeyCredentialLocation                    
                   .header(ApiKeyAdditionalConfiguration.builder()
                     .credentialParameterName("x-api-key")
                     .credentialPrefix(" ")
                     .build()))
               .build()));

GatewayTarget.Builder.create(this, "APIGATEWAY-Target-123")           
  .targetConfiguration((ApiGatewayTargetConfiguration.Builder.create()
  ...
  .credentialProviderConfigurations(apiKeyProviderConfigs)
  ...
  build();

We grab the AgentCore Identity and Secret ARNs and use them to create an API Key Credential Provider. Then we define to set the credentials within the HTTP header with the name x-api-key. This is how we secured the Amazon API Gateway. Another option that AgentCore Gateway supports is to set them as query parameters. We then set them when creating the AgentCore Target credential provider configuration.

To deploy the AgentCore Gateway, please invoke cdk deploy spring-ai-conference-search-agentcore-gateway-with-mcp-server-target-stack -c awsAccountId={YOUR_AWS_ACCOUNT_ID} -c restApiId={YOUR_API_ID}:

After having successfully executed the AgentCore Gateway deployment, we'll see our Gateway in the console:

We need to grab the Gateway URL, which ends with /mcp. We also see both Gateway targets we created:

This AgentCore exposes 7 MCP tools in total:

4 tools for the conference search provided by the MCP server from part 2 and deployed on AgentCore Runtime.
3 tools to create a talk, search for existing talks, and apply for the conference with the talk. This 3 tools are provided through the Amazon API Gateway we deployed in this article.

Now, let's extend our Conference Application MCP client that we developed in part 3, so it can use this AgentCore Gateway MCP endpoint.

The important remaining topic is designing the IAM role and permissions so that AgentCore Gateway can handle inbound and outbound authentication and communicate with the Amazon API Gateway. I'll refer you to the articles, which cover those topics:

Extend our local Conference Application MCP client

In part 3, we developed a generic local MCP client capable of talking to each MCP server. I decided to extend it to be able to configure the AgentCore Gateway endpoint. This gives us the following options:

by configuring the amazon.bedrock.agentcore.runtime.id property in the application.properties to be not a blank string, we'll still connect to the MCP server running on AgentCore Runtime. It exposes only 4 MCP tools for the conference search.
by configuring the amazon.bedrock.agentcore.gateway.url property in the application.properties to be not a blank string, we'll connect to the AgentCore Gateway created previously, which exposes all 7 MCP tools. This is how we'll use it to show what is possible with that. Please make sure that amazon.bedrock.agentcore.runtime.id= is set to an empty string.
by configuring both properties, amazon.bedrock.agentcore.runtime.id take precedence. This is how I implemented the logic in the SpringAIAgentController class:

private String getMCPServerEndpoint() {
   if(!AGENTCORE_RUNTIME_ID.isBlank()) {
    return "https://bedrock-agentcore." + awsRegion + ".amazonaws.com/runtimes/"
       + AGENTCORE_RUNTIME_ID + "/invocations?qualifier=DEFAULT&accountId=" + this.getAccountId();
    } else if (!AGENTCORE_GATEWAY_URL.isBlank()) {
       return AGENTCORE_GATEWAY_URL;
    } else throw new RuntimeException(" no AgentCore Runtime Id or AgentCore Gateway URL defined");
}

You can change this logic if you wish.

Now we can use CURL or HTTPie to send some prompts. For example:

"Please provide me with the list of conferences, including their IDs, with Java topics happening in 2027, with the call for papers open today. Also, provide me with the list of my talks with this topic in the title. Finally, for each conference and talk retrieved, apply individually for the conference".

Here is an example of the request with HTTPie:

http GET http://localhost:8080/conference?prompt="Please provide me with the list of conferences, including their IDs, with Java topics happening in 2027, with the call for papers open today. Also, provide me with the list of my talks with this topic in the title. Finally, for each conference and talk retrieved, apply individually for the conference." Content-Type:text/plain.

Here is the correct LLM response:

Let's try another prompt:

http GET http://localhost:8080/conference?prompt="Please create a talk with a cool title (max 60 characters long) and description (max 300 characters long) about using Spring AI on the Amazon Bedrock AgentCore service. Then provide me with the list of conferences, including their IDs, with Java topics happening in 2026 and 2027, with the call for papers open today. Finally, for each conference, apply individually for it with the talk just created." Content-Type:text/plain.

Here is the correct LLM response again:

Cool, we created AgentCore Gateway, which gives us centralized access to the MCP tools that we need or the agent needs to accomplish the goal.

Conclusion

In this article, we looked at how to provide the MCP Tools for the Conference application via AgentCore Gateway in a centralized way.

As we saw in this and previous articles, the local MCP client for the Conference application, to talk to AgentCore Runtime or Gateway, became quite big. If we have many customers using such a client, changing and operating it can become quite challenging. That's why, in the next article, we look at how to deploy and run our MCP client on AgentCore Runtime.

If you like my content, please follow me on GitHub and give my repositories a star!

Please also check out my website for more technical content and upcoming public speaking activities.

One model is a guess. Three that agree is a plan.

Anton Babenko — Mon, 18 May 2026 08:04:46 +0000

Why I shipped multi-model consensus as a plugin, plus two quieter tools that keep agents honest.

Updated (28.5.2026): /consensus is now a 3-stage loop. Stage 2 adds a blind cross-review where each external rates the others' answers, anonymized, before Claude adjudicates. Pattern from karpathy/llm-council.

Updated (26.5.2026): Grok (xAI) joins GPT and Gemini as the third external provider, Gemini 3 is the default via Google's Antigravity CLI, two new experts (Researcher and Debugger) bring the count to seven, reviews are severity-graded, and /consensus now forces Claude to commit a blind verdict in the transcript before dispatching - arbiter-mediated, not pure democracy.

Ask one model to plan something hard - a migration, a refactor, a cutover - and you get a fluent, confident answer. Fluency is not correctness. A single model is articulate and alone, and being alone is the problem: nothing in the loop disagrees with it, so it rationalizes its first guess into a plan.

The expensive failures with coding agents are almost never syntax. They are plans that read well and were wrong: the wrong abstraction, the missed blast radius, the migration step that bricks state. You find out three hours into execution, not at review time.

I have been running the fix for the last few months - on Terraform modules, and on the everyday work of running compliance.tf, not only on code. It is not a bigger model. It is making models disagree on purpose and then forcing them to resolve it. That is what consensus does, and it is the reason the agent-plugins repo exists. Two other tools ship with it; they are narrower, and I will get to them.

One model is a guess

A single model samples one distribution with no adversary in the room. Two independent models rarely make the same mistake on a plan. Where they diverge is, almost exactly, the risky part of the plan - the assumption nobody checked. A consensus loop turns that disagreement from noise into a signal you can act on.

What `/consensus` actually does

/consensus runs GPT (via Codex), Gemini 3 (via Antigravity), and Grok (via the xAI API) against the same artifact, with Claude as the arbiter. Each round has three stages.

Stage 1 - parallel verdicts. Claude posts its own verdict (APPROVE / REQUEST CHANGES / REJECT) into the transcript first, blind, before any external sees the work. The pre-commitment sits there in writing, so Claude's judgment cannot drift later to match what the others say. Then GPT, Gemini, and Grok review the artifact in parallel, each in a fresh thread, single-shot. None sees another's review.

Stage 2 - blind cross-review. Each external then rates the OTHER externals' answers, identity stripped best-effort. Votes of "not viable" become candidate critical issues the arbiter has to weigh. This catches the case where Stage 1 looks like agreement but is really three reviewers each rationalizing past the same hole. Pattern adapted from karpathy/llm-council. Stage 2 fires every round 1, and after that only when Stage 1 disagreed or the previous Stage 2 surfaced an accepted issue.

Stage 3 - arbiter adjudication. Claude reconciles the Stage 1 verdicts, the Stage 2 candidate issues, and its own blind verdict. Every objection is accepted, dismissed with a recorded reason, or deferred. Claude revises the artifact and the loop runs again, up to five rounds. It converges only when every responding external approves and Claude's pre-committed verdict agrees, with reasons on both sides where it walked back. If the group cannot agree, it says so plainly instead of faking it.

Click for the detailed diagram with bias guards and per-model flow.

Independence is not only about which model. It is sharper when each reviewer wears a different hat. claude-delegator ships seven expert profiles - Architect, Plan Reviewer, Scope Analyst, Code Reviewer, Security Analyst, Researcher, and Debugger.

Combine the axes. A Security Analyst on Gemini and an Architect on GPT fight about different things than one model reviewing twice. Different profiles catch different categories of mistake.

For a migration plan I run the Plan Reviewer broadly and add a Security Analyst pass on top. "Is this safe to run" and "is this the right shape" get argued by separate reviewers, not averaged into one bland verdict.

There is a second kind of contamination worth naming: me. In consensus, every round sends the reviewers the same artifact text, cold, in a fresh thread. They never see my triage, my running verdict, or how I framed the previous round - only the artifact and bounded round metadata. My judgment is applied after they report, not baked into what they receive.

The ask-* commands are the opposite by design: the expert gets exactly the prompt I assembled from the conversation - what I chose to hand it. Fast for a second opinion, but the input is mine, not independent. consensus keeps the input independent and pays for it in extra rounds.

It does not have to be a plan. The loop runs on anything you can put in text - a design, a runbook, a decision memo, a spec. Plans are simply where I reach for it most, and where it converges fastest. The looser and fuzzier the input, the more rounds it takes to agree, so non-plan runs tend to run longer - worth it when the answer matters, overkill for a quick lookup. For that, single-shot ask-gpt, ask-gemini, ask-grok, and ask-all (which fans out to all three in parallel) are right there. consensus is for when it has to be right.

Nothing about this is Terraform, or even code. That generality is why it is the headline.

The quieter two

Same release, narrower scope:

code-intelligence - a language-agnostic skill. Agents grab text grep when they should ask the language server, and silently swap tools when one is missing, then report "found all references." This encodes search precedence: the language server (LSP) for symbols, rg/ripgrep for exact text, an embedding/semantic grep (such as mgrep) for fuzzy discovery - and a hard rule to disclose any substitution on the first line of the reply. You learn the moment coverage drops.
terraform-skill - routes a Terraform/OpenTofu request to its real failure mode (identity churn, blast radius, state corruption) before emitting HCL. It is terraform-ls aware: it knows the language server has no rename provider, so it runs the safe manual reference workflow instead of a blind find-replace. It is approaching 2,000 GitHub stars - the part of this post I am quietly proud of.

Those two are discipline for the agent's hands. Consensus is discipline for its judgment, and judgment generalizes further.

A note on claude-delegator

I did not invent the delegation layer. claude-delegator is a fork of Jarrod Watts' original (MIT, upstream currently quiet), fully based on his design - I kept the structure and the license.

What I added is what months of daily use exposed: a Gemini bridge that wraps Google's Antigravity CLI (agy) with auto-gemini-3 as the routing default and recovers an answer the CLI flushed to disk after a soft timeout instead of failing the call; a fresh Grok bridge over the xAI API that is advisory-only but reads attached files via the xAI Files API (with TTL-based cleanup); two more experts (Researcher and Debugger) on top of the original five; severity-graded reviews so three parallel reports merge cleanly; and a hardened /consensus loop where Claude pre-commits a blind verdict before any external sees the artifact, with a Stage 2 blind cross-review on top (adapted from karpathy/llm-council). Plus the bundled ask-gpt / ask-gemini / ask-grok / ask-all / consensus commands, so the workflow ships with the plugin instead of living in my dotfiles. The seven expert prompts borrow from oh-my-openagent and PAL; both credited in the README.

Credit for the foundation is his; the bug-fixing scar tissue is mine.

Why a plugin, not a blog post

One honest caveat, because I have hit it myself: skills are model-triggered, which makes them soft. Packaging this as a plugin improves reuse and discoverability. It does not guarantee the agent obeys every time - hard enforcement (a real pre-tool gate) is a separate, still-open problem.

I keep finding and fixing bugs in all three. That constant repair is the only reason I trust them enough to write this - and I would rather say so than oversell the fix.

The takeaway

Stop accepting the first confident plan from one model. Make them argue, and only move when they stop. The release is at github.com/antonbabenko/agent-plugins; consensus ships in claude-delegator from the same marketplace. The cheapest review is the one that happens before you execute.

You don't 3D print a house. You print your tools.

Luca Bianchi — Fri, 15 May 2026 12:17:19 +0000

Vibe coding is to engineering what 3D printing is to making, and that's exactly why it matters.

There's a recurring debate in our industry about whether vibe coding will replace serious software engineering. Both camps are framing the question wrong. The right reference is the desktop 3D printer.

When 3D printing went mainstream a decade ago, the same two camps showed up. One predicted print-on-demand cars and houses; the other dismissed it as a toy. What happened was stranger. 3D printing didn't replace manufacturing. It collapsed the cost of bespoke tooling. A specific bracket for a specific shelf in a specific corner of your specific workshop used to be a project. Now it's a Sunday afternoon. Nobody prints a load-bearing wall. Everyone prints jigs, fixtures, replacement knobs, and tools tailored to the exact job at hand.

That's the mental model for vibe coding.

The analogy, precisely

Production systems engineering still works the way it did. Distributed transactions, security boundaries, multi-region failover, and regulatory-grade audit trails- none of that becomes "vibe-able." The constraints are the same: correctness, throughput, observability, blast radius. If anything, the bar has gone up because the cost of writing plausible-looking wrong code has dropped to zero.

But there's a category of software that used to sit in a no-man's-land: too specific to be worth packaging as a product, too tedious to write by hand for a single use, too critical to skip entirely. Internal scripts that should have safety rails. CLIs you run twice a year. Migration tools tied to the exact shape of your stack. These were the software brackets and jigs, necessary, valuable, and almost always either skipped or built poorly because nobody had the budget for them.

That's where vibe coding shines. It's a workshop tool that brings industrial-grade results to one-off problems. The cost of bespoke, well-built tooling has collapsed to the point where it makes economic sense to build it for a single use.

The instance: a Route 53 migration

Last week, I needed to move a Route 53-hosted zone from one AWS account to another. Standard enterprise hygiene, wrong account ownership, billing consolidation, the usual story. The problem itself is straightforward if you know Route 53: you can't transfer a hosted zone directly between accounts. You list the records in the source, create a new zone in the destination, replay the records into it, then cut over the registrar's NS delegation.

Each step has small traps. The apex NS records and the SOA record are auto-generated by AWS and will be rejected on import. Pagination on ListResourceRecordSets uses a three-field cursor: name, type, and set identifier, not a simple token. The ChangeResourceRecordSets API has a hard cap of 1000 changes per call, but it gives much better error messages if you batch smaller changes. Private zones require VPC re-association and are a separate problem. None of these is hard. They're just sharp edges that someone running this once is statistically guaranteed to hit.

Pre-2023, my options were three. Run it manually through the console, slow, fat-finger-prone, no audit trail. Write a one-off Bash script with AWS CLI calls, faster, but every safety check I want to add is another hour. Build a proper internal tool, justified for a team running this monthly, hard to justify for a one-time job.

The 3D-printer-for-tools answer is option four: build the proper tool anyway, because building it is no longer expensive.

Spec first, then vibe

This is the part people get wrong about vibe coding. Because the model writes fast, vague specs produce a lot of confidently wrong code very quickly. The discipline shifts from typing to specifying.

The dialogue that produced this tool started with the problem, and implementation followed. I described what I was trying to do: move a hosted zone safely between accounts, with the registrar transfer as a separate concern. The conversation forced a series of decisions before any code existed.

Scope: public hosted zones only. Private zones are moved to v2 because cross-account VPC association is a different problem with different failure modes, and conflating them in v1 dilutes the design.

Trust model: never mutate anything until the operator has confirmed which account they're talking to. STS GetCallerIdentity runs on both source and destination credentials at startup; the account IDs and caller ARNs are shown in plain text, and the operator confirms before the tool proceeds.

Credential surface: named AWS profiles and environment variables, nothing else. No baked-in keys, no custom config files. The credential chain is the SDK's; the tool just picks where to source from.

Reversibility: the tool stops short of the irreversible step. It replicates the zone, records it, then prints the new name servers and stops. Updating the registrar's NS delegation is a manual final step, deliberately, because that's the cutover moment, and a human should be the one who pulls that lever.

Failure modes: which records get skipped (apex NS, SOA), what batch size to use (100, not 1000, clearer error messages outweigh the marginal call count), how pagination is handled (full marker-based loops, not "first page is probably fine").

These decisions were made in prose, before any TypeScript existed. The model is excellent at translating that prose into code; it is much less reliable at making these decisions for you. Spec-driven vibe coding means the operator writes the spec, the model writes the code, and the operator reviews both for fidelity.

Route53 Migration Tool

The result is a CLI called route53-aws-to-aws-transfer, written in TypeScript with the AWS SDK v3. The structure mirrors the spec. A credentials module resolves either a named profile or environment variables and runs an STS identity check, returning the validated account ID and caller ARN so the CLI layer can show them to the operator for confirmation. Two independent credential resolutions happen, one for the source account, one for the destination, because conflating them is the most likely operator error and the easiest to prevent at the boundary.

A Route53 module wraps the SDK calls the migration actually needs: paginated zone listing filtered to public zones, paginated record-set listing with the three-field cursor that the API requires and the SDK doesn't abstract, zone creation with a unique caller reference, and the change-set builder that explicitly drops apex NS and SOA records before batching the rest into UPSERT calls.

An orchestration module sequences these against the operator's confirmed inputs and emits structured progress. The CLI layer uses @inquirer/prompts for the interactive flow, chalk for the highlighting that draws the eye to account IDs and name-server lists, and ora for the spinners that make long pagination loops feel like progress rather than a hang.

The whole tool is around 400 lines of TypeScript. It does one thing. It does it with the safety rails I'd expect from an internal platform team's tooling. It will probably run three times in its life, and that's fine, because the cost of building it correctly was lower than the cost of running the migration carelessly even once.

What this means if you're running engineering

The economic shift this represents is small in the aggregate but large in the aggregate. The list of things that were previously "not worth building properly" is enormous: data migration scripts, one-off ETL jobs, internal admin CLIs, environment-bootstrap tools, audit-report generators, ad-hoc dashboards, throwaway integrations between two SaaS products you happen to use. Every team has dozens. Most are currently either absent, and the work is being done by hand, or present in a form that's basically a liability: Bash, no tests, no logging, run from someone's laptop.

If you're a CTO or a tech lead, the practical question is what quality bar you hold for built-once tooling. My answer for my own team is the same as our production bar, minus the scale concerns. STS validation, structured error handling, idempotency where the underlying API allows it, and no silent failures. The model can hold that bar if you specify it. It absolutely won't hold if you don't.

This is also where vibe coding stops being a private hobby and starts being a team practice. The artifacts are small enough to review properly. The specs are short enough to write down. The economics work even for tooling a single engineer will use once, which means there's no longer an excuse for the un-toolable middle. That category just collapsed.

Where to go from here?

What desktop 3D printing did for the workshop, vibe coding is doing for software. Production engineering is still production engineering, with all the disciplines that it requires. What returns is a layer we lost when software industrialized: the ability to make exactly the tool you need, for exactly the job in front of you, at a quality level you would have respected even from a professional. Vibe-code the Route 53 migration tool you needed this morning. The alternative is doing the migration without it. The workshop is back. Print your tools.

How I Monitor AI Agents: CloudWatch for Infra, Arize Phoenix for Traces and OpenTelemetry, LLM-as-Judge for Quality

Carlos Cortez 🇵🇪 [AWS Hero] — Thu, 14 May 2026 00:56:26 +0000

How I Monitor My AI Agents: CloudWatch for Infra, Arize Phoenix for Traces, LLM-as-Judge for Quality

AI agents are not regular software. They reason, they call tools, they make decisions — and they can fail in ways that a simple health check will never catch. The response was technically successful, but was it actually helpful? The agent called the right tool, but did it interpret the result correctly? Traditional monitoring doesn't answer these questions.

That's why I built a three-layer observability stack for my AI agents, and today I'm walking you through exactly how it works.

📓 Full working notebook: All the code in this post is validated and executable in the companion Jupyter notebook — including setup, tracing, evals, and cleanup. here as well: https://github.com/breakingthecloud/observability-ai-agents-phoenix-otel-strands

The Problem with Monitoring AI Agents

Here's the thing: when your agent answers "I don't have weather data for Paris" — is that a failure? Technically no, the agent ran fine. But from a user perspective, it's a miss. Traditional monitoring would show 200 OK, low latency, zero errors. Everything looks green. But the user didn't get what they needed.

You need three layers of observability to actually understand what's happening:

User Query → Strands Agent → Tool Calls → Bedrock (Claude)
     ↓              ↓              ↓
  Phoenix      CloudWatch      Phoenix Evals
 (AI traces)  (infra metrics)  (quality scores)

Layer	Tool	What it answers
AI Traces	Arize Phoenix	What did the agent think? Which tools did it call? What was the full LLM input/output?
Infrastructure	Amazon CloudWatch	Is the system healthy? How fast? How much is it costing me?
Quality Evals	Phoenix + LLM-as-Judge	Was the response actually good? Helpful? Accurate?

The Stack

Strands Agents SDK — AWS's open-source framework for building agents
Amazon Bedrock — Claude Sonnet 4.6 as the foundation model
Arize Phoenix — Open-source AI observability, runs locally, zero accounts needed
Amazon CloudWatch — Metrics, alarms, dashboards
OpenTelemetry — The glue that connects everything

What's interesting is that Phoenix runs entirely on your machine — localhost:6006. No cloud accounts, no API keys for the observability layer. You get a full tracing UI for free.

Layer 1: AI Traces with Arize Phoenix

The first thing you need is visibility into what your agent is actually doing. Not just "it responded in 2 seconds" but the full reasoning chain: what the LLM received, what it decided, which tools it called, and what it returned.

Setting Up the Tracing Pipeline

Three steps: launch Phoenix, configure OpenTelemetry, instrument Bedrock.

import phoenix as px
from opentelemetry import trace as trace_api
from opentelemetry.sdk import trace as trace_sdk
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from openinference.instrumentation.bedrock import BedrockInstrumentor

# 1. Launch Phoenix locally
session = px.launch_app()  # UI at http://localhost:6006

# 2. Configure OTel to send traces to Phoenix
tracer_provider = trace_sdk.TracerProvider()
tracer_provider.add_span_processor(
    SimpleSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:6006/v1/traces"))
)
trace_api.set_tracer_provider(tracer_provider)

# 3. Auto-instrument all Bedrock API calls
BedrockInstrumentor().instrument(tracer_provider=tracer_provider)

That's it. Every Bedrock call your agent makes is now traced automatically. No decorators on your business logic, no manual span creation. OpenInference handles it.

Building the Agent

The agent itself is straightforward with Strands:

import boto3
from strands import Agent
from strands.models.bedrock import BedrockModel

session = boto3.Session(profile_name="cc", region_name="us-east-1")

def get_weather(city: str) -> str:
    """Get current weather for a city."""
    weather_data = {
        "Lima": "☀️ 22°C, clear skies",
        "New York": "🌧️ 15°C, rainy",
        "Tokyo": "⛅ 18°C, partly cloudy",
    }
    return weather_data.get(city, f"Weather data not available for {city}")

model = BedrockModel(model_id="us.anthropic.claude-sonnet-4-6", boto_session=session)

agent = Agent(
    model=model,
    tools=[get_weather],
    system_prompt="You are a helpful weather assistant. Use the get_weather tool."
)

Now when you run agent("What's the weather in Lima and Tokyo?"), Phoenix captures the entire trace tree: the agent span, the LLM calls, the tool invocations, the final response. All visible in the UI at localhost:6006.

Exploring Traces Programmatically

You don't have to use the UI. Phoenix exposes everything as DataFrames:

from phoenix.client import Client

traces_df = Client().spans.get_spans_dataframe()
traces_df["latency_ms"] = (traces_df["end_time"] - traces_df["start_time"]).dt.total_seconds() * 1000

print(f"Total spans captured: {len(traces_df)}")
traces_df[["name", "span_kind", "latency_ms", "status_code"]].head(10)

This gives you every span — agent, LLM, tool — with timing, status, and the full input/output attributes. Perfect for building custom analytics or feeding into your own dashboards.

Layer 2: Infrastructure Monitoring with CloudWatch

Phoenix tells you what the agent is thinking. CloudWatch tells you if the system is healthy. Different questions, both critical.

The AgentMonitor Class

I built a simple wrapper that publishes four metrics per agent invocation:

cloudwatch = session.client("cloudwatch", region_name="us-east-1")

class AgentMonitor:
    def __init__(self, namespace="AI/Agents"):
        self.namespace = namespace
        self.cw = cloudwatch

    def track(self, agent_name: str, latency_ms: float, tokens: int,
              success: bool, tool_calls: int = 0):
        metrics = [
            {"MetricName": "Latency", "Value": latency_ms, "Unit": "Milliseconds"},
            {"MetricName": "TokensUsed", "Value": tokens, "Unit": "Count"},
            {"MetricName": "Success", "Value": 1 if success else 0, "Unit": "Count"},
            {"MetricName": "ToolCalls", "Value": tool_calls, "Unit": "Count"},
        ]
        dims = [{"Name": "AgentName", "Value": agent_name}]
        for m in metrics:
            m["Dimensions"] = dims
        self.cw.put_metric_data(Namespace=self.namespace, MetricData=metrics)

Usage is clean — wrap your agent call:

monitor = AgentMonitor()
start = time.time()
try:
    result = agent("What's the weather in Lima?")
    latency = (time.time() - start) * 1000
    monitor.track("weather-agent", latency, tokens=150, success=True, tool_calls=1)
except Exception as e:
    latency = (time.time() - start) * 1000
    monitor.track("weather-agent", latency, tokens=0, success=False, tool_calls=0)

Smart Alarms

Two alarms that catch the most common issues:

# Alert when response time is consistently high
cloudwatch.put_metric_alarm(
    AlarmName="Agent-High-Latency",
    MetricName="Latency", Namespace="AI/Agents",
    Statistic="Average", Period=300, EvaluationPeriods=3,
    Threshold=10000.0,  # 10 seconds
    ComparisonOperator="GreaterThanThreshold",
    Dimensions=[{"Name": "AgentName", "Value": "weather-agent"}],
)

# Alert when error rate exceeds 5%
cloudwatch.put_metric_alarm(
    AlarmName="Agent-High-Error-Rate",
    MetricName="Success", Namespace="AI/Agents",
    Statistic="Average", Period=300, EvaluationPeriods=2,
    Threshold=0.95,
    ComparisonOperator="LessThanThreshold",
    Dimensions=[{"Name": "AgentName", "Value": "weather-agent"}],
)

The concept is straightforward: latency catches performance degradation, error rate catches reliability issues. These two alarms alone will catch 80% of production problems.

Layer 3: LLM-as-Judge Evals

This is the layer most people skip — and it's the most important one. Your agent can be fast, reliable, and still give terrible answers. You need automated quality evaluation.

The idea: use another LLM to judge the quality of your agent's responses. It's not perfect, but it's infinitely better than no evaluation at all.

Setting Up the Evaluator

Phoenix evals v3 uses a provider-based LLM wrapper. For Bedrock, it goes through litellm:

from phoenix.evals import LLM, create_evaluator, evaluate_dataframe
from phoenix.client import Client
import pandas as pd

# Get LLM spans from Phoenix
spans_df = Client().spans.get_spans_dataframe()
llm_spans = spans_df[spans_df["span_kind"] == "LLM"].copy()

# Build eval dataframe
eval_data = pd.DataFrame({
    "input": llm_spans["attributes.input.value"].fillna("").values,
    "output": llm_spans["attributes.output.value"].fillna("").values,
})
eval_data = eval_data[eval_data["output"].str.len() > 0].reset_index(drop=True)

# Create the judge
eval_model = LLM(provider="bedrock", model="us.anthropic.claude-sonnet-4-6")

@create_evaluator(name="helpfulness", source="llm")
def helpfulness(input: str, output: str) -> float:
    """Rate how helpful the agent response is on a scale of 0 to 1."""
    prompt = (
        f"Rate the helpfulness of this AI response on a scale of 0.0 to 1.0.\n"
        f"User asked: {input}\n"
        f"AI responded: {output}\n"
        f"Return ONLY a number between 0.0 and 1.0."
    )
    result = eval_model.generate_text(prompt=prompt)
    try:
        return float(result.strip())
    except ValueError:
        return 0.5

# Run evaluation
results = evaluate_dataframe(dataframe=eval_data, evaluators=[helpfulness])

The cool part here is the @create_evaluator decorator — it turns a simple function into a full evaluator that Phoenix understands. You can create as many as you need: helpfulness, accuracy, safety, tone, whatever matters for your use case.

Pushing Scores Back to Phoenix

The evaluation results are useful in a DataFrame, but they're even more useful when attached to the actual traces in Phoenix:

import json as _json

score_col = [c for c in results.columns if "_score" in c][0]
scores = results[score_col].apply(
    lambda x: _json.loads(x).get("value", 0) if isinstance(x, str) else 0
)

annotations = pd.DataFrame({
    "span_id": llm_spans["context.span_id"].values[:len(scores)],
    "score": scores.values,
    "label": scores.apply(lambda s: "good" if s >= 0.7 else "needs_review").values,
    "explanation": [f"Helpfulness score: {s:.2f}" for s in scores.values],
})

Client().spans.log_span_annotations_dataframe(
    dataframe=annotations,
    annotation_name="helpfulness",
    annotator_kind="LLM",
)

Now when you open Phoenix UI and click on any LLM span, you see the helpfulness score right there in the Annotations tab. Traces + quality scores in one place.

What This Costs

One question I always get: what does this cost to run?

Component	Where	Cost
Phoenix	Your machine (localhost)	$0
Bedrock (agent calls)	AWS, pay-per-request	~$0.003 per query
Bedrock (eval judge)	AWS, pay-per-request	~$0.003 per eval
CloudWatch alarms	AWS	~$0.20/month
CloudWatch custom metrics	AWS	~$0.30/month

For development and testing, you're looking at less than $1/month for the AWS side. Phoenix is completely free and local.

The Main Takeaway

Observability for AI agents requires thinking in three layers:

Traces (Phoenix) — What is the agent doing? What's the full reasoning chain?
Infra metrics (CloudWatch) — Is the system healthy? Fast? Within budget?
Quality evals (LLM-as-Judge) — Are the responses actually good?

Most teams only do layer 2. Some add layer 1. Almost nobody does layer 3 — and that's where the real insights are. A fast, reliable agent that gives bad answers is worse than a slow one that gives good answers, because you won't even know there's a problem.

My advice: start with Phoenix traces (it's free and local), add CloudWatch for the basics (latency, errors, tokens), and then build at least one LLM-as-Judge evaluator for whatever quality dimension matters most to your users. You can set this up in an afternoon and it will save you weeks of debugging blind.

Connect with me:

LinkedIn - Let's discuss AI observability and agent architectures
X/Twitter - Follow for AWS, GenAI, and agentic AI updates
GitHub - Check out the full notebook and more
Dev.to - More technical deep-dives
AWS Community - Join the conversation

I'm Carlos Cortez, this is Breaking the Cloud, and today we made our agents observable. See you in the next one!

Building AI Agents with Spring AI and Amazon Bedrock AgentCore - Part 3 Develop local MCP client for Conference application

Vadym Kazulkin — Mon, 11 May 2026 15:03:58 +0000

Introduction

Develop local MCP client for Conference application

You can find the source code of the MCP client in my spring-ai-1.1-conference-app-agent-local repository.

Let's go step-by-step through it.

First, in pom.xml, we include, among others, those dependencies:

spring-ai-bom - to include the general Spring AI functionality.
spring-boot-starter-web - as we develop the MCP client as a web application.
spring-ai-starter-model-bedrock-converse -as we use foundational models on Amazon Bedrock.
spring-ai-starter-mcp-client-webflux - to develop an asynchronous Spring AI MCP Client. We can use spring-ai-starter-mcp-client to develop a synchronous one.

SpringAIConferenceLocalMCPClient class is the main entry point to our application.

Second, in application.properties, we define some properties. Those are Spring AI-related:

spring.ai.bedrock.aws.region=us-east-1
spring.ai.bedrock.aws.timeout=10m
spring.ai.bedrock.converse.chat.options.max-tokens=100
spring.ai.bedrock.converse.chat.options.model=amazon.nova-lite-v1:0
spring.ai.mcp.client.type=ASYNC

We define the region where we host our application, and the timeout when talking to the Amazon Bedrock models. Then we also set the default Amazon Bedrock to use and a maximal number of tokens, and the MCP client type to ASYNC. We can also set SYNC instead, but we need to use another Spring AI MCP client dependency as described above.

We also include some application-related properties:

cognito.user.pool.name=UserPoolForAgentCoreMCP
cognito.user.pool.client.name=UserPoolClientWithUserAndPasswordForAgentCoreMCP
cognito.auth.token.resource.server.id=AgentCoreResourceServerId
amazon.bedrock.agentcore.runtime.id=spring_ai_conference_search_agentcore_runtime-6dnMIL9455

These are individual properties, whose values we need to set from the deployment of the Conference search MCP server. We described the configuration, creation process, and those properties of the MCP server in part 2.

Please ignore other properties like amazon.bedrock.agentcore.gateway.url as we will need them when we extend our application in the next articles.

The whole application logic is in the SpringAIAgentController class.

We inject the values of individual properties and build AWS service clients (STS and Cognito). This is how we create the ChatClient, which is the main interface of Spring AI to talk to the LLMs:

public SpringAIAgentController(ChatClient.Builder builder, ChatMemory chatMemory, @Value("${aws.region}") String awsRegion) {
   var options = ToolCallingChatOptions.builder()
    .model("us.anthropic.claude-sonnet-4-6")
    .maxTokens(2000)
    .build();

    this.chatClient = builder.defaultOptions(options).build();
}

We show here that we can optionally build ToolCallingChatOptions and override the default model name and the maximum number of tokens defined in application.properties. Then, we build the ChatClient, and can optionally set ToolCallingChatOptions.

Below is how the code for the method looks, which will receive the prompt from the user:

@GetMapping(value = "/conference", consumes = "text/plain")
public Flux<String> conferenceSearch(@RequestParam String prompt) {
  var token = getAuthTokenViaHttpClient();
  var client = McpClient.async(getMcpClientTransport(token)).build();
  client.initialize();
  var toolsResult = client.listTools(); 
  for (var tool : toolsResult.block().tools()) { 
    logger.info("tool found " + tool); 
  }

  var asyncMcpToolCallbackProvider = AsyncMcpToolCallbackProvider.builder()
     .mcpClients(client)
     .build();


  return this.chatClient.prompt().user(prompt)  
        .tools(new DateTimeTools())
        .toolCallbacks(asyncMcpToolCallbackProvider.getToolCallbacks())
        .stream()
        .content();
}

Let's break this code down and explain it. First, we need to obtain the JWT token:

var token = getAuthTokenViaHttpClient();

This, in turn, uses a bunch of Amazon Cognito services to achieve this goal:

private String getAuthTokenViaHttpClient() {
  var userPool = getUserPool();
  var userPoolClient = getUserPoolClient(userPool);
  var userPoolClientType = describeUserPoolClient(userPoolClient);
  var userPoolId = userPool.id();
  userPoolId = userPoolId.replace("_", "").toLowerCase();
  var url = "https://" + userPoolId + ".auth." + Region.US_EAST_1.id() + ".amazoncognito.com/oauth2/token";

  var SCOPE_STRING = RESOURCE_SERVER_ID + "/*";
  var entity = "grant_type=client_credentials&" + "client_id=" + userPoolClientType.clientId() + "&"
    + "client_secret=" + userPoolClientType.clientSecret() + "&" + "scope=" + SCOPE_STRING;

  try (var httpClient = HttpClients.createDefault()) {
     var httpPost = ClassicRequestBuilder.post(url)
       .setHeader("Content-Type", "application/x-www-form-urlencoded").setEntity(entity).build();
     return httpClient.execute(httpPost, new AuthTokenResponseHandler());       
}

Here, we use the configuration of the user (client ) names and the resource server ID from application.properties to obtain the user (client) pool. Then we construct the URL and the body (entity) of the HTTP request to obtain the authentication token. After it, we execute this request and obtain the token from the response:

private class AuthTokenResponseHandler implements HttpClientResponseHandler<String> {
@Override
  public String handleResponse(ClassicHttpResponse response) throws HttpException, IOException {
    var inputStream = response.getEntity().getContent();
    var responseString = new String(inputStream.readAllBytes(), StandardCharsets.UTF_8);
    var responseMap = objectMapper.readValue(responseString, new TypeReference<Map<String, Object>>() {});
    return (String) responseMap.get("access_token");
   }
}

After we have obtained the token, we're ready to create the (asynchronous as configured) MCP client:

  var client = McpClient.async(getMcpClientTransport(token)).build();

Let's describe what happens when we invoke the getMcpClientTransport method:

private McpClientTransport getMcpClientTransport(String token) {        
  var MCP_SERVER_ENDPOINT= this.getMCPServerEndpoint();
  var headerValue = "Bearer " + token;
  var webClientBuilder = WebClient.builder()
     .defaultHeader("Authorization", headerValue)
     .defaultHeader("accept","application/json, text/event-stream")
     .defaultHeader("Content-Type","application/json");
     return WebClientStreamableHttpTransport
       .builder(webClientBuilder)
       .endpoint(MCP_SERVER_ENDPOINT).build();
}

We first construct the MCP_SERVER_ENDPOINT URL from the in application.properties configured AgentCore Runtime ID. In the next article, I'll add the use case to also add the AgentCore Gateway URL. Then, we create the WebClientBuilder by passing some HTTP headers, including the bearer token. After it, we create WebClientStreamableHttpTransport and set the web client builder and the MCP server endpoint. It's important to use the HTTP Streamable web client because AgentCore Runtime (and Gateway) only supports it.

Now we are ready to initialize our MCP client and obtain the list of tools from it:

client.initialize();
var toolsResult = client.listTools(); 
for (var tool : toolsResult.block().tools()) { 
    logger.info("tool found " + tool); 
}

We get all 4 tools that our Conference Search application from part 2 exposes, which we deployed on AgentCore Runtime.

Next, we need to create the list of tool callbacks from the MCP Client to pass to the ChatClient:

var asyncMcpToolCallbackProvider = AsyncMcpToolCallbackProvider.builder()
     .mcpClients(client)
     .build();

If you don't need all the tools, you can filter them and, for example, only leave those tools whose name contains Conference_Search_Tool_By_Topic as a substring, as shown below:

var asyncMcpToolCallbackProvider = 
AsyncMcpToolCallbackProvider.builder().mcpClients(client)
  .toolFilter(new McpToolFilter() {              
       @Override public boolean test(McpConnectionInfo info, Tool tool) { 
        return tool.name().toLowerCase()
                .contains("Conference_Search_Tool_By_Topic"); 
             } 
         }
      )
    .build();

Next, to enable search prompts such as "Please provide me with the list of conferences including their IDs, with Java topic happening in 2027, with call for papers open today", we need to obtain the current date. LLM doesn't know the current date, and for this, I wrote a small tool with the name DateTimeTools :

@Tool(description = "Get the current date ")
String getLocalDate() {
        return LocalDate.now().toString();       
}

It contains only one tool to get the current date. Then, we pass this local tool to the ChatClient by invoking the tools method. We also pass the tool callback list from the AsyncMcpToolCallbackProvider by invoking the toolCallbacks method. The last step is to use the ChatClient with the given prompt and tool (callbacks) to produce an answer to the prompt. This answer will be streamed back to the user:

  return this.chatClient.prompt().user(prompt)  
   .tools(new DateTimeTools())
   .toolCallbacks(asyncMcpToolCallbackProvider.getToolCallbacks())
   .stream()
   .content();

Let's build our application with mvn clean package and start it with: mvn spring-boot:run.

Now we can use CURL or HTTPie to send some prompts. For example:

"Please provide me with the list of conferences, including their IDs, with Java topics happening in 2027".

Here is an example of the request with HTTPie:

http GET http://localhost:8080/conference?prompt="Please provide me with the list of conferences, including their IDs, with Java topics happening in 2027" Content-Type:text/plain.

Here is the correct LLM response:

As you can see from the description and logs, LLM used the tool Conference_Search_Tool_By_Topic_And_Date from the MCP server to produce the answer. Let's try another prompt:

http GET http://localhost:8080/conference?prompt="Please provide me with the list of conferences, including their IDs, with Java topics happening in 2026 and 2027, with the call for papers open today" Content-Type:text/plain.

Here is the correct LLM response again:

As you can see from the description and logs, LLM used the tools to produce the answer. Conference_Search_Tool_By_Topic_Date_CFP_Open from the MCP server and the local tool Get_The_Current_Date to produce the answer.

Conclusion

In this article, we developed the (MCP-) client, capable of talking to our application running on AgentCore Runtime. In the next article, we'll look at another alternative to AgentCore Runtime to host MCP servers on AgentCore - AgentCore Gateway. We'll also compare both alternatives. In one of the next articles, I'll show you how to deploy and run this MCP client on the AgentCore Runtime as well, using the HTTP protocol. It's not always appropriate to work with the client locally.

If you like my content, please follow me on GitHub and give my repositories a star!

Please also check out my website for more technical content and upcoming public speaking activities.