With the CDK infrastructure in place (Part 2), we need an actual agent to run inside it.
The agent is a Python application that:
- Exposes an HTTP endpoint AgentCore can call
- Uses the Strands Agents SDK to run a Bedrock-backed reasoning loop
- Integrates with AgentCore Memory for persistent context
- Uses Bedrock Guardrails on every invocation
The full source is in apps/customer-service-agent/ in the demo repo.
Why Strands over LangChain or LlamaIndex?
When I started this project, LangChain was the default answer for "I need to build an agent." I used it, ran into friction, and switched to Strands. Here's why:
Strands is AWS-native. It's built to integrate directly with Bedrock services — prompt caching, guardrail configs, tool definitions. With LangChain, you write adapter code to bridge from LangChain abstractions down to raw Bedrock APIs. With Strands, you're calling the Bedrock API directly through a thin, intentional abstraction.
Tool definitions are simpler. In LangChain, you define tools with StructuredTool.from_function() or subclass BaseTool. In Strands, you decorate a function with @tool and the docstring becomes the description:
# LangChain approach
from langchain.tools import StructuredTool
from pydantic import BaseModel, Field
class OrderLookupInput(BaseModel):
order_id: str = Field(description="The order ID")
def lookup_order_status(order_id: str) -> str:
...
tool = StructuredTool.from_function(
func=lookup_order_status,
name="lookup_order_status",
description="Look up the current status of an order",
args_schema=OrderLookupInput,
)
# Strands approach
from strands import tool
@tool
def lookup_order_status(order_id: str) -> str:
"""Look up the current status of a customer order by order ID."""
...
Active development matches AgentCore. Strands is developed at a cadence that tracks AgentCore releases. New AgentCore features show up in Strands before they make it to LangChain adapters.
Project structure
customer_service_agent/
├── __init__.py
├── config.py # Settings from env vars
├── prompts.py # System prompt
├── tools.py # @tool definitions
├── memory.py # AgentCore Memory boto3 helpers
├── agent.py # BedrockModel setup + streaming
└── main.py # FastAPI app
config.py — environment variables
Everything the agent needs is injected as environment variables by AgentCore. In production, you set EnvironmentVariables on the CfnRuntime resource in CDK. Locally, you use .env.local.
from pydantic_settings import BaseSettings
class Settings(BaseSettings):
agentcore_memory_id: str = ""
bedrock_guardrail_id: str = ""
bedrock_guardrail_version: str = "1"
aws_region: str = "us-east-1"
environment: str = "dev"
# Primary: Claude Sonnet 4.6
primary_model_id: str = "anthropic.claude-sonnet-4-6-20251001-v1:0"
# Background: Nova Pro (~15x cheaper per token)
background_model_id: str = "amazon.nova-pro-v1:0"
class Config:
env_file = ".env.local"
settings = Settings()
The dual-model strategy
The agent uses two Bedrock models for different tasks:
Claude Sonnet 4.6 for main conversations — best reasoning, multi-step tool use, nuanced responses. More expensive but worth it for the customer-facing output.
Amazon Nova Pro for background tasks — ~15x cheaper per token. Ideal for:
- Classifying intent before routing
- Summarising long conversation history
- Generating internal labels/tags
- Any task where "good enough" is sufficient
Prompt caching — the 90% cost saving
This is the most impactful optimisation in the whole system.
Prompt caching works like this: you mark part of your prompt as a "cacheable prefix". Bedrock caches those tokens server-side for ~5 minutes. On subsequent calls that use the same prefix, you pay the cache read price instead of the full input token price.
For Claude Sonnet 4.6:
- Cache write: $3.00 per 1M input tokens (same as normal)
- Cache read: $0.30 per 1M input tokens
- Saving: 90% on cached tokens
The system prompt is the perfect candidate for caching — it's the same on every request in a session:
from strands.models import BedrockModel
from botocore.config import Config
# Adaptive retry — Bedrock throttles hard under load
boto_config = Config(
retries={"max_attempts": 5, "mode": "adaptive"},
read_timeout=120,
)
primary_model = BedrockModel(
model_id="anthropic.claude-sonnet-4-6-20251001-v1:0",
boto_config=boto_config,
additional_request_fields={
# Enable prompt caching (Anthropic beta feature on Bedrock)
"anthropic_beta": ["prompt-caching-2024-07-31"],
},
guardrail_config={
"guardrailIdentifier": settings.bedrock_guardrail_id,
"guardrailVersion": settings.bedrock_guardrail_version,
},
)
# System prompt with cache_control: ephemeral
# This marks the prompt as a cacheable prefix for Bedrock
cached_system_prompt = [
{
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}, # Cache this prefix
}
]
For a 1,500-token system prompt at 100 requests/day with 5 turns each:
- Without caching: ~450,000 system prompt tokens/day × $3/1M = $1.35/day just for system prompts
- With caching: first turn normal price, turns 2-5 at cache read → ~$0.27/day
- Saving: ~$1/day, ~$365/year on just the system prompt
The saving scales linearly with session length and request volume.
Tool definitions
Strands tools are just decorated Python functions. The function signature defines the input schema, and the docstring is sent to the model as the tool description:
from strands import tool
@tool
def lookup_order_status(order_id: str) -> str:
"""
Look up the current status and details of a customer order by order ID.
Use this when a customer asks about their order, delivery, or shipment.
Args:
order_id: The order ID (format: ORD-XXXXXX)
Returns:
Order details including status, items, and estimated delivery date.
"""
# Your implementation here
...
@tool
def search_product_faq(query: str) -> str:
"""
Search the product FAQ and policy knowledge base for answers to customer questions.
...
"""
...
Tools are passed to the Agent constructor. Strands handles the tool invocation loop — calling the tool when the model decides to use it, feeding the result back, and continuing the reasoning loop until the model produces a final response.
AgentCore Memory integration
AgentCore Memory provides persistent storage across sessions without you building any of the storage infrastructure.
The three strategy types:
- Semantic — stores facts and user profile information. Consolidates information like "user prefers email contact", "user is on premium plan".
- Summary — stores compressed session history. "On 2025-03-15 user reported late delivery of order ORD-001234. Issue was resolved."
- UserPreference — stores interaction style. "User prefers brief responses without extra detail."
The memory client is a standard boto3 client:
import boto3
from botocore.config import Config
memory_client = boto3.client(
"bedrock-agentcore-memory",
config=Config(retries={"max_attempts": 5, "mode": "adaptive"}, read_timeout=30),
)
# Store a conversation turn
memory_client.create_event(
memoryId=settings.agentcore_memory_id,
actorId=actor_id, # Identifies the user (e.g., user ID from JWT)
sessionId=session_id, # Identifies the conversation session
messages=[
{"role": "USER", "content": [{"text": user_message}]},
{"role": "ASSISTANT", "content": [{"text": assistant_message}]},
],
)
# Retrieve relevant memories before each invocation
response = memory_client.retrieve_memory_records(
memoryId=settings.agentcore_memory_id,
actorId=actor_id,
searchQuery=user_message, # Semantic search over stored memories
maxResults=5,
)
memories = response.get("memoryRecords", [])
The retrieved memories are prepended to the user message as context:
memory_context = "
".join(
f"- {record['content']['text']}"
for record in memories
if record.get("content", {}).get("text")
)
enriched_message = f"[Past context:]
{memory_context}
{user_message}"
The streaming agent loop
The agent produces a streaming response via the Strands stream_async method. Each chunk is forwarded as an SSE event:
async def stream_agent_response(user_message, actor_id, session_id):
# 1. Retrieve memories
memories = retrieve_relevant_memories(actor_id=actor_id, query=user_message)
enriched_message = prepend_memory_context(memories, user_message)
# 2. Build agent with tools
agent = Agent(
model=primary_model,
system_prompt=cached_system_prompt,
tools=[lookup_order_status, search_product_faq],
)
# 3. Stream response
full_response_parts = []
async for chunk in agent.stream_async(enriched_message):
if chunk.get("type") == "text":
text = chunk.get("text", "")
if text:
full_response_parts.append(text)
yield f"data: {text}
" # SSE format
# 4. Store turn in memory
store_conversation_turn(
actor_id=actor_id,
session_id=session_id,
user_message=user_message,
assistant_message="".join(full_response_parts),
)
yield "data: [DONE]
"
The adaptive retry config
Bedrock throttles hard when you exceed your model's TPS (tokens per second) limit. Without retry logic, throttled requests fail immediately.
mode: "adaptive" uses a token bucket algorithm — it monitors the throttle rate and automatically backs off when it detects pressure:
from botocore.config import Config
boto_config = Config(
retries={"max_attempts": 5, "mode": "adaptive"},
read_timeout=120, # Streaming responses can take 30-90s for complex tool chains
connect_timeout=10,
)
The difference between "standard" and "adaptive" retry modes:
-
standard: fixed exponential backoff between retries -
adaptive: adjusts retry rate based on observed throttling, converges to a sustainable rate faster
For agentic workloads that run multi-step tool chains — and thus make many Bedrock calls in sequence — "adaptive" consistently outperforms "standard".
Putting it all together
The agent.py file wires everything together:
primary_model = BedrockModel(
model_id=settings.primary_model_id,
boto_config=boto_config,
additional_request_fields={"anthropic_beta": ["prompt-caching-2024-07-31"]},
guardrail_config={"guardrailIdentifier": settings.bedrock_guardrail_id, ...},
)
cached_system_prompt = [{"text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}}]
async def stream_agent_response(user_message, actor_id, session_id):
memories = retrieve_relevant_memories(actor_id, user_message)
enriched = prepend_memory_context(memories, user_message)
agent = Agent(
model=primary_model,
system_prompt=cached_system_prompt,
tools=[lookup_order_status, search_product_faq],
)
full_response = []
async for chunk in agent.stream_async(enriched):
if chunk.get("type") == "text" and chunk.get("text"):
full_response.append(chunk["text"])
yield f"data: {chunk['text']}
"
store_conversation_turn(actor_id, session_id, user_message, "".join(full_response))
yield "data: [DONE]
"
In Part 4, we set up the local Docker dev environment so you can iterate on the agent code without deploying to AWS on every change.
→ Continue to Part 4: Running Locally with Docker
Originally published at rajmurugan.com. This is Part 3 of the Ultimate Guide to Building AI Agents on AWS with Bedrock AgentCore series.
Top comments (0)