At 2 AM, your on-call engineer has four browser tabs open: CloudWatch Logs, CloudWatch Metrics, a runbook wiki, and Slack. They are synthesizing evidence manually — and every fragmented minute is MTTR climbing. Building an AI agent to close that gap sounds simple until you realize you are actually wiring a runtime, a JWT-gated API layer, an MCP transport, memory persistence, guardrails, observability, and an evaluation harness. This post walks through a production-shaped template that does that wiring once — so you swap four files and ship your own domain.
The 7-day demo cost to run the full stack was $2.11 USD.
What this article is: A teardown of a fork-and-ship CDK template for multi-agent systems on Bedrock AgentCore. The built-in exemplar is an SRE incident-response system running against seeded demo fixtures in CloudWatch — not real production data. That's intentional: synthetic fixtures prove the pattern works end-to-end so you can swap in your own data sources with confidence.
To adapt it to your domain: 4 file swaps — MCP server, sub-agent, orchestrator prompt, fixtures. Everything else (Runtime, Gateway, Memory, Guardrails, OTEL, eval harness) doesn't move. Jump to Adapting to Your Domain if you want that first.
The Problem: Manual Incident Response Does Not Scale
When an incident fires, three things break down simultaneously:
- Responders gather evidence from disconnected windows (logs, metrics, runbooks)
- Operational knowledge lives in heads and wikis, not in the workflow
- Synthesis happens manually under pressure — inconsistent and slow
The fix is a single orchestration path: specialized agents gather evidence in parallel, synthesize once, and return a structured answer. That is what this template implements.
Architecture: Strands Agents-as-Tools on AgentCore
Important distinction: This project uses Strands' agents-as-tools pattern — four sub-agents as in-process
@toolfunctions inside a single container. This is architecturally different from Amazon Bedrock Agents' managed multi-agent collaboration feature (separate Agent resources wired viaAssociateAgentCollaborator). The trade-off is intentional: agents-as-tools means zero inter-agent network hops, the same call stack, and identical local/deployed behavior. The managed Bedrock Agents approach earns its complexity when you need cross-team ownership or independent release cycles.
User → Cognito JWT → AgentCore Gateway → AgentCore Runtime (ARM64)
│
Orchestrator (any LLM via Strands)
┌──────────────┬──────────────┬──────────────┐
log_analyst metrics_analyst runbook_agent security_auditor
│ │ │
CW MCP CW MCP Lambda MCP
└──────────────┴──────────────┘
│
CloudWatch Logs + Metrics + DynamoDB
│
OTEL → CloudWatch Gen AI Observability
The orchestrator holds four sub-agents as tools=[]. The LLM selects which to call based on their docstrings — no hardcoded dispatch logic:
def build_orchestrator(*, session_id: str | None = None, actor_id: str | None = None) -> Agent:
"""Strands orchestrator — four sub-agents exposed as @tool functions."""
system_prompt = (Path(__file__).parent / "prompts" / "orchestrator.md").read_text(encoding="utf-8")
agent_kwargs: dict[str, object] = {}
if memory_enabled():
agent_kwargs["session_manager"] = build_session_manager(
session_id=session_id,
actor_id=actor_id,
)
return Agent(
model=strands_bedrock_model(), # swappable — one env var
system_prompt=system_prompt,
tools=[log_analyst, metrics_analyst, runbook_agent, security_auditor_agent],
**agent_kwargs,
)
Adapting to Your Domain: Four File Swaps
Everything outside these four paths is domain-agnostic scaffolding — do not touch it:
| Swap | From | To |
|---|---|---|
| Custom MCP server | mcp_custom/runbook_server/ |
mcp_custom/<your_domain>_server/ |
| Sub-agent | agent/sub_agents/runbook.py |
agent/sub_agents/<your_domain>.py |
| Orchestrator prompt | agent/prompts/orchestrator.md |
Add one tool entry + one routing rule (additive only) |
| Fixtures + eval cases |
fixtures/scenarios/ + eval/test_cases.jsonl
|
Your 3 canonical queries |
After the four swaps: make test && make lint → make phase1-demo-debug → DOCKER_BUILDKIT=0 make phase4-deploy.
The scaffolding — Runtime, Gateway, Memory, Guardrails, OTEL, eval harness — does not move. See docs/ADAPT.md for the step-by-step checklist and a worked Jira triage example.
Session & Memory Model
AgentCore provides two distinct persistence layers — keeping these separate is important:
| Layer | Scope | What it stores | Lifetime |
|---|---|---|---|
| Runtime session (microVM) | Single invocation | In-flight context, tool outputs, reasoning trace | 15-min idle / 8-hr max |
| AgentCore Memory | Cross-session | Conversation history (session-window, sliding-window, or long-term summarization) | Configurable TTL |
Each invocation runs in a dedicated microVM with isolated CPU, memory, and filesystem. When the session ends, the microVM is terminated and memory is sanitized — no cross-session data contamination, even with non-deterministic AI processes. AgentCore Memory is opt-in (AGENTCORE_MEMORY_ENABLED=true); the session ID propagates through every OTEL span automatically.
MCP as Transport and Policy Layer
log_analyst and metrics_analyst share one CloudWatch MCP server process. Specialization happens through per-agent tool filters — one server, two different tool surfaces, zero duplication:
def cloudwatch_mcp_client(*, tool_filters: ToolFilters) -> MCPClient:
"""Same MCP server, different tool surface per sub-agent."""
return MCPClient(
lambda: stdio_client(
StdioServerParameters(
command="uvx",
args=["awslabs.cloudwatch-mcp-server@latest"],
env=_mcp_subprocess_env(),
)
),
startup_timeout=120,
tool_filters=tool_filters, # ← the only difference between sub-agents
)
The runbook server uses a dual-shape design — local stdio in Phase 1, Gateway-registered Lambda target in Phase 2+. The sub-agent code does not change between modes; only the transport env var changes.
Why Not Step Functions at the Core?
AWS Prescriptive Guidance is explicit: Step Functions handles deterministic, rule-based flows. AgentCore handles AI-native orchestration where the LLM is the workflow engine. Mixing them at the reasoning layer adds latency without benefit.
In this template, Step Functions belongs at the edges — nightly eval harness, human-in-the-loop approval flows, infra lifecycle — not between the orchestrator and sub-agents.
| Pattern | Right fit |
|---|---|
| Single agent, all tools | Simplest — context pressure grows as tools scale |
| Agents-as-tools (this repo) | Single team, one container, LLM routes, local debuggable |
| A2A choreography | Cross-team ownership, independent release cycles |
| Step Functions + agents | Deterministic outer workflow, AI inner reasoning |
Enterprise Security: Three-Layer Least-Privilege Boundary
client.create_gateway(
name=gateway_name,
protocolType="MCP",
roleArn=config["role_arn"],
authorizerType="CUSTOM_JWT",
authorizerConfiguration={
"customJWTAuthorizer": {
"discoveryUrl": _issuer_url(region, config["user_pool_id"]),
"allowedClients": [config["client_id"]],
}
},
)
Three explicit boundaries, each independently enforced:
| Layer | Mechanism | What it prevents |
|---|---|---|
| Identity | Cognito JWT — discoveryUrl + allowedClients validated on every request |
Unauthenticated callers |
| Authorization | Gateway IAM service role (roleArn) scoped to registered targets only |
Lateral movement to unregistered services |
| Transport enforcement |
AGENT_TRANSPORT_MODE=gateway in the runtime container |
Local stdio bypass in production |
Bedrock Guardrails are wired separately at the model layer (agent/guardrails.py) and provisioned via CDK (infrastructure/stacks/guardrail_stack.py) — covering input/output filtering independent of the transport layer.
Honest Eval: What the Scores Actually Mean
The AgentCore LLM-as-judge eval runs three scenarios against the deployed runtime:
| Scenario | Status | GoalSuccessRate | Helpfulness |
|---|---|---|---|
debug_external_dep_01 |
COMPLETED | 0.0 | 0.83 — Very Helpful |
debug_external_dep_02 |
COMPLETED | 0.0 | 0.67 — Moderately Helpful |
debug_external_dep_03 |
COMPLETED | 0.0 | 0.67 — Moderately Helpful |
| Error count | — | 0 | — |
GoalSuccessRate 0.0 is a fixture alignment gap, not a system failure. The evaluator matches exact strings ("Stripe," "503," "CircuitBreakerOpen") against agent responses. The agent reasons in natural language ("payment provider," "upstream errors") — the semantics match, the strings don't. Updating expected_markers in eval/test_cases.jsonl to match the agent's vocabulary fixes this without touching the agent.
Helpfulness 0.83 is the meaningful signal — the LLM judge rated the response as something that would actually help an SRE. The runbook was matched, the mitigation steps were numbered and actionable, and the analysis was coherent.
Surfacing this gap explicitly rather than hiding it is the point: partial confidence is a design principle here, not an error state. When evidence is unavailable, the system returns [Partial] — data not retrieved instead of fabricating an answer.
Observed Cost: 7-Day Demo Window
| Layer | Approx. cost | Pricing model |
|---|---|---|
| AgentCore Runtime | Majority of total | Consumption-based — billed on active CPU only, not LLM wait time |
| AgentCore Gateway | Small | Per-request |
| AgentCore Memory | Small | Storage + retrieval ops |
| Bedrock Guardrails | Small | Per text-unit processed |
| Cognito (Auth) | Negligible | MAU-based |
| Total (7 days) | $2.11 USD | Full stack including all layers |
The consumption-based Runtime pricing is the key lever: you are not charged while the container waits on model responses. For SRE use cases where invocations are event-driven (not continuous), the economics are favorable.
Why Strands Agents Over LangChain or CrewAI?
Strands Agents is an open-source SDK published by AWS with first-class AgentCore Runtime integration:
- OTEL built-in via ADOT auto-instrumentation — no middleware to configure, spans appear in CloudWatch Gen AI Observability automatically
-
Typed
@toolcontracts — sub-agents are plain Python functions; their docstrings become tool descriptions the LLM uses for routing -
MCP tool filtering via a single
tool_filters=kwarg — one server, scoped tool surface per sub-agent -
Model-agnostic — swap the model ID in one place (
strands_bedrock_model()); Claude, Nova, and others all work
LangChain and CrewAI are valid choices for different constraint sets. Strands fits here because the target is AgentCore Runtime, not a generic cloud environment.
Closing
The hard part of building agentic systems on AWS is not writing the agent logic — it is wiring runtime, auth, MCP, memory, guardrails, observability, and eval into a coherent system you can actually ship and trust. Every one of those layers is already wired here: microVM session isolation per invocation, Cognito JWT gating, OTEL to CloudWatch Gen AI Observability, LLM-as-judge evaluation via AgentCore's on-demand eval API, and CDK IaC for all infrastructure.
Fork it. Swap mcp_custom/runbook_server/ for your domain's data source. Update the orchestrator prompt. Ship. The other eleven services do not move.
Repo: agentcore-multiagent-framework · Adapt guide: docs/ADAPT.md · Run it: follow the First-time deployed demo (recommended path) section in README.md (CDK deploy → token/runtime deploy → seed → demo queries) · Local-only fallback: make phase1-demo-debug
Top comments (0)