Nitesh Reddy Challa

Posted on May 8

A Production-Shaped Multi-Agent SRE System on Amazon Bedrock AgentCore

#aws #python #architecture #ai

At 2 AM, your on-call engineer has four browser tabs open: CloudWatch Logs, CloudWatch Metrics, a runbook wiki, and Slack. They are synthesizing evidence manually — and every fragmented minute is MTTR climbing. Building an AI agent to close that gap sounds simple until you realize you are actually wiring a runtime, a JWT-gated API layer, an MCP transport, memory persistence, guardrails, observability, and an evaluation harness. This post walks through a production-shaped template that does that wiring once — so you swap four files and ship your own domain.

The 7-day demo cost to run the full stack was $2.11 USD.

What this article is: A teardown of a fork-and-ship CDK template for multi-agent systems on Bedrock AgentCore. The built-in exemplar is an SRE incident-response system running against seeded demo fixtures in CloudWatch — not real production data. That's intentional: synthetic fixtures prove the pattern works end-to-end so you can swap in your own data sources with confidence.

To adapt it to your domain: 4 file swaps — MCP server, sub-agent, orchestrator prompt, fixtures. Everything else (Runtime, Gateway, Memory, Guardrails, OTEL, eval harness) doesn't move. Jump to Adapting to Your Domain if you want that first.

The Problem: Manual Incident Response Does Not Scale

When an incident fires, three things break down simultaneously:

Responders gather evidence from disconnected windows (logs, metrics, runbooks)
Operational knowledge lives in heads and wikis, not in the workflow
Synthesis happens manually under pressure — inconsistent and slow

The fix is a single orchestration path: specialized agents gather evidence in parallel, synthesize once, and return a structured answer. That is what this template implements.

Architecture: Strands Agents-as-Tools on AgentCore

Important distinction: This project uses Strands' agents-as-tools pattern — four sub-agents as in-process @tool functions inside a single container. This is architecturally different from Amazon Bedrock Agents' managed multi-agent collaboration feature (separate Agent resources wired via AssociateAgentCollaborator). The trade-off is intentional: agents-as-tools means zero inter-agent network hops, the same call stack, and identical local/deployed behavior. The managed Bedrock Agents approach earns its complexity when you need cross-team ownership or independent release cycles.

User → Cognito JWT → AgentCore Gateway → AgentCore Runtime (ARM64)
                                                │
                                   Orchestrator (any LLM via Strands)
                          ┌──────────────┬──────────────┬──────────────┐
                     log_analyst  metrics_analyst  runbook_agent  security_auditor
                          │              │              │
                       CW MCP         CW MCP       Lambda MCP
                          └──────────────┴──────────────┘
                                         │
                              CloudWatch Logs + Metrics + DynamoDB
                                         │
                          OTEL → CloudWatch Gen AI Observability

The orchestrator holds four sub-agents as tools=[]. The LLM selects which to call based on their docstrings — no hardcoded dispatch logic:

def build_orchestrator(*, session_id: str | None = None, actor_id: str | None = None) -> Agent:
    """Strands orchestrator — four sub-agents exposed as @tool functions."""
    system_prompt = (Path(__file__).parent / "prompts" / "orchestrator.md").read_text(encoding="utf-8")
    agent_kwargs: dict[str, object] = {}
    if memory_enabled():
        agent_kwargs["session_manager"] = build_session_manager(
            session_id=session_id,
            actor_id=actor_id,
        )
    return Agent(
        model=strands_bedrock_model(),          # swappable — one env var
        system_prompt=system_prompt,
        tools=[log_analyst, metrics_analyst, runbook_agent, security_auditor_agent],
        **agent_kwargs,
    )

Adapting to Your Domain: Four File Swaps

Everything outside these four paths is domain-agnostic scaffolding — do not touch it:

Swap	From	To
Custom MCP server	`mcp_custom/runbook_server/`	`mcp_custom/<your_domain>_server/`
Sub-agent	`agent/sub_agents/runbook.py`	`agent/sub_agents/<your_domain>.py`
Orchestrator prompt	`agent/prompts/orchestrator.md`	Add one tool entry + one routing rule (additive only)
Fixtures + eval cases	`fixtures/scenarios/` + `eval/test_cases.jsonl`	Your 3 canonical queries

After the four swaps: make test && make lint → make phase1-demo-debug → DOCKER_BUILDKIT=0 make phase4-deploy.

The scaffolding — Runtime, Gateway, Memory, Guardrails, OTEL, eval harness — does not move. See docs/ADAPT.md for the step-by-step checklist and a worked Jira triage example.

Session & Memory Model

AgentCore provides two distinct persistence layers — keeping these separate is important:

Layer	Scope	What it stores	Lifetime
Runtime session (microVM)	Single invocation	In-flight context, tool outputs, reasoning trace	15-min idle / 8-hr max
AgentCore Memory	Cross-session	Conversation history (session-window, sliding-window, or long-term summarization)	Configurable TTL

Each invocation runs in a dedicated microVM with isolated CPU, memory, and filesystem. When the session ends, the microVM is terminated and memory is sanitized — no cross-session data contamination, even with non-deterministic AI processes. AgentCore Memory is opt-in (AGENTCORE_MEMORY_ENABLED=true); the session ID propagates through every OTEL span automatically.

MCP as Transport and Policy Layer

log_analyst and metrics_analyst share one CloudWatch MCP server process. Specialization happens through per-agent tool filters — one server, two different tool surfaces, zero duplication:

def cloudwatch_mcp_client(*, tool_filters: ToolFilters) -> MCPClient:
    """Same MCP server, different tool surface per sub-agent."""
    return MCPClient(
        lambda: stdio_client(
            StdioServerParameters(
                command="uvx",
                args=["awslabs.cloudwatch-mcp-server@latest"],
                env=_mcp_subprocess_env(),
            )
        ),
        startup_timeout=120,
        tool_filters=tool_filters,  # ← the only difference between sub-agents
    )

The runbook server uses a dual-shape design — local stdio in Phase 1, Gateway-registered Lambda target in Phase 2+. The sub-agent code does not change between modes; only the transport env var changes.

Why Not Step Functions at the Core?

AWS Prescriptive Guidance is explicit: Step Functions handles deterministic, rule-based flows. AgentCore handles AI-native orchestration where the LLM is the workflow engine. Mixing them at the reasoning layer adds latency without benefit.

In this template, Step Functions belongs at the edges — nightly eval harness, human-in-the-loop approval flows, infra lifecycle — not between the orchestrator and sub-agents.

Pattern	Right fit
Single agent, all tools	Simplest — context pressure grows as tools scale
Agents-as-tools (this repo)	Single team, one container, LLM routes, local debuggable
A2A choreography	Cross-team ownership, independent release cycles
Step Functions + agents	Deterministic outer workflow, AI inner reasoning

Enterprise Security: Three-Layer Least-Privilege Boundary

client.create_gateway(
    name=gateway_name,
    protocolType="MCP",
    roleArn=config["role_arn"],
    authorizerType="CUSTOM_JWT",
    authorizerConfiguration={
        "customJWTAuthorizer": {
            "discoveryUrl": _issuer_url(region, config["user_pool_id"]),
            "allowedClients": [config["client_id"]],
        }
    },
)

Three explicit boundaries, each independently enforced:

Layer	Mechanism	What it prevents
Identity	Cognito JWT — `discoveryUrl` + `allowedClients` validated on every request	Unauthenticated callers
Authorization	Gateway IAM service role (`roleArn`) scoped to registered targets only	Lateral movement to unregistered services
Transport enforcement	`AGENT_TRANSPORT_MODE=gateway` in the runtime container	Local stdio bypass in production

Bedrock Guardrails are wired separately at the model layer (agent/guardrails.py) and provisioned via CDK (infrastructure/stacks/guardrail_stack.py) — covering input/output filtering independent of the transport layer.

Honest Eval: What the Scores Actually Mean

The AgentCore LLM-as-judge eval runs three scenarios against the deployed runtime:

Scenario	Status	GoalSuccessRate	Helpfulness
`debug_external_dep_01`	COMPLETED	0.0	0.83 — Very Helpful
`debug_external_dep_02`	COMPLETED	0.0	0.67 — Moderately Helpful
`debug_external_dep_03`	COMPLETED	0.0	0.67 — Moderately Helpful
Error count	—	0	—

GoalSuccessRate 0.0 is a fixture alignment gap, not a system failure. The evaluator matches exact strings ("Stripe," "503," "CircuitBreakerOpen") against agent responses. The agent reasons in natural language ("payment provider," "upstream errors") — the semantics match, the strings don't. Updating expected_markers in eval/test_cases.jsonl to match the agent's vocabulary fixes this without touching the agent.

Helpfulness 0.83 is the meaningful signal — the LLM judge rated the response as something that would actually help an SRE. The runbook was matched, the mitigation steps were numbered and actionable, and the analysis was coherent.

Surfacing this gap explicitly rather than hiding it is the point: partial confidence is a design principle here, not an error state. When evidence is unavailable, the system returns [Partial] — data not retrieved instead of fabricating an answer.

Observed Cost: 7-Day Demo Window

Layer	Approx. cost	Pricing model
AgentCore Runtime	Majority of total	Consumption-based — billed on active CPU only, not LLM wait time
AgentCore Gateway	Small	Per-request
AgentCore Memory	Small	Storage + retrieval ops
Bedrock Guardrails	Small	Per text-unit processed
Cognito (Auth)	Negligible	MAU-based
Total (7 days)	$2.11 USD	Full stack including all layers

The consumption-based Runtime pricing is the key lever: you are not charged while the container waits on model responses. For SRE use cases where invocations are event-driven (not continuous), the economics are favorable.

Why Strands Agents Over LangChain or CrewAI?

Strands Agents is an open-source SDK published by AWS with first-class AgentCore Runtime integration:

OTEL built-in via ADOT auto-instrumentation — no middleware to configure, spans appear in CloudWatch Gen AI Observability automatically
Typed @tool contracts — sub-agents are plain Python functions; their docstrings become tool descriptions the LLM uses for routing
MCP tool filtering via a single tool_filters= kwarg — one server, scoped tool surface per sub-agent
Model-agnostic — swap the model ID in one place (strands_bedrock_model()); Claude, Nova, and others all work

LangChain and CrewAI are valid choices for different constraint sets. Strands fits here because the target is AgentCore Runtime, not a generic cloud environment.

Closing

The hard part of building agentic systems on AWS is not writing the agent logic — it is wiring runtime, auth, MCP, memory, guardrails, observability, and eval into a coherent system you can actually ship and trust. Every one of those layers is already wired here: microVM session isolation per invocation, Cognito JWT gating, OTEL to CloudWatch Gen AI Observability, LLM-as-judge evaluation via AgentCore's on-demand eval API, and CDK IaC for all infrastructure.

Fork it. Swap mcp_custom/runbook_server/ for your domain's data source. Update the orchestrator prompt. Ship. The other eleven services do not move.

Repo: agentcore-multiagent-framework · Adapt guide: docs/ADAPT.md · Run it: follow the First-time deployed demo (recommended path) section in README.md (CDK deploy → token/runtime deploy → seed → demo queries) · Local-only fallback: make phase1-demo-debug

DEV Community