Danny Teller for AWS Community Builders

Posted on Apr 1

We Built the Same Agent Three Times Before It Worked

#ai #aws #agents

Two months ago, our DevOps team set out to build an AWS governance agent. Something that could look across a multi-account AWS organization, find orphaned resources, flag security issues, check tag compliance, and tell you where you're bleeding money — in plain English.

We had AWS Strands Agents SDK, Amazon Bedrock AgentCore, and a reasonable amount of optimism.

What followed was two months of building, tearing down, and rebuilding. Three fundamentally different architectures. 18,000 lines of code written and then deleted. And a final system that's simpler than any of the ones that came before it.

This is the story of how we got there.

Iteration 1: "The LLM Will Figure It Out"

The first version was the obvious one. Give the LLM a set of AWS API tools — describe_instances, list_security_groups, get_cost_and_usage — and let it call them directly.

We built an AgentRouter that received user queries, a CoordinatorAgent that managed multi-agent flow, and wired it all to boto3 calls. The LLM would receive a question like "find unused security groups in our production VPC," reason about which APIs to call, and chain them together.

It worked. Sort of.

The problem wasn't that the LLM couldn't call AWS APIs. It could. The problem was that AWS APIs are inconsistent, paginated, rate-limited, and return wildly different response shapes across services. A question about orphaned EBS volumes required the LLM to:

List all volumes
Filter for available state
Cross-reference with instance attachments
Check if any are in use by ASGs or launch templates
Handle pagination across all of these

The LLM would sometimes get this right. Sometimes it'd miss the pagination. Sometimes it'd hallucinate an API parameter. Every query was a fresh adventure in whether the model remembered the exact shape of describe_volumes response.

We were spending tokens on API exploration instead of governance analysis.

Lesson: Giving an LLM raw AWS APIs is like giving someone a phone book and asking them to plan a city. The information is there, but the abstraction is wrong.

Iteration 2: The 8,300-Line Orchestrator

If the LLM couldn't be trusted to navigate AWS APIs on its own, we'd give it structure.

We built a five-stage deterministic pipeline:

Classify — determine the intent (security audit, cost review, orphan detection)
Reason — extract entities (account IDs, VPC names, resource types)
Route — select the right tools and agents
Execute — run the tools in the correct order
Synthesize — compile results into a coherent response

This was a hybrid approach. Deterministic control flow with LLM reasoning at each stage. A SemanticIntentClassifier replaced keyword routing. A SupervisorAgent managed the pipeline. We added VPC disambiguation, cross-account routing, hallucination guards.

It felt like progress. The pipeline was predictable. Tests could target each stage. We could reason about failure modes.

But the orchestrator kept growing. Intent taxonomies needed constant updates. Every new query pattern required new routing logic. The classifier would misroute edge cases, and fixing one route would break another. VPC name resolution alone went through four bug-fix cycles.

By late February, the orchestrator was 8,300 lines across seven modules. The SupervisorAgent had been decomposed, recomposed, and decomposed again. We'd built an entire reasoning engine on top of an LLM that was already a reasoning engine.

We were fighting the model instead of using it.

Lesson: A deterministic pipeline that wraps an LLM is still deterministic. You get the rigidity of hardcoded flows with the unpredictability of language models. The worst of both worlds.

Iteration 3: The Satellite Architecture

Around the same time, we had a data freshness problem. Direct API calls were slow, rate-limited, and gave you a point-in-time snapshot. We wanted something closer to a continuously updated inventory.

So we built a satellite architecture:

Lambda scanners deployed across every AWS account via StackSets
Each scanner would enumerate resources on a schedule, enrich them with metadata and relationships
Results flowed into S3 Vectors — four buckets with 52 vector indexes, organized by domain (core, data, compute, ops)
An embedding service (Titan Embed Text v2, 1024 dimensions) vectorized everything
The agent would query the vector store instead of calling AWS APIs directly

The infrastructure was impressive on a whiteboard. Service-aware vector indexes. Cross-account routing. Relationship graphs embedded alongside resource metadata. A reconciliation pipeline to handle eventual consistency.

In practice:

Cost: Titan embedding calls for every resource in every account on every scan. Lambda execution time kept climbing — we bumped timeouts from 10 to 15 to 20 minutes. 57 VPC endpoints to give Lambdas access to S3 Vectors.

Consistency: Vector deduplication was a constant battle. Resources would appear twice after re-scans. Cross-account vector routing had subtle bugs where resources from one account would surface in queries about another.

Freshness: The thing we built this to solve. Scans ran on schedules, so the vector store was always behind reality. Users would ask about a security group that was created an hour ago and get nothing back.

Complexity: 3,700 lines of scanner code across eight modules. A 611-line S3 Vectors client. A 388-line relationship index. A 286-line embedding service. Infrastructure stacks for Step Functions, S3 Vectors buckets, scanner Lambdas, and all the IAM plumbing to connect them.

We had built a distributed data pipeline to feed an LLM that still couldn't reliably answer "show me the unused EBS volumes."

Lesson: When your data layer is more complex than your analysis layer, you've probably over-solved the wrong problem.

The Moment It Clicked

In late February, we stepped back and asked a different question: What if we stopped building infrastructure and started using what AWS already provides?

AWS Config Aggregator already has a continuously updated inventory of resources across all accounts. It supports SQL queries. It's maintained by AWS. It doesn't need Lambdas, embeddings, or vector stores.

And Strands SDK already handles tool selection, invocation, and chaining. It doesn't need a five-stage pipeline to decide which tool to call.

On February 27, we deleted the 8,300-line orchestrator and replaced it with a single Strands agent. On March 2, we deleted the entire scanner infrastructure and replaced S3 Vectors with Config Aggregator queries.

The diff was dramatic: 18,000 lines removed. The new agent was roughly 1,500 lines.

What We Built Instead

The current system has one Strands agent running Haiku. No multi-agent orchestration. No intent classification. No vector stores. The LLM picks tools from a registry, and the tools handle the complexity of talking to AWS.

Here's what makes it work:

Data: Config Aggregator + Resource Explorer 2

Instead of building our own data layer, we query AWS's:

SELECT resourceId, resourceType, configuration
WHERE resourceType = 'AWS::EC2::SecurityGroup'
AND accountId = 'XXXXXXXXXXXX'

Config Aggregator covers ~85% of resource types. Resource Explorer 2 fills the gaps. For anything with zero results, a direct API fallback fires automatically. No Lambdas. No embeddings. No eventual consistency problems.

Tools: Dispatchers Over Individual Functions

Early versions exposed 70+ individual tools to the model. Each tool call required the LLM to pick from a massive schema — roughly 30K tokens of tool definitions per request. The model would get confused, pick the wrong tool, or combine tools incorrectly.

We consolidated into 8 domain dispatchers:

@tool
def security_tool(action: str, **params):
    """Security operations: findings, sg_rules, nacl_analysis,
    kms_keys, compliance_check, ..."""
    return dispatch(action, params, _ACTIONS)

One security_tool with an action parameter replaces 11 individual tools. The LLM sees 8 clear categories instead of 70 ambiguous options. Tool schema dropped by 64% — about 10K fewer tokens per request.

Analysis: Context Builders, Not LLM Reasoning

The earlier architectures asked the LLM to analyze raw AWS API responses. "Here are 200 security groups, figure out which ones are orphaned."

Now, deterministic context builders pre-process the data:

Orphan detection applies known rules (no attachments, no references, default VPC)
Security analysis flags known-bad patterns (0.0.0.0/0 ingress, missing encryption)
Cost analysis identifies savings opportunities from utilization data

The LLM receives pre-analyzed context and focuses on what it's good at: explaining findings in plain language and prioritizing recommendations.

Skills: Progressive Disclosure

15 domain skills (cost optimization, security analysis, network intelligence, etc.) load on-demand based on query context. Each skill gates which tools the agent can access and provides domain-specific guidance.

This means the model doesn't see irrelevant tool instructions. A cost question loads cost-optimization guidance and cost-related tools. The context stays focused.

The Well-Architected Sub-Agent

For deep assessments, we spawn a separate Strands agent running Sonnet (the main agent runs Haiku). This sub-agent has its own tool set — 20+ check functions for EBS, RDS, IAM, S3 — and its own system prompt tuned for Well-Architected Framework analysis.

It's the one place where we use a more powerful model, and only when the user explicitly asks for an assessment.

Strands SDK: What Actually Matters

After using Strands through three architectural iterations, here's what ended up being load-bearing:

Prompt caching (CacheConfig(strategy="auto")): Our system prompt is substantial — core orchestration plus dynamically loaded skills. Caching it across invocations cut latency and cost meaningfully.

The Plugin API: A single GovernancePlugin class registers hooks for before/after tool calls, after model calls, and after invocation. This gives us telemetry, token tracking, and post-response validation without touching the agent's core logic.

event.resume: After the agent finishes responding, a hook can inspect the output and inject a follow-up. We use this for orphan analysis validation — if the agent's response doesn't match our deterministic findings, the hook sets event.resume with a correction prompt and the agent self-corrects. No infinite loops, no separate validation agent.

Streaming with agent.stream_async(): Real-time progress in Slack. Users see which tools are running, partial results, and the final analysis as it generates. This turned a 30-second wait into a 30-second experience of watching the agent work.

Agent cancellation: Long governance scans can take a while. Users can cancel mid-flight, and the agent cleans up gracefully.

Conversation summarization: Governance conversations can run long — "now check the other account," "what about the network side." The SDK's SummarizingConversationManager keeps conversation history manageable while preserving critical context like account IDs and prior findings.

AgentCore: Production Without the Ops

AgentCore handles the parts we didn't want to build:

Runtime hosting: Docker container on Graviton, managed by AgentCore. No ECS cluster to maintain.
Memory: Short-term and long-term session memory with episodic strategies. Cross-session learning — the agent remembers what it found in previous conversations and injects those reflections into new sessions.
JWT authentication: Corporate identity provider integration for user identity. The agent knows who's asking and can scope responses to their permissions.
Guardrails: Bedrock Guardrails filter content to prevent secrets leakage in governance responses.

Bringing It to Slack with AG-UI

An agent that lives behind an API endpoint is useful. An agent that lives in the channel where your team already works is adopted.

We built a Slack bot as a separate service — Slack Bolt running in Socket Mode on ECS Fargate. When a user messages the bot, it calls the AgentCore-hosted agent over HTTPS. But early versions had a problem: governance scans take 15–30 seconds. Users would type a question, stare at a typing indicator, and wonder if anything was happening.

AgentCore supports AG-UI (Agent-User Interface), a streaming protocol that surfaces what the agent is doing in real time. We built a custom Strands-to-AG-UI adapter that translates the SDK's internal events into an SSE stream:

POST /invocations
Accept: text/event-stream

→ TOOL_CALL_START: discover_resources (scanning 3 accounts...)
→ TEXT_DELTA: "Found 47 security groups across..."
→ TOOL_CALL_START: security_tool (analyzing findings...)
→ TEXT_DELTA: "12 groups have unrestricted ingress..."

The same endpoint serves both modes. Accept: text/event-stream gets the SSE stream; a normal request gets synchronous JSON. The Slack bot consumes the stream and progressively updates the Slack message — users see tools firing, partial results appearing, and the final analysis building in real time.

A few things we learned the hard way about streaming:

Never clear your accumulated response text when a new tool call starts. The agent can call tools mid-response, and clearing the buffer silently drops everything it said before the tool call.
Track unique tool stages, not individual calls, for progress display. Otherwise you flood the Slack message with duplicate updates.
Track text offsets after both tool-call-start and tool-call-end events. Text can appear between tool calls, and missing either offset creates gaps in the streamed output.

The result: what used to be a 30-second black box is now a 30-second live feed of the agent working through the problem. Users trust it more because they can see it thinking.

The Numbers

Metric	Satellite Architecture	Current
Lines of code	~25,000	~7,000
Tool definitions exposed to LLM	70+	25 (8 dispatchers + 17 individual)
Tool schema tokens per request	~30K	~10K
Infrastructure components	S3 Vectors (4 buckets), 52 indexes, Lambda scanners, Step Functions, 57 VPC endpoints	Config Aggregator (AWS-managed), 20 VPC endpoints
Data freshness	Minutes to hours (scan schedule)	Real-time (Config Aggregator)
Embedding costs	Per-resource per-scan	Zero
Live test pass rate	Inconsistent	109/109 (100%)

What We'd Tell Ourselves Two Months Ago

"Not enough coverage" isn't always a good reason to build your own. We dismissed Config Aggregator early because our research showed it didn't cover every resource type we wanted to query. So we built an entire scanning pipeline to get 100% coverage. What we discovered later: Config Aggregator covers ~85% of resource types, Resource Explorer 2 fills most of the gaps, and a simple direct API fallback handles the rest. Three lines of fallback logic replaced 3,700 lines of scanner code. Perfect coverage wasn't worth the complexity cost.

Don't build orchestration on top of orchestration. Strands SDK is already a tool-use loop. Adding a five-stage pipeline on top of it added complexity without adding capability.

Token budget is an architecture constraint. 70 tools meant 30K tokens of schema per request. The dispatcher pattern wasn't just cleaner code — it was a 64% reduction in per-request cost.

Deterministic preprocessing, LLM synthesis. The best division of labor: code handles data collection and known-rule analysis. The LLM handles explanation, prioritization, and natural language. Don't ask the model to do what a SQL query can do better.

Ship the boring version first. Every complex architecture we built was an attempt to solve problems we hadn't actually encountered yet. The current system handles real governance queries from real users. That's the bar.

Built with AWS Strands Agents SDK and Amazon Bedrock AgentCore.

DEV Community