I've spent the last 9 months shipping Bedrock AgentCore into four different ANZ enterprises (plus one internal PoC that crashed and burned).
This isn't a hello-world tutorial – it's the bruises, the invoices, and the 3 a.m. CloudWatch alarms that finally made the thing stick.
If you're about to promote an agent past the "demo for the board" stage, steal this checklist – it will save you at least one rollback.
the numbers we actually saw
| Pattern | Use-case | 10 k q/mo cost | p95 latency | Notes |
|---|---|---|---|---|
| Single agent | Simple Q&A | ~AUD 180 | 2.1 s | Hallucinated once traffic > 2 k/day |
| Supervisor + 3 subs | HR triage | ~AUD 420 | 4.3 s | 60 % less duplicate Lambda code |
| AgentCore Runtime | SRE co-pilot | ~AUD 620 | 3.8 s | GitOps deploy, full traces |
| Guardrail-wrapped | Student chat | ~AUD 520 | 4.9 s | PII blocked, compliance happy |
Supervisor pattern is the only one that survived a production spike without a hot-fix.
Single agents are great for a sprint demo – and terrible for anything that hits the internet.
Managed agents vs. AgentCore Runtime – pick one before 10 k users
I drew this on a whiteboard for our CFO after she saw the second invoice:
Rule we now write into every SoW:
PoC = managed. Day-1 prod = Runtime.
The moment you need a custom MCP tool or a side-car Lambda, the console becomes a drag.
Ground-truth data – skip it and you'll ship a liar
Our first Kindo chatbot went live with 37 manually-written examples.
Two weeks later a student asked "What grade do I need to pass?" and the agent calmly invented a 42 % cutoff (it's 50 %).
Cue 4 a.m. rollback.
We fixed it the boring way:
- Exported 18 k real (de-identified) chat logs.
- LLM-expanded edge cases: "give me 50 ways to ask about vacation pay".
- Human reviewed, 1 200 kept.
- Added sessionAttributes (studentID, semester) so the agent could look up live data.
Accuracy jumped from 67 % → 92 % and the support ticket queue dropped by half.
# pytest harness we run in CI
tests = json.load(open("ground_truth.json"))
for t in tests:
out = agent.invoke(t["input"], sessionAttributes=t["attrs"])
assert out["answer"] == t["expected"]
Supervisor pattern that actually compiles
Payroll bot rewrite: one supervisor + three specialised subs (policy, leave-balances, tickets).
60 % less copy-paste Lambda code, and we could unit-test each sub in isolation.
from agentcore import Agent, app
supervisor = Agent(
model_id="anthropic.claude-3-5-sonnet-20240620-v1:0",
instructions="You are a router. Never answer directly – always delegate to the correct sub-agent."
)
@app.entrypoint
def lambda_handler(event, _):
return supervisor.invoke(event["prompt"])
Gateway MCP let us plug ServiceNow REST APIs without re-writing the OpenAPI schema – biggest time-saver of the sprint.
Guardrails – the checkbox that saved our audit
First deploy forgot guardrails.
Next day a student pasted their email + TFN into the chat and the agent happily parroted it back in the response.
Security team put a red sticker on my laptop.
Now we enforce org-level guardrails before any agent alias hits prod:
| Filter | Block % | Mask % | AUD / mo |
|---|---|---|---|
| PII (email, TFN) | 2.1 | 8.4 | 32 |
| Custom finance terms | 1.7 | 3.2 | 22 |
| Hate/violence | 0.3 | – | 12 |
| Total | 4.1 | 11.6 | 66 |
Cheap insurance.
IaC + observability – or you'll debug in the console at 2 a.m.
We template everything in CDK (Python). One cdk deploy spins up:
- AgentCore Runtime container
- Lambda layers for Powertools & boto3 latest
- X-Ray traces, CloudWatch dashboards, alarms
| Metric | Target | Alarm |
|---|---|---|
| Task success | ≥ 95 % | < 90 % |
| p95 latency | ≤ 5 s | > 10 s |
| Token spend | ≤ AUD 70/day | > AUD 140 |
| PII leak count | 0 | > 0 |
Routing loops show up as 30 s p99 spikes – impossible to spot without traces.
10-line deploy checklist we paste into every PR
- [ ] 200+ ground-truth conversations in
tests/ground_truth.json - [ ] Supervisor agent uses Sonnet; subs pinned to Haiku for cost
- [ ] Guardrails alias attached (BLOCK PII, MASK custom)
- [ ]
agentcore deploy --stage prod --approve - [ ] Powertools tracer + metrics on every handler
- [ ] CloudWatch alarm for "PII leak > 0" – page the on-call
- [ ] A/B toggle for Haiku fallback if token spend > budget
Starter repo we fork every time:
https://github.com/awslabs/amazon-bedrock-agentcore-samples
80 % of the boilerplate is done in 90 min – the rest is ground-truth grunt work.
If you're riding the agent hype-wave right now, remember: the demo is the easy 10 %.
These notes are for the other 90 % – the invoices, the guardrails, the 3 a.m. pages.
Steal what you need, add your own scars, and ship something that won't hallucinate when the CFO asks it a question.
Happy building, and may your p95 always be under 5 s.

Top comments (0)