Speaker: Tan Xin @ AWS Amarathon 2025
Summary by Amazon Nova
Challenge
Traditional Workflow During a Failure
Check cloud resource status
Failure occurs
View incident history
Check configuration
Check alerts
Assign work order
Analyze logs
Hypothesize root cause
Search for solutions
Check metrics
Analyze dependencies
View recent updates
Search call chains
View O&M manual
Query service dashboard
Execute mitigation measures
Notify colleagues
Query abnormal metrics
Read manual
Monitor recovery status
Core Challenges
Timely loss mitigation
Fault isolation
Resolve issues within an acceptable timeframe
Limit failures within isolation boundaries to prevent cascading effects on other services, thereby reducing the scope of failure impact
Ensure services meet user expectations and SLAs
Solution Evolution
- Evolution from single agent to multi-agent solutions
Ideal Solution
System Notifications and Alerts
Alarm Metric: 5xx rate over 30%
Alarm Detail: Service, Endpoint, Triggered Time
Supplementary Root Cause Analysis
Key Findings: API error rate 100%, DB no error, S3 bucket policy was updated.
Immediate Action: Check S3 bucket policy to Deny
Confidence: 90%
Confirm Automatic Operations
Message: System Revered.
RCA: S3 bucket policy was set to Deny
MTTR: 5 mins
Reduce failure recovery time from hours to minutes
SRE Expert Work Scope
Why did my User Service error rate reach 5% in the past hour?
Because the RDS MySQL instance experienced 12 connection limit exceeded issues in the past hour
Familiarize with the current system
Find service correlations
Analyze logs
Analyze audit logs
Analyze configuration
Analyze metrics
Consider Two Questions
If the system complexity is high, the troubleshooting workflow is long, and the log volume is large, can a single agent work smoothly?
Is it possible to clone SRE expert experience into agents to replace the Q&A method, allowing agents to make autonomous decisions and actions?
Multi-Agent Architecture Design
"Planner" creates workflows
"Executor" is responsible for executing assigned tasks
"Evaluator" is responsible for assessing whether each step's result is beneficial, returning to the "Planner" for subsequent planning
Results also need to be reviewed by the "Evaluator" before returning
The "Planner" can adjust the process based on feedback from the "Evaluator"
Multi-Agent vs Single Agent
Suitable for more complex tasks
Clearer responsibility and permission boundaries
Easier context engineering
More convenient scaling
AgentCore Best Practices
Introduction to Agent-Specific Runtime Environment
Challenges from "trial" to "implementation"
Challenges from PoC to production environment implementation
Performance
Elasticity
Security
Creating business value
Compliance
Agent Runtime Environment v1.0
INTERFACES & PROTOCOLS (MCP/A2A)
Agent Deployment
Agent Framework
Large Language Model
Memory
Prompts
Tools/Resources
Guardrails
Observability
Evaluation
Agent Runtime Environment v2.0
INTERFACES & PROTOCOLS (MCP/A2A)
Agent Deployment
Amazon Bedrock
AgentCore
Agent Framework
Large Language Model Runtime
Memory
Prompts
Tools/Resources
Identity Tools
Guardrails
Gateway
Observability
Evaluation
SPECIALIZED
- Amazon Q Agents
FULLY-MANAGED
- Amazon Bedrock Agents
DIY
- OSS Frameworks, Strands Agents SDK
Team:
Top comments (0)