Multi-Agent on AgentCore: Accelerating Fault Diagnosis and Recovery in Distributed Systems

#aws #cloud #beginners #productivity

Speaker: Tan Xin @ AWS Amarathon 2025

Summary by Amazon Nova

Challenge

Traditional Workflow During a Failure

Core Challenges

Timely loss mitigation
Fault isolation
Resolve issues within an acceptable timeframe
Limit failures within isolation boundaries to prevent cascading effects on other services, thereby reducing the scope of failure impact
Ensure services meet user expectations and SLAs

Solution Evolution

Ideal Solution

System Notifications and Alerts

Supplementary Root Cause Analysis

Key Findings: API error rate 100%, DB no error, S3 bucket policy was updated.

Confirm Automatic Operations

SRE Expert Work Scope

Why did my User Service error rate reach 5% in the past hour?
Because the RDS MySQL instance experienced 12 connection limit exceeded issues in the past hour
Familiarize with the current system
Find service correlations
Analyze logs
Analyze audit logs
Analyze configuration
Analyze metrics

Consider Two Questions

If the system complexity is high, the troubleshooting workflow is long, and the log volume is large, can a single agent work smoothly?
Is it possible to clone SRE expert experience into agents to replace the Q&A method, allowing agents to make autonomous decisions and actions?

Multi-Agent Architecture Design

"Planner" creates workflows
"Executor" is responsible for executing assigned tasks
"Evaluator" is responsible for assessing whether each step's result is beneficial, returning to the "Planner" for subsequent planning
Results also need to be reviewed by the "Evaluator" before returning
The "Planner" can adjust the process based on feedback from the "Evaluator"

Multi-Agent vs Single Agent