DEV Community

Cover image for Multi-Agent on AgentCore: Accelerating Fault Diagnosis and Recovery in Distributed Systems
Eliana Lam for AWS Community On Air

Posted on

Multi-Agent on AgentCore: Accelerating Fault Diagnosis and Recovery in Distributed Systems

Speaker: Tan Xin @ AWS Amarathon 2025

Summary by Amazon Nova



Challenge

Traditional Workflow During a Failure

  • Check cloud resource status

  • Failure occurs

  • View incident history

  • Check configuration

  • Check alerts

  • Assign work order

  • Analyze logs

  • Hypothesize root cause

  • Search for solutions

  • Check metrics

  • Analyze dependencies

  • View recent updates

  • Search call chains

  • View O&M manual

  • Query service dashboard

  • Execute mitigation measures

  • Notify colleagues

  • Query abnormal metrics

  • Read manual

  • Monitor recovery status

Core Challenges

  • Timely loss mitigation

  • Fault isolation

  • Resolve issues within an acceptable timeframe

  • Limit failures within isolation boundaries to prevent cascading effects on other services, thereby reducing the scope of failure impact

  • Ensure services meet user expectations and SLAs



Solution Evolution

  • Evolution from single agent to multi-agent solutions

Ideal Solution

System Notifications and Alerts

  • Alarm Metric: 5xx rate over 30%

  • Alarm Detail: Service, Endpoint, Triggered Time

Supplementary Root Cause Analysis

Key Findings: API error rate 100%, DB no error, S3 bucket policy was updated.

  • Immediate Action: Check S3 bucket policy to Deny

  • Confidence: 90%

Confirm Automatic Operations

  • Message: System Revered.

  • RCA: S3 bucket policy was set to Deny

  • MTTR: 5 mins

  • Reduce failure recovery time from hours to minutes

SRE Expert Work Scope

  • Why did my User Service error rate reach 5% in the past hour?

  • Because the RDS MySQL instance experienced 12 connection limit exceeded issues in the past hour

  • Familiarize with the current system

  • Find service correlations

  • Analyze logs

  • Analyze audit logs

  • Analyze configuration

  • Analyze metrics

Consider Two Questions

  1. If the system complexity is high, the troubleshooting workflow is long, and the log volume is large, can a single agent work smoothly?

  2. Is it possible to clone SRE expert experience into agents to replace the Q&A method, allowing agents to make autonomous decisions and actions?



Multi-Agent Architecture Design

  • "Planner" creates workflows

  • "Executor" is responsible for executing assigned tasks

  • "Evaluator" is responsible for assessing whether each step's result is beneficial, returning to the "Planner" for subsequent planning

  • Results also need to be reviewed by the "Evaluator" before returning

  • The "Planner" can adjust the process based on feedback from the "Evaluator"

Multi-Agent vs Single Agent

  1. Suitable for more complex tasks

  2. Clearer responsibility and permission boundaries

  3. Easier context engineering

  4. More convenient scaling

AgentCore Best Practices

Introduction to Agent-Specific Runtime Environment

  • Challenges from "trial" to "implementation"

  • Challenges from PoC to production environment implementation

  • Performance

  • Elasticity

  • Security

  • Creating business value

  • Compliance



Agent Runtime Environment v1.0

  • INTERFACES & PROTOCOLS (MCP/A2A)

  • Agent Deployment

  • Agent Framework

  • Large Language Model

  • Memory

  • Prompts

  • Tools/Resources

  • Guardrails

  • Observability

  • Evaluation

Agent Runtime Environment v2.0

  • INTERFACES & PROTOCOLS (MCP/A2A)

  • Agent Deployment

  • Amazon Bedrock

  • AgentCore

  • Agent Framework

  • Large Language Model Runtime

  • Memory

  • Prompts

  • Tools/Resources

  • Identity Tools

  • Guardrails

  • Gateway

  • Observability

  • Evaluation

SPECIALIZED

  • Amazon Q Agents

FULLY-MANAGED

  • Amazon Bedrock Agents

DIY

  • OSS Frameworks, Strands Agents SDK


Team:

AWS FSI Customer Acceleration Hong Kong

AWS Amarathon Fan Club

AWS Community Builder Hong Kong

Top comments (0)