Topology-Aware AI Agents for Observability: Automating SLO Breach Root Cause Analysis
Modern cloud systems are complex distributed architectures where a single user journey may depend on dozens of services running across multiple infrastructure layers.
When a Service Level Objective (SLO) breach occurs, identifying the root cause often requires navigating logs, metrics, traces, service dependencies, and infrastructure relationships.
In many organizations, this investigation is still manual and time-consuming.
In a recent project, I explored how AI agents can automate incident investigation by combining:
- Observability data
- Service topology
- Kubernetes infrastructure context
- Historical incident knowledge
- Graph-based reasoning
This approach reduced investigation time from 20–30 minutes to under a minute for certain SLO breaches.
This article introduces the concept of Topology-Aware AI Agents and how such a system can be implemented using AWS services and graph-based system modeling.
The Problem: Traditional Incident Investigation
When an SLO breach occurs, SRE teams typically perform the following steps:
- Identify the impacted user journey
- Check monitoring dashboards
- Inspect logs and traces
- Identify impacted services
- Traverse upstream and downstream dependencies
- Correlate incidents with infrastructure problems
In large microservice environments, this investigation becomes difficult because:
- Logs lack system-wide context
- Metrics show symptoms but not relationships
- Service dependencies are hard to traverse quickly
- Infrastructure and application layers are often disconnected
Even with powerful observability tools, humans still perform most correlation tasks manually.
Why Logs Alone Are Not Enough for AI
Many AI troubleshooting systems rely on RAG (Retrieval Augmented Generation) using logs or documentation.
However, logs alone do not provide system relationships.
Example log entry:
Payment API latency spike
Without topology context, an AI system cannot determine:
- Which upstream service triggered the issue
- Which downstream dependency failed
- Whether the issue originated from infrastructure or application layers
To solve this, we need structural knowledge about the system architecture.
Introducing Topology-Aware AI Agents
A Topology-Aware AI Agent combines three major sources of context:
Observability Data
+
Service Topology
+
Historical Incident Knowledge
The agent uses this combined knowledge to automatically:
- Identify impacted services
- Traverse dependency graphs
- Correlate incidents
- Suggest root causes
This transforms incident troubleshooting from log searching into graph-based reasoning.
Platform Context: Microservices Running on Amazon EKS
In this environment, the application platform was built using Kubernetes running on Amazon Elastic Kubernetes Service (EKS).
Each user request travels across multiple layers:
User Request
↓
API Gateway / Entry Service
↓
Microservices running on Kubernetes
↓
Databases / external dependencies
Each microservice runs inside containers deployed on Kubernetes pods.
To enable automated incident analysis, the system needed visibility into:
- Cloud infrastructure
- Kubernetes resources
- Application services
- Runtime service interactions
- Observability signals
These relationships were modeled as a graph database.
Building the Service Relationship Graph
The system used Neo4j to build a knowledge graph representing the full platform topology.
The graph captured relationships across multiple layers:
- Cloud infrastructure
- Kubernetes platform
- Application services
- Service interactions
- Historical incidents
This structure allowed the AI agent to reason about how failures propagate across the system.
Modeling the Infrastructure Layer
The first layer of the graph represented the cloud infrastructure.
Example nodes:
Cloud Provider
AWS Account
Region
Availability Zone
Host (EC2)
Example relationships:
AWS Account
│
DEPLOYS
▼
EKS Cluster
│
RUNS_ON
▼
EC2 Worker Node
This enables the system to correlate incidents with infrastructure-level problems such as:
- node failures
- CPU saturation
- network issues
Modeling the Kubernetes Platform
The next layer represents Kubernetes resources running on the EKS cluster.
Example nodes:
EKS Cluster
Namespace
Pod
Container
Process Group
Example relationships:
EKS Cluster
│
CONTAINS
▼
Namespace
│
CONTAINS
▼
Pod
│
RUNS
▼
Container
Each container instance is mapped to a process group representing a running microservice instance.
This structure allows the graph to capture runtime relationships between services and infrastructure nodes.
Modeling Application Services
At the application level, the graph represents each microservice as a service node.
Example nodes:
Service
API
Database
External Dependency
Services are connected to the runtime processes executing them.
Example relationship:
Checkout Service
│
RUNS_AS
▼
Process Group
│
HOSTED_ON
▼
Kubernetes Pod
This mapping enables the system to trace incidents from application failures down to infrastructure components.
Modeling Caller–Callee Relationships
One of the most critical aspects of the topology graph is capturing service interaction flows.
Microservices communicate through APIs, forming caller–callee relationships.
Example:
Checkout Service
│
CALLS
▼
Payment Service
│
CALLS
▼
Payment Database
These relationships represent the actual runtime service communication paths.
By modeling these relationships, the AI agent can identify:
- downstream dependencies
- cascading failures
- shared services impacting multiple user journeys
Linking Observability Data to the Graph
Observability signals such as logs and errors are attached to graph nodes.
Example:
Payment Service
│
HAS_ERROR
▼
Timeout Exception
Infrastructure events can also be attached:
EC2 Worker Node
│
HAS_EVENT
▼
CPU Spike
This allows the agent to correlate:
- infrastructure issues
- application errors
- service dependencies
within a single reasoning model.
Learning from Historical Incidents
Each investigated incident is also stored in the graph.
Example structure:
Incident
├ impacted service
├ root cause
├ infrastructure correlation
└ resolution
Over time, this builds a knowledge graph of operational incidents.
The AI agent can then detect patterns such as:
- recurring failures
- common dependency issues
- infrastructure patterns impacting multiple services
Architecture Overview
A simplified architecture for this approach looks like this:
SLO Breach Alert
│
▼
Event Trigger (Monitoring / EventBridge)
│
▼
Incident AI Agent
│
├── Service Topology Graph (Neo4j)
├── Observability Data (Logs / Traces)
└── Historical Incident Knowledge
│
▼
LLM Reasoning
│
▼
Root Cause Hypothesis
AWS services that can support this architecture include:
- Amazon EKS
- AWS Lambda
- Amazon EventBridge
- Amazon Bedrock
- Amazon OpenSearch
- Amazon Neptune (as a managed graph alternative)
Agent Workflow
When a new SLO breach occurs, the AI agent performs the following steps.
Step 1 — Detect SLO Breach
Monitoring tools trigger an alert event.
Step 2 — Identify Impacted Services
The agent queries the service topology graph.
Step 3 — Traverse Dependencies
The graph traversal identifies:
- upstream services
- downstream dependencies
- infrastructure nodes
Step 4 — Retrieve Observability Signals
Logs and errors are retrieved from observability platforms.
Step 5 — LLM Reasoning
Structured context is sent to the LLM.
Example prompt:
SLO breach detected in Checkout Service
Impacted services:
Checkout Service
Payment Service
Payment Database
Recent errors:
Timeout errors in Payment Service
Historical incident:
Database connection pool exhaustion
The LLM then generates a root cause hypothesis.
Results from the Prototype
In the prototype implementation:
Manual investigation time:
20–30 minutes
AI-assisted investigation:
Under 1 minute
For a specific platinum user journey SLO, the agent achieved:
~52% correlation accuracy between SLO breaches and underlying service problems.
While not perfect, it significantly accelerates incident triage.
Why Graph-Based Observability Matters
Traditional observability focuses on:
- metrics
- logs
- traces
However modern systems also require relationship awareness.
Graph-based models enable:
- dependency reasoning
- cross-service correlation
- historical incident learning
Combining graph knowledge with LLM reasoning enables a new class of systems:
AI-assisted incident response agents.
Future Directions
This concept can evolve further with:
- autonomous remediation agents
- continuous incident learning
- multi-agent observability systems
- integration with CI/CD pipelines
As distributed architectures continue to grow in complexity, topology-aware AI agents may become an essential part of SRE operations.
Final Thoughts
AI-powered incident investigation is still in its early stages.
However combining:
- observability data
- service topology graphs
- Kubernetes infrastructure knowledge
- historical incident intelligence
- LLM reasoning
creates a powerful approach to automated root cause analysis.
Topology-aware AI agents represent a promising direction for improving SRE productivity and incident response time in modern cloud-native systems.
If you're exploring AI for SRE, observability, or incident automation, I would love to hear your thoughts or experiences.

Top comments (0)