Roopa Venkatesh

Posted on Mar 6

Topology-Aware AI Agents for Observability: Automating SLO Breach Root Cause Analysis

#aws #sre #devops #ai

Topology-Aware AI Agents for Observability: Automating SLO Breach Root Cause Analysis

Modern cloud systems are complex distributed architectures where a single user journey may depend on dozens of services running across multiple infrastructure layers.

When a Service Level Objective (SLO) breach occurs, identifying the root cause often requires navigating logs, metrics, traces, service dependencies, and infrastructure relationships.

In many organizations, this investigation is still manual and time-consuming.

In a recent project, I explored how AI agents can automate incident investigation by combining:

Observability data
Service topology
Kubernetes infrastructure context
Historical incident knowledge
Graph-based reasoning

This approach reduced investigation time from 20–30 minutes to under a minute for certain SLO breaches.

This article introduces the concept of Topology-Aware AI Agents and how such a system can be implemented using AWS services and graph-based system modeling.

The Problem: Traditional Incident Investigation

When an SLO breach occurs, SRE teams typically perform the following steps:

Identify the impacted user journey
Check monitoring dashboards
Inspect logs and traces
Identify impacted services
Traverse upstream and downstream dependencies
Correlate incidents with infrastructure problems

In large microservice environments, this investigation becomes difficult because:

Logs lack system-wide context
Metrics show symptoms but not relationships
Service dependencies are hard to traverse quickly
Infrastructure and application layers are often disconnected

Even with powerful observability tools, humans still perform most correlation tasks manually.

Why Logs Alone Are Not Enough for AI

Many AI troubleshooting systems rely on RAG (Retrieval Augmented Generation) using logs or documentation.

However, logs alone do not provide system relationships.

Example log entry:
Payment API latency spike

Without topology context, an AI system cannot determine:

Which upstream service triggered the issue
Which downstream dependency failed
Whether the issue originated from infrastructure or application layers

To solve this, we need structural knowledge about the system architecture.

Introducing Topology-Aware AI Agents

A Topology-Aware AI Agent combines three major sources of context:

Observability Data
+
Service Topology
+
Historical Incident Knowledge

The agent uses this combined knowledge to automatically:

Identify impacted services
Traverse dependency graphs
Correlate incidents
Suggest root causes

This transforms incident troubleshooting from log searching into graph-based reasoning.

Platform Context: Microservices Running on Amazon EKS

In this environment, the application platform was built using Kubernetes running on Amazon Elastic Kubernetes Service (EKS).

Each user request travels across multiple layers:

User Request
↓
API Gateway / Entry Service
↓
Microservices running on Kubernetes
↓
Databases / external dependencies

Each microservice runs inside containers deployed on Kubernetes pods.

To enable automated incident analysis, the system needed visibility into:

Cloud infrastructure
Kubernetes resources
Application services
Runtime service interactions
Observability signals

These relationships were modeled as a graph database.

Building the Service Relationship Graph

The system used Neo4j to build a knowledge graph representing the full platform topology.

The graph captured relationships across multiple layers:

Cloud infrastructure
Kubernetes platform
Application services
Service interactions
Historical incidents

This structure allowed the AI agent to reason about how failures propagate across the system.

Modeling the Infrastructure Layer

The first layer of the graph represented the cloud infrastructure.

Example nodes:

Cloud Provider
AWS Account
Region
Availability Zone
Host (EC2)

Example relationships:
AWS Account
│
DEPLOYS
▼
EKS Cluster
│
RUNS_ON
▼
EC2 Worker Node

This enables the system to correlate incidents with infrastructure-level problems such as:

node failures
CPU saturation
network issues

Modeling the Kubernetes Platform

The next layer represents Kubernetes resources running on the EKS cluster.

Example nodes:

EKS Cluster
Namespace
Pod
Container
Process Group

Example relationships:

EKS Cluster
│
CONTAINS
▼
Namespace
│
CONTAINS
▼
Pod
│
RUNS
▼
Container

Each container instance is mapped to a process group representing a running microservice instance.

This structure allows the graph to capture runtime relationships between services and infrastructure nodes.

Modeling Application Services

At the application level, the graph represents each microservice as a service node.

Example nodes:

Service
API
Database
External Dependency

Services are connected to the runtime processes executing them.

Example relationship:

Checkout Service
│
RUNS_AS
▼
Process Group
│
HOSTED_ON
▼
Kubernetes Pod

This mapping enables the system to trace incidents from application failures down to infrastructure components.

Modeling Caller–Callee Relationships

One of the most critical aspects of the topology graph is capturing service interaction flows.

Microservices communicate through APIs, forming caller–callee relationships.

Example:

Checkout Service
│
CALLS
▼
Payment Service
│
CALLS
▼
Payment Database

These relationships represent the actual runtime service communication paths.

By modeling these relationships, the AI agent can identify:

downstream dependencies
cascading failures
shared services impacting multiple user journeys

Linking Observability Data to the Graph

Observability signals such as logs and errors are attached to graph nodes.

Example:

Payment Service
│
HAS_ERROR
▼
Timeout Exception

Infrastructure events can also be attached:

EC2 Worker Node
│
HAS_EVENT
▼
CPU Spike

This allows the agent to correlate:

infrastructure issues
application errors
service dependencies

within a single reasoning model.

Learning from Historical Incidents

Each investigated incident is also stored in the graph.

Example structure:

Incident
├ impacted service
├ root cause
├ infrastructure correlation
└ resolution

Over time, this builds a knowledge graph of operational incidents.

The AI agent can then detect patterns such as:

recurring failures
common dependency issues
infrastructure patterns impacting multiple services

Architecture Overview

A simplified architecture for this approach looks like this:

SLO Breach Alert
│
▼
Event Trigger (Monitoring / EventBridge)
│
▼
Incident AI Agent
│
├── Service Topology Graph (Neo4j)
├── Observability Data (Logs / Traces)
└── Historical Incident Knowledge
│
▼
LLM Reasoning
│
▼
Root Cause Hypothesis

AWS services that can support this architecture include:

Amazon EKS
AWS Lambda
Amazon EventBridge
Amazon Bedrock
Amazon OpenSearch
Amazon Neptune (as a managed graph alternative)

Agent Workflow

When a new SLO breach occurs, the AI agent performs the following steps.

Step 1 — Detect SLO Breach

Monitoring tools trigger an alert event.

Step 2 — Identify Impacted Services

The agent queries the service topology graph.

Step 3 — Traverse Dependencies

The graph traversal identifies:

upstream services
downstream dependencies
infrastructure nodes

Step 4 — Retrieve Observability Signals

Logs and errors are retrieved from observability platforms.

Step 5 — LLM Reasoning

Structured context is sent to the LLM.

Example prompt:

SLO breach detected in Checkout Service

Impacted services:
Checkout Service
Payment Service
Payment Database

Recent errors:
Timeout errors in Payment Service

Historical incident:
Database connection pool exhaustion

The LLM then generates a root cause hypothesis.

Results from the Prototype

In the prototype implementation:

Manual investigation time:

20–30 minutes

AI-assisted investigation:

Under 1 minute

For a specific platinum user journey SLO, the agent achieved:

~52% correlation accuracy between SLO breaches and underlying service problems.

While not perfect, it significantly accelerates incident triage.

Why Graph-Based Observability Matters

Traditional observability focuses on:

metrics
logs
traces

However modern systems also require relationship awareness.

Graph-based models enable:

dependency reasoning
cross-service correlation
historical incident learning

Combining graph knowledge with LLM reasoning enables a new class of systems:

AI-assisted incident response agents.

Future Directions

This concept can evolve further with:

autonomous remediation agents
continuous incident learning
multi-agent observability systems
integration with CI/CD pipelines

As distributed architectures continue to grow in complexity, topology-aware AI agents may become an essential part of SRE operations.

Final Thoughts

AI-powered incident investigation is still in its early stages.

However combining:

observability data
service topology graphs
Kubernetes infrastructure knowledge
historical incident intelligence
LLM reasoning

creates a powerful approach to automated root cause analysis.

Topology-aware AI agents represent a promising direction for improving SRE productivity and incident response time in modern cloud-native systems.

If you're exploring AI for SRE, observability, or incident automation, I would love to hear your thoughts or experiences.

DEV Community

Topology-Aware AI Agents for Observability: Automating SLO Breach Root Cause Analysis

Topology-Aware AI Agents for Observability: Automating SLO Breach Root Cause Analysis

The Problem: Traditional Incident Investigation

Why Logs Alone Are Not Enough for AI

Introducing Topology-Aware AI Agents

Platform Context: Microservices Running on Amazon EKS

Building the Service Relationship Graph

Modeling the Infrastructure Layer

Modeling the Kubernetes Platform

Modeling Application Services

Modeling Caller–Callee Relationships

Linking Observability Data to the Graph

Learning from Historical Incidents

Architecture Overview

Agent Workflow

Step 1 — Detect SLO Breach

Step 2 — Identify Impacted Services

Step 3 — Traverse Dependencies

Step 4 — Retrieve Observability Signals

Step 5 — LLM Reasoning

Results from the Prototype

Why Graph-Based Observability Matters

Future Directions

Final Thoughts

Top comments (0)