DEV Community

Roopa Venkatesh
Roopa Venkatesh

Posted on

Topology-Aware AI Agents for Observability: Automating SLO Breach Root Cause Analysis

Topology-Aware AI Agents for Observability: Automating SLO Breach Root Cause Analysis

Modern cloud systems are complex distributed architectures where a single user journey may depend on dozens of services running across multiple infrastructure layers.

When a Service Level Objective (SLO) breach occurs, identifying the root cause often requires navigating logs, metrics, traces, service dependencies, and infrastructure relationships.

In many organizations, this investigation is still manual and time-consuming.

In a recent project, I explored how AI agents can automate incident investigation by combining:

  • Observability data
  • Service topology
  • Kubernetes infrastructure context
  • Historical incident knowledge
  • Graph-based reasoning

This approach reduced investigation time from 20–30 minutes to under a minute for certain SLO breaches.

This article introduces the concept of Topology-Aware AI Agents and how such a system can be implemented using AWS services and graph-based system modeling.


The Problem: Traditional Incident Investigation

When an SLO breach occurs, SRE teams typically perform the following steps:

  1. Identify the impacted user journey
  2. Check monitoring dashboards
  3. Inspect logs and traces
  4. Identify impacted services
  5. Traverse upstream and downstream dependencies
  6. Correlate incidents with infrastructure problems

In large microservice environments, this investigation becomes difficult because:

  • Logs lack system-wide context
  • Metrics show symptoms but not relationships
  • Service dependencies are hard to traverse quickly
  • Infrastructure and application layers are often disconnected

Even with powerful observability tools, humans still perform most correlation tasks manually.


Why Logs Alone Are Not Enough for AI

Many AI troubleshooting systems rely on RAG (Retrieval Augmented Generation) using logs or documentation.

However, logs alone do not provide system relationships.

Example log entry:
Payment API latency spike

Without topology context, an AI system cannot determine:

  • Which upstream service triggered the issue
  • Which downstream dependency failed
  • Whether the issue originated from infrastructure or application layers

To solve this, we need structural knowledge about the system architecture.


Introducing Topology-Aware AI Agents

A Topology-Aware AI Agent combines three major sources of context:

Observability Data
+
Service Topology
+
Historical Incident Knowledge

The agent uses this combined knowledge to automatically:

  • Identify impacted services
  • Traverse dependency graphs
  • Correlate incidents
  • Suggest root causes

This transforms incident troubleshooting from log searching into graph-based reasoning.


Platform Context: Microservices Running on Amazon EKS

In this environment, the application platform was built using Kubernetes running on Amazon Elastic Kubernetes Service (EKS).

Each user request travels across multiple layers:

User Request

API Gateway / Entry Service

Microservices running on Kubernetes

Databases / external dependencies

Each microservice runs inside containers deployed on Kubernetes pods.

To enable automated incident analysis, the system needed visibility into:

  • Cloud infrastructure
  • Kubernetes resources
  • Application services
  • Runtime service interactions
  • Observability signals

These relationships were modeled as a graph database.


Building the Service Relationship Graph

The system used Neo4j to build a knowledge graph representing the full platform topology.

The graph captured relationships across multiple layers:

  • Cloud infrastructure
  • Kubernetes platform
  • Application services
  • Service interactions
  • Historical incidents

This structure allowed the AI agent to reason about how failures propagate across the system.


Modeling the Infrastructure Layer

The first layer of the graph represented the cloud infrastructure.

Example nodes:

Cloud Provider
AWS Account
Region
Availability Zone
Host (EC2)

Example relationships:
AWS Account

DEPLOYS

EKS Cluster

RUNS_ON

EC2 Worker Node

This enables the system to correlate incidents with infrastructure-level problems such as:

  • node failures
  • CPU saturation
  • network issues

Modeling the Kubernetes Platform

The next layer represents Kubernetes resources running on the EKS cluster.

Example nodes:

EKS Cluster
Namespace
Pod
Container
Process Group

Example relationships:

EKS Cluster

CONTAINS

Namespace

CONTAINS

Pod

RUNS

Container

Each container instance is mapped to a process group representing a running microservice instance.

This structure allows the graph to capture runtime relationships between services and infrastructure nodes.


Modeling Application Services

At the application level, the graph represents each microservice as a service node.

Example nodes:

Service
API
Database
External Dependency

Services are connected to the runtime processes executing them.

Example relationship:

Checkout Service

RUNS_AS

Process Group

HOSTED_ON

Kubernetes Pod

This mapping enables the system to trace incidents from application failures down to infrastructure components.


Modeling Caller–Callee Relationships

One of the most critical aspects of the topology graph is capturing service interaction flows.

Microservices communicate through APIs, forming caller–callee relationships.

Example:

Checkout Service

CALLS

Payment Service

CALLS

Payment Database

These relationships represent the actual runtime service communication paths.

By modeling these relationships, the AI agent can identify:

  • downstream dependencies
  • cascading failures
  • shared services impacting multiple user journeys

Linking Observability Data to the Graph

Observability signals such as logs and errors are attached to graph nodes.

Example:

Payment Service

HAS_ERROR

Timeout Exception

Infrastructure events can also be attached:

EC2 Worker Node

HAS_EVENT

CPU Spike

This allows the agent to correlate:

  • infrastructure issues
  • application errors
  • service dependencies

within a single reasoning model.


Learning from Historical Incidents

Each investigated incident is also stored in the graph.

Example structure:

Incident
├ impacted service
├ root cause
├ infrastructure correlation
└ resolution

Over time, this builds a knowledge graph of operational incidents.

The AI agent can then detect patterns such as:

  • recurring failures
  • common dependency issues
  • infrastructure patterns impacting multiple services

Architecture Overview

A simplified architecture for this approach looks like this:

SLO Breach Alert


Event Trigger (Monitoring / EventBridge)


Incident AI Agent

├── Service Topology Graph (Neo4j)
├── Observability Data (Logs / Traces)
└── Historical Incident Knowledge


LLM Reasoning


Root Cause Hypothesis

AWS services that can support this architecture include:

  • Amazon EKS
  • AWS Lambda
  • Amazon EventBridge
  • Amazon Bedrock
  • Amazon OpenSearch
  • Amazon Neptune (as a managed graph alternative)

Agent Workflow

When a new SLO breach occurs, the AI agent performs the following steps.

Step 1 — Detect SLO Breach

Monitoring tools trigger an alert event.

Step 2 — Identify Impacted Services

The agent queries the service topology graph.

Step 3 — Traverse Dependencies

The graph traversal identifies:

  • upstream services
  • downstream dependencies
  • infrastructure nodes

Step 4 — Retrieve Observability Signals

Logs and errors are retrieved from observability platforms.

Step 5 — LLM Reasoning

Structured context is sent to the LLM.

Example prompt:

SLO breach detected in Checkout Service

Impacted services:
Checkout Service
Payment Service
Payment Database

Recent errors:
Timeout errors in Payment Service

Historical incident:
Database connection pool exhaustion

The LLM then generates a root cause hypothesis.


Results from the Prototype

In the prototype implementation:

Manual investigation time:

20–30 minutes

AI-assisted investigation:

Under 1 minute

For a specific platinum user journey SLO, the agent achieved:

~52% correlation accuracy between SLO breaches and underlying service problems.

While not perfect, it significantly accelerates incident triage.


Why Graph-Based Observability Matters

Traditional observability focuses on:

  • metrics
  • logs
  • traces

However modern systems also require relationship awareness.

Graph-based models enable:

  • dependency reasoning
  • cross-service correlation
  • historical incident learning

Combining graph knowledge with LLM reasoning enables a new class of systems:

AI-assisted incident response agents.


Future Directions

This concept can evolve further with:

  • autonomous remediation agents
  • continuous incident learning
  • multi-agent observability systems
  • integration with CI/CD pipelines

As distributed architectures continue to grow in complexity, topology-aware AI agents may become an essential part of SRE operations.


Final Thoughts

AI-powered incident investigation is still in its early stages.

However combining:

  • observability data
  • service topology graphs
  • Kubernetes infrastructure knowledge
  • historical incident intelligence
  • LLM reasoning

creates a powerful approach to automated root cause analysis.

Topology-aware AI agents represent a promising direction for improving SRE productivity and incident response time in modern cloud-native systems.


If you're exploring AI for SRE, observability, or incident automation, I would love to hear your thoughts or experiences.

Top comments (0)