Ananya S

Posted on May 26

How I Built an AI-Powered Incident RCA Platform with LangGraph and RAG

#ai #langgraph #programming #productivity

It’s 2:13 AM.

A payment API suddenly starts failing in production.

Customers can’t complete transactions. Alerts begin firing everywhere. Dashboards turn red. Kubernetes pods restart unexpectedly. Database connections start timing out.

And somewhere, an exhausted engineer opens Datadog and starts scrolling through thousands of logs trying to answer a single question:

“What actually broke?”

Modern systems generate enormous amounts of telemetry:

logs
alerts
traces
metrics
infrastructure events

The problem isn’t the lack of monitoring anymore.

The problem is:

making sense of the chaos quickly enough during an outage.

That idea became the starting point for OpsMind AI — an AI-powered incident root cause analysis platform inspired by real-world DevOps and Site Reliability Engineering workflows.

The goal was ambitious but simple:

Upload observability logs → identify probable root cause → generate remediation recommendations automatically.

The Core Problem

In modern distributed systems, a single failure rarely stays isolated.

A database lock might cause:

API latency spikes
gateway timeouts
downstream service crashes
Kubernetes restarts

During incidents, engineers manually jump between:

Grafana dashboards
Datadog alerts
New Relic traces
raw log streams

trying to correlate failures across services.

This process is:

time-consuming
mentally exhausting
highly dependent on experience

I wanted to explore whether multi-agent AI systems could assist in this process.

Not just summarizing logs.

But actually:

retrieving similar historical incidents
classifying incident severity
reconstructing event timelines
identifying affected services
generating RCA explanations
suggesting remediation steps

Enter OpsMind AI

OpsMind AI simulates an AI-driven observability assistant for SRE and DevOps teams.

The platform processes observability logs through a LangGraph-based multi-agent workflow that orchestrates specialized agents for different operational tasks.

Instead of relying on a single monolithic LLM prompt, the system breaks incident investigation into multiple coordinated reasoning stages.

System Architecture

The workflow begins by ingesting logs from simulated monitoring platforms such as:

Datadog
Grafana
New Relic

The logs are normalized and passed into a multi-agent orchestration pipeline.

The architecture consists of:

Retrieval Agent

Searches historical incidents using FAISS vector similarity search.

Incident Classification Agent

Identifies:

incident type
severity level
monitoring source

RCA Agent

Performs root cause analysis and generates remediation recommendations using LLM reasoning.

Timeline & Impact Analysis

Reconstructs operational event sequences and identifies affected downstream services.

Evaluation Layer

Measures:

retrieval accuracy
RCA quality
latency
incident correlation confidence

The frontend dashboard was built using Streamlit to simulate an operational observability console.

Why RAG Was Important Here

One of the most interesting parts of the project was integrating retrieval-augmented generation.

Production incidents often repeat patterns:

database pool exhaustion
API rate limiting
Kubernetes OOM crashes
retry storms
deadlocks

Instead of asking the LLM to reason from scratch every time, OpsMind AI retrieves semantically similar historical incidents from a FAISS vector database and uses them as contextual memory during RCA generation.

This significantly improved the consistency of generated analyses.

Building the Multi-Agent Workflow

The orchestration layer uses LangGraph to model incident analysis as a graph of specialized AI agents.

This made the workflow:

modular
explainable
easier to visualize

One thing I particularly enjoyed was building the animated agent execution dashboard where each agent executes sequentially:

Retrieval Agent
Classification Agent
RCA Agent
Timeline Agent
Impact Analysis Agent

Watching the workflow execute in real time made the system feel much closer to an actual operational AI assistant rather than just another chatbot interface.

Simulating Real Production Incidents

Since real enterprise observability data isn’t publicly available, I generated synthetic production-style incident logs for:

Kubernetes CrashLoopBackOff failures
database connection exhaustion
API rate limiting failures
downstream gateway crashes

The architecture was intentionally designed so that simulated connectors can later be replaced with real monitoring APIs.

Evaluation Was Surprisingly Hard

One unexpected realization during development:

Building the RCA pipeline was easier than evaluating it.

It’s very easy to generate convincing AI explanations.

It’s much harder to measure:

whether the RCA is actually correct
whether retrieval is meaningful
whether severity classification is reliable

That’s why I added an evaluation layer measuring:

Retrieval Accuracy
RCA Match Accuracy
Severity Accuracy
Average Latency
Correlation Confidence

Adding evaluation made the project feel significantly more engineering-focused rather than simply prompt-driven.

Tech Stack

Python
Streamlit
LangGraph
FAISS
SentenceTransformers
Groq LLM API
Pandas

Building Under Hackathon Constraints

OpsMind AI was originally built during a short-duration engineering hackathon focused on AI agents and developer infrastructure workflows.

One interesting challenge was balancing:

ambitious system design ideas
realistic implementation scope
evaluation reliability
UI polish
deployment constraints

I wanted the project to feel less like a simple LLM wrapper and more like an actual operational intelligence platform, which is why I focused heavily on:

multi-agent orchestration
retrieval systems
evaluation metrics
workflow visualization
observability-inspired architecture

Even within a constrained timeline, building the system end-to-end — from synthetic telemetry generation to agent orchestration and evaluation — was an incredibly valuable learning experience.

What I Learned

This project taught me a lot about:

observability systems
multi-agent orchestration
RAG pipelines
AI evaluation strategies
operational intelligence workflows

More importantly, it changed how I think about AI systems.

The interesting challenge wasn’t generating text.

It was designing systems that:

reason through operational data
coordinate specialized agents
retrieve contextual memory
produce actionable outputs

That feels much closer to how real-world AI systems will evolve.

Demo & Repository

GitHub Repository

https://github.com/Anucool419/OpsMind-AI

Demo Video

Future Improvements

Some things I’d love to explore next:

real-time telemetry ingestion
live Datadog/New Relic integrations
Slack incident alerting
autonomous remediation workflows
distributed tracing support
long-term incident memory systems

Conclusion

What started as a simple idea — “Can AI help investigate production incidents faster?” — turned into a much deeper exploration of how intelligent systems can assist engineering operations.

The most interesting part of building OpsMind AI wasn’t the UI or even the LLM integration.

It was understanding how modern operational systems actually behave:

cascading failures
noisy telemetry
infrastructure dependencies
repeated incident patterns
operational uncertainty

This project made me realize that the future of AI in engineering is not just about chat interfaces.

It’s about building systems that can:

reason over complex environments
retrieve operational memory
coordinate specialized agents
assist humans during high-pressure decision making

OpsMind AI is still a prototype, but building it gave me a much deeper appreciation for:

observability engineering
SRE workflows
AI orchestration systems
evaluation-driven AI development

And honestly, that combination of AI + systems engineering is one of the most exciting areas to explore right now. Do suggest any improvements you think I should make or share your experiences.

Thanks for reading.

DEV Community