Project Viveka: A Multi-Agent AI That Does Root Cause Analysis in Under 90 Seconds

#rca #mtti #mttr #razorpay

If you've ever been on-call for production systems, you know the 2 AM drill. An alert fires. You groggily open your laptop, check the incident dashboard, jump into Grafana to examine metrics, dig through Coralogix logs looking for error spikes, SSH into Kubernetes to check pod health, review recent deployments, correlate across six different data sources, and thirty minutes later you're still trying to figure out what's actually wrong.

At Razorpay, where payment infrastructure processes billions of rupees daily, this manual investigation dance was costing us precious time during every incident.

The industry obsesses over Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR), but there's a critical metric in between that often gets overlooked: Mean Time to Investigate (MTTI). This is the gap between knowing something is broken and understanding why it's broken.

Traditional incident response spends the majority of time in this investigation phase, manually following runbooks, querying systems, and correlating signals. By the time you understand the root cause, you've already burned through the minutes that matter most for customer impact.

That's why we built Project Viveka, a multi-agent AI system that automates the entire investigation workflow. When an alert fires, Viveka orchestrates specialist agents across our observability stack, correlates evidence, and produces a structured root cause analysis with supporting data in under 90 seconds.

The name comes from Sanskrit meaning "discernment" or "wisdom," which felt appropriate for a system designed to cut through observability noise and find signal.

The Investigation Problem: When Manual Triage Doesn't Scale

Before diving into how Viveka works, let's talk about why incident investigation is so painful at scale. The challenge isn't having too little data; it's having too much data in too many disconnected systems, and needing human intelligence to connect the dots.

Our observability stack spans multiple systems. Zenduty handles alert routing and incident coordination. Grafana and VictoriaMetrics provide metrics dashboards and PromQL queries. Coralogix aggregates logs from hundreds of services. Kubernetes provides pod health and deployment information. AWS surfaces infrastructure-level signals about compute, load balancers, and networking. Each system has valuable information, but they don't talk to each other automatically.

When an alert fires for something like "Payment success rate dropped below 50%," an engineer follows a mental runbook. Check recent deployments. Look at error logs. Examine pod restarts. Query database metrics. Check downstream dependencies. Cross-reference all these signals to form a hypothesis about what's wrong.

This manual correlation is where time disappears. Each check takes minutes, involves context switching between tools, and requires the engineer to remember how different signals relate to each other. Moreover, the quality of investigation depends heavily on who's on-call. Experienced engineers know exactly which signals matter for which alerts. Junior engineers might check irrelevant systems or miss critical correlations. This inconsistency means similar incidents get diagnosed differently depending on who's investigating.

The consequences are measurable. High MTTI because no system automatically correlates signals across observability tools. Engineers spend 20-40 minutes just figuring out what's wrong before they can start fixing it. Inconsistent diagnosis because different engineers investigate the same symptoms differently. Knowledge silos because the correlation logic lives in people's heads rather than documented playbooks. After-hours pain because automated systems can detect problems but can't explain them, requiring human intervention regardless of the hour.

Solution Architecture: Multi-Agent Orchestration

Our response was to encode the investigation workflow into an AI system that thinks like an experienced SRE. Rather than a single monolithic AI trying to understand all observability signals, we built a multi-agent system where specialized agents handle different domains, orchestrated by a Supervisor that coordinates the investigation.

The Supervisor Agent is built on LangGraph, a framework for creating stateful multi-agent workflows. It receives incident context, retrieves relevant knowledge from our RAG systems, creates an investigation plan based on alert runbooks, delegates tasks to specialist agents, and synthesizes their findings into a coherent root cause analysis. Think of it as the incident commander making strategic decisions about what to investigate and how to correlate findings.

The Specialist Agents are domain experts. The Kubernetes Agent knows how to check pod health, identify failed rollouts, and spot resource constraints. The AWS Agent understands infrastructure patterns like load balancer saturation, network issues, or compute degradation. The Coralogix Agent analyzes logs for error spikes, exception patterns, and anomalous behavior. The PromQL Tool queries metrics to understand performance degradation, latency increases, or throughput drops. Each agent is focused and excellent in its domain.

The RAG systems provide contextual memory. Application Info contains service architecture, dependencies, ownership, and common failure modes. Alert Runbooks store diagnostic procedures specific to each alert type. When investigating a payment service alert, the Supervisor retrieves that service's architecture and the specific runbook for payment success rate degradation. This contextual grounding prevents generic responses and ensures investigations follow proven procedures.

The Memory system is crucial for correlation. After each agent completes its investigation, results get stored as structured evidence: what was checked, what was found, confidence level, and supporting data. Once all agents finish, the Supervisor reviews all stored evidence together, identifies patterns and correlations, resolves conflicts between signals, and constructs the most likely hypothesis based on collective evidence.

The Investigation Workflow: From Alert to Answer

Let me walk through exactly what happens when an incident triggers Viveka. Understanding the step-by-step flow reveals why this approach dramatically reduces investigation time.

Step 1: Context Retrieval. When the alert arrives, the Supervisor immediately pulls relevant information from both RAG collections. Application Info provides the payment service's architecture, dependencies, recent changes, and ownership. Alert Runbook provides the specific diagnostic procedure for payment success rate alerts. This contextual loading takes 2-3 seconds and ensures the investigation is targeted rather than generic.

Step 2: Investigation Planning. The Supervisor parses the runbook and creates a task plan. For a payment success rate alert, the plan might specify: check recent deployments, analyze error logs for payment failures, query success rate metrics over time, examine pod health and restarts, validate database connection health. This planning phase takes 1-2 seconds and produces a prioritized list of checks.

Step 3: Parallel Investigation. Here's where the architecture shines. The Supervisor delegates tasks to multiple agents simultaneously. While the Kubernetes Agent checks pod health, the Coralogix Agent analyzes logs, the PromQL Tool queries metrics, and the AWS Agent validates infrastructure. These investigations happen in parallel with a per-agent timeout of 5-8 seconds. Total investigation time is bounded by the slowest agent, not the sum of all agents.

Step 4: Evidence Storage. As each agent completes, it writes structured evidence to Memory: the input (what was checked), the output (what was found), a one-line note (interpretation), and confidence level. This structured storage is critical because it creates a fact base that the Supervisor can reason over during synthesis.

Step 5: Correlation and Hypothesis Scoring. With all evidence collected, the Supervisor builds an incident timeline ordered by timestamp. Deployment at T, error spike at T+2 minutes, pod restarts at T+5 minutes. Temporal correlation is powerful; events happening in sequence suggest causality. The Supervisor generates multiple hypotheses (bad deployment, infrastructure issue, downstream dependency failure) and scores each based on evidence count, temporal correlation, and historical patterns. The highest-scoring hypothesis becomes the primary explanation.

Step 6: RCA Generation. The Supervisor generates a human-readable summary including the root cause hypothesis, confidence score, key supporting evidence with citations, reasoning trail explaining the conclusion, and recommended next actions. This isn't just "here's what's wrong" but "here's what's wrong, here's the evidence, here's what you should do."

Step 7: Slack Posting. The RCA gets posted to the relevant team's Slack channel in a threaded format under the original alert. This keeps conversation tied to the incident and provides visibility to the entire team. Engineers can review the analysis, provide feedback on accuracy, and discuss remediation approaches without switching tools.

Why Multi-Agent Architecture Matters

You might wonder why we chose a multi-agent approach rather than a single powerful model. The answer reveals fundamental insights about building reliable AI systems for production operations.

Specialization beats generalization for complex domains. A single model trying to understand Kubernetes, AWS infrastructure, log patterns, metrics interpretation, and incident correlation would need enormous context and struggle with domain-specific nuances. Specialist agents can be optimized for their specific task, use domain-specific reasoning patterns, and maintain focused expertise.

Parallel execution dramatically reduces latency. A sequential investigation checking systems one after another would take minutes. Parallel agent execution means total time is bounded by the slowest check (typically 5-8 seconds), not the sum of all checks. This parallelism is critical for hitting the 90-second target.

Bounded context prevents token overflow. Each agent receives only the context it needs for its specific check. The Kubernetes Agent gets pod names and namespace, not the entire application architecture. This focused context prevents token limit issues that plague single-agent approaches with comprehensive context.

Memory-based synthesis reduces hallucinations. Rather than asking the LLM to remember everything from the investigation, we store facts in Memory and have the Supervisor reason over concrete evidence. This grounds the analysis in observable data rather than model-generated speculation.

Compositional improvement over time. When we improve an individual agent (better log analysis, more sophisticated metrics queries), all investigations automatically benefit. This modularity makes the system easier to enhance incrementally.

Results and Impact

The shift to AI-powered investigation produced measurable improvements. MTTI dropped by approximately 80%. Investigations that previously took 20-40 minutes of manual correlation now complete in 90 seconds. MTTR improved by 50-60% because faster investigation means faster remediation. Engineers can act on findings immediately rather than spending half their time figuring out what's wrong.

Consistency improved dramatically. Every alert of a given type follows the same investigation procedure, checking the same systems and correlating the same signals. Junior and senior engineers see identical analysis quality. This consistency also improves knowledge sharing; when the investigation is documented automatically, everyone learns from each incident.

After-hours coverage became genuine. Previously, automated alerting still required human investigation. Now the investigation happens automatically, and the on-call engineer receives a complete RCA alongside the alert. In many cases, the recommended action is clear enough that remediation can start immediately without additional investigation.

The system posts approximately 90-second analyses for most alerts, and teams rate accuracy through feedback in Slack threads. These ratings feed back into improving the RAG knowledge base and refining runbooks over time.

The Bottom Line

Project Viveka demonstrates that AI-powered incident investigation is practical, reliable, and dramatically faster than manual approaches. By encoding investigation workflows into orchestrated multi-agent systems, we've automated the most time-consuming phase of incident response while maintaining the quality and thoroughness of human investigation.

The 80% MTTI reduction isn't just a number; it represents minutes saved during every incident, which compounds into hours saved weekly and days saved annually. More importantly, it changes how engineers experience on-call. Rather than starting from scratch with every alert, they receive structured analysis immediately and can focus on remediation rather than diagnosis.

The multi-agent architecture is key to this success. Specialization enables domain expertise, parallelism enables speed, memory enables accurate correlation, and modularity enables continuous improvement. This isn't a single AI trying to do everything; it's a coordinated system of specialist AIs working together like an experienced SRE team.

If your organisation faces similar incident investigation challenges, the lessons apply broadly. Build specialist agents for different observability domains, orchestrate them with explicit workflows, ground analysis in stored evidence rather than model memory, and integrate directly into team communication channels. The technology exists, the patterns are proven, and the benefits are measurable.

editor: @paaarth96