Engineering Autonomous Root Cause Analysis: Beyond LLM Heuristics
The challenge of automating on-call response is fundamentally a problem of signal-to-noise ratio and verifiable execution. While Large Language Models (LLMs) have demonstrated exceptional capabilities in code generation and textual reasoning, they struggle significantly with the "OpenRCA" problem—performing root cause analysis (RCA) on live telemetry data. The primary failure mode for naive AI integrations is the "hallucinatory path," where an agent attempts to infer causality from sparse or noisy metrics without a bounded problem space.
At Relvy, we have architected a system that shifts the paradigm from generative "problem solving" to deterministic, runbook-oriented execution. This article explores the engineering requirements for building a reliable, autonomous on-call agent that avoids the pitfalls of generic LLM agents.
The Problem: Why Generative RCA Fails
Current benchmarks indicate that even high-parameter models like Claude 3.5 Sonnet or GPT-4o struggle with RCA, often yielding accuracy metrics below 40%. The failure arises from three specific technical constraints:
- Context Overflow via High-Cardinality Data: Standard observability stacks generate terabytes of time-series data. Simply passing raw logs or unsampled spans into an LLM context window causes "attention dilution," where the model fails to prioritize the critical signal among thousands of noise events.
- Lack of Enterprise Context: An LLM does not know that a specific latency spike on
Endpoint_Ais "normal" behavior due to a cron job, whereas the same spike onEndpoint_Bis a catastrophic failure. - The Exploration Cost: In a production incident, time-to-mitigation (TTM) is the primary metric. A non-deterministic agent that explores irrelevant failure hypotheses consumes the limited incident window and damages trust.
Architectural Solution: Runbook-Anchored Agentic Workflows
To solve these, we moved away from open-ended reasoning. Instead, we anchor the agent in a Runbook State Machine. By constraining the agent to defined, deterministic steps, we transform an "unbounded investigation" into a "verification sequence."
1. Telemetry Abstraction Layers
We implemented a layer that performs pre-analysis before the LLM sees the data. Instead of raw logs, the agent interacts with specialized "tool interfaces" that provide summarized insights.
# Conceptual tool interface for telemetry analysis
class TelemetryTool:
def __init__(self, datasource):
self.ds = datasource
def analyze_anomaly(self, metric_name, time_range):
"""
Uses statistical anomaly detection (Z-score or STL decomposition)
rather than asking the LLM to 'look for spikes'.
"""
data = self.ds.get_metrics(metric_name, time_range)
anomalies = detect_outliers(data)
# Return a summarized representation rather than raw data points
return {"anomalies_detected": len(anomalies), "period": anomalies}
def correlate_with_deployment(self, timestamp):
"""
Query CI/CD metadata to find recent code changes.
"""
return self.ds.get_recent_commits(limit=5)
By using these targeted tools, we reduce the token load significantly. The agent receives a structured JSON object describing the anomaly, which acts as a "ground truth" anchor, preventing the hallucination of non-existent error patterns.
2. Deterministic Reasoning via Runbook Graphs
We define runbooks as Directed Acyclic Graphs (DAGs). Each node represents a specific diagnostic action. When an alert fires, the Relvy agent traverses the DAG based on the results of the preceding step.
If a diagnostic step yields a result that exceeds a confidence threshold (e.g., an 80% correlation between a latency spike and a specific deployment ID), the agent moves to the mitigation phase. If the confidence is low, the agent surfaces a "notebook" for the human engineer, highlighting the ambiguous data.
Implementation: The Tooling Layer
Relvy utilizes a local-first deployment architecture (Docker/Helm) to minimize security latency when accessing internal observability stacks like Datadog, Prometheus, or Honeycomb. The agent operates within the customer’s VPC, ensuring that proprietary codebases and sensitive telemetry do not leave the infrastructure perimeter.
The agentic loop is implemented via a specialized controller that manages three distinct threads:
- Observation Loop: Regularly polls observability sinks for anomalous state changes.
- Reasoning Thread: Uses a RAG-augmented LLM to match the current incident signature against existing runbook definitions.
- Action/Execution Layer: Executes approved CLI commands or API calls to perform mitigation (e.g., rolling back a deployment, restarting a service, or adjusting traffic weights).
Designing for Trust: The Notebook UI
In high-stakes environments, a "black box" AI is unacceptable. We built a notebook-style output interface to maintain transparency. Every autonomous action taken by the agent is logged as a cell in the notebook, containing the input data, the reasoning process, and the resulting visualization.
{
"step": "Check Endpoint Latency",
"status": "completed",
"data": {
"avg_latency": "450ms",
"p99_latency": "1200ms",
"anomaly_confirmed": true
},
"agent_thought": "P99 latency has deviated from the 7-day moving average by 3.2 standard deviations. Initiating segment analysis by shard ID."
}
This record allows engineers to review the agent's work post-incident. If the agent makes a wrong turn, the user can modify the runbook YAML configuration, essentially "training" the agent for future incidents without needing to re-fine-tune the base model.
Overcoming the "Cold Start" Problem
One of the significant hurdles in adopting automated on-call tools is the lack of initial runbooks. We address this through an "observation-first" mode. When installed, Relvy monitors alerts and suggests candidate runbooks based on historical incident patterns.
We utilize a technique where we retrospectively analyze resolved tickets. By feeding historical incident logs and the associated mitigation actions into the agent, we can generate a baseline "Draft Runbook." The engineering team then simply reviews and approves these drafts. This significantly reduces the overhead of adopting Relvy in legacy environments where documentation is either outdated or non-existent.
The Role of Local Execution
The critical distinction in our architecture is the decision to keep the agentic reasoning and tool execution as close to the data as possible. By installing Relvy within the user's environment, we solve two problems simultaneously:
- Security and Compliance: Data-at-rest stays within the perimeter. Only anonymized metadata is sent to the orchestration layer for agent planning.
- Latency: The agent interacts with internal APIs (Kubernetes, AWS, Datadog) over high-speed local networks, which is crucial when an incident is causing cascading failures.
Conclusion: Moving Towards Autonomous Resilience
The shift toward autonomous on-call is not about replacing human engineers; it is about automating the "drudge work" of the investigation. By combining deterministic runbook workflows with specialized observability tools, Relvy provides a structured environment where AI can perform RCA effectively, accurately, and safely.
The next evolution of this technology will likely involve cross-service dependency mapping, where the agent automatically maps an alert in a frontend service to a failing downstream microservice, further shortening the path to resolution.
For organizations looking to integrate autonomous on-call capabilities into their existing infrastructure, or for deep dives into building out scalable observability pipelines, we are available to assist. Our team specializes in bridging the gap between high-volume telemetry and actionable, AI-driven automation. Visit https://www.mgatc.com for consulting services.
Originally published in Spanish at www.mgatc.com/blog/relvy-ai-automated-on-call-runbooks/
Top comments (0)