Relvy: Automating On-Call Runbooks with AI Agents!

#aiagents #devops #observability #oncall

Relvy (YC F24) – Automating on-call runbooks using AI agents.
Engineering teams dealing with on-call fatigue, high-stakes production issues, and the failure of general-purpose LLMs in Root Cause Analysis (RCA).

    *   LLMs struggle with RCA (e.g., 36% accuracy on OpenRCA).
    *   Challenges: Telemetry noise, enterprise-specific context, high-stakes/low-latency requirements.
    *   Relvy's Approach: Specialized tools (anomaly detection, problem slicing, log pattern search, span tree reasoning) + Runbook-anchored execution.
    *   Deployment: Docker/Helm/Cloud.
    *   Workflow: Connect observability/code -> Create runbook -> Investigate -> Notebook UI with visualizations -> Automated Slack response/Mitigation (AWS CLI with approval).

    *   Target: Senior/Staff Engineers, SREs, DevOps.
    *   Tone: Professional, neutral, dry, deep-dive, technical.
    *   Constraints: 1500-2500 words, no emojis, no frontmatter, use Markdown, specific ending.

*   *Introduction:* The crisis of on-call. Why current AI approaches (copy-pasting logs to LLMs) fail. The statistical gap in RCA accuracy.
*   *The Technical Bottleneck of Autonomous RCA:*
    *   Telemetry volume and the "Context Window Noise" problem.
    *   The "Semantic Gap" in enterprise context.
    *   The high-stakes constraint (The Cost of Hallucination).
*   *Architecture Analysis: Beyond General-Purpose LLMs:*
    *   The Tool-Use Paradigm (MCP-like patterns).
    *   Specialized Telemetry Tooling:
        *   Time-series anomaly detection and problem slicing.
        *   Log pattern clustering.
        *   Trace/Span tree reasoning.
*   *Runbook-Anchored Agentic Workflows:*
    *   Deterministic vs. Stochastic exploration.
    *   Reducing cognitive load through structured investigation.
*   *Implementation and Workflow Orchestration:*
    *   Deployment models (Local/Cloud).
    *   The "Investigation Notebook" pattern (Observability as a Verification Layer).
    *   Integration with mitigation loops (Human-in-the-loop/HITL).
*   *Comparative Analysis:* General LLM vs. Relvy's specialized agent.
*   *Conclusion:* The shift from "Chat with Logs" to "Automated Investigative Agents."

*   *Intro:* Start with the metrics. 36% accuracy on OpenRCA is a damning statistic. It highlights that reasoning is not enough; data access and data *reduction* are the keys.

*   *Section: The Noise Problem:* Explain why feeding 1GB of logs into Claude or GPT-4 is a disaster. It's not just the cost; it's the attention mechanism. If the signal is 10 lines in 1,000,000 lines, the model loses the signal.
*   *Technical Concept:* Use terms like "signal-to-noise ratio" (SNR) and "attention dilution."

*   *Section: Specialized Tooling:* This is the core "meat."
    *   *Anomaly Detection:* Don't just say "it detects errors." Talk about Z-scores, seasonality, and decomposing time series.
    *   *Problem Slicing:* Explain how high-cardinality data (user_id, pod_id, shard_id) is used to find the intersection of erroring entities.
    *   *Span Tree Reasoning:* Discuss distributed tracing. How the agent traverses the DAG (Directed Acyclic Graph) of a request.

*   *Section: Runbook Anchoring:* This is the clever bit. Instead of letting an agent go wild (which is dangerous during a SEV1), we use a "template-driven" approach.
    *   *Concept:* A runbook is a state machine of investigative steps.
    *   *Mathematical/Logical view:* $Agent(State, Runbook) \rightarrow Action$. It restricts the search space.

*   *Section: The Verification Layer:* Why a notebook? Because SREs don't trust black boxes. They need the "Proof of Work." Mention data visualizations (heatmaps, scatter plots, waterfall diagrams).

*   *Section: Mitigation:* Discuss the "Human-in-the-loop" (HITL) pattern. Using AWS CLI via an agent.

*   *Drafting the Code Blocks:* I need to show what a "runbook" might look like in a structured format (YAML/JSON) to make it feel real. I should also show a conceptual implementation of a tool call.

*   *Self-Correction during writing:* I need to make sure I don't sound like a marketing brochure. I must stay in the "Senior Staff Engineer" persona—analyzing *how* it works, not just *what* it is. Use terms like "High-cardinality dimensions," "Heuristics," "Deterministic vs. Probabilistic," and "Observability stack integration."

*   *Expanding the "Specialized Tools" section:*
    *   *Log Clustering:* Discussing how to turn unstructured text into structured patterns using something like Drain or similar algorithms to reduce token count.
    *   *Trace Analysis:* How to navigate the Span IDs to find the bottleneck node.

*   *Expanding the "Context Dependency" section:*
    *   The difference between "Error 500" and "Error 500 on the Checkout Shard due to a stale cache in the US-EAST-1 region." General LLMs lack the infrastructure topology.

*   *Ensuring Length:* The depth of the "Architecture" and "Challenges" sections will drive the word count. I will go deep into the mathematics of anomaly detection (mentioning statistical significance) and the complexity of distributed traces.

*   *Language check:* Ensure "English" is used. (Check).
*   *Emoji check:* Ensure "No emojis." (Check).
*   *Tone check:* Ensure "Professional/Dry." (Check).
*   *Formatting check:* Markdown headers and code blocks. (Check).
*   *Ending check:* The specific URL link. (Check).

*(Self-Correction during drafting)*: I need to ensure I address the specific technical claim about Claude Opus 4.6 and OpenRCA to ground the article in the provided context.

*Drafting the YAML Runbook example:*

```yaml
runbook_id: "service_latency_spike"
steps:
  - name: "check_shard_distribution"
    tool: "telemetry_query"
    params:
      metric: "http_request_duration_seconds"
      group_by: "shard_id"
      threshold: "p95 > 500ms"
  - name: "correlate_with_deployments"
    tool: "git_query"
    params:
      repository: "core-api"
      lookback_minutes: 30
```

*Drafting the "Problem Slicing" logic:*
Describe the process of intersecting high-cardinality dimensions. If `error_rate` is high, check `dimension_A` (e.g., `customer_tier`), then `dimension_B` (e.g., `availability_zone`). This is a binary search through the metadata space.

*Final Word Count Strategy:*
Intro: 250
The RCA Problem (Theoretical): 400
The Architecture of Specialized Tools: 600
The Runbook-Anchored Agent Model: 400
Deployment and Mitigation Patterns: 300
Conclusion: 150
Total: ~2100 words. Perfect.

Originally published in Spanish at www.mgatc.com/blog/relvy-ai-on-call-automation-runbooks/

DEV Community

Relvy: Automating On-Call Runbooks with AI Agents!

Top comments (0)