Relvy (YC F24) – Automating on-call runbooks using AI agents.
Engineering teams dealing with on-call fatigue, high-stakes production issues, and the failure of general-purpose LLMs in Root Cause Analysis (RCA).
* LLMs struggle with RCA (e.g., 36% accuracy on OpenRCA).
* Challenges: Telemetry noise, enterprise-specific context, high-stakes/low-latency requirements.
* Relvy's Approach: Specialized tools (anomaly detection, problem slicing, log pattern search, span tree reasoning) + Runbook-anchored execution.
* Deployment: Docker/Helm/Cloud.
* Workflow: Connect observability/code -> Create runbook -> Investigate -> Notebook UI with visualizations -> Automated Slack response/Mitigation (AWS CLI with approval).
* Target: Senior/Staff Engineers, SREs, DevOps.
* Tone: Professional, neutral, dry, deep-dive, technical.
* Constraints: 1500-2500 words, no emojis, no frontmatter, use Markdown, specific ending.
* *Introduction:* The crisis of on-call. Why current AI approaches (copy-pasting logs to LLMs) fail. The statistical gap in RCA accuracy.
* *The Technical Bottleneck of Autonomous RCA:*
* Telemetry volume and the "Context Window Noise" problem.
* The "Semantic Gap" in enterprise context.
* The high-stakes constraint (The Cost of Hallucination).
* *Architecture Analysis: Beyond General-Purpose LLMs:*
* The Tool-Use Paradigm (MCP-like patterns).
* Specialized Telemetry Tooling:
* Time-series anomaly detection and problem slicing.
* Log pattern clustering.
* Trace/Span tree reasoning.
* *Runbook-Anchored Agentic Workflows:*
* Deterministic vs. Stochastic exploration.
* Reducing cognitive load through structured investigation.
* *Implementation and Workflow Orchestration:*
* Deployment models (Local/Cloud).
* The "Investigation Notebook" pattern (Observability as a Verification Layer).
* Integration with mitigation loops (Human-in-the-loop/HITL).
* *Comparative Analysis:* General LLM vs. Relvy's specialized agent.
* *Conclusion:* The shift from "Chat with Logs" to "Automated Investigative Agents."
* *Intro:* Start with the metrics. 36% accuracy on OpenRCA is a damning statistic. It highlights that reasoning is not enough; data access and data *reduction* are the keys.
* *Section: The Noise Problem:* Explain why feeding 1GB of logs into Claude or GPT-4 is a disaster. It's not just the cost; it's the attention mechanism. If the signal is 10 lines in 1,000,000 lines, the model loses the signal.
* *Technical Concept:* Use terms like "signal-to-noise ratio" (SNR) and "attention dilution."
* *Section: Specialized Tooling:* This is the core "meat."
* *Anomaly Detection:* Don't just say "it detects errors." Talk about Z-scores, seasonality, and decomposing time series.
* *Problem Slicing:* Explain how high-cardinality data (user_id, pod_id, shard_id) is used to find the intersection of erroring entities.
* *Span Tree Reasoning:* Discuss distributed tracing. How the agent traverses the DAG (Directed Acyclic Graph) of a request.
* *Section: Runbook Anchoring:* This is the clever bit. Instead of letting an agent go wild (which is dangerous during a SEV1), we use a "template-driven" approach.
* *Concept:* A runbook is a state machine of investigative steps.
* *Mathematical/Logical view:* $Agent(State, Runbook) \rightarrow Action$. It restricts the search space.
* *Section: The Verification Layer:* Why a notebook? Because SREs don't trust black boxes. They need the "Proof of Work." Mention data visualizations (heatmaps, scatter plots, waterfall diagrams).
* *Section: Mitigation:* Discuss the "Human-in-the-loop" (HITL) pattern. Using AWS CLI via an agent.
* *Drafting the Code Blocks:* I need to show what a "runbook" might look like in a structured format (YAML/JSON) to make it feel real. I should also show a conceptual implementation of a tool call.
* *Self-Correction during writing:* I need to make sure I don't sound like a marketing brochure. I must stay in the "Senior Staff Engineer" persona—analyzing *how* it works, not just *what* it is. Use terms like "High-cardinality dimensions," "Heuristics," "Deterministic vs. Probabilistic," and "Observability stack integration."
* *Expanding the "Specialized Tools" section:*
* *Log Clustering:* Discussing how to turn unstructured text into structured patterns using something like Drain or similar algorithms to reduce token count.
* *Trace Analysis:* How to navigate the Span IDs to find the bottleneck node.
* *Expanding the "Context Dependency" section:*
* The difference between "Error 500" and "Error 500 on the Checkout Shard due to a stale cache in the US-EAST-1 region." General LLMs lack the infrastructure topology.
* *Ensuring Length:* The depth of the "Architecture" and "Challenges" sections will drive the word count. I will go deep into the mathematics of anomaly detection (mentioning statistical significance) and the complexity of distributed traces.
* *Language check:* Ensure "English" is used. (Check).
* *Emoji check:* Ensure "No emojis." (Check).
* *Tone check:* Ensure "Professional/Dry." (Check).
* *Formatting check:* Markdown headers and code blocks. (Check).
* *Ending check:* The specific URL link. (Check).
*(Self-Correction during drafting)*: I need to ensure I address the specific technical claim about Claude Opus 4.6 and OpenRCA to ground the article in the provided context.
*Drafting the YAML Runbook example:*
```yaml
runbook_id: "service_latency_spike"
steps:
- name: "check_shard_distribution"
tool: "telemetry_query"
params:
metric: "http_request_duration_seconds"
group_by: "shard_id"
threshold: "p95 > 500ms"
- name: "correlate_with_deployments"
tool: "git_query"
params:
repository: "core-api"
lookback_minutes: 30
```
*Drafting the "Problem Slicing" logic:*
Describe the process of intersecting high-cardinality dimensions. If `error_rate` is high, check `dimension_A` (e.g., `customer_tier`), then `dimension_B` (e.g., `availability_zone`). This is a binary search through the metadata space.
*Final Word Count Strategy:*
Intro: 250
The RCA Problem (Theoretical): 400
The Architecture of Specialized Tools: 600
The Runbook-Anchored Agent Model: 400
Deployment and Mitigation Patterns: 300
Conclusion: 150
Total: ~2100 words. Perfect.
Originally published in Spanish at www.mgatc.com/blog/relvy-ai-on-call-automation-runbooks/
Top comments (0)