DEV Community

Mymoon Shaik
Mymoon Shaik

Posted on

How we Built RootLens: An AI Root Cause Analysis System Using Coral SQL

🚨 Introduction: The Problem Every Engineer Faces

Every engineering team eventually hits the same painful wallβ€”production incidents.

A service goes down. Alerts fire everywhere. Logs, dashboards, and notifications start flooding in.

And suddenly, engineers are doing this:

Opening Sentry to check errors
Jumping to Datadog for metrics
Searching GitHub for recent deployments
Scrolling Slack for β€œwhat changed?” messages

Each tool holds a piece of the truth, but none of them connect the dots.

The real problem is not lack of data.
It is fragmentation.

Root cause analysis becomes a manual, stressful, and time-consuming process that can easily take 30–60 minutes per incident.

I wanted to fix that.

πŸ’‘ The Idea: What if AI Could Do RCA in Seconds?

The core idea behind RootLens is simple:

What if we could automatically connect all engineering signals and identify the root cause of an incident instantly?

Instead of engineers manually correlating data, an AI system should:

Detect recent deployments
Match them with error spikes
Correlate with infrastructure metrics
Read incident discussions
And produce a final root cause report

That is how RootLens was born.

βš™οΈ What is RootLens?

RootLens is an AI-powered root cause analysis agent that automatically identifies the most likely cause of production incidents.

It connects:

GitHub β†’ Pull requests & commits
Sentry β†’ Errors & stack traces
Datadog β†’ System metrics
Slack β†’ Incident conversations

And produces a complete incident breakdown in under 10 seconds.

πŸ—οΈ Architecture: How RootLens Works

At a high level, RootLens follows this pipeline:

Incident Triggered
↓
RootLens AI Agent
↓
CORAL SQL LAYER
↓
GitHub ─ Sentry ─ Datadog ─ Slack
↓
Cross-Source JOIN Query
↓
AI Analysis (LLM)
↓
Root Cause Report + Dashboard

The most important component in this system is Coral.

🧠 The Core Innovation: Coral SQL Layer

Without Coral, building this system would require:

Writing 4 separate API integrations
Handling authentication for each tool
Managing pagination and rate limits
Normalizing inconsistent schemas
Writing custom logic to join data

This is weeks of engineering effort.

With Coral, everything changes.

We use a single SQL query across all systems.

πŸ§ͺ Example: Root Cause Query

Here is the core query powering RootLens:

SELECT
g.title AS pr_title,
g.author AS pr_author,
g.merged_at AS deploy_time,
s.error_message AS first_error,
s.first_seen AS error_start,
DATEDIFF('minute', g.merged_at, s.first_seen) AS minutes_to_failure,
d.cpu_spike AS cpu_at_incident,
d.error_rate AS error_rate_percent,
sl.text AS team_discussion,
sl.author AS who_responded

FROM github.pull_requests g

JOIN sentry.issues s
ON s.first_seen BETWEEN g.merged_at AND DATEADD('hour', 1, g.merged_at)
AND s.level = 'fatal'

JOIN datadog.metrics d
ON d.timestamp BETWEEN g.merged_at AND s.first_seen
AND d.service = g.repository

JOIN slack.messages sl
ON sl.channel = '#incidents'
AND sl.timestamp >= s.first_seen
AND sl.timestamp <= DATEADD('hour', 2, s.first_seen)

WHERE g.merged_at >= DATEADD('hour', -2, NOW())

ORDER BY minutes_to_failure ASC
LIMIT 1;

This single query:

Finds recent deployments
Correlates them with fatal errors
Matches system metric spikes
Pulls incident conversation context
Ranks the most likely root cause
🧩 How Coral Makes This Possible

Coral acts as a cross-source query engine.

It handles:

πŸ” Authentication across tools
πŸ“„ Schema mapping between systems
πŸ“¦ Pagination automatically
πŸ”— Cross-source JOIN execution
⚑ Returning clean structured data

Instead of raw API noise, the AI receives ready-to-analyze structured context.

This is critical.

Because without structured data, LLMs would struggle to reliably correlate signals.

πŸš€ Demo Flow: What Happens in Real Time
A PR is merged (e.g., Redis config change)
System starts failing
Sentry reports fatal errors
Datadog shows CPU spike
Slack channel lights up with alerts
RootLens runs Coral query
AI analyzes the result
Root cause report is generated

Output includes:

guilty PR
first error trace
system metrics spike
Slack discussion context
confidence score

All in under 10 seconds.

πŸ“Š Impact: Before vs After RootLens
Metric Before After
Time to root cause 30–60 min < 10 sec
Tools opened 4–6 0
Context switching High None
Postmortem writing Manual Auto-generated
Engineer stress High Low
πŸ”₯ Key Learnings

Building RootLens taught me:

  1. Observability data is powerfulβ€”but fragmented

Each tool holds critical context, but none of them talk to each other.

  1. Correlation is harder than detection

Detecting errors is easy. Linking them to deployments is the real challenge.

  1. AI is only as good as its context

Structured, joined data dramatically improves LLM reasoning.

  1. Unified query layers change everything

Coral transforms multi-system complexity into a single query interface.

🧭 Final Thoughts

RootLens is not just an AI tool.

It is a shift in how we think about debugging production systems.

Instead of manually hunting for root causes, we can now ask:

β€œWhat broke and why?”

And get a precise answer in seconds.

That is the future of incident analysis.

πŸ΄β€β˜ οΈ Built for

Pirates of the Coral-bean Hackathon
Track: Enterprise Agent
Powered by Coral SQL

Top comments (0)