Khadirullah Mohammad

Posted on May 31 • Originally published at khadirullah.com

Building a DevOps Incident Investigator with Coral SQL — From 15 Minutes to 15 Seconds

#devops #coral #hackathon #python

🏴‍☠️ Built for the Pirates of the Coral-bean hackathon by WeMakeDevs | May 25–31, 2026

TL;DR

Built a DevOps Incident Investigator using Coral SQL that correlates GitHub PRs, Sentry incidents, and Slack incident context using a single SQL query.

Coral turns operational debugging into a single SQL query across distributed systems.

Results:

📉 Incident triage reduced from ~15 minutes to ~15 seconds.
🪸 Custom Coral source spec created for internal APIs.
🤖 AI-generated root cause analysis and Slack alerts.
💻 Includes both a CLI and a Web Dashboard.

Built in 4 days for the Pirates of the Coral-bean hackathon.

The Problem

As a DevOps engineer, I've experienced the pain of incident investigation firsthand — switching between GitHub, Sentry, and Slack at 2 AM trying to figure out what broke production.

GitHub — "Which PR was deployed last?"
Sentry — "What errors are spiking?"
Slack — "What's the team saying?"

That's 3 tabs, 3 APIs, and 15 minutes of context-switching before you even understand what happened. This is what modern incident response looks like without unified observability.

The Solution

When I discovered Coral, an open-source tool that lets you query any API with SQL, I knew exactly what to build.

One command. Three sources. Full incident picture.

python3 investigator.py --query all --owner khadirullah --repo demo-payment-api

The Magic: Cross-Source JOINs

Coral allows querying GitHub and Sentry together in a single statement. For this prototype, I use deployment timing to surface likely suspect PRs that were merged around the same time incidents first appeared.

SELECT p.number AS pr_number, p.title AS pr_title,
       p.user__login AS pr_author, p.merged_at,
       i.short_id AS sentry_id, i.title AS error_title,
       i.level AS error_level, i.first_seen AS error_first_seen
FROM github.pulls p
JOIN sentry.issues i ON p.merged_at IS NOT NULL
WHERE p.owner = 'khadirullah' AND p.repo = 'demo-payment-api'
  AND p.state = 'closed'
ORDER BY i.first_seen DESC, p.merged_at DESC LIMIT 15

🔥 This is the money shot. PRs merged on the same day as errors appeared, side by side. "PR #820 merged at 12:18 PM, errors started at 12:20 PM" — instant suspect identification.

Why I Chose Coral (and Why It's a Game-Changer)

Normally, if you want to pull data from GitHub, Sentry, and Slack, you end up installing multiple SDKs (PyGithub, sentry-sdk, slack-sdk), learning different APIs, handling pagination and retries yourself, and writing glue code to correlate everything manually.

(Note: SDKs like sentry-sdk are excellent for emitting telemetry from applications, but they are still siloed when it comes to querying and correlating operational data across multiple platforms during an incident.)

What Coral Replaced

Before Coral:

GitHub SDK ──┐
Sentry SDK ──┤──→ Custom Glue Code + Pagination + Rate Limiting
Slack SDK  ──┘

After Coral:

Coral SQL ──→ SELECT * FROM github JOIN sentry JOIN slack

By treating the entire operational stack as a unified query layer, I was able to build my agent using far fewer SDKs and almost no glue code. Coral handles the authentication, pagination, and schema mapping locally, allowing me to focus on the actual business logic: writing cross-source JOINs and generating AI analysis.

Day 1: Setting Up Coral & Connecting Sources

Installing Coral

curl -fsSL https://withcoral.com/install.sh | sh
coral --version
# coral 0.3.0+96d61f7

One command. That's it.

Connecting GitHub

ℹ️ For the GitHub token, I created a classic PAT at github.com/settings/tokens with zero scopes selected. For public repos, you don't need any permissions — the token just bumps your API rate limit from 60 to 5,000 requests per hour.

GITHUB_TOKEN=ghp_XXXXX coral source add github

362 tables connected instantly — issues, pull requests, commits, repos, actions.

Connecting Sentry

Finding the Sentry token wasn't obvious. Sentry doesn't have a simple "API Tokens" page — you create tokens through Custom Integrations:

Settings → Custom Integrations → Create New Integration (Internal Integration)
Name it coral-hackathon with minimal read-only permissions (Project: Read, Issue & Event: Read, Organization: Read)
Copy the generated token and connect:

SENTRY_TOKEN=sntrys_XXXXX SENTRY_ORG=my-org coral source add sentry

12 tables connected — events, issues, projects, deployments.

I verified the tables were available with a quick SELECT schema_name, table_name FROM coral.tables.

Connecting Slack

Slack was the trickiest. Coral's pre-filled link creates a Slack app, but it only sets up User Token Scopes with PKCE — tokens aren't displayed in the UI.

⚠️ The fix: Add Bot Token Scopes instead of User Token Scopes. In OAuth & Permissions, scroll to Bot Token Scopes and add: channels:history, channels:read, groups:history, groups:read, users:read. Then reinstall the app — a Bot User OAuth Token (xoxb-...) appears!

coral source add --interactive slack
# Paste xoxb-... token when prompted

All Sources Connected! 🎉

Source	Tables	What It Gives Us
GitHub	362	PRs, commits, issues, actions
Sentry	12	Errors, events, projects
Slack	2	Channels, users
Total	376	All queryable with SQL

Architecture

Here's how the entire system fits together:

Two interfaces (CLI + Web Dashboard) sit on top of Coral SQL, which unifies 4 data sources into a single query layer. The AI layer uses Gemini Flash for root cause analysis and natural language → SQL generation.

Day 2: Writing SQL Queries & Building the CLI

Generating Test Data for Sentry

First roadblock: Sentry was empty. I wrote generate_errors.py to send 10 realistic DevOps errors.

Errors include: ConnectionError (PostgreSQL max connections), MemoryError (OOM kill), RuntimeError (K8s CrashLoopBackOff), TimeoutError (30s timeout), and more.

The SQL Queries

Query 1: Deployments — "What was recently deployed?"

SELECT number, title, user__login AS author, merged_at
FROM github.pulls
WHERE owner = 'khadirullah' AND repo = 'demo-payment-api'
  AND merged_at IS NOT NULL
ORDER BY merged_at DESC LIMIT 10

Query 2: Incidents — "What errors are happening?"

SELECT short_id, title, level, count AS event_count, first_seen, last_seen
FROM sentry.issues
ORDER BY last_seen DESC LIMIT 10

Query 3: Correlation — "Suspect deployment identification"

(As shown in the introduction, this cross-source JOIN correlates GitHub PRs with Sentry errors based on timestamps).

Query 4: Team Overview — Another cross-source JOIN:

SELECT u.name AS username, u.real_name, u.is_admin,
       c.name AS channel_name, c.num_members
FROM slack.users u CROSS JOIN slack.channels c
WHERE u.deleted = false AND c.is_archived = false

Building the CLI

The CLI wraps everything in a clean, colorful interface with zero pip dependencies — Python stdlib only (subprocess, argparse, json, urllib):

python3 investigator.py --query all \
  --owner khadirullah --repo demo-payment-api \
  --slack-token $SLACK_TOKEN

Troubleshooting Gotchas

⚠️ Sentry Project ID: My first query failed with "Invalid project parameter. Values must be numbers." I was using the slug (python) but Sentry wants the numeric ID. Fixed with:
SELECT id, slug, name FROM sentry.projects
⚠️ Slack Bot Permissions: Got not_in_channel error when fetching messages. The bot had channels:history but wasn't in the channel. Then hit missing_scope — needed channels:join scope. Quick fix: add scope → reinstall app → copy new token.

Day 3: Web Dashboard + AI Integration

Why Build a Dashboard?

A CLI is functional, but judges have 5 minutes. They need to see the data. I built a Flask dashboard with a dark glassmorphism design.

The Dashboard

Features:

Stats Row — Live counters for deployments, incidents, correlations, risky PRs
Incident Timeline — Horizontal scrolling event sequence
Deployments Table — Merged PRs with authors and timestamps
Sentry Incidents — Severity-colored cards with "🤖 Analyze" button
Correlation View — Visual PR ↔ Error mapping
Slack Messages — Chat-style from #incidents
Risky PRs — Risk-scored with green/yellow/red bars
Team Overview — Member cards from Slack users × channels JOIN

AI Root Cause Analysis

Click 🤖 Analyze on any Sentry error.

For example, PROD-41A (PostgreSQL max connections):

Root Cause: PR #487 introduced a new connection pooling layer that eagerly opens connections on pod startup. With 4 pods each opening 25 connections, the default max_connections=100 limit is immediately exhausted.

Immediate Fix: 1) Roll back PR #487, 2) Increase max_connections to 200, 3) Restart affected pods

Confidence: 94% (Coral system score)

(Note: This example is generated from demo data and illustrates the style of analysis produced by the assistant.)

In live mode with a Gemini API key, it calls Gemini Flash in real-time. Pre-computed demos ensure instant results without API waits.

Settings & Token Security

All 4 API tokens encrypted with Fernet symmetric encryption before writing to disk
"Delete All Tokens" does a 3-pass random overwrite
Demo → Live toggle with per-panel badges

Demo Mode

The dashboard starts in Demo Mode — pre-loaded data, zero setup. Switching to live is 4 clicks:

⚙️ Settings → enter tokens
💾 Save
Click DEMO badge → LIVE
🔄 Refresh

If a live call fails, panels gracefully fall back to demo data.

Day 4: Competitive Upgrades — Going Beyond

Custom Source Spec: Querying Internal APIs

Enterprises don't just use public SaaS tools. A real incident investigator needs to query internal microservices. This is where Coral's Custom Source Specs shine.

I built a companion project (demo-payment-api) and wrote a YAML spec to teach Coral how to talk to it:

name: payment_api
version: 0.1.0
dsl_version: 3
backend: http
base_url: "http://localhost:5001"

auth:
  type: HeaderAuth
  headers:
    - name: Authorization
      template: "Bearer {{input.PAYMENT_API_TOKEN}}"

tables:
  - name: health
    request:
      method: GET
      path: /api/health
    columns:
      - name: status
        type: Utf8
        expr: { kind: path, path: [status] }
      - name: response_time_ms
        type: Int64
        expr: { kind: path, path: [response_time_ms] }

The YAML acts as a "translator" — it tells Coral how to map SQL concepts (tables, columns) to HTTP concepts (endpoints, JSON paths):

User ──→ Coral: SELECT status FROM payment_api.health
Coral ──→ YAML: Look up "health" table
YAML ──→ Coral: GET /api/health
Coral ──→ API: HTTP GET http://localhost:5001/api/health
API ──→ Coral: {"status": "healthy", "response_time_ms": 42}
Coral ──→ User: | status  | response_time_ms |
                | healthy |       42         |

# Lint the spec
coral source lint coral-config/payment-api.yaml

# Add to Coral
PAYMENT_API_TOKEN=mock_123 coral source add --file coral-config/payment-api.yaml

# Query like a database!
coral sql "SELECT * FROM payment_api.health"

ℹ️ This proves the tool can connect to any internal enterprise service — not just the big SaaS providers. Read more in our dedicated Chart New Waters deep-dive.

Natural Language to SQL (`/api/ask`)

Type a question in plain English → AI generates Coral SQL → executes it → returns results:

User: "Show me critical errors from last week"
  → Gemini Flash
  → SELECT * FROM sentry.issues WHERE level = 'error' LIMIT 10
  → Coral SQL Engine
  → Results Table

The backend feeds the live Coral schema (all tables + columns) to Gemini, so it generates accurate SQL every time.

Automated Slack Alerts

Investigation is only half the battle — communication is the other half. After AI generates a root cause analysis, users can click "📢 Send to Slack" to push the report directly to #incidents:

🚨 Sentry Error → 🤖 AI Analysis → 📢 Send to Slack → #incidents channel → Team sees report

If the token only has read permissions, the UI catches the missing_scope error and gracefully prompts the user to add chat:write scope.

Handling Slack Messages (TVF vs Custom Fallback)

Coral exposes Slack messages via a function-like table interface (TVF), which requires passing the exact channel ID directly into the SQL query: SELECT * FROM slack.messages(channel => 'C12345678').

However, I needed something more robust for an automated dashboard. If the bot isn't already in the incident channel, the Slack API returns a strict not_in_channel error. Instead of failing, I deliberately built a custom Python fallback that catches this error, dynamically forces the bot to join the channel, fetches the messages, and then uses Coral to pull the slack.users table to map the raw User IDs (e.g., UXXXXXXX) to real human names.

Fetching the Team Overview

With the messages handled resiliently, I still needed to display the active Incident Response team. Instead of making separate API calls for users and channels, I let Coral grab the entire team landscape in a single round-trip using a CROSS JOIN:

SELECT u.name AS username, u.real_name, u.is_admin,
       c.name AS channel_name, c.num_members
FROM slack.users u CROSS JOIN slack.channels c
WHERE u.deleted = false AND c.is_archived = false

A quick Python iteration separates the results, giving us the full team and channel rosters instantly without juggling multiple Slack SDK requests.

Why Sentry (Not Just SonarQube & Trivy)

A question I got: "Why do you need Sentry if you have SonarQube and Trivy?" The answer:

Before Deploy:  SonarQube → "Code COULD break"
                Trivy     → "Has known CVEs"
                      ↓
                   Deploy
                      ↓
After Deploy:   Sentry    → "App JUST broke for 1,203 users"

Tool	Stage	Catches
SonarQube	Pre-deploy (CI)	Code smells, potential bugs
Trivy	Pre-deploy (CI)	Known CVEs in dependencies
Sentry	Post-deploy (Runtime)	Real crashes, right now, with stack traces + user impact

The Incident Investigator bridges all three stages — correlating code changes (GitHub) with runtime errors (Sentry) and team communication (Slack).

Setting Up Slack Alerts (Webhooks vs Bot Tokens)

We use both approaches:

Feature	Incoming Webhook	Bot Token (xoxb-)
Direction	App → Slack (one-way)	App ↔ Slack (two-way)
Use case	Post alerts	Read messages + respond
Setup	Just a URL	OAuth scopes
Best for	Simple alerting	Complex integrations

Webhook → Demo Payment API posts error alerts to #incidents
Bot Token → Incident Investigator reads messages from #incidents via Coral SQL

The complete pipeline:

Error in Payment API
  ├──→ Sentry captures exception ──→ Coral SQL queries ──┐
  └──→ Slack webhook fires ──→ #incidents channel ──────┤
                                                         ├──→ Incident Investigator
  GitHub PRs ──────────────────────────────────────────┘       │
                                                               ▼
                                                        Gemini AI Analysis
                                                               │
                                                               ▼
                                                        📢 Push to Slack #incidents

What I Learned

About Coral

Zero-scope GitHub tokens work — just bumps your rate limit for public repos
Sentry wants numeric project IDs — not slugs, query sentry.projects first
Slack's Coral source is limited — only channels and users, no messages. But the bot token works with direct API calls
Cross-source JOINs are the killer feature — GitHub × Sentry in one SQL statement is genuinely powerful

About DevOps Incident Response

The hardest part of incident response isn't fixing the problem — it's finding the right information. We spend more time context-switching than debugging. The investigator answers three questions fast:

What was deployed recently?
What errors appeared after deployment?
What's the team saying about it?

That's 80% of the first 15 minutes of any incident.

About Hackathons

Ship fast, iterate later — the Discord advice was spot-on
Document as you go — writing the blog alongside the code was more efficient
Errors are content — every missing_scope became a blog section
Keep scope tight, then expand — CLI first, dashboard second, AI third

Limitations

Correlation heuristic: GitHub ↔ Sentry correlation is currently based on deployment timing, not deterministic tracing.
AI as an assistant: Root cause analysis is AI-assisted and should be treated as a hypothesis, not absolute truth.
Slack capabilities: Slack messages are retrieved through a Python fallback integration because the native Coral Slack source currently exposes only users and channels.
API rate limits: In a high-traffic live incident, direct API calls to Slack and GitHub could hit rate limits, requiring an intermediate caching layer.
Schema mismatches: Edge cases in custom internal APIs may occasionally fail to map perfectly to Coral's static YAML types without custom data coercions.

The Result

In testing, the workflow reduced incident investigation from roughly 15 minutes of manual context-switching to less than 15 seconds for an initial deployment-to-error correlation query.

The Impact

Before (The Old Way):

❌ Open GitHub to check recent PRs
❌ Open Sentry to check recent errors
❌ Open Slack to read team discussions
❌ Manually correlate timestamps across 3 tabs

After (The Coral Way):

✅ Run one command or view one dashboard
✅ Instantly see PRs and Errors side-by-side
✅ Get AI-generated root cause analysis

Technical Achievements

Feature	Details
6 SQL Queries	Deployments, incidents, correlation, risky PRs, team, health
2 Cross-Source JOINs	GitHub × Sentry, Slack users × channels
1 Custom Source Spec	`payment-api.yaml` for internal microservice
AI Analysis	Gemini-powered root cause + fix suggestions
NL-to-SQL	Ask questions in English, get Coral SQL results
2 Interfaces	CLI (zero deps) + Web Dashboard (Flask)
Security	Fernet symmetric encryption at rest
Docker + CI	Production-ready packaging + GitHub Actions

Try It

git clone https://github.com/khadirullah/devops-incident-investigator
cd devops-incident-investigator
pip install flask google-generativeai cryptography
python3 app.py
# Open http://localhost:5000 — works instantly with demo data!

🔗 View on GitHub

Future Work

Direct Release Correlation: Correlate Sentry releases directly to GitHub commits using commit SHAs.
More Sources: Add Grafana and Kubernetes log sources to the SQL engine.
Automated Postmortems: Use the AI layer to generate and publish full incident postmortems automatically.

Built with Coral for the Pirates of the Coral-bean hackathon by WeMakeDevs.

Follow the journey: khadirullah.com | GitHub

Originally published on khadirullah.com

Top comments (2)

Harjot Singh • May 31

The 15-minutes-to-15-seconds win is real and the reason it works is the correlation, not the speed: incident triage is slow because the truth is scattered across the PR that shipped, the Sentry spike, and the Slack thread where someone already half-diagnosed it, and a human is just doing expensive manual joins across tabs. Collapsing that into one query over distributed sources is the actual unlock. The AI root-cause-analysis layer is where I'd add one note of caution though: correlation makes "what changed around the time it broke" trivial, but the model proposing a root cause is a hypothesis, not a verdict, and a confidently-wrong RCA in an incident can send people debugging the wrong thing under pressure. The pattern I'd keep is AI-surfaces-the-correlated-evidence, human-confirms-causation, the speed should buy better judgment, not replace it. That correlate-fast-but-verify-the-conclusion discipline is exactly how I think about agent-assisted ops in Moonshift. Does it rank the likely cause with the evidence attached, or commit to a single root cause in the Slack alert?

Khadirullah Mohammad • Jul 3

Hey, sorry for the late reply on this. Things got busy and I wanted to respond properly rather than rush it.

Really appreciate the thoughtful breakdown. The “AI surfaces evidence, human confirms causation” framing is exactly right.

To answer your question directly, the current version does produce a single root cause with a confidence score rather than a ranked set of hypotheses. That’s a fair point, and I agree that showing multiple possibilities with supporting evidence would be a meaningful improvement for the next iteration.

Just to give some context, this was a hackathon prototype built in about four days. The goal was to demonstrate a concept: that Coral SQL can unify scattered operational data from GitHub PRs, Sentry errors, and Slack conversations into a single queryable layer, and that layering AI on top of that can significantly improve incident triage.

On the AI reliability side, I completely agree with your concern. But that’s not specific to this project, it’s true for any AI-assisted system today. No model can guarantee a correct root cause, which is why the blog also explicitly mentions that RCA is AI-assisted and should be treated as a hypothesis, not truth.

The real value here is the correlation layer. The system surfaces linked PRs, error spikes, and Slack context in a single timeline so engineers can see everything in one place. The AI is just an additional layer to help summarize and guide attention, not to replace human judgment. The engineer always remains the final decision maker.

So yes, it currently outputs a single likely cause, but everything is reviewed by humans before any action is taken. And I agree with you that ranking hypotheses with evidence would be a better direction going forward.

Thanks again for taking the time to write such a detailed and thoughtful comment. Really appreciate it.