DEV Community

Cover image for IncidentOS AI — We Built a Self-Learning SRE Brain at HackBaroda 2026
Aagam Shah
Aagam Shah

Posted on

IncidentOS AI — We Built a Self-Learning SRE Brain at HackBaroda 2026

Submitted to HackBaroda 2026 by Khusavant Choudhary, Aagam Shah, and Krrish Raj


It's 20:30 PM. An alert fires. Your database connections are maxed out, API latency is at 8 seconds, and users are reporting errors. You're on-call. You're tired. And you have to figure out — from scratch — what's going on.

Now imagine if your incident management system already knew. Not a generic "check your logs" suggestion — but an actual SRE-grade response: "Based on 12 similar past incidents, root cause is connection pool exhaustion. Confidence: 91%. Here's the exact runbook."

That's what we built at HackBaroda 2026. We called it IncidentOS AI.


🚨 The Problem

Modern engineering teams face a brutal paradox: the same incidents keep happening, the same runbooks get rewritten, and the same on-call engineer at 3 AM starts from zero every single time.

Existing tools like PagerDuty and Opsgenie are great at alerting. They're not great at learning. They don't remember that last week's DNS failure looked exactly like this one, or that the fix was "flush the resolver cache on nodes 4 and 7."

We wanted to build something that gets smarter with every incident it sees.


💡 What We Built

IncidentOS AI is an AI-powered incident management backend that:

  • Analyses new incidents using Groq's llama-3.3-70b-versatile LLM with a strict SRE persona — no filler phrases, no generic advice
  • Queries its own memory for similar past incidents before generating a response
  • Gives a confidence score based on how many similar resolved incidents it found (0 matches = 20%, 1 = 60%, 2+ = 85%+)
  • Generates numbered runbooks with exact commands, config keys, and thresholds
  • Permanently stores every resolved incident in Hindsight Cloud — so the knowledge is never lost

The more incidents it resolves, the smarter it gets. It's a flywheel.


🛠️ The Stack

Layer Technology
API FastAPI (Python)
LLM Groq API — llama-3.3-70b-versatile
Cloud Memory Hindsight Cloud (retain() / recall() / retain_batch())
Embeddings sentence-transformers/all-MiniLM-L6-v2 (384-dim)
Local Fallback incident_memory.json (JSON + cosine similarity)

Why Groq?

Speed. During a live incident, nobody wants to wait 10 seconds for a response. Groq's LPU inference is fast enough to feel like a real-time assistant, not a batch job.

Why Hindsight?

Hindsight is a cloud memory layer built specifically for AI agents. It stores information as NLP-extracted facts and lets you recall() semantically relevant memories in plain English. This made it a perfect fit for "what do I know about incidents like this one?"

Every incident we resolve gets retain()-ed into Hindsight with full metadata — incident_id, root_cause, mitigation_steps, status. When a new incident comes in, we recall() against it and get back the most relevant resolved memories with confidence scores.


⚙️ Architecture

HTTP Client
  │
  ▼
FastAPI (main.py)
  ├── POST /incident/new
  │     ├─ memory.find_similar_incidents()  ◄── Hindsight recall()
  │     ├─ agent.analyze_incident()         ◄── Groq LLM
  │     └─ agent.suggest_actions()
  │
  ├── POST /incident/resolve
  │     └─ memory.resolve_incident()        ──► Hindsight retain() (with metadata)
  │
  └── POST /sync                            ──► memory.sync_from_hindsight_cloud()

Memory Layer (memory.py)
  Primary:   Hindsight Cloud   (retain / recall / list_memories / retain_batch)
  Secondary: incident_memory.json  (local embedding cache + offline fallback)
  Dedup:     Exact ID match  +  semantic cosine similarity (threshold 0.97)
Enter fullscreen mode Exit fullscreen mode

The memory layer has two paths:

  1. Primary (cloud): Every query hits Hindsight recall() first. Results that carry metadata (our stored incident_id, root_cause) are used directly. Results that don't carry metadata (NLP fact extractions by Hindsight) are ID-matched back to local records.

  2. Fallback (local): If Hindsight returns nothing useful, we fall through to cosine similarity search over local embeddings — the system always gives an answer.


🔥 The Hardest Problem We Solved

When we started syncing data from Hindsight, we discovered something unexpected: recall() doesn't return your original document. It returns NLP-extracted facts — atomic sentences Hindsight derived from what you stored.

So if you stored:

"Incident ID d56b2cad: DB connections maxed out. Root cause: connection pool exhaustion."

Hindsight might give you back:

"Database connections reached maximum limit on 2026-06-07."

No incident_id. No root_cause. Just a raw extracted fact.

This was a problem for syncing. Our early sync_from_hindsight.py tried to parse these and — critically — assigned a random uuid4() to every fact it couldn't match. The result: running the sync script twice would double the record count. We were generating 2,000+ garbage duplicate records on every run.

The fix: Never fall back to uuid4(). If a fact has no parseable incident ID — from our stored metadata or from regex-extracting "Incident d56b2cad" / "Incident INC0027946" patterns out of the fact text — the record gets skipped entirely.

# BEFORE (the bug):
"incident_id": inc_id or str(uuid.uuid4())[:8],  # creates infinite duplicates

# AFTER (the fix):
if not inc_id:
    return None  # skip — never invent an ID
Enter fullscreen mode Exit fullscreen mode

After fixing this, sync became idempotent. Running it 10 times gives the same result.


📊 What We Loaded Into It

We loaded 1,019 real-world SRE incidents spanning incident types like:

Failure Type Count
Load balancer misconfiguration 257
Third-party API outage 138
Cache cluster failure 138
DB connection pool exhaustion 81
Failed deployment / regression 73
DNS resolution failure 69
CPU saturation 65
Memory leak (OOM) 63
Expired SSL/TLS certificate 57
Disk space exhaustion 32

We built bulk_resolve.py to classify all open incidents by failure category and resolve them in one shot — pushing all 903 updates to Hindsight using retain_batch() in 50-item batches.

Final state: 1,019 / 1,019 incidents resolved. 0 open. 100% Hindsight-synced.


🤖 What the AI Actually Says

Here's a real example. We submitted a new incident:

"DB connection pool at 95%, API p99 latency at 6.2s, auth service returning 503s"

The response came back in under 2 seconds:

{
  "incident_id": "c988e1ff",
  "ai_analysis": "Based on 9 similar past incidents, root cause is database connection pool exhaustion. Confidence: 91%. The auth service 503s are a downstream effect — the real problem is upstream at the connection pool level.",
  "suggested_actions": [
    "1. Run: SELECT count(*) FROM pg_stat_activity WHERE state='active' — confirm connections at limit",
    "2. Kill idle connections: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state='idle' AND query_start < now() - interval '5 minutes'",
    "3. Increase DB_POOL_SIZE from 100 to 200 in app config and rolling-restart pods",
    "4. Add pool-utilisation alert at 80% threshold in Datadog/Grafana"
  ],
  "similar_past_incidents": [
    {"id": "4e6a6f1b", "title": "DB connection exhaustion", "root_cause": "connection pool exhaustion", "resolved": true},
    {"id": "9a78a49f", "title": "API latency spike with DB connections maxed", "resolved": true}
  ]
}
Enter fullscreen mode Exit fullscreen mode

No "please check your logs." No "contact your database administrator." Specific. Actionable. Confident.


🔑 Key API Endpoints

# Analyse a new incident
POST /incident/new
{"title": "...", "description": "..."}

# Mark resolved (updates Hindsight memory)
POST /incident/resolve
{"incident_id": "...", "root_cause": "...", "mitigation_steps": "..."}

# Pull everything from Hindsight into local memory
POST /sync

# Health check (backend + Groq + Hindsight)
GET /status

# All incidents with counts
GET /incidents/all
Enter fullscreen mode Exit fullscreen mode

🧠 What We Learned

1. Cloud memory is a first-class architecture concern, not an afterthought.
Hindsight forced us to think about what we store and how we tag it from the very first retain() call. Metadata design (always pass incident_id, root_cause, status in the metadata= dict) is as important as the content itself.

2. Idempotency matters more than you think.
Any script that touches a shared data store needs to be safe to run multiple times. Our sync bug taught us this the hard way — after the uuid4 bug, we had 2,987 records where we expected 556.

3. LLM persona engineering is real engineering.
Getting Groq to consistently output SRE-grade responses (not generic advice) required precise prompt design. The phrase "You are an SRE on-call engineer. Never use filler phrases. Never say 'it appears'. Always give numbered steps with exact commands." was the difference between mediocre and great output.

4. Build for the fallback first.
Our local cosine similarity search was the first thing we built, and it saved us multiple times when cloud connectivity had issues. The system always gives an answer — that's non-negotiable for an incident management tool.


🚀 Try It Yourself

git clone https://github.com/KrrishR05/IncidentOS-AI.git
cd IncidentOS-AI
python -m venv venv && venv\Scripts\activate  # Windows
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

Create .env:

GROQ_API_KEY=gsk_...
HINDSIGHT_API_KEY=hsk_...
Enter fullscreen mode Exit fullscreen mode
uvicorn main:app --host 0.0.0.0 --port 8000
Enter fullscreen mode Exit fullscreen mode

Then hit http://localhost:8000 and submit your first incident.

GitHub: github.com/KrrishR05/IncidentOS-AI


👥 The Team

Built in 12 hours at HackBaroda 2026 by:


Tags

#python #ai #devops #sre #fastapi #groq #llm #hackathon #webdev #showdev

Top comments (0)