<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Aagam Shah</title>
    <description>The latest articles on DEV Community by Aagam Shah (@aagam_1910).</description>
    <link>https://dev.to/aagam_1910</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3972519%2Fb13c01b2-e6f1-40c3-9c18-9c8f40c0cbb4.png</url>
      <title>DEV Community: Aagam Shah</title>
      <link>https://dev.to/aagam_1910</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aagam_1910"/>
    <language>en</language>
    <item>
      <title>IncidentOS AI — We Built a Self-Learning SRE Brain at HackBaroda 2026</title>
      <dc:creator>Aagam Shah</dc:creator>
      <pubDate>Sun, 07 Jun 2026 14:06:52 +0000</pubDate>
      <link>https://dev.to/aagam_1910/incidentos-ai-we-built-a-self-learning-sre-brain-at-hackbaroda-2026-1in2</link>
      <guid>https://dev.to/aagam_1910/incidentos-ai-we-built-a-self-learning-sre-brain-at-hackbaroda-2026-1in2</guid>
      <description>&lt;p&gt;&lt;em&gt;Submitted to HackBaroda 2026 by &lt;a href="https://www.linkedin.com/in/khusavant-choudhary-546b48369/" rel="noopener noreferrer"&gt;Khusavant Choudhary&lt;/a&gt;, &lt;a href="https://www.linkedin.com/in/aagam-shah-b9561833b/" rel="noopener noreferrer"&gt;Aagam Shah&lt;/a&gt;, and &lt;a href="https://www.linkedin.com/in/krrish-r-59m18/" rel="noopener noreferrer"&gt;Krrish Raj&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;It's 20:30 PM. An alert fires. Your database connections are maxed out, API latency is at 8 seconds, and users are reporting errors. You're on-call. You're tired. And you have to figure out — from scratch — what's going on.&lt;/p&gt;

&lt;p&gt;Now imagine if your incident management system already &lt;em&gt;knew&lt;/em&gt;. Not a generic "check your logs" suggestion — but an actual SRE-grade response: "Based on 12 similar past incidents, root cause is &lt;strong&gt;connection pool exhaustion&lt;/strong&gt;. Confidence: 91%. Here's the exact runbook."&lt;/p&gt;

&lt;p&gt;That's what we built at &lt;strong&gt;HackBaroda 2026&lt;/strong&gt;. We called it &lt;strong&gt;IncidentOS AI&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  🚨 The Problem
&lt;/h2&gt;

&lt;p&gt;Modern engineering teams face a brutal paradox: the same incidents keep happening, the same runbooks get rewritten, and the same on-call engineer at 3 AM starts from zero every single time.&lt;/p&gt;

&lt;p&gt;Existing tools like PagerDuty and Opsgenie are great at &lt;em&gt;alerting&lt;/em&gt;. They're not great at &lt;em&gt;learning&lt;/em&gt;. They don't remember that last week's DNS failure looked exactly like this one, or that the fix was "flush the resolver cache on nodes 4 and 7."&lt;/p&gt;

&lt;p&gt;We wanted to build something that &lt;strong&gt;gets smarter with every incident it sees&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  💡 What We Built
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;IncidentOS AI&lt;/strong&gt; is an AI-powered incident management backend that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Analyses new incidents&lt;/strong&gt; using Groq's &lt;code&gt;llama-3.3-70b-versatile&lt;/code&gt; LLM with a strict SRE persona — no filler phrases, no generic advice&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Queries its own memory&lt;/strong&gt; for similar past incidents before generating a response&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gives a confidence score&lt;/strong&gt; based on how many similar resolved incidents it found (0 matches = 20%, 1 = 60%, 2+ = 85%+)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generates numbered runbooks&lt;/strong&gt; with exact commands, config keys, and thresholds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Permanently stores every resolved incident&lt;/strong&gt; in Hindsight Cloud — so the knowledge is never lost&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The more incidents it resolves, the smarter it gets. It's a flywheel.&lt;/p&gt;




&lt;h2&gt;
  
  
  🛠️ The Stack
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Technology&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;FastAPI (Python)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Groq API — &lt;code&gt;llama-3.3-70b-versatile&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cloud Memory&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hindsight Cloud (&lt;code&gt;retain()&lt;/code&gt; / &lt;code&gt;recall()&lt;/code&gt; / &lt;code&gt;retain_batch()&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Embeddings&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;sentence-transformers/all-MiniLM-L6-v2&lt;/code&gt; (384-dim)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Local Fallback&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;incident_memory.json&lt;/code&gt; (JSON + cosine similarity)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Why Groq?
&lt;/h3&gt;

&lt;p&gt;Speed. During a live incident, nobody wants to wait 10 seconds for a response. Groq's LPU inference is fast enough to feel like a real-time assistant, not a batch job.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Hindsight?
&lt;/h3&gt;

&lt;p&gt;Hindsight is a cloud memory layer built specifically for AI agents. It stores information as NLP-extracted facts and lets you &lt;code&gt;recall()&lt;/code&gt; semantically relevant memories in plain English. This made it a perfect fit for "what do I know about incidents like this one?"&lt;/p&gt;

&lt;p&gt;Every incident we resolve gets &lt;code&gt;retain()&lt;/code&gt;-ed into Hindsight with full metadata — &lt;code&gt;incident_id&lt;/code&gt;, &lt;code&gt;root_cause&lt;/code&gt;, &lt;code&gt;mitigation_steps&lt;/code&gt;, &lt;code&gt;status&lt;/code&gt;. When a new incident comes in, we &lt;code&gt;recall()&lt;/code&gt; against it and get back the most relevant resolved memories with confidence scores.&lt;/p&gt;




&lt;h2&gt;
  
  
  ⚙️ Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;HTTP Client
  │
  ▼
FastAPI (main.py)
  ├── POST /incident/new
  │     ├─ memory.find_similar_incidents()  ◄── Hindsight recall()
  │     ├─ agent.analyze_incident()         ◄── Groq LLM
  │     └─ agent.suggest_actions()
  │
  ├── POST /incident/resolve
  │     └─ memory.resolve_incident()        ──► Hindsight retain() (with metadata)
  │
  └── POST /sync                            ──► memory.sync_from_hindsight_cloud()

Memory Layer (memory.py)
  Primary:   Hindsight Cloud   (retain / recall / list_memories / retain_batch)
  Secondary: incident_memory.json  (local embedding cache + offline fallback)
  Dedup:     Exact ID match  +  semantic cosine similarity (threshold 0.97)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The memory layer has two paths:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Primary (cloud)&lt;/strong&gt;: Every query hits Hindsight &lt;code&gt;recall()&lt;/code&gt; first. Results that carry metadata (our stored &lt;code&gt;incident_id&lt;/code&gt;, &lt;code&gt;root_cause&lt;/code&gt;) are used directly. Results that don't carry metadata (NLP fact extractions by Hindsight) are ID-matched back to local records.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Fallback (local)&lt;/strong&gt;: If Hindsight returns nothing useful, we fall through to cosine similarity search over local embeddings — the system always gives an answer.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  🔥 The Hardest Problem We Solved
&lt;/h2&gt;

&lt;p&gt;When we started syncing data from Hindsight, we discovered something unexpected: &lt;code&gt;recall()&lt;/code&gt; doesn't return your original document. It returns &lt;strong&gt;NLP-extracted facts&lt;/strong&gt; — atomic sentences Hindsight derived from what you stored.&lt;/p&gt;

&lt;p&gt;So if you stored:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Incident ID d56b2cad: DB connections maxed out. Root cause: connection pool exhaustion."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Hindsight might give you back:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Database connections reached maximum limit on 2026-06-07."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;No &lt;code&gt;incident_id&lt;/code&gt;. No &lt;code&gt;root_cause&lt;/code&gt;. Just a raw extracted fact.&lt;/p&gt;

&lt;p&gt;This was a problem for syncing. Our early &lt;code&gt;sync_from_hindsight.py&lt;/code&gt; tried to parse these and — critically — &lt;strong&gt;assigned a random &lt;code&gt;uuid4()&lt;/code&gt; to every fact it couldn't match&lt;/strong&gt;. The result: running the sync script twice would double the record count. We were generating 2,000+ garbage duplicate records on every run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Never fall back to &lt;code&gt;uuid4()&lt;/code&gt;. If a fact has no parseable incident ID — from our stored metadata &lt;em&gt;or&lt;/em&gt; from regex-extracting &lt;code&gt;"Incident d56b2cad"&lt;/code&gt; / &lt;code&gt;"Incident INC0027946"&lt;/code&gt; patterns out of the fact text — the record gets &lt;strong&gt;skipped entirely&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# BEFORE (the bug):
&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;incident_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;inc_id&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;())[:&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  &lt;span class="c1"&gt;# creates infinite duplicates
&lt;/span&gt;
&lt;span class="c1"&gt;# AFTER (the fix):
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;inc_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  &lt;span class="c1"&gt;# skip — never invent an ID
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After fixing this, sync became idempotent. Running it 10 times gives the same result.&lt;/p&gt;




&lt;h2&gt;
  
  
  📊 What We Loaded Into It
&lt;/h2&gt;

&lt;p&gt;We loaded &lt;strong&gt;1,019 real-world SRE incidents&lt;/strong&gt; spanning incident types like:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure Type&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Load balancer misconfiguration&lt;/td&gt;
&lt;td&gt;257&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Third-party API outage&lt;/td&gt;
&lt;td&gt;138&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cache cluster failure&lt;/td&gt;
&lt;td&gt;138&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DB connection pool exhaustion&lt;/td&gt;
&lt;td&gt;81&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failed deployment / regression&lt;/td&gt;
&lt;td&gt;73&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DNS resolution failure&lt;/td&gt;
&lt;td&gt;69&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU saturation&lt;/td&gt;
&lt;td&gt;65&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory leak (OOM)&lt;/td&gt;
&lt;td&gt;63&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Expired SSL/TLS certificate&lt;/td&gt;
&lt;td&gt;57&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Disk space exhaustion&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;We built &lt;code&gt;bulk_resolve.py&lt;/code&gt; to classify all open incidents by failure category and resolve them in one shot — pushing all 903 updates to Hindsight using &lt;code&gt;retain_batch()&lt;/code&gt; in 50-item batches.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final state: 1,019 / 1,019 incidents resolved. 0 open. 100% Hindsight-synced.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🤖 What the AI Actually Says
&lt;/h2&gt;

&lt;p&gt;Here's a real example. We submitted a new incident:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"DB connection pool at 95%, API p99 latency at 6.2s, auth service returning 503s"&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The response came back in under 2 seconds:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"incident_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"c988e1ff"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ai_analysis"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Based on 9 similar past incidents, root cause is database connection pool exhaustion. Confidence: 91%. The auth service 503s are a downstream effect — the real problem is upstream at the connection pool level."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"suggested_actions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"1. Run: SELECT count(*) FROM pg_stat_activity WHERE state='active' — confirm connections at limit"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"2. Kill idle connections: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state='idle' AND query_start &amp;lt; now() - interval '5 minutes'"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"3. Increase DB_POOL_SIZE from 100 to 200 in app config and rolling-restart pods"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"4. Add pool-utilisation alert at 80% threshold in Datadog/Grafana"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"similar_past_incidents"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"4e6a6f1b"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"DB connection exhaustion"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"root_cause"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"connection pool exhaustion"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"resolved"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"9a78a49f"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"API latency spike with DB connections maxed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"resolved"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No "please check your logs." No "contact your database administrator." Specific. Actionable. Confident.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔑 Key API Endpoints
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Analyse a new incident&lt;/span&gt;
POST /incident/new
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"title"&lt;/span&gt;: &lt;span class="s2"&gt;"..."&lt;/span&gt;, &lt;span class="s2"&gt;"description"&lt;/span&gt;: &lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;# Mark resolved (updates Hindsight memory)&lt;/span&gt;
POST /incident/resolve
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"incident_id"&lt;/span&gt;: &lt;span class="s2"&gt;"..."&lt;/span&gt;, &lt;span class="s2"&gt;"root_cause"&lt;/span&gt;: &lt;span class="s2"&gt;"..."&lt;/span&gt;, &lt;span class="s2"&gt;"mitigation_steps"&lt;/span&gt;: &lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;# Pull everything from Hindsight into local memory&lt;/span&gt;
POST /sync

&lt;span class="c"&gt;# Health check (backend + Groq + Hindsight)&lt;/span&gt;
GET /status

&lt;span class="c"&gt;# All incidents with counts&lt;/span&gt;
GET /incidents/all
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🧠 What We Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Cloud memory is a first-class architecture concern, not an afterthought.&lt;/strong&gt;&lt;br&gt;
Hindsight forced us to think about &lt;em&gt;what&lt;/em&gt; we store and &lt;em&gt;how&lt;/em&gt; we tag it from the very first &lt;code&gt;retain()&lt;/code&gt; call. Metadata design (always pass &lt;code&gt;incident_id&lt;/code&gt;, &lt;code&gt;root_cause&lt;/code&gt;, &lt;code&gt;status&lt;/code&gt; in the &lt;code&gt;metadata=&lt;/code&gt; dict) is as important as the content itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Idempotency matters more than you think.&lt;/strong&gt;&lt;br&gt;
Any script that touches a shared data store needs to be safe to run multiple times. Our sync bug taught us this the hard way — after the uuid4 bug, we had 2,987 records where we expected 556.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. LLM persona engineering is real engineering.&lt;/strong&gt;&lt;br&gt;
Getting Groq to consistently output SRE-grade responses (not generic advice) required precise prompt design. The phrase &lt;code&gt;"You are an SRE on-call engineer. Never use filler phrases. Never say 'it appears'. Always give numbered steps with exact commands."&lt;/code&gt; was the difference between mediocre and great output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Build for the fallback first.&lt;/strong&gt;&lt;br&gt;
Our local cosine similarity search was the first thing we built, and it saved us multiple times when cloud connectivity had issues. The system always gives an answer — that's non-negotiable for an incident management tool.&lt;/p&gt;


&lt;h2&gt;
  
  
  🚀 Try It Yourself
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/KrrishR05/IncidentOS-AI.git
&lt;span class="nb"&gt;cd &lt;/span&gt;IncidentOS-AI
python &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; venv&lt;span class="se"&gt;\S&lt;/span&gt;cripts&lt;span class="se"&gt;\a&lt;/span&gt;ctivate  &lt;span class="c"&gt;# Windows&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;torch &lt;span class="nt"&gt;--index-url&lt;/span&gt; https://download.pytorch.org/whl/cpu
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Create &lt;code&gt;.env&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GROQ_API_KEY=gsk_...
HINDSIGHT_API_KEY=hsk_...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uvicorn main:app &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt; 8000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then hit &lt;code&gt;http://localhost:8000&lt;/code&gt; and submit your first incident.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/KrrishR05/IncidentOS-AI" rel="noopener noreferrer"&gt;github.com/KrrishR05/IncidentOS-AI&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  👥 The Team
&lt;/h2&gt;

&lt;p&gt;Built in 12 hours at &lt;strong&gt;HackBaroda 2026&lt;/strong&gt; by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.linkedin.com/in/khusavant-choudhary-546b48369/" rel="noopener noreferrer"&gt;Khusavant Choudhary&lt;/a&gt;&lt;/strong&gt; (&lt;a class="mentioned-user" href="https://dev.to/khusavant"&gt;@khusavant&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.linkedin.com/in/aagam-shah-b9561833b/" rel="noopener noreferrer"&gt;Aagam Shah&lt;/a&gt;&lt;/strong&gt; (&lt;a class="mentioned-user" href="https://dev.to/aagam_1910"&gt;@aagam_1910&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.linkedin.com/in/krrish-r-59m18/" rel="noopener noreferrer"&gt;Krrish Raj&lt;/a&gt;&lt;/strong&gt; (&lt;a class="mentioned-user" href="https://dev.to/krrish_r5"&gt;@krrish_r5&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Tags
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;#python&lt;/code&gt; &lt;code&gt;#ai&lt;/code&gt; &lt;code&gt;#devops&lt;/code&gt; &lt;code&gt;#sre&lt;/code&gt; &lt;code&gt;#fastapi&lt;/code&gt; &lt;code&gt;#groq&lt;/code&gt; &lt;code&gt;#llm&lt;/code&gt; &lt;code&gt;#hackathon&lt;/code&gt; &lt;code&gt;#webdev&lt;/code&gt; &lt;code&gt;#showdev&lt;/code&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>fastapi</category>
      <category>hackathon</category>
    </item>
  </channel>
</rss>
