Praveen Ballari

Posted on May 24

I Got Tired of 35-Minute Incident Reviews — So I Built an AI SRE Copilot

#devops #ai #sre #webdev

I got tired of spending 35 minutes debugging the same production incidents.

So I built an AI incident response copilot.

Every outage followed the same pattern:

Scroll through logs
Google obscure error messages
Debate root cause in Slack
Write the same post-mortem template again

The engineering cost wasn’t just downtime.
It was repeated cognitive load.

So this week I built OperatorMesh — a lightweight AI-powered incident response platform designed for SRE and DevOps workflows.

What makes it different isn’t “AI”.
It’s confidence calibration and failure transparency.

Most AI tools pretend to know everything.

I wanted the opposite:

show uncertainty honestly
explain rejected hypotheses
identify missing signals
separate diagnosis confidence from remediation confidence

Because at 2AM, “probably correct” and “safe to execute” are not the same thing.

Here’s what I shipped:

🚨 AI Incident Triage

Paste logs or alerts and get:

root cause analysis in plain English
confidence scoring
ranked remediation actions
rejected hypotheses
missing evidence/signals

One real example:
A PostgreSQL connection pool exhaustion issue was diagnosed in 19 seconds with 82% confidence.

🔍 Pre-Mortem Deploy Scanner

Describe a deployment before shipping it.

The system predicts:

deployment safety score
likely failure modes
rollback triggers
at-risk services
monitoring priorities

It caught a dangerous database migration issue involving non-concurrent index creation on a large table before deployment.

💥 Blast Radius Predictor

Describe a failing service.

The system estimates:

cascade severity
dependency impact chain
T+5 / T+15 / T+30 failure progression
highest-priority stabilization action

One auth-service outage simulation correctly identified immediate JWT/session validation failure risk across dependent systems.

📄 Post-Mortem Auto-Draft

This was built purely from frustration.

It generates:

executive summary
timeline
root cause analysis
contributing factors
action items
lessons learned

No more writing post-mortems from scratch after midnight incidents.

🔄 On-Call Handoff Briefing

Summarises an entire shift into a 60-second briefing:

current system state
resolved incidents
active risks
watch metrics
escalation context

Less Slack archaeology.
Less context loss between engineers.

Technical Stack

I’m building this solo.

Current stack:

Netlify serverless functions
Supabase auth + storage
Multi-provider AI fallback routing
Vanilla HTML/CSS/JS frontend

Current infrastructure cost:
Under $50/month.

Ironically, the hardest part wasn’t the infrastructure.

It was prompt engineering.

The biggest challenge was forcing:

structured JSON outputs
confidence calibration
deterministic formatting
honest failure handling
low hallucination behavior

One thing I learned quickly:
LLMs become dramatically more useful for operational tooling when they’re allowed to admit uncertainty.

That single design decision improved trust more than anything else.

What still needs work

Response latency is still too high (~19 seconds average)
Streaming output is not implemented yet
Slack integration is still in progress
No real production users yet

Right now I’m optimizing for feedback, not scale.

If you work in SRE, DevOps, platform engineering, or incident response — I’d genuinely love feedback from people who deal with production failures daily.

What’s missing?
What would make something like this genuinely useful in production?

I’m building in public and documenting the journey as I go.

— Praveen

DEV Community

I Got Tired of 35-Minute Incident Reviews — So I Built an AI SRE Copilot

Top comments (0)