I got tired of spending 35 minutes debugging the same production incidents.
So I built an AI incident response copilot.
Every outage followed the same pattern:
- Scroll through logs
- Google obscure error messages
- Debate root cause in Slack
- Write the same post-mortem template again
The engineering cost wasn’t just downtime.
It was repeated cognitive load.
So this week I built OperatorMesh — a lightweight AI-powered incident response platform designed for SRE and DevOps workflows.
What makes it different isn’t “AI”.
It’s confidence calibration and failure transparency.
Most AI tools pretend to know everything.
I wanted the opposite:
- show uncertainty honestly
- explain rejected hypotheses
- identify missing signals
- separate diagnosis confidence from remediation confidence
Because at 2AM, “probably correct” and “safe to execute” are not the same thing.
Here’s what I shipped:
🚨 AI Incident Triage
Paste logs or alerts and get:
- root cause analysis in plain English
- confidence scoring
- ranked remediation actions
- rejected hypotheses
- missing evidence/signals
One real example:
A PostgreSQL connection pool exhaustion issue was diagnosed in 19 seconds with 82% confidence.
🔍 Pre-Mortem Deploy Scanner
Describe a deployment before shipping it.
The system predicts:
- deployment safety score
- likely failure modes
- rollback triggers
- at-risk services
- monitoring priorities
It caught a dangerous database migration issue involving non-concurrent index creation on a large table before deployment.
💥 Blast Radius Predictor
Describe a failing service.
The system estimates:
- cascade severity
- dependency impact chain
- T+5 / T+15 / T+30 failure progression
- highest-priority stabilization action
One auth-service outage simulation correctly identified immediate JWT/session validation failure risk across dependent systems.
📄 Post-Mortem Auto-Draft
This was built purely from frustration.
It generates:
- executive summary
- timeline
- root cause analysis
- contributing factors
- action items
- lessons learned
No more writing post-mortems from scratch after midnight incidents.
🔄 On-Call Handoff Briefing
Summarises an entire shift into a 60-second briefing:
- current system state
- resolved incidents
- active risks
- watch metrics
- escalation context
Less Slack archaeology.
Less context loss between engineers.
Technical Stack
I’m building this solo.
Current stack:
- Netlify serverless functions
- Supabase auth + storage
- Multi-provider AI fallback routing
- Vanilla HTML/CSS/JS frontend
Current infrastructure cost:
Under $50/month.
Ironically, the hardest part wasn’t the infrastructure.
It was prompt engineering.
The biggest challenge was forcing:
- structured JSON outputs
- confidence calibration
- deterministic formatting
- honest failure handling
- low hallucination behavior
One thing I learned quickly:
LLMs become dramatically more useful for operational tooling when they’re allowed to admit uncertainty.
That single design decision improved trust more than anything else.
What still needs work
- Response latency is still too high (~19 seconds average)
- Streaming output is not implemented yet
- Slack integration is still in progress
- No real production users yet
Right now I’m optimizing for feedback, not scale.
If you work in SRE, DevOps, platform engineering, or incident response — I’d genuinely love feedback from people who deal with production failures daily.
What’s missing?
What would make something like this genuinely useful in production?
I’m building in public and documenting the journey as I go.
— Praveen
Top comments (0)