DEV Community

Cover image for I Replaced My On-Call Runbook with AI — Here’s What Happened in Production
Ravi Teja Reddy Mandala
Ravi Teja Reddy Mandala

Posted on

I Replaced My On-Call Runbook with AI — Here’s What Happened in Production

Last month I tried something risky.

Instead of waking up at 3AM to debug production incidents, I experimented with an AI assistant handling the first layer of incident triage.

No runbook.
No manual log digging.
Just AI analyzing alerts, logs, and metrics.

Here’s what actually happened in production.


The Problem Every On-Call Engineer Knows

If you've ever been on call, you know the routine.

PagerDuty fires.

You open logs.

You check dashboards.

You run the same 5 commands.

Every single time.

The process is predictable, but it still requires a human in the loop.

So I asked a simple question:

Why can't AI do the first layer of incident investigation?


The Idea

Instead of engineers performing repetitive triage, I built a simple AI incident assistant.

The AI receives alerts and performs initial debugging steps automatically.

Architecture looked like this:

Alert → AI Agent → Log Analysis → Root Cause Guess → Suggested Fix
Enter fullscreen mode Exit fullscreen mode

Tools used:

  • OpenAI API
  • GitHub Actions
  • Kubernetes logs
  • Prometheus metrics

The AI Prompt

The core of the system was surprisingly simple.

You are a Site Reliability Engineer assistant.

Analyze the following production logs and metrics.

Tasks:
1. Identify possible root causes
2. Classify incident severity
3. Suggest debugging steps
4. Provide likely remediation
Enter fullscreen mode Exit fullscreen mode

This prompt runs every time a critical alert fires.


Real Incident Example

Incident: API latency spike

Logs showed increased response times.

The AI analyzed the logs and returned:

Possible Root Cause
Redis latency increase due to connection pool saturation.

Suggested Debugging Steps

  • Check Redis CPU usage
  • Inspect connection pool metrics
  • Verify recent deployment changes

Suggested Fix
Scale Redis replicas or increase connection pool size.

Time to initial diagnosis:

3 minutes

Typical human triage time:

15–20 minutes


What Worked Surprisingly Well

The AI was very good at:

  • Pattern recognition in logs
  • Suggesting common infrastructure fixes
  • Identifying deployment-related issues

It reduced time spent on basic incident investigation dramatically.


What Failed

AI is not perfect.

Twice it suggested completely wrong root causes.

Example:

It blamed database contention when the real issue was a misconfigured feature flag.

Lesson learned:

Never allow AI to make production changes automatically.

AI should assist engineers, not replace them.


The Future of On-Call Engineering

The biggest realization was this:

AI doesn't replace engineers.

It replaces the boring parts of operations.

The repetitive steps.
The predictable debugging paths.
The manual log searching.

The future of SRE might look like this:

Alert → AI Investigation → Engineer Decision
Enter fullscreen mode Exit fullscreen mode

Engineers focus on solving real problems.

AI handles the repetitive investigation.


Final Thoughts

After running this experiment for a few weeks, one thing became clear.

AI is incredibly useful for incident triage.

Not perfect.

But powerful enough to reduce on-call fatigue significantly.

And honestly…

Anything that reduces 3AM debugging sessions is worth exploring.


If you're experimenting with AI in DevOps or SRE workflows, I'd love to hear what you're building.

Top comments (0)