DEV Community

Cover image for I Replaced My On-Call Runbook with AI — Here’s What Happened in Production
Ravi Teja Reddy Mandala
Ravi Teja Reddy Mandala

Posted on

I Replaced My On-Call Runbook with AI — Here’s What Happened in Production

Last month I tried something risky.

Instead of waking up at 3AM to debug production incidents, I experimented with an AI assistant handling the first layer of incident triage.

No runbook.
No manual log digging.
Just AI analyzing alerts, logs, and metrics.

Here’s what actually happened in production.


The Problem Every On-Call Engineer Knows

If you've ever been on call, you know the routine.

PagerDuty fires.

You open logs.

You check dashboards.

You run the same 5 commands.

Every single time.

The process is predictable, but it still requires a human in the loop.

So I asked a simple question:

Why can't AI do the first layer of incident investigation?


The Idea

Instead of engineers performing repetitive triage, I built a simple AI incident assistant.

The AI receives alerts and performs initial debugging steps automatically.

Architecture looked like this:

Alert → AI Agent → Log Analysis → Root Cause Guess → Suggested Fix
Enter fullscreen mode Exit fullscreen mode

Tools used:

  • OpenAI API
  • GitHub Actions
  • Kubernetes logs
  • Prometheus metrics

The AI Prompt

The core of the system was surprisingly simple.

You are a Site Reliability Engineer assistant.

Analyze the following production logs and metrics.

Tasks:
1. Identify possible root causes
2. Classify incident severity
3. Suggest debugging steps
4. Provide likely remediation
Enter fullscreen mode Exit fullscreen mode

This prompt runs every time a critical alert fires.


Real Incident Example

Incident: API latency spike

Logs showed increased response times.

The AI analyzed the logs and returned:

Possible Root Cause
Redis latency increase due to connection pool saturation.

Suggested Debugging Steps

  • Check Redis CPU usage
  • Inspect connection pool metrics
  • Verify recent deployment changes

Suggested Fix
Scale Redis replicas or increase connection pool size.

Time to initial diagnosis:

3 minutes

Typical human triage time:

15–20 minutes


What Worked Surprisingly Well

The AI was very good at:

  • Pattern recognition in logs
  • Suggesting common infrastructure fixes
  • Identifying deployment-related issues

It reduced time spent on basic incident investigation dramatically.


What Failed

AI is not perfect.

Twice it suggested completely wrong root causes.

Example:

It blamed database contention when the real issue was a misconfigured feature flag.

Lesson learned:

Never allow AI to make production changes automatically.

AI should assist engineers, not replace them.


The Future of On-Call Engineering

The biggest realization was this:

AI doesn't replace engineers.

It replaces the boring parts of operations.

The repetitive steps.
The predictable debugging paths.
The manual log searching.

The future of SRE might look like this:

Alert → AI Investigation → Engineer Decision
Enter fullscreen mode Exit fullscreen mode

Engineers focus on solving real problems.

AI handles the repetitive investigation.


Final Thoughts

After running this experiment for a few weeks, one thing became clear.

AI is incredibly useful for incident triage.

Not perfect.

But powerful enough to reduce on-call fatigue significantly.

And honestly…

Anything that reduces 3AM debugging sessions is worth exploring.


If you're experimenting with AI in DevOps or SRE workflows, I'd love to hear what you're building.

Top comments (16)

Collapse
 
benjamin_nguyen_8ca6ff360 profile image
Benjamin Nguyen • Edited

Nice! I agree with your project. AI agent is great to create coding these days. Companies should create an application to isolate the malicious attack and isolate the incident automatically . We need to install guardrails regarding AI. My sentinel project is early concept for cybersecurity where the application detects the malicious attack and run all day long.

Collapse
 
ravi_teja_8b63d9205dc7a13 profile image
Ravi Teja Reddy Mandala

Thanks, Benjamin, really appreciate this!

Totally agree on guardrails. That’s actually one of the biggest gaps I noticed early on without constraints, AI tends to over-suggest or miss critical signals.

Your point about isolating malicious activity is interesting, especially integrating incident triage with security detection. In my setup, I focused more on reliability signals (timeouts, retries, missing observability), but combining that with security signals would make it much more powerful.

Curious, how are you handling false positives in your sentinel project? That’s been one of the tougher challenges on my side.

Collapse
 
benjamin_nguyen_8ca6ff360 profile image
Benjamin Nguyen

Nice! Yes, I did.

Thread Thread
 
ravi_teja_8b63d9205dc7a13 profile image
Ravi Teja Reddy Mandala

Nice! Curious to hear what worked well for you vs. where it struggled.

Thread Thread
 
benjamin_nguyen_8ca6ff360 profile image
Benjamin Nguyen

The main issue I faced was how quickly Gemini 3 Flash runs out of tokens when working on a previous project. It did not happen with my current project (Sentinel).

Thread Thread
 
ravi_teja_8b63d9205dc7a13 profile image
Ravi Teja Reddy Mandala

That’s interesting. I’ve seen similar behavior with token limits depending on context size and prompt patterns.

In my case, breaking workflows into smaller steps and adding retrieval (instead of passing full context each time) helped a lot with token efficiency and consistency.

Curious, did you change anything in your architecture between the previous project and Sentinel, or do you think it’s mostly model-related?

Thread Thread
 
benjamin_nguyen_8ca6ff360 profile image
Benjamin Nguyen

Interesting! I should clarify something about my Arctic AI project from January. I ended up restructuring the system and switching models because Gemini 3 Flash kept running out of tokens for the workload. For that project, I moved back to Gemini 2.5 Flash, which handled the token demands much better. I heard stories from other people on Dev.to. They had issues with Gemini 3 flash concerning the tokens.

What’s funny is that I never had any issues with my system or with my Sentinel projects when using Gemini 3 Flash—only the Arctic project pushed it past its limits.

Thread Thread
 
ravi_teja_8b63d9205dc7a13 profile image
Ravi Teja Reddy Mandala

That makes sense, it sounds like your Arctic workload was hitting the upper bounds of context and chaining more aggressively.

I have seen similar patterns where certain use cases, especially long reasoning chains or heavy context stitching, expose limits that do not show up in typical flows like Sentinel-type systems.

In those cases, moving to a hybrid approach with retrieval, tighter prompt windows, and step-wise execution usually stabilizes things much more than just switching models.

Curious, was the Arctic system doing more multi-step reasoning or large context aggregation compared to Sentinel?

Thread Thread
 
benjamin_nguyen_8ca6ff360 profile image
Benjamin Nguyen

Yes, it was! Gemini 3 Flash was pulling information from three to five websites to generate a summary. I had to correct a mistake in the code, and after refreshing Gemini 3 Flash, it reported that the model had run out of tokens. That’s why I switched back to Gemini 2.5 Flash

Thread Thread
 
benjamin_nguyen_8ca6ff360 profile image
Benjamin Nguyen

I am cautious with all of the new Gemini 3 models.

Thread Thread
 
ravi_teja_8b63d9205dc7a13 profile image
Ravi Teja Reddy Mandala

Makes sense, that multi-source aggregation can hit token limits pretty quickly.

I’ve seen similar behavior when agents try to compress too much context into a single pass. Breaking it into smaller steps or adding retrieval usually helps stabilize things.

Curious how Gemini 2.5 Flash is handling that workload for you now 👍

Thread Thread
 
benjamin_nguyen_8ca6ff360 profile image
Benjamin Nguyen

Nice! I borrow the expression night and day regarding gemini 2.5 flash. I never had any issues concerning this model (2.5 flash). I never run out of token with gemini 2.5. flash

Thread Thread
 
ravi_teja_8b63d9205dc7a13 profile image
Ravi Teja Reddy Mandala

That’s great to hear, sounds like a much more stable setup 👍

Yeah, 2.5 Flash seems to handle context and token management much better. Curious if you’re still doing multi-source aggregation the same way, or if the model just handles it more efficiently now.

Thread Thread
 
benjamin_nguyen_8ca6ff360 profile image
Benjamin Nguyen

Honestly, I have not use gemini 2.5 flash anymore but it handle better and efficient for multi-source compare to gemini 3 flash.

Thread Thread
 
ravi_teja_8b63d9205dc7a13 profile image
Ravi Teja Reddy Mandala

Got it, that makes sense 👍

Yeah, I’ve noticed similar trade-offs between the newer models and stability. Always interesting to see how different setups behave in real use cases.

Thread Thread
 
benjamin_nguyen_8ca6ff360 profile image
Benjamin Nguyen

yeah!