Ravi Teja Reddy Mandala

Posted on Mar 12

I Replaced My On-Call Runbook with AI — Here’s What Happened in Production

#devops #ai #programming #sre

Last month I tried something risky.

Instead of waking up at 3AM to debug production incidents, I experimented with an AI assistant handling the first layer of incident triage.

No runbook.
No manual log digging.
Just AI analyzing alerts, logs, and metrics.

Here’s what actually happened in production.

The Problem Every On-Call Engineer Knows

If you've ever been on call, you know the routine.

PagerDuty fires.

You open logs.

You check dashboards.

You run the same 5 commands.

Every single time.

The process is predictable, but it still requires a human in the loop.

So I asked a simple question:

Why can't AI do the first layer of incident investigation?

The Idea

Instead of engineers performing repetitive triage, I built a simple AI incident assistant.

The AI receives alerts and performs initial debugging steps automatically.

Architecture looked like this:

Alert → AI Agent → Log Analysis → Root Cause Guess → Suggested Fix

Tools used:

OpenAI API
GitHub Actions
Kubernetes logs
Prometheus metrics

The AI Prompt

The core of the system was surprisingly simple.

You are a Site Reliability Engineer assistant.

Analyze the following production logs and metrics.

Tasks:
1. Identify possible root causes
2. Classify incident severity
3. Suggest debugging steps
4. Provide likely remediation

This prompt runs every time a critical alert fires.

Real Incident Example

Incident: API latency spike

Logs showed increased response times.

The AI analyzed the logs and returned:

Possible Root Cause
Redis latency increase due to connection pool saturation.

Suggested Debugging Steps

Check Redis CPU usage
Inspect connection pool metrics
Verify recent deployment changes

Suggested Fix
Scale Redis replicas or increase connection pool size.

Time to initial diagnosis:

3 minutes

Typical human triage time:

15–20 minutes

What Worked Surprisingly Well

The AI was very good at:

Pattern recognition in logs
Suggesting common infrastructure fixes
Identifying deployment-related issues

It reduced time spent on basic incident investigation dramatically.

What Failed

AI is not perfect.

Twice it suggested completely wrong root causes.

Example:

It blamed database contention when the real issue was a misconfigured feature flag.

Lesson learned:

Never allow AI to make production changes automatically.

AI should assist engineers, not replace them.

The Future of On-Call Engineering

The biggest realization was this:

AI doesn't replace engineers.

It replaces the boring parts of operations.

The repetitive steps.
The predictable debugging paths.
The manual log searching.

The future of SRE might look like this:

Alert → AI Investigation → Engineer Decision

Engineers focus on solving real problems.

AI handles the repetitive investigation.

Final Thoughts

After running this experiment for a few weeks, one thing became clear.

AI is incredibly useful for incident triage.

Not perfect.

But powerful enough to reduce on-call fatigue significantly.

And honestly…

Anything that reduces 3AM debugging sessions is worth exploring.

If you're experimenting with AI in DevOps or SRE workflows, I'd love to hear what you're building.

Top comments (16)

Benjamin Nguyen • Mar 12 • Edited

Nice! I agree with your project. AI agent is great to create coding these days. Companies should create an application to isolate the malicious attack and isolate the incident automatically . We need to install guardrails regarding AI. My sentinel project is early concept for cybersecurity where the application detects the malicious attack and run all day long.

Ravi Teja Reddy Mandala • Mar 17

Thanks, Benjamin, really appreciate this!

Totally agree on guardrails. That’s actually one of the biggest gaps I noticed early on without constraints, AI tends to over-suggest or miss critical signals.

Your point about isolating malicious activity is interesting, especially integrating incident triage with security detection. In my setup, I focused more on reliability signals (timeouts, retries, missing observability), but combining that with security signals would make it much more powerful.

Curious, how are you handling false positives in your sentinel project? That’s been one of the tougher challenges on my side.

Benjamin Nguyen • Mar 17

Nice! Yes, I did.

Ravi Teja Reddy Mandala • Mar 17

Nice! Curious to hear what worked well for you vs. where it struggled.

Benjamin Nguyen • Mar 17

The main issue I faced was how quickly Gemini 3 Flash runs out of tokens when working on a previous project. It did not happen with my current project (Sentinel).

Ravi Teja Reddy Mandala • Mar 17

That’s interesting. I’ve seen similar behavior with token limits depending on context size and prompt patterns.

In my case, breaking workflows into smaller steps and adding retrieval (instead of passing full context each time) helped a lot with token efficiency and consistency.

Curious, did you change anything in your architecture between the previous project and Sentinel, or do you think it’s mostly model-related?

Benjamin Nguyen • Mar 17

Interesting! I should clarify something about my Arctic AI project from January. I ended up restructuring the system and switching models because Gemini 3 Flash kept running out of tokens for the workload. For that project, I moved back to Gemini 2.5 Flash, which handled the token demands much better. I heard stories from other people on Dev.to. They had issues with Gemini 3 flash concerning the tokens.

What’s funny is that I never had any issues with my system or with my Sentinel projects when using Gemini 3 Flash—only the Arctic project pushed it past its limits.

Ravi Teja Reddy Mandala • Mar 17

That makes sense, it sounds like your Arctic workload was hitting the upper bounds of context and chaining more aggressively.

I have seen similar patterns where certain use cases, especially long reasoning chains or heavy context stitching, expose limits that do not show up in typical flows like Sentinel-type systems.

In those cases, moving to a hybrid approach with retrieval, tighter prompt windows, and step-wise execution usually stabilizes things much more than just switching models.

Curious, was the Arctic system doing more multi-step reasoning or large context aggregation compared to Sentinel?

Benjamin Nguyen • Mar 17

Yes, it was! Gemini 3 Flash was pulling information from three to five websites to generate a summary. I had to correct a mistake in the code, and after refreshing Gemini 3 Flash, it reported that the model had run out of tokens. That’s why I switched back to Gemini 2.5 Flash

Benjamin Nguyen • Mar 17

I am cautious with all of the new Gemini 3 models.

Ravi Teja Reddy Mandala • Mar 27

Makes sense, that multi-source aggregation can hit token limits pretty quickly.

I’ve seen similar behavior when agents try to compress too much context into a single pass. Breaking it into smaller steps or adding retrieval usually helps stabilize things.

Curious how Gemini 2.5 Flash is handling that workload for you now 👍

Benjamin Nguyen • Mar 27

Nice! I borrow the expression night and day regarding gemini 2.5 flash. I never had any issues concerning this model (2.5 flash). I never run out of token with gemini 2.5. flash

Ravi Teja Reddy Mandala • Mar 27

That’s great to hear, sounds like a much more stable setup 👍

Yeah, 2.5 Flash seems to handle context and token management much better. Curious if you’re still doing multi-source aggregation the same way, or if the model just handles it more efficiently now.

Benjamin Nguyen • Mar 27

Honestly, I have not use gemini 2.5 flash anymore but it handle better and efficient for multi-source compare to gemini 3 flash.

Ravi Teja Reddy Mandala • Mar 27

Got it, that makes sense 👍

Yeah, I’ve noticed similar trade-offs between the newer models and stability. Always interesting to see how different setups behave in real use cases.

Benjamin Nguyen • Mar 27

yeah!

View full discussion (16 comments)