DEV Community

Cover image for I Replaced My On-Call Runbook with AI — Here’s What Happened in Production

I Replaced My On-Call Runbook with AI — Here’s What Happened in Production

Ravi Teja Reddy Mandala on March 12, 2026

Last month I tried something risky. Instead of waking up at 3AM to debug production incidents, I experimented with an AI assistant handling the fi...
Collapse
 
benjamin_nguyen_8ca6ff360 profile image
Benjamin Nguyen • Edited

Nice! I agree with your project. AI agent is great to create coding these days. Companies should create an application to isolate the malicious attack and isolate the incident automatically . We need to install guardrails regarding AI. My sentinel project is early concept for cybersecurity where the application detects the malicious attack and run all day long.

Collapse
 
ravi_teja_8b63d9205dc7a13 profile image
Ravi Teja Reddy Mandala

Thanks, Benjamin, really appreciate this!

Totally agree on guardrails. That’s actually one of the biggest gaps I noticed early on without constraints, AI tends to over-suggest or miss critical signals.

Your point about isolating malicious activity is interesting, especially integrating incident triage with security detection. In my setup, I focused more on reliability signals (timeouts, retries, missing observability), but combining that with security signals would make it much more powerful.

Curious, how are you handling false positives in your sentinel project? That’s been one of the tougher challenges on my side.

Collapse
 
benjamin_nguyen_8ca6ff360 profile image
Benjamin Nguyen

Nice! Yes, I did.

Thread Thread
 
ravi_teja_8b63d9205dc7a13 profile image
Ravi Teja Reddy Mandala

Nice! Curious to hear what worked well for you vs. where it struggled.

Thread Thread
 
benjamin_nguyen_8ca6ff360 profile image
Benjamin Nguyen

The main issue I faced was how quickly Gemini 3 Flash runs out of tokens when working on a previous project. It did not happen with my current project (Sentinel).

Thread Thread
 
ravi_teja_8b63d9205dc7a13 profile image
Ravi Teja Reddy Mandala

That’s interesting. I’ve seen similar behavior with token limits depending on context size and prompt patterns.

In my case, breaking workflows into smaller steps and adding retrieval (instead of passing full context each time) helped a lot with token efficiency and consistency.

Curious, did you change anything in your architecture between the previous project and Sentinel, or do you think it’s mostly model-related?