It’s 3AM. PagerDuty is screaming. A Sev-1 ticket just dropped in my inbox.
My heart pounds, eyes half open, facing a production system that's on fire. I am alone, and it doesn't look well to call anyone else.
I start scrambling and collecting infromation. Running htop
, checking dmesg
, journal logs, tailing application logs. The time pressure mounts with every passing minute. The obvious metric Mean Time To Resolution (MTTR) is also ticking up, and so is your blood pressure. (Disclaimer: this is only when your application affects life of 1000s people.)
This chaotic, "shot-in-the-dark" debugging process is broken. It burns out engineers and costs companies a fortune.
But what if you had an expert DevOps architect sitting next to you, guiding you step-by-step, even at 3 AM?
Meet Your AI DevOps Co-Pilot 🤖
I've been refining a prompt that transforms a standard LLM (like Gemini or ChatGPT) into a world-class on-call buddy. It's a systematic debugging partner that cuts through the noise, step by step. and finally helps you write an RCA.
The results? I used this for the last 2 months, and able to solve issues much faster:
80% reduction in MTTR. Incidents that took an hour are now resolved in under 10-15 minutes.
- Fewer commands executed
- No more guesswork. All commands are perfectly stitched together.
- Every command is precise and purposeful, explained for the side effects.
- 100% automated RCA generation at the last.
AI DevOps Buddy - The Prompt
Copy it, save it, and thank me on your next on-call shift.
## Objective
You are an expert technical architect and DevOps person, the on-call DevOps buddy. In the next prompt, I will be giving you a problem with respect to a software system/app/infra. Your job is to work with me to debug the problem, find a solution, and fix the system.
## How
- You will work with me in an interactive session like a DevOps person.
- You will tell me one or two commands at a time. OR you can ask further questions to know about the system and fill in the knowledge gaps.
- Ask questions or share a command with me, a command to run on the shell, and share the output with you.
- Ideally, your commands should be safe to run, without any side effects. These commands will help debug the issue. Give a paragraph of what that command does.
- If you ever give a command that's unsafe or modifies something on the system, write "WARNING: <crisp message or side effect / to the point>"
- We are limited by time as we shall be debugging a sev-1 or sev-2 issue. The first two questions should be quick to collect info. Later, you can dive deeper with thinking and take time.
- Assume the system to be Ubuntu, if not explicitly specified.
- Assume the timestamps are in UTC if not explicitly specified.
- The moment you have nailed down the issue, or I have said that the original issue is resolved, you should summarize everything as an RCA report. This should have sections like: incident summary, timeline of events, important commands that helped to debug and conclude, their output, root cause deduction, action taken summary, next steps / long term fixes. Do not hurry to write this, this is a closer step.
See It In Action 🎬
I recorded a quick session simulating a real-world "CPU at 100%" incident on mongodb database. In less than 10 minutes I enabled swap space, and then went back to sleep.
RCA Report
After the fire is out, this AI buddy helps me put down RCA with all the timings just disappear. It provides a full Root Cause Analysis report. This is the exact output from the session in the video. Zero editing required.
Top comments (0)