Vijaya Rajeev Bollu

Posted on Mar 31

# I Built a DevOps Chatbot That Checks My Live App for Failures — Here's How It Works

#devops #ai #langchain #python

Why I Built This

Every DevOps engineer has had the 2am moment. Something is broken. You don't know what. You SSH in, check logs, Google the error, open five tabs, still nothing clear. Thirty minutes later you find it — a config flag someone changed, a slow query, a dependency timing out.

I wanted to ask an AI instead. Not a generic ChatGPT that gives you textbook answers, but an AI connected to my actual running system that can check what's broken right now.

So I built the AI DevOps Copilot — Project 01 of my 30-project AI + DevOps YouTube series.

How It Works

The system has four layers:

1. LangChain agent (the brain)
Uses create_tool_calling_agent with Llama 3.1 via Groq. When you ask a question, the agent decides whether to answer from knowledge or call a tool. General DevOps questions → instant answer. Questions about the live app → tool call.

2. ChromaDB RAG (the knowledge base)
Nine runbook documents embedded into a vector database — Docker troubleshooting, AWS debugging, Kubernetes, Terraform, Linux performance, security, and more. The agent searches these for context when answering general questions.

3. Tool layer (the live connection)
Four tools: restaurant_monitor (hits the live restaurant app API), log_search (searches application logs), github_search (searches repos), and devops_docs (searches the runbook vector store).

4. FastAPI + SSE streaming (the interface)
The agent runs in a thread executor, streaming tokens back through an asyncio Queue with a sentinel pattern. The Next.js frontend connects via Server-Sent Events and renders each token as it arrives.

Data flow:

User question → FastAPI → LangChain agent → tool call or RAG search → streamed response → Next.js UI

Demo / Results

I deployed a chaos engineering demo app — a restaurant application with injectable failure modes. One failure mode: slow_menu, which adds a 2-second artificial delay to GET /api/menu.

Input: Menu page spinning, customers waiting. I ask the copilot: "What is wrong with the restaurant app right now?"

What happened:

Tool called: restaurant_monitor → action: failures
API response: slow_menu: ACTIVE (2s delay on /api/menu)

AI answer:
The slow_menu failure mode is currently ACTIVE.
This injects a 2-second delay into the Menu API.

Fix:
1. Operator dashboard → Failures → Disable slow_menu
2. Or: POST /api/admin/failures/slow_menu/disable
3. Verify: menu should load in <100ms

Time to diagnosis: ~8 seconds.

After disabling the toggle, the menu loaded instantly. The entire incident lifecycle — detection, diagnosis, fix, verification — took under 2 minutes.

What Surprised Me

1. create_react_agent kept hitting iteration limits.
I started with the ReAct text-parsing agent. Llama 3.1-8b kept failing the "Thought/Action/Observation" format, exhausting 10 iterations on simple questions. Switching to create_tool_calling_agent — which uses native LLM function calling — fixed this completely. The model knows how to call functions; it doesn't know how to produce exact ReAct formatting.

2. Tool call chunks were streaming to the user as garbage text.
on_llm_new_token in LangChain fires for every token — including internal tool call encodings like {"name": "restaurant_monitor", "arguments":. Fixed by checking chunk.tool_call_chunks and skipping those tokens. Without this, users see raw JSON blobs in the chat.

3. host.docker.internal doesn't resolve on EC2 Linux.
Added extra_hosts: - "host.docker.internal:host-gateway" to Docker Compose — still failed. Fixed by using the EC2 private IP directly: RESTAURANT_API_URL=http://172.31.90.69:8010 in the .env file. Simple, obvious in hindsight.

4. Groq free tier is 6,000 tokens per minute.
With llama-3.3-70b-versatile, three questions hit the limit. Switched to llama-3.1-8b-instant — much faster, lower token usage, still very capable for DevOps Q&A. For a demo or portfolio project, this is the right call.

Try It

📁 GitHub: https://github.com/ThinkWithOps/ai-devops-systems-lab/tree/main/projects/01-ai-devops-copilot

🎬 Video walkthrough: https://youtu.be/a50334Szt5g

# Clone and run locally
git clone https://github.com/ThinkWithOps/ai-devops-systems-lab
cd projects/01-ai-devops-copilot
cp backend/.env.example backend/.env
# Add your GROQ_API_KEY to .env
docker compose up -d

Open http://localhost:3000 — the copilot is running.

This is Project 01 of a 30-project series building AI + DevOps systems from scratch.

What's the most painful part of your on-call experience that you wish an AI could handle? Drop it in the comments — it might become Project 02.


`

---

Top comments (3)

Jonathan Murray • Apr 6

Building a chatbot that actually has eyes on your app's health state is a much better use of AI than most integrations I see, which are basically just wrapped search. The key engineering challenge is making the failure detection reliable enough that you trust the alerts - false positives kill confidence in the tool fast. What monitoring data are you feeding it: logs, metrics, both?

Vijaya Rajeev Bollu • Apr 8

@jon_at_backboardio

Great point on false positives — that's exactly the failure mode I designed around. Right now it's feeding on both logs and metrics: container health status, restart counts, and response time from a lightweight health-check endpoint. The key was keeping the failure signal simple — instead of trying to be clever with anomaly detection, I set hard thresholds (e.g. >3 restarts in 5 min = alert) so the AI only gets called when something is definitively broken, not just noisy. Still early, but confidence in the alert matters more than recall at this stage.

Knowband • Apr 2

The time to diagnosis improvement is impressive. Cutting incident response from minutes to seconds is a big shift.