DEV Community

Vijaya Bollu
Vijaya Bollu

Posted on

# I Built a DevOps Chatbot That Checks My Live App for Failures — Here's How It Works

Why I Built This

Every DevOps engineer has had the 2am moment. Something is broken. You don't know what. You SSH in, check logs, Google the error, open five tabs, still nothing clear. Thirty minutes later you find it — a config flag someone changed, a slow query, a dependency timing out.

I wanted to ask an AI instead. Not a generic ChatGPT that gives you textbook answers, but an AI connected to my actual running system that can check what's broken right now.

So I built the AI DevOps Copilot — Project 01 of my 30-project AI + DevOps YouTube series.


How It Works

The system has four layers:

1. LangChain agent (the brain)
Uses create_tool_calling_agent with Llama 3.1 via Groq. When you ask a question, the agent decides whether to answer from knowledge or call a tool. General DevOps questions → instant answer. Questions about the live app → tool call.

2. ChromaDB RAG (the knowledge base)
Nine runbook documents embedded into a vector database — Docker troubleshooting, AWS debugging, Kubernetes, Terraform, Linux performance, security, and more. The agent searches these for context when answering general questions.

3. Tool layer (the live connection)
Four tools: restaurant_monitor (hits the live restaurant app API), log_search (searches application logs), github_search (searches repos), and devops_docs (searches the runbook vector store).

4. FastAPI + SSE streaming (the interface)
The agent runs in a thread executor, streaming tokens back through an asyncio Queue with a sentinel pattern. The Next.js frontend connects via Server-Sent Events and renders each token as it arrives.

Data flow:

User question → FastAPI → LangChain agent → tool call or RAG search → streamed response → Next.js UI
Enter fullscreen mode Exit fullscreen mode

Demo / Results

I deployed a chaos engineering demo app — a restaurant application with injectable failure modes. One failure mode: slow_menu, which adds a 2-second artificial delay to GET /api/menu.

Input: Menu page spinning, customers waiting. I ask the copilot: "What is wrong with the restaurant app right now?"

What happened:

Tool called: restaurant_monitor → action: failures
API response: slow_menu: ACTIVE (2s delay on /api/menu)

AI answer:
The slow_menu failure mode is currently ACTIVE.
This injects a 2-second delay into the Menu API.

Fix:
1. Operator dashboard → Failures → Disable slow_menu
2. Or: POST /api/admin/failures/slow_menu/disable
3. Verify: menu should load in <100ms
Enter fullscreen mode Exit fullscreen mode

Time to diagnosis: ~8 seconds.

After disabling the toggle, the menu loaded instantly. The entire incident lifecycle — detection, diagnosis, fix, verification — took under 2 minutes.


What Surprised Me

1. create_react_agent kept hitting iteration limits.
I started with the ReAct text-parsing agent. Llama 3.1-8b kept failing the "Thought/Action/Observation" format, exhausting 10 iterations on simple questions. Switching to create_tool_calling_agent — which uses native LLM function calling — fixed this completely. The model knows how to call functions; it doesn't know how to produce exact ReAct formatting.

2. Tool call chunks were streaming to the user as garbage text.
on_llm_new_token in LangChain fires for every token — including internal tool call encodings like {"name": "restaurant_monitor", "arguments":. Fixed by checking chunk.tool_call_chunks and skipping those tokens. Without this, users see raw JSON blobs in the chat.

3. host.docker.internal doesn't resolve on EC2 Linux.
Added extra_hosts: - "host.docker.internal:host-gateway" to Docker Compose — still failed. Fixed by using the EC2 private IP directly: RESTAURANT_API_URL=http://172.31.90.69:8010 in the .env file. Simple, obvious in hindsight.

4. Groq free tier is 6,000 tokens per minute.
With llama-3.3-70b-versatile, three questions hit the limit. Switched to llama-3.1-8b-instant — much faster, lower token usage, still very capable for DevOps Q&A. For a demo or portfolio project, this is the right call.


Try It

📁 GitHub: https://github.com/ThinkWithOps/ai-devops-systems-lab/tree/main/projects/01-ai-devops-copilot

🎬 Video walkthrough: https://youtu.be/a50334Szt5g

# Clone and run locally
git clone https://github.com/ThinkWithOps/ai-devops-systems-lab
cd projects/01-ai-devops-copilot
cp backend/.env.example backend/.env
# Add your GROQ_API_KEY to .env
docker compose up -d
Enter fullscreen mode Exit fullscreen mode

Open http://localhost:3000 — the copilot is running.

This is Project 01 of a 30-project series building AI + DevOps systems from scratch.


What's the most painful part of your on-call experience that you wish an AI could handle? Drop it in the comments — it might become Project 02.


`

---
Enter fullscreen mode Exit fullscreen mode

Top comments (0)