<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vijaya Rajeev Bollu</title>
    <description>The latest articles on DEV Community by Vijaya Rajeev Bollu (@vijaya_bollu).</description>
    <link>https://dev.to/vijaya_bollu</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3808851%2F28130c4b-5e4a-4938-94b6-370a13c89737.png</url>
      <title>DEV Community: Vijaya Rajeev Bollu</title>
      <link>https://dev.to/vijaya_bollu</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vijaya_bollu"/>
    <language>en</language>
    <item>
      <title>I Built an MCP Server That Lets Claude Control My Kubernetes Cluster</title>
      <dc:creator>Vijaya Rajeev Bollu</dc:creator>
      <pubDate>Thu, 09 Apr 2026 13:52:04 +0000</pubDate>
      <link>https://dev.to/vijaya_bollu/i-built-an-mcp-server-that-lets-claude-control-my-kubernetes-cluster-mfp</link>
      <guid>https://dev.to/vijaya_bollu/i-built-an-mcp-server-that-lets-claude-control-my-kubernetes-cluster-mfp</guid>
      <description>&lt;h2&gt;
  
  
  Why I Built This
&lt;/h2&gt;

&lt;p&gt;Every DevOps incident I've dealt with follows the same pattern.&lt;/p&gt;

&lt;p&gt;Something breaks. I open five terminal tabs. I run kubectl, check the AWS console, tail Docker logs, and manually piece together what went wrong — all while the clock is ticking.&lt;/p&gt;

&lt;p&gt;The tools all exist. They're just completely disconnected from each other.&lt;/p&gt;

&lt;p&gt;I wanted one interface that could ask a question, reach into all of them simultaneously, and explain what it found. So I built an MCP server that gives Claude Desktop direct access to my real infrastructure.&lt;/p&gt;




&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;The server runs locally on my laptop. Claude Desktop launches it on startup and communicates over stdio (standard input/output).&lt;/p&gt;

&lt;p&gt;When I type a question, Claude reads the 14 registered tool descriptions, decides which ones to call, and sends a JSON request to my server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"method"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"call_tool"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"params"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"get_failing_pods"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"arguments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"namespace"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"default"&lt;/span&gt;&lt;span class="p"&gt;}}}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;My server executes against real infrastructure and returns structured text. Claude synthesises everything into a plain English response.&lt;/p&gt;

&lt;p&gt;The key thing: &lt;strong&gt;Claude never connects to AWS or Kubernetes directly.&lt;/strong&gt; It only talks to my Python server. My server uses local credentials (&lt;code&gt;~/.aws/credentials&lt;/code&gt;, &lt;code&gt;~/.kube/config&lt;/code&gt;) to reach the real systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;14 tools across 4 categories:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes:&lt;/strong&gt; pod status, failing pods, logs, restart deployment, describe pod&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS:&lt;/strong&gt; cost report, EC2 instances, CloudWatch alarms, S3 buckets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker:&lt;/strong&gt; list containers, container logs, restart container&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Terraform:&lt;/strong&gt; run plan, check state&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Demo / Results
&lt;/h2&gt;

&lt;p&gt;I typed: &lt;em&gt;"Give me a health report of my infrastructure"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Claude called &lt;code&gt;get_pod_status&lt;/code&gt;, &lt;code&gt;get_aws_cost&lt;/code&gt;, &lt;code&gt;get_cloudwatch_alarms&lt;/code&gt;, and &lt;code&gt;list_containers&lt;/code&gt; simultaneously and returned a unified summary — Kubernetes pods, AWS costs, Docker containers, CloudWatch alarms — all in one response.&lt;/p&gt;

&lt;p&gt;One pod was in CrashLoopBackOff with 14 restarts. I typed:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Pull the logs from broken-app and tell me why it's crashing"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Claude diagnosed it instantly — busybox with no long-running process, exits immediately every time, Kubernetes keeps restarting it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before:&lt;/strong&gt; 45 minutes of manual investigation across 5 tabs.&lt;br&gt;
&lt;strong&gt;After:&lt;/strong&gt; 10 seconds, one chat window.&lt;/p&gt;


&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. The tool description is everything.&lt;/strong&gt;&lt;br&gt;
Claude picks tools based on their description, not their name. A vague description causes Claude to call the wrong tool or skip it entirely. My first version had &lt;code&gt;get_failing_pods&lt;/code&gt; described as &lt;em&gt;"Get pods"&lt;/em&gt;. Claude never called it. I changed it to &lt;em&gt;"List only pods in Failed, CrashLoopBackOff, OOMKilled, or Error state"&lt;/em&gt; and it worked perfectly every time. The description is the API contract between you and Claude.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Parallel tool calls are not guaranteed — they're emergent.&lt;/strong&gt;&lt;br&gt;
I didn't write any code to call tools in parallel. Claude decided to do that on its own when the question was broad enough ("health report"). For narrow questions ("which pods are failing?") Claude called only one tool. The parallelism comes from Claude's reasoning, not from your server. This means your tool descriptions need to be distinct enough that Claude can make that judgment call correctly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Sync SDKs in an async server is the silent performance killer.&lt;/strong&gt;&lt;br&gt;
boto3, the Kubernetes Python client, and the Docker SDK are all blocking. If you call them directly inside an async MCP handler, you stall the entire event loop. Every tool call wraps the SDK call in &lt;code&gt;asyncio.to_thread()&lt;/code&gt;. Without this, Claude would time out waiting for responses when multiple tools were called simultaneously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Error handling scope matters more than you think.&lt;/strong&gt;&lt;br&gt;
My first version caught SDK import errors at module level. If boto3 failed to connect to AWS at startup, the entire server crashed — Kubernetes and Docker tools went down too. The fix: catch errors inside each tool call, not at import time. One broken service should never take down the whole server.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. The MCP protocol is simpler than it looks.&lt;/strong&gt;&lt;br&gt;
There are really only two things you implement: &lt;code&gt;list_tools()&lt;/code&gt; (return tool names + descriptions + input schema) and &lt;code&gt;call_tool()&lt;/code&gt; (receive tool name + arguments, return text). Everything else is handled by the MCP SDK. Building a tool is 20 lines of Python.&lt;/p&gt;


&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Single EC2 instance.&lt;/strong&gt; The demo infrastructure runs on one t3.small. In a real multi-node Kubernetes cluster, the pod information would be spread across namespaces and nodes. The tools work against any kubeconfig — but the demo is scoped to what's in one cluster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Read-heavy, not write-heavy.&lt;/strong&gt; Most tools read state — pod status, AWS costs, Docker containers. The only write operation is &lt;code&gt;restart_deployment&lt;/code&gt;. There's no tool to edit a deployment YAML, push a config change, or create resources. Giving Claude write access to production infrastructure is a separate (and more serious) conversation about guardrails.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude restarts but can't fix root causes.&lt;/strong&gt; If a pod is in CrashLoopBackOff because of a bad container image, Claude can restart it — but it'll crash again immediately. Claude diagnosed the issue correctly and warned me, but the actual fix (updating the deployment YAML and pushing a correct image) is outside the scope of the tools I gave it. The tools match what I was comfortable giving an AI access to.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No authentication on the MCP server.&lt;/strong&gt; The server runs locally over stdio — there's no network exposure. But if you ever expose this over a network (e.g., run the server on EC2 and connect Claude Desktop remotely), you'd need to add authentication. Don't do that without thinking it through.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS credentials are local.&lt;/strong&gt; The server uses whatever credentials are in &lt;code&gt;~/.aws/credentials&lt;/code&gt;. If you're running this on a shared machine, those credentials are accessible to anything running as your user. Use IAM roles with minimal permissions scoped to read-only for Cost Explorer, EC2, CloudWatch, and S3.&lt;/p&gt;


&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;GitHub: [&lt;a href="https://github.com/ThinkWithOps/ai-devops-systems-lab.git" rel="noopener noreferrer"&gt;https://github.com/ThinkWithOps/ai-devops-systems-lab.git&lt;/a&gt;]&lt;br&gt;
Demo: [&lt;a href="https://youtu.be/2XsNcKSa28s" rel="noopener noreferrer"&gt;https://youtu.be/2XsNcKSa28s&lt;/a&gt;]&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ThinkWithOps/ai-devops-systems-lab.git
&lt;span class="nb"&gt;cd &lt;/span&gt;03-ai-devops-mcp-server
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;span class="c"&gt;# Test with mock data — no real infra needed&lt;/span&gt;
&lt;span class="nv"&gt;KUBE_MOCK_MODE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true &lt;/span&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; pytest tests/ &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add the MCP server to your Claude Desktop config and restart. Full setup guide in the README.&lt;/p&gt;

&lt;p&gt;This is Project 03 of the ThinkWithOps AI + DevOps series &lt;/p&gt;




&lt;p&gt;What would you control first if Claude had access to your infrastructure? Drop it in the comments.&lt;/p&gt;




</description>
      <category>devops</category>
      <category>kubernetes</category>
      <category>ai</category>
      <category>python</category>
    </item>
    <item>
      <title># I Built an AI That Understands Any GitHub Repo Using LangChain and ChromaDB</title>
      <dc:creator>Vijaya Rajeev Bollu</dc:creator>
      <pubDate>Wed, 01 Apr 2026 16:54:53 +0000</pubDate>
      <link>https://dev.to/vijaya_bollu/-i-built-an-ai-that-understands-any-github-repo-using-langchain-and-chromadb-m30</link>
      <guid>https://dev.to/vijaya_bollu/-i-built-an-ai-that-understands-any-github-repo-using-langchain-and-chromadb-m30</guid>
      <description>&lt;h2&gt;
  
  
  Why I Built This
&lt;/h2&gt;

&lt;p&gt;Every time I join a new codebase, the first few days are the same: open the repo, stare at folders, try to figure out which service does what, read half a file, get interrupted, lose context, start over.&lt;/p&gt;

&lt;p&gt;GitHub's built-in search is keyword-only. ChatGPT has never seen your repo. Teammates are busy. Documentation is either missing or out of date.&lt;/p&gt;

&lt;p&gt;I wanted a tool that could answer &lt;em&gt;"how does checkout work?"&lt;/em&gt; from the actual code — not from training data, not from docs, but from the real source files.&lt;/p&gt;

&lt;p&gt;So I built one.&lt;/p&gt;




&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;The system is built around a RAG (Retrieval-Augmented Generation) pipeline. The idea: instead of asking an LLM to answer from memory, you first retrieve the most relevant code chunks, then ask the LLM to answer using only those chunks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ingest flow:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Clone the GitHub repo locally&lt;/li&gt;
&lt;li&gt;Walk every file and split into overlapping chunks (~500 tokens, 50-token overlap)&lt;/li&gt;
&lt;li&gt;Convert each chunk to a vector embedding using &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; (Sentence Transformers — local, free)&lt;/li&gt;
&lt;li&gt;Store embeddings + metadata in ChromaDB&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Query flow:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Embed the user's question with the same model&lt;/li&gt;
&lt;li&gt;ChromaDB cosine similarity search → top-5 most relevant chunks&lt;/li&gt;
&lt;li&gt;Inject chunks into a LangChain prompt&lt;/li&gt;
&lt;li&gt;LLM generates an answer with source file citations&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The frontend is a Next.js split-pane layout — file tree on the left, chat on the right. The ingest endpoint streams progress via Server-Sent Events so the user sees live updates instead of a spinner.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# LangChain RAG chain (simplified)
&lt;/span&gt;&lt;span class="n"&gt;retriever&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vectorstore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;as_retriever&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;search_kwargs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;k&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;qa_chain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RetrievalQA&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_chain_type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chain_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stuff&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;return_source_documents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;qa_chain&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LLM only ever sees 5 chunks — not the whole repo. That keeps answers fast and grounded.&lt;/p&gt;




&lt;h2&gt;
  
  
  Demo / Results
&lt;/h2&gt;

&lt;p&gt;I indexed the ShopFlow ecommerce demo app — ~847 files across 12 services.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Input:&lt;/strong&gt; "How does the checkout flow work?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The checkout flow works as follows:
1. Cart validation — CartService checks inventory availability (src/services/cart.py:142)
2. Payment processing — PaymentService calls Stripe API with order total (src/services/payment.py:89)
3. Order creation — OrderService writes to PostgreSQL with status PENDING (src/models/order.py:67)
4. Inventory update — InventoryService decrements stock counts (src/services/inventory.py:203)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Index time: ~30 seconds. Answer latency: ~3 seconds. API cost: $0 (self-hosted on AWS EC2).&lt;/p&gt;




&lt;h2&gt;
  
  
  What Surprised Me
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. FastAPI path parameters break on slashes&lt;/strong&gt;&lt;br&gt;
Repo names like &lt;code&gt;owner/repo&lt;/code&gt; contain a &lt;code&gt;/&lt;/code&gt; — FastAPI treats it as a path separator and routes to the wrong endpoint. Fix: declare the parameter as &lt;code&gt;{repo_name:path}&lt;/code&gt;. One character. Found it via a 404 in production after deployment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. TypeScript types drift silently from backend SSE events&lt;/strong&gt;&lt;br&gt;
The backend was emitting &lt;code&gt;file&lt;/code&gt;, &lt;code&gt;file_path&lt;/code&gt;, and &lt;code&gt;indexed_at&lt;/code&gt; fields in the SSE stream that the frontend TypeScript interface didn't declare. No error in local dev — only failed during &lt;code&gt;next build&lt;/code&gt; inside Docker. A shared OpenAPI-generated type contract would have caught this at development time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Docker layer caching hides real code&lt;/strong&gt;&lt;br&gt;
After pushing a fix, the Docker build was still serving old code because it cached the &lt;code&gt;COPY . .&lt;/code&gt; layer. Always run &lt;code&gt;docker-compose build --no-cache&lt;/code&gt; when a fix isn't appearing after git pull.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Chunk overlap matters more than chunk size&lt;/strong&gt;&lt;br&gt;
I initially had no overlap between chunks. The AI would give incomplete answers for questions that spanned function boundaries — the relevant context was split across two chunks that were never returned together. Adding 50-token overlap fixed most of these cases.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ThinkWithOps/ai-devops-systems-lab
&lt;span class="nb"&gt;cd &lt;/span&gt;projects/02-ai-github-repo-explainer
&lt;span class="nb"&gt;cp &lt;/span&gt;backend/.env.example backend/.env
&lt;span class="c"&gt;# Add your GROQ_API_KEY or leave blank to use Ollama&lt;/span&gt;
docker-compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;http://&amp;lt;your-ec2-ip&amp;gt;:3000&lt;/code&gt;, paste any public GitHub URL, and start asking questions.&lt;/p&gt;

&lt;p&gt;🔗 GitHub: &lt;a href="https://github.com/ThinkWithOps/ai-devops-systems-lab" rel="noopener noreferrer"&gt;https://github.com/ThinkWithOps/ai-devops-systems-lab&lt;/a&gt;&lt;br&gt;
📺 Full build walkthrough: &lt;a href="https://youtu.be/a6376K9Lm00" rel="noopener noreferrer"&gt;https://youtu.be/a6376K9Lm00&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is Project 02 of a 30-project AI + DevOps series. Each project is a real deployed system — not a tutorial snippet.&lt;/p&gt;




&lt;p&gt;What's the most confusing codebase you've ever had to onboard into? What would you have asked an AI first?&lt;/p&gt;

</description>
      <category>langchain</category>
      <category>chromadb</category>
      <category>devops</category>
      <category>python</category>
    </item>
    <item>
      <title># I Built a DevOps Chatbot That Checks My Live App for Failures — Here's How It Works</title>
      <dc:creator>Vijaya Rajeev Bollu</dc:creator>
      <pubDate>Tue, 31 Mar 2026 16:50:21 +0000</pubDate>
      <link>https://dev.to/vijaya_bollu/-i-built-a-devops-chatbot-that-checks-my-live-app-for-failures-heres-how-it-works-h78</link>
      <guid>https://dev.to/vijaya_bollu/-i-built-a-devops-chatbot-that-checks-my-live-app-for-failures-heres-how-it-works-h78</guid>
      <description>&lt;h2&gt;
  
  
  Why I Built This
&lt;/h2&gt;

&lt;p&gt;Every DevOps engineer has had the 2am moment. Something is broken. You don't know what. You SSH in, check logs, Google the error, open five tabs, still nothing clear. Thirty minutes later you find it — a config flag someone changed, a slow query, a dependency timing out.&lt;/p&gt;

&lt;p&gt;I wanted to ask an AI instead. Not a generic ChatGPT that gives you textbook answers, but an AI connected to my actual running system that can check what's broken right now.&lt;/p&gt;

&lt;p&gt;So I built the AI DevOps Copilot — Project 01 of my 30-project AI + DevOps YouTube series.&lt;/p&gt;




&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;The system has four layers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. LangChain agent (the brain)&lt;/strong&gt;&lt;br&gt;
Uses &lt;code&gt;create_tool_calling_agent&lt;/code&gt; with Llama 3.1 via Groq. When you ask a question, the agent decides whether to answer from knowledge or call a tool. General DevOps questions → instant answer. Questions about the live app → tool call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. ChromaDB RAG (the knowledge base)&lt;/strong&gt;&lt;br&gt;
Nine runbook documents embedded into a vector database — Docker troubleshooting, AWS debugging, Kubernetes, Terraform, Linux performance, security, and more. The agent searches these for context when answering general questions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Tool layer (the live connection)&lt;/strong&gt;&lt;br&gt;
Four tools: &lt;code&gt;restaurant_monitor&lt;/code&gt; (hits the live restaurant app API), &lt;code&gt;log_search&lt;/code&gt; (searches application logs), &lt;code&gt;github_search&lt;/code&gt; (searches repos), and &lt;code&gt;devops_docs&lt;/code&gt; (searches the runbook vector store).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. FastAPI + SSE streaming (the interface)&lt;/strong&gt;&lt;br&gt;
The agent runs in a thread executor, streaming tokens back through an asyncio Queue with a sentinel pattern. The Next.js frontend connects via Server-Sent Events and renders each token as it arrives.&lt;/p&gt;

&lt;p&gt;Data flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User question → FastAPI → LangChain agent → tool call or RAG search → streamed response → Next.js UI
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Demo / Results
&lt;/h2&gt;

&lt;p&gt;I deployed a chaos engineering demo app — a restaurant application with injectable failure modes. One failure mode: &lt;code&gt;slow_menu&lt;/code&gt;, which adds a 2-second artificial delay to &lt;code&gt;GET /api/menu&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Input:&lt;/strong&gt; Menu page spinning, customers waiting. I ask the copilot: "What is wrong with the restaurant app right now?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happened:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tool called: restaurant_monitor → action: failures
API response: slow_menu: ACTIVE (2s delay on /api/menu)

AI answer:
The slow_menu failure mode is currently ACTIVE.
This injects a 2-second delay into the Menu API.

Fix:
1. Operator dashboard → Failures → Disable slow_menu
2. Or: POST /api/admin/failures/slow_menu/disable
3. Verify: menu should load in &amp;lt;100ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Time to diagnosis: ~8 seconds.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After disabling the toggle, the menu loaded instantly. The entire incident lifecycle — detection, diagnosis, fix, verification — took under 2 minutes.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Surprised Me
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. &lt;code&gt;create_react_agent&lt;/code&gt; kept hitting iteration limits.&lt;/strong&gt;&lt;br&gt;
I started with the ReAct text-parsing agent. Llama 3.1-8b kept failing the "Thought/Action/Observation" format, exhausting 10 iterations on simple questions. Switching to &lt;code&gt;create_tool_calling_agent&lt;/code&gt; — which uses native LLM function calling — fixed this completely. The model knows how to call functions; it doesn't know how to produce exact ReAct formatting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Tool call chunks were streaming to the user as garbage text.&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;on_llm_new_token&lt;/code&gt; in LangChain fires for every token — including internal tool call encodings like &lt;code&gt;{"name": "restaurant_monitor", "arguments":&lt;/code&gt;. Fixed by checking &lt;code&gt;chunk.tool_call_chunks&lt;/code&gt; and skipping those tokens. Without this, users see raw JSON blobs in the chat.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. &lt;code&gt;host.docker.internal&lt;/code&gt; doesn't resolve on EC2 Linux.&lt;/strong&gt;&lt;br&gt;
Added &lt;code&gt;extra_hosts: - "host.docker.internal:host-gateway"&lt;/code&gt; to Docker Compose — still failed. Fixed by using the EC2 private IP directly: &lt;code&gt;RESTAURANT_API_URL=http://172.31.90.69:8010&lt;/code&gt; in the &lt;code&gt;.env&lt;/code&gt; file. Simple, obvious in hindsight.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Groq free tier is 6,000 tokens per minute.&lt;/strong&gt;&lt;br&gt;
With &lt;code&gt;llama-3.3-70b-versatile&lt;/code&gt;, three questions hit the limit. Switched to &lt;code&gt;llama-3.1-8b-instant&lt;/code&gt; — much faster, lower token usage, still very capable for DevOps Q&amp;amp;A. For a demo or portfolio project, this is the right call.&lt;/p&gt;


&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;📁 GitHub: &lt;a href="https://github.com/ThinkWithOps/ai-devops-systems-lab/tree/main/projects/01-ai-devops-copilot" rel="noopener noreferrer"&gt;https://github.com/ThinkWithOps/ai-devops-systems-lab/tree/main/projects/01-ai-devops-copilot&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🎬 Video walkthrough: &lt;a href="https://youtu.be/a50334Szt5g" rel="noopener noreferrer"&gt;https://youtu.be/a50334Szt5g&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone and run locally&lt;/span&gt;
git clone https://github.com/ThinkWithOps/ai-devops-systems-lab
&lt;span class="nb"&gt;cd &lt;/span&gt;projects/01-ai-devops-copilot
&lt;span class="nb"&gt;cp &lt;/span&gt;backend/.env.example backend/.env
&lt;span class="c"&gt;# Add your GROQ_API_KEY to .env&lt;/span&gt;
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;http://localhost:3000&lt;/code&gt; — the copilot is running.&lt;/p&gt;

&lt;p&gt;This is Project 01 of a 30-project series building AI + DevOps systems from scratch.&lt;/p&gt;




&lt;p&gt;What's the most painful part of your on-call experience that you wish an AI could handle? Drop it in the comments — it might become Project 02.&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
`

---
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>devops</category>
      <category>ai</category>
      <category>langchain</category>
      <category>python</category>
    </item>
    <item>
      <title>How I Built an AI That Diagnoses GitHub Actions Failures Automatically</title>
      <dc:creator>Vijaya Rajeev Bollu</dc:creator>
      <pubDate>Fri, 20 Mar 2026 19:17:38 +0000</pubDate>
      <link>https://dev.to/vijaya_bollu/how-i-built-an-ai-that-diagnoses-github-actions-failures-automatically-2d7j</link>
      <guid>https://dev.to/vijaya_bollu/how-i-built-an-ai-that-diagnoses-github-actions-failures-automatically-2d7j</guid>
      <description>&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;GitHub Actions failure logs are noisy. Finding the actual error in 500 lines of output takes time you don't have during an incident. You get a red X, click through multiple pages, scroll past runner setup noise and dependency install output, land on the real error — and then you have to figure out what it means and what to do about it. I was doing this loop manually too often, so I automated it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What It Does
&lt;/h2&gt;

&lt;p&gt;The tool fetches your repository's failed workflow runs via the GitHub API, extracts the relevant error sections from job logs, pulls in the workflow YAML for context, and sends everything to Ollama running locally. You get back a root cause, an explanation of why it happened, exact YAML changes to make, and steps to prevent it from happening again — in about 15 seconds.&lt;/p&gt;

&lt;p&gt;Zero cloud costs. Your logs and code never leave your machine.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 5 Failure Types It Handles
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Dependency conflicts&lt;/strong&gt; — version mismatches, packages missing from requirements.txt or package.json&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Missing secrets&lt;/strong&gt; — env vars referenced in the workflow but not configured in repository settings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Permission errors&lt;/strong&gt; — &lt;code&gt;GITHUB_TOKEN&lt;/code&gt; scope issues, OIDC misconfiguration, action not allowed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker build failures&lt;/strong&gt; — base image not found, build context issues, registry auth failures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flaky tests&lt;/strong&gt; — identifies non-deterministic failures vs real bugs based on error patterns and exit codes&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The flow is straightforward: GitHub REST API → Python log parser → Ollama.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GitHub API (workflow runs + job logs + YAML)
         │
         ▼
  extract_error_from_logs()   ← keyword scan, context windows, dedup
         │
         ▼
  analyze_failure()           ← structured prompt to Ollama
         │
         ▼
  Terminal report             ← root cause + YAML changes + prevention
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The log extraction step is where most of the work happens. Raw GitHub Actions logs are thousands of lines — runner diagnostics, apt output, pip install progress bars, none of which is useful. The parser scans for a list of error keywords (&lt;code&gt;error:&lt;/code&gt;, &lt;code&gt;failed&lt;/code&gt;, &lt;code&gt;Traceback&lt;/code&gt;, &lt;code&gt;exit code&lt;/code&gt;, &lt;code&gt;permission denied&lt;/code&gt;, etc.), then captures 8 lines of context before and 12 after each hit, deduplicates overlapping windows, and caps the result at 1500 characters before sending to the AI.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_error_from_logs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;logs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;lines&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;error_keywords&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;error:&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;failed&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Traceback&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fatal:&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;command not found&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;permission denied&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;exit code&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;returned non-zero&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;error_sections&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;seen_lines&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kw&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;kw&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;error_keywords&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;seen_lines&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;
            &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="n"&gt;seen_lines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;error_sections&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;

    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;error_sections&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;1500&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;... (truncated)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1500&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The AI prompt includes the workflow name, failed job name, the extracted error section, and up to 1000 characters of the workflow YAML. That YAML context is what lets the AI suggest specific YAML fixes rather than generic advice.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a GitHub Actions expert analyzing a failed CI/CD workflow.

Workflow: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;workflow_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Failed Job: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;job_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;workflow_context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

ERROR LOGS:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;limited_logs&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

**ROOT CAUSE:** [each specific error, all of them]
**WHY THIS HAPPENED:** [plain language explanation]
**HOW TO FIX:** [numbered, actionable steps]
**YAML CHANGES:** [exact changes needed]
**PREVENTION:** [how to avoid this next time]
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Most Useful Feature: Failure Classification
&lt;/h2&gt;

&lt;p&gt;The thing that saves the most time isn't root cause identification — it's the flaky vs real distinction.&lt;/p&gt;

&lt;p&gt;When a test fails with a non-deterministic error (race condition, network timeout, port already in use), re-running the workflow is the right call. When a test fails because you introduced a bug, re-running wastes 5 minutes and doesn't help. Before this tool, I'd often re-run first and only look at logs after the second failure confirmed it wasn't flaky. That's a 5-10 minute delay every time.&lt;/p&gt;

&lt;p&gt;The AI picks this up from patterns in the error output. A &lt;code&gt;ConnectionRefused&lt;/code&gt; or &lt;code&gt;socket timeout&lt;/code&gt; alongside a test failure is a different signal than a clean &lt;code&gt;AssertionError: expected 200, got 404&lt;/code&gt;. The prompt explicitly asks for this classification, and Llama 3.2 handles it reliably — it's the kind of pattern matching that LLMs are genuinely good at.&lt;/p&gt;

&lt;p&gt;The accuracy isn't perfect (the README documents ~85%), but it's good enough that I check the classification before deciding whether to re-run or investigate.&lt;/p&gt;




&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Input&lt;/strong&gt; — a failed workflow log containing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;Run pytest tests/
ERROR: No module named 'requests'
FAILED tests/test_api.py::test_health - ModuleNotFoundError
error: Process completed with exit code 1.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;🔍 GITHUB ACTIONS FAILURE ANALYSIS
Repository: myname/myrepo | Workflow: CI Build | Failed Job: test

🤖 AI DIAGNOSIS:

&lt;span class="gs"&gt;**ROOT CAUSE:**&lt;/span&gt;
pytest is failing because the 'requests' library is not installed.
The test imports it, but it's not listed in requirements.txt, so
pip install -r requirements.txt doesn't include it.

&lt;span class="gs"&gt;**WHY THIS HAPPENED:**&lt;/span&gt;
The package works locally because you have it installed globally on your
machine. In the GitHub Actions runner, only packages in requirements.txt
get installed — nothing else is available.

&lt;span class="gs"&gt;**HOW TO FIX:**&lt;/span&gt;
&lt;span class="p"&gt;1.&lt;/span&gt; Add 'requests' to requirements.txt: requests&amp;gt;=2.32.0
&lt;span class="p"&gt;2.&lt;/span&gt; Commit and push — the workflow will pick it up automatically.
&lt;span class="p"&gt;3.&lt;/span&gt; If requests is only needed for tests, add a requirements-dev.txt
   and install it separately in a dedicated workflow step.

&lt;span class="gs"&gt;**YAML CHANGES:**&lt;/span&gt;
No YAML changes needed. The fix is in requirements.txt.

&lt;span class="gs"&gt;**PREVENTION:**&lt;/span&gt;
Run your CI workflow locally with a clean virtualenv before pushing:
  python -m venv .venv &amp;amp;&amp;amp; source .venv/bin/activate
  pip install -r requirements.txt
  pytest tests/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That took 14 seconds. The manual version of this (click to run page, scroll logs, identify error, Google it, figure out the fix) is 10-15 minutes minimum.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/ThinkWithOps/ai-devops-projects" rel="noopener noreferrer"&gt;https://github.com/ThinkWithOps/ai-devops-projects&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Video:&lt;/strong&gt; &lt;a href="https://youtu.be/EwgdZ8KmBJg" rel="noopener noreferrer"&gt;https://youtu.be/EwgdZ8KmBJg&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Prerequisites: Ollama running with llama3.2 pulled&lt;/span&gt;
git clone https://github.com/ThinkWithOps/ai-devops-projects
&lt;span class="nb"&gt;cd &lt;/span&gt;ai-devops-projects/04-ai-github-actions-healer
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;span class="c"&gt;# Set your GitHub token (needs repo + workflow scopes)&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GITHUB_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your_token_here"&lt;/span&gt;

&lt;span class="c"&gt;# Run against your repo&lt;/span&gt;
python src/github_actions_healer.py &lt;span class="nt"&gt;--repo&lt;/span&gt; your-username/your-repo

&lt;span class="c"&gt;# Save report to file&lt;/span&gt;
python src/github_actions_healer.py &lt;span class="nt"&gt;--repo&lt;/span&gt; owner/repo &lt;span class="nt"&gt;--output&lt;/span&gt; report.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Project 4 in my AI+DevOps series — all tools run locally with Ollama, zero cloud AI costs. Project 3 was an AI AWS Cost Detective, Project 5 is an AI Terraform Code Generator. Links in my profile.&lt;/p&gt;




&lt;p&gt;What GitHub Actions failure do you dread the most? For me it's the OIDC / GITHUB_TOKEN permission errors — the error message never tells you which specific permission is missing. The AI actually handles those surprisingly well. Drop yours in the comments.&lt;/p&gt;

</description>
      <category>github</category>
      <category>devops</category>
      <category>cicd</category>
      <category>ai</category>
    </item>
    <item>
      <title># I Built a Local AI Terraform Generator and Tested It By Actually Deploying to AWS — Here Are the Results</title>
      <dc:creator>Vijaya Rajeev Bollu</dc:creator>
      <pubDate>Fri, 13 Mar 2026 15:29:11 +0000</pubDate>
      <link>https://dev.to/vijaya_bollu/-i-built-a-local-ai-terraform-generator-and-tested-it-by-actually-deploying-to-aws-here-are-the-n4m</link>
      <guid>https://dev.to/vijaya_bollu/-i-built-a-local-ai-terraform-generator-and-tested-it-by-actually-deploying-to-aws-here-are-the-n4m</guid>
      <description>&lt;h2&gt;
  
  
  The Idea
&lt;/h2&gt;

&lt;p&gt;Every time I needed a new AWS resource, I'd spend 20 minutes reading Terraform docs just to get the syntax right for something I'd done before. I wanted to type plain English and get working HCL back. But I also didn't want to just &lt;em&gt;generate&lt;/em&gt; code — I wanted to know if it actually deploys. So I tested every resource by running &lt;code&gt;terraform apply&lt;/code&gt; against a real AWS account.&lt;/p&gt;




&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;You describe infrastructure in plain English. The tool sends it to a local Llama 3.2 model via Ollama, which returns four Terraform files. Those files get saved to a &lt;code&gt;generated/&lt;/code&gt; folder, ready for &lt;code&gt;terraform init&lt;/code&gt; and &lt;code&gt;terraform apply&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Plain English → Python → Ollama (local) → Parse HCL → main.tf + variables.tf + outputs.tf + tfvars.example
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key piece is the prompt. Getting consistent, parseable HCL out of an LLM required a very specific structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a Terraform/OpenTofu expert. Generate production-ready infrastructure code.

USER REQUEST:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

PROVIDER: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

CRITICAL:
- Generate ONLY valid Terraform/HCL code
- NO markdown formatting or code blocks
- Start each file with a comment showing the filename
- Separate files with: ### FILENAME ###

Format your response like this:

### main.tf ###
terraform {{
  required_version = &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;= 1.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
  required_providers {{
    &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; = {{
      source  = &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hashicorp/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
      version = &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;~&amp;gt; 5.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
    }}
  }}
}}

[rest of main.tf code]

### variables.tf ###
[variables code]

### outputs.tf ###
[outputs code]

### terraform.tfvars.example ###
[example values]

Generate production-ready code now:&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;### FILENAME ###&lt;/code&gt; markers are what make the response parseable. The script splits on &lt;code&gt;###&lt;/code&gt;, reads the filename, grabs everything after it until the next marker, and writes that to disk. There's also a fallback parser for when the model goes off-script and wraps things in code blocks anyway.&lt;/p&gt;




&lt;h2&gt;
  
  
  Test Results: 10 Resources Tested
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Generated&lt;/th&gt;
&lt;th&gt;Validated&lt;/th&gt;
&lt;th&gt;Deployed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;EC2 instance&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;S3 bucket&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IAM role&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VPC + subnets&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security group&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RDS instance&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ALB&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lambda&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ECS task&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;⚠️ needed fix&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complex module&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;⚠️ needed fix&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;8/10 deployed first try. 2/10 needed minor manual fixes.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The ECS task definition had an incorrect &lt;code&gt;network_mode&lt;/code&gt; value for Fargate. The complex multi-resource module had a missing &lt;code&gt;depends_on&lt;/code&gt; for the security group. Both were one-line fixes once &lt;code&gt;terraform validate&lt;/code&gt; pointed at them.&lt;/p&gt;




&lt;h2&gt;
  
  
  What It's Good At vs Where It Struggles
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Good at:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Standard single-resource configs (EC2, S3, IAM, RDS) — near-perfect every time&lt;/li&gt;
&lt;li&gt;Wiring dependencies correctly between resources it knows well&lt;/li&gt;
&lt;li&gt;Generating &lt;code&gt;variables.tf&lt;/code&gt; with descriptions and sensible defaults&lt;/li&gt;
&lt;li&gt;Adding tags and naming conventions without being asked&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Struggles with:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Very new AWS resources where Llama's training data is thin&lt;/li&gt;
&lt;li&gt;Complex modules with many interdependent resources — sometimes misses a &lt;code&gt;depends_on&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Provider version pinning — occasionally suggests a deprecated argument from an older AWS provider version&lt;/li&gt;
&lt;li&gt;ECS/EKS specifics — these configs are dense and the model sometimes gets task definition fields wrong&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Honest assessment: treat it like a junior engineer who's read all the Terraform docs but hasn't deployed much to production. Good first draft, always needs a review.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Prompt Engineering That Made It Work
&lt;/h2&gt;

&lt;p&gt;Three things made the difference between garbage output and deployable HCL:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Kill the markdown.&lt;/strong&gt; LLMs love wrapping code in &lt;code&gt;&lt;/code&gt;&lt;code&gt;hcl&lt;/code&gt;&lt;code&gt;&lt;/code&gt; blocks. That breaks the file parser completely. The explicit instruction &lt;code&gt;NO markdown formatting or code blocks&lt;/code&gt; eliminated this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Show the exact format in the prompt.&lt;/strong&gt; Telling the model to use &lt;code&gt;### main.tf ###&lt;/code&gt; as a separator, and including the &lt;code&gt;terraform {}&lt;/code&gt; block structure directly in the prompt, anchored the output format. Without this, every response looked slightly different.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Demand variables explicitly.&lt;/strong&gt; Early versions hardcoded values like &lt;code&gt;instance_type = "t3.micro"&lt;/code&gt; directly in &lt;code&gt;main.tf&lt;/code&gt;. Adding &lt;code&gt;Use variables for configurable values&lt;/code&gt; to the requirements section fixed this — now everything configurable lands in &lt;code&gt;variables.tf&lt;/code&gt; with proper descriptions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Local Matters for IaC Generation
&lt;/h2&gt;

&lt;p&gt;Your Terraform descriptions contain your architecture. "Create a VPC with private subnets, an RDS cluster for our auth service, and an ECS task that pulls from our private ECR registry" — that's a roadmap of your production infrastructure. Sending that to a cloud API means it leaves your machine.&lt;/p&gt;

&lt;p&gt;Running Ollama locally means the description, the generated code, and any sensitive context like account IDs or naming patterns stay on your machine. For anything touching production infrastructure, that's not optional.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/ThinkWithOps" rel="noopener noreferrer"&gt;https://github.com/ThinkWithOps&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Live deploy demo:&lt;/strong&gt; &lt;a href="https://youtu.be/nhhZqrCEhOA" rel="noopener noreferrer"&gt;https://youtu.be/nhhZqrCEhOA&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ThinkWithOps/ai-devops-projects
&lt;span class="nb"&gt;cd &lt;/span&gt;ai-devops-projects/05-ai-terraform-generator
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;span class="c"&gt;# Generate code&lt;/span&gt;
python src/terraform_generator.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--description&lt;/span&gt; &lt;span class="s2"&gt;"EC2 instance with S3 bucket for logs"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--provider&lt;/span&gt; aws

&lt;span class="c"&gt;# Deploy it&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;generated/
&lt;span class="nb"&gt;cp &lt;/span&gt;terraform.tfvars.example terraform.tfvars
terraform init &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; terraform plan &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; terraform apply
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Project 5 in my AI+DevOps series — all tools run locally with Ollama, zero cloud API costs.&lt;/p&gt;




&lt;p&gt;What's your current Terraform workflow? I'm curious whether people are using Copilot for this, manually writing it, or something else entirely — drop it in the comments.&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>aws</category>
      <category>devops</category>
      <category>ai</category>
    </item>
    <item>
      <title># I Built an AI Kubernetes Pod Debugger — Diagnoses CrashLoopBackOff, OOMKilled, and More in Seconds</title>
      <dc:creator>Vijaya Rajeev Bollu</dc:creator>
      <pubDate>Fri, 13 Mar 2026 15:14:28 +0000</pubDate>
      <link>https://dev.to/vijaya_bollu/-i-built-an-ai-kubernetes-pod-debugger-diagnoses-crashloopbackoff-oomkilled-and-more-in-seconds-24k2</link>
      <guid>https://dev.to/vijaya_bollu/-i-built-an-ai-kubernetes-pod-debugger-diagnoses-crashloopbackoff-oomkilled-and-more-in-seconds-24k2</guid>
      <description>&lt;h2&gt;
  
  
  Why I Built This
&lt;/h2&gt;

&lt;p&gt;K8s error messages are designed for cluster operators, not for the developer who just shipped a feature and now has a pod stuck in &lt;code&gt;CrashLoopBackOff&lt;/code&gt; at 11pm. &lt;code&gt;kubectl get pods&lt;/code&gt; tells you something is broken. It doesn't tell you why, and it definitely doesn't tell you what to do about it. I kept doing the same 4-command loop — &lt;code&gt;get pods&lt;/code&gt;, &lt;code&gt;describe pod&lt;/code&gt;, &lt;code&gt;logs&lt;/code&gt;, Google the error — and thought there had to be a better way.&lt;/p&gt;

&lt;p&gt;So I automated the loop and added an AI in the middle.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 6 Failure Types It Handles
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;ImagePullBackOff&lt;/strong&gt; — wrong image name, missing credentials, private registry issues&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CrashLoopBackOff&lt;/strong&gt; — app crash on startup, missing env vars, bad config&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OOMKilled&lt;/strong&gt; — container exceeding memory limits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pending&lt;/strong&gt; — insufficient cluster resources, node selector issues&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failed&lt;/strong&gt; — job completion errors, config map missing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Init Container failures&lt;/strong&gt; — init container exits non-zero&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;The tool runs three kubectl commands in sequence, then hands everything to Ollama.&lt;/p&gt;

&lt;p&gt;First, &lt;code&gt;kubectl get pods -o json&lt;/code&gt; scans the namespace and flags any pod that isn't &lt;code&gt;Running&lt;/code&gt;, isn't &lt;code&gt;ready&lt;/code&gt;, or has restarts &amp;gt; 0. For each unhealthy pod, it grabs the last 50 lines of logs and the Events section from &lt;code&gt;kubectl describe&lt;/code&gt;. Both get truncated before hitting the AI — logs capped at 2000 chars, events at 1000 — to stay within a reasonable context window.&lt;/p&gt;

&lt;p&gt;Then everything goes into a structured prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;analyze_pod_failure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;logs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;logs_sample&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logs&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;logs&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No logs available&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;events_sample&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No events available&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a Kubernetes expert helping debug pod failures.

Pod Information:
- Name: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
- Status: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
- Ready: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ready&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
- Restart Count: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;restarts&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Recent Logs (last 50 lines):
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;logs_sample&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Recent Events:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;events_sample&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Provide a diagnosis in this format:

**ROOT CAUSE:**
[Explain in 1-2 sentences what&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s causing the pod to fail, using simple terms]

**WHY THIS HAPPENS:**
[Explain why this error occurs, use an analogy if helpful]

**HOW TO FIX:**
[Provide specific kubectl commands or YAML changes to fix the issue]
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ask_ollama&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_remove_repeated_suffix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;_remove_repeated_suffix&lt;/code&gt; at the end strips duplicated content that Llama 3.2 sometimes generates — more on that below.&lt;/p&gt;




&lt;h2&gt;
  
  
  Demo: CrashLoopBackOff Diagnosis
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Input&lt;/strong&gt; — what the cluster shows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl get pods
NAME          READY   STATUS             RESTARTS   AGE
broken-pod    0/1     CrashLoopBackOff   4          3m
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Logs show: &lt;code&gt;panic: runtime error: invalid memory address or nil pointer dereference&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output from the AI:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🔍 KUBERNETES POD DEBUG REPORT
Pod: broken-pod | Status: CrashLoopBackOff | Restarts: 4

🤖 AI DIAGNOSIS:

**ROOT CAUSE:**
The application is crashing immediately on startup due to a nil pointer
dereference — it's trying to use a variable that was never initialized,
likely a missing environment variable or config value.

**WHY THIS HAPPENS:**
Think of it like a recipe that calls for eggs, but your fridge is empty.
The app tries to read a config value that doesn't exist, gets back nothing,
and then crashes when it tries to use it. Kubernetes keeps restarting it,
hoping it'll work — it won't until the missing config is there.

**HOW TO FIX:**
1. Check what env vars the container expects:
   kubectl describe pod broken-pod | grep -A 10 'Environment'
2. Add the missing values to your deployment YAML under 'env:'
3. Or create a ConfigMap and reference it:
   kubectl create configmap app-config --from-literal=KEY=value
4. Reapply: kubectl apply -f your-deployment.yaml

💡 SUGGESTED NEXT STEP:
   kubectl logs broken-pod --previous
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That diagnosis took 12 seconds. The manual version of this took me 25 minutes the first time I hit a nil pointer crash in K8s.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Surprised Me
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;kubectl logs&lt;/code&gt; silently fails on crashed containers.&lt;/strong&gt; If a container has already exited, the default &lt;code&gt;kubectl logs&lt;/code&gt; command returns nothing — no error, just empty output. The fix is the &lt;code&gt;--previous&lt;/code&gt; flag, which fetches logs from the last terminated container. I found this out the hard way when the AI kept saying "No logs available" for CrashLoopBackOff pods. The code now tries normal logs first, then falls back automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CalledProcessError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Try to get previous logs if container already crashed
&lt;/span&gt;    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kubectl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;logs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pod_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--previous&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--tail=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tail&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="bp"&gt;...&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Llama 3.2 sometimes repeats its own output.&lt;/strong&gt; Occasionally the model generates a full response and then starts over from the beginning, appending a duplicate. For a terminal tool this looks terrible. I had to write a suffix deduplication function that checks if the second half of the response is a repeat of any trailing segment of the first half, and strips it. Not something I expected to need when I started this project.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Minikube hangs silently when the base image is cached.&lt;/strong&gt; On Windows with Docker Desktop, &lt;code&gt;minikube start&lt;/code&gt; can freeze indefinitely trying to pull an image that's already on disk. The fix — &lt;code&gt;minikube start --base-image=gcr.io/k8s-minikube/kicbase:v0.0.49&lt;/code&gt; — forces it to use the local cache. This isn't documented prominently and cost me an hour of debugging a debugging tool.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/ThinkWithOps" rel="noopener noreferrer"&gt;https://github.com/ThinkWithOps&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Demo video:&lt;/strong&gt; &lt;a href="https://youtu.be/LFF-987-uhA" rel="noopener noreferrer"&gt;https://youtu.be/LFF-987-uhA&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Prerequisites: Minikube running, Ollama + llama3.2 pulled&lt;/span&gt;
git clone https://github.com/ThinkWithOps/ai-devops-projects
&lt;span class="nb"&gt;cd &lt;/span&gt;ai-devops-projects/02-ai-k8s-debugger
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;span class="c"&gt;# Deploy a broken pod to test with&lt;/span&gt;
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; demo/broken-pod.yaml

&lt;span class="c"&gt;# Run the debugger&lt;/span&gt;
python src/k8s_debugger.py

&lt;span class="c"&gt;# Or target a specific pod&lt;/span&gt;
python src/k8s_debugger.py &lt;span class="nt"&gt;--pod&lt;/span&gt; broken-pod
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Project 2 in my AI+DevOps series — all tools run locally with Ollama, zero cloud costs. Project 1 was an AI Docker vulnerability scanner, Project 3 is an AI AWS Cost Detective. Links in my profile.&lt;/p&gt;




&lt;p&gt;What K8s error do you dread seeing the most? For me it's still &lt;code&gt;Pending&lt;/code&gt; with node affinity issues — the AI actually handles that one better than I expected. Drop yours in the comments.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>ai</category>
      <category>python</category>
    </item>
    <item>
      <title>I Built an AI AWS Cost Detective That Found $900/Year in Waste — Here's How</title>
      <dc:creator>Vijaya Rajeev Bollu</dc:creator>
      <pubDate>Sun, 08 Mar 2026 15:22:25 +0000</pubDate>
      <link>https://dev.to/vijaya_bollu/i-built-an-ai-aws-cost-detective-that-found-900year-in-waste-heres-how-1ll4</link>
      <guid>https://dev.to/vijaya_bollu/i-built-an-ai-aws-cost-detective-that-found-900year-in-waste-heres-how-1ll4</guid>
      <description>&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;AWS Cost Explorer shows you data. It doesn't tell you what to do about it. I was paying $127/month and knew I was wasting money but couldn't quickly identify where.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the AI Found
&lt;/h2&gt;

&lt;p&gt;Running the tool against my own account uncovered:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;EC2 Waste:&lt;/strong&gt; A t3.small running 24/7 — used maybe 2 hours a day for testing. That's $45/month for 22 hours of idle compute every single day.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;EBS Volumes:&lt;/strong&gt; Three EBS volumes still attached to stopped instances. No data being written, no instance using them. $8/month evaporating for nothing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NAT Gateway:&lt;/strong&gt; A NAT Gateway from an old VPC setup I'd completely forgotten. Nothing routing through it. $12/month for a network door with no traffic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Snapshots:&lt;/strong&gt; Automated snapshots from an RDS instance I deleted months ago. The database was gone but the snapshots kept accumulating — $10/month.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Total: $75/month = $900/year&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;The tool chains three things together: boto3 fetches your AWS costs and resource counts, Python shapes the data, and Ollama (local Llama 3.2) turns it into actionable recommendations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AWS Cost Explorer API  →  Python (boto3)  →  Ollama  →  Structured report
     (billing data)        (resource counts)   (local LLM)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;First it pulls 30 days of costs grouped by service, then counts your live resources (EC2 instances, EBS volumes, S3 buckets, RDS databases, Lambda functions). Both datasets go into the AI prompt together — because cost numbers without resource context give you vague answers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;analyze_costs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;services&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;resources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;total_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;services&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;top_services&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;services&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;services_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;- &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;service&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;top_services&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="n"&gt;resources_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;- &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;resources&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are an AWS cost optimization expert.

COST SUMMARY (Last 30 Days):
Total Cost: $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;total_cost&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Top Services by Cost:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;services_text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Resources Currently Running:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;resources_text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Provide recommendations in this format:

**COST ANALYSIS:**
**HIDDEN COSTS DETECTED:**
**OPTIMIZATION RECOMMENDATIONS:**
**ESTIMATED SAVINGS:**
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ask_ollama&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The structured output format in the prompt is what makes the response actually parseable and useful — not just a wall of text.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setting Up Read-Only AWS Access
&lt;/h2&gt;

&lt;p&gt;This is important — the tool only needs read permissions. Here's the minimal IAM policy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"ce:GetCostAndUsage"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"ec2:Describe*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"s3:ListAllMyBuckets"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"rds:Describe*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"lambda:List*"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"*"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a dedicated IAM user (&lt;code&gt;cost-detective&lt;/code&gt;), attach this policy, generate an access key, and run &lt;code&gt;aws configure&lt;/code&gt;. The tool never writes anything to your account — worst case it reads data you didn't expect it to.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Surprised Me
&lt;/h2&gt;

&lt;p&gt;Two things I didn't expect:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Cost Explorer API is completely free.&lt;/strong&gt; I assumed querying billing data would itself have a cost. It doesn't. Zero charges for API calls to Cost Explorer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS returns cost values as Python &lt;code&gt;Decimal&lt;/code&gt;, not &lt;code&gt;float&lt;/code&gt;.&lt;/strong&gt; This one is a quiet killer — &lt;code&gt;json.dumps()&lt;/code&gt; will crash when you try to save a report because the standard JSON encoder doesn't handle &lt;code&gt;Decimal&lt;/code&gt;. Had to write a custom encoder:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DecimalEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONEncoder&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;default&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Decimal&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DecimalEncoder&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;default&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No error message tells you why it failed. You just get a &lt;code&gt;TypeError&lt;/code&gt; and have to figure out where the &lt;code&gt;Decimal&lt;/code&gt; came from.&lt;/p&gt;




&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;p&gt;Being honest about what this doesn't handle well:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data is 24–48 hours delayed.&lt;/strong&gt; Cost Explorer isn't real-time. If you just spun up a resource today, it won't show up yet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single region by default.&lt;/strong&gt; Resource counts only scan &lt;code&gt;us-east-1&lt;/code&gt;. Multi-region setups need extra config.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Doesn't catch everything.&lt;/strong&gt; Very small charges (under $0.01) are filtered out. Some hidden costs — like cross-AZ data transfer — aren't obvious from the Cost Explorer groupings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI recommendations need verification.&lt;/strong&gt; The tool identifies patterns and suggests actions, but you should always sanity-check before terminating anything. I almost deleted an EBS volume that was actually still in use by a snapshot restore.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/ThinkWithOps" rel="noopener noreferrer"&gt;https://github.com/ThinkWithOps&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Demo video:&lt;/strong&gt; &lt;a href="https://youtu.be/rg1Vnjjt9xk" rel="noopener noreferrer"&gt;https://youtu.be/rg1Vnjjt9xk&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ThinkWithOps/ai-devops-projects
&lt;span class="nb"&gt;cd &lt;/span&gt;ai-devops-projects/03-ai-aws-cost-detective
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
python src/aws_cost_detective.py

&lt;span class="c"&gt;# Save report to JSON&lt;/span&gt;
python src/aws_cost_detective.py &lt;span class="nt"&gt;--output&lt;/span&gt; report.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Project 3 in my AI+DevOps series. Project 4 is an AI GitHub Actions Auto-Healer — it reads failing CI logs and suggests fixes. Link in my profile.&lt;/p&gt;




&lt;p&gt;What's the most unexpected thing hiding in your AWS bill? I'd have never noticed that NAT Gateway without the tool surfacing it. Drop yours in the comments.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>devops</category>
      <category>ai</category>
      <category>cloud</category>
    </item>
    <item>
      <title>How I Built a Local AI Docker Vulnerability Scanner (No API Costs, No Cloud)</title>
      <dc:creator>Vijaya Rajeev Bollu</dc:creator>
      <pubDate>Fri, 06 Mar 2026 19:04:47 +0000</pubDate>
      <link>https://dev.to/vijaya_bollu/how-i-built-a-local-ai-docker-vulnerability-scanner-no-api-costs-no-cloud-3ef7</link>
      <guid>https://dev.to/vijaya_bollu/how-i-built-a-local-ai-docker-vulnerability-scanner-no-api-costs-no-cloud-3ef7</guid>
      <description>&lt;h1&gt;
  
  
  How I Built a Local AI Docker Vulnerability Scanner (No API Costs, No Cloud)
&lt;/h1&gt;




&lt;h2&gt;
  
  
  The Problem with Trivy Output
&lt;/h2&gt;

&lt;p&gt;Running Trivy gives you a wall of CVE numbers. Most developers copy-paste them into Google and spend 20 minutes figuring out if each one actually matters for their use case.&lt;/p&gt;

&lt;p&gt;I built a tool that fixes this.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;A local AI wrapper around Trivy that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scans any Docker image&lt;/li&gt;
&lt;li&gt;Takes the raw CVE output&lt;/li&gt;
&lt;li&gt;Feeds it to Ollama (local LLM — no API costs)&lt;/li&gt;
&lt;li&gt;Returns plain English explanations + specific fix recommendations&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Interesting Finding
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nginx:1.27-alpine: 14 vulnerabilities
nginx:alpine:       3 vulnerabilities
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same base image family — pinned version had 4.5x more CVEs. The AI caught this pattern and recommended variants to compare automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tech Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.11&lt;/li&gt;
&lt;li&gt;Trivy (vulnerability scanner)&lt;/li&gt;
&lt;li&gt;Ollama + Llama 3.2 (local LLM)&lt;/li&gt;
&lt;li&gt;Zero cloud dependencies&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  How It Works (Code Walkthrough)
&lt;/h2&gt;

&lt;p&gt;The scanner has three moving parts: Trivy does the heavy lifting of CVE detection, Python orchestrates everything, and Ollama explains what it all means.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1 — Scan with Trivy and parse the JSON:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;scan_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;image_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;trivy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HIGH,CRITICAL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;image_name&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;check&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_vulnerabilities&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scan_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;vulnerabilities&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;seen_vulns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# deduplicate by CVE ID
&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;scan_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]):&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;vuln&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Vulnerabilities&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]):&lt;/span&gt;
            &lt;span class="n"&gt;vuln_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vuln&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VulnerabilityID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;N/A&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;vuln_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;seen_vulns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;
            &lt;span class="n"&gt;seen_vulns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vuln_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;vulnerabilities&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;vuln_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;package&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;vuln&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PkgName&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;N/A&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;vuln&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;N/A&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fixed_version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;vuln&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FixedVersion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Not available&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;vulnerabilities&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2 — Send each CVE to Ollama for a plain English explanation:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;explain_vulnerability&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vuln&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a security expert explaining vulnerabilities to developers.

Vulnerability Details:
- ID: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;vuln&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
- Package: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;vuln&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;package&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; (version &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;vuln&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;version&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)
- Severity: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;vuln&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
- Title: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;vuln&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Explain in 2-3 sentences:
1. What this vulnerability means in simple terms
2. Why it&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s dangerous
3. How to fix it (fixed version: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;vuln&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fixed_version&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)

Keep it concise and actionable. Use analogies if helpful.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ollama_host&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/api/generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama3.2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3 — Generate an overall summary with structured output:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The summary prompt forces Ollama into a key-value format so we can parse it reliably and build a comparison command on the fly — more on that in the next section.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Trickiest Part
&lt;/h2&gt;

&lt;p&gt;Getting Ollama to return structured output consistently was harder than expected. Free-form responses were great for individual CVE explanations, but the security summary needed to be &lt;em&gt;parseable&lt;/em&gt; — I needed specific fields like &lt;code&gt;SECURITY_POSTURE&lt;/code&gt; and &lt;code&gt;VARIANTS_TO_TEST&lt;/code&gt; to programmatically build the comparison command.&lt;/p&gt;

&lt;p&gt;The solution was strict prompt formatting: I told the model to respond in &lt;code&gt;KEY: value&lt;/code&gt; pairs and gave it an explicit example. Then I split each line on &lt;code&gt;:&lt;/code&gt; and built a dict. When parsing failed I fell back to a hardcoded comparison command. The other challenge was Llama 3.2 sometimes repeating itself — I solved that with a deduplication pass that checks for repeated section headers (&lt;code&gt;**1.&lt;/code&gt;, &lt;code&gt;**Vulnerability&lt;/code&gt;, etc.) and drops them before printing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Before&lt;/strong&gt; — Raw Trivy output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CVE-2024-1234 (CRITICAL)
Package: openssl 1.1.1k
Description: Use-after-free in X509_verify_cert function
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;😕 &lt;em&gt;"What does this mean? Do I need to care about this?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;After&lt;/strong&gt; — AI-enhanced output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🤖 AI Explanation:
This is like leaving your house key under the doormat.
OpenSSL handles your HTTPS connections, and this bug lets
attackers potentially decrypt traffic. Fix: update your
Dockerfile base image to get openssl 1.1.1w or later.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;✅ &lt;em&gt;"Got it, I'll update the base image today."&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Avg scan time&lt;/td&gt;
&lt;td&gt;15–30 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI explanation per CVE&lt;/td&gt;
&lt;td&gt;~3 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloud API cost&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Images tested&lt;/td&gt;
&lt;td&gt;50+ (nginx, node, python, ubuntu)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Manual CVE triage that used to take 20+ minutes per image now takes under a minute for the top 5 vulnerabilities.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/ThinkWithOps/ai-devops-projects/tree/main/01-ai-docker-scanner" rel="noopener noreferrer"&gt;https://github.com/ThinkWithOps/ai-devops-projects/tree/main/01-ai-docker-scanner&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Full demo video:&lt;/strong&gt; &lt;a href="https://youtu.be/J6fmU6t9jUU" rel="noopener noreferrer"&gt;https://youtu.be/J6fmU6t9jUU&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Prerequisites: Docker, Trivy, Ollama + llama3.2 pulled&lt;/span&gt;
git clone 
&lt;span class="nb"&gt;cd &lt;/span&gt;01-ai-docker-scanner
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
python src/docker_scanner.py nginx:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;This is Project 1 in my AI+DevOps series. Next I built an AI K8s Pod Debugger — link in my profile.&lt;/p&gt;

</description>
      <category>docker</category>
      <category>devops</category>
      <category>ai</category>
      <category>security</category>
    </item>
  </channel>
</rss>
