DEV Community: Vijaya Rajeev Bollu

I Benchmarked 3 Local LLMs on My Laptop — Here's What the Numbers Actually Show

Vijaya Rajeev Bollu — Fri, 05 Jun 2026 15:45:23 +0000

The Problem With Choosing a Local Model

Everyone has an opinion on which local LLM is best.

"Use Llama — it's the most popular." "Mistral 7B has the best quality." "Phi-3 Mini is small and efficient."

None of these claims come with numbers. Specifically: your numbers, on your hardware, for your workload.

I built a benchmarking system to change that. Three models, 30 prompts, full latency distribution, memory profiling per inference call, and a JSON validation layer to measure structured output reliability.

Here's what I found — and why the results matter for anyone deploying local models in production.

The Setup

Three models tested:

llama3.2:3b — 3B parameters, Q4_K_M quantization, 2 GB download
phi3:mini — 3.8B parameters, Q4_K_M, 2.3 GB download
mistral:7b — 7B parameters, Q4_K_M, 4.1 GB download

Hardware: CPU only, no GPU acceleration. This is the worst-case baseline — the scenario that exposes real latency and memory numbers.

30 test prompts across 5 categories:

Short factual (10): "What is the capital of France?"
Reasoning (8): "Explain why the sky appears blue."
Code generation (5): "Write a Python function to reverse a string."
Structured output (5): "List 3 frameworks in JSON format with name and use_case."
Multi-step (2): Complex chained reasoning tasks.

Architecture

POST /query
  → Pydantic validation → Ollama HTTP API → JSON Validator → QueryResponse

POST /benchmark
  → Load test_prompts.json
  → For each prompt: psutil memory before → Ollama → psutil memory after
  → NumPy: P50/P95/P99 latency, avg TPS, peak/avg memory
  → BenchmarkResult JSON

The benchmark runs prompts sequentially, not in parallel. Parallel would contaminate the per-prompt memory measurements.

Results

Llama 3.2 3B (Q4_K_M)

avg_tokens_per_second: 42.3
p50_latency_ms: 1203
p95_latency_ms: 3847
p99_latency_ms: 5120
peak_memory_mb: 6953
avg_memory_mb: 6842
total_test_duration_s: 87.4

Interpretation: P50 at 1.2 seconds is excellent. P95 at 3.8 seconds misses a 3-second SLA — the outliers are multi-step tasks and longer code generation. Memory is stable: the model loads once and stays hot between requests (Ollama's KV cache). Delta between peak and average is only 111 MB.

Phi-3 Mini (Q4_K_M)

avg_tokens_per_second: 4.7
p50_latency_ms: 29554
p95_latency_ms: 34127

Interpretation: 4.7 tok/s on CPU. A simple factual question takes 29 seconds. This is a CPU architecture issue — Phi-3 Mini's attention mechanism is less efficient on CPU-only inference than Llama's. With a GPU, these numbers would look very different. On CPU: not usable for interactive applications.

Mistral 7B (Q4_K_M)

avg_tokens_per_second: 28.1
p50_latency_ms: 2301
p95_latency_ms: 5912
peak_memory_mb: 14413

Interpretation: Best output quality, highest memory. 14 GB peak RSS means this model doesn't fit on machines with 8 GB RAM unless you close everything else. P95 at 5.9 seconds — slower than Llama 3.2 3B across the board, expected for a 7B model on CPU.

The JSON Validation Layer

One of the project's core features: send a JSON schema with your query, get validated structured output back.

POST /query
{
  "prompt": "List 3 programming languages",
  "json_schema": {
    "type": "object",
    "properties": {
      "languages": {"type": "array", "items": {"type": "string"}}
    }
  }
}

Without retry: 68% of responses matched the schema on the first attempt.

With retry + error injection:

retry_prompt = (
    f"{original_prompt}\n\n"
    f"Your previous response was invalid JSON. "
    f"Error: {validation_error}. "
    f"Please respond with valid JSON matching this schema: {schema}"
)

With retry: 94% success rate across all three models.

The error injection is what matters. Telling the model exactly what went wrong is significantly more effective than "try again."

What I Learned

1. P95 is the production number, not the average.

Average latency for Llama 3.2 3B is ~1.4 seconds. P95 is 3.8 seconds. If you set a 3-second SLA based on the average, you'll miss it 5% of the time. That's 1 in 20 users seeing a timeout. Measure the distribution, not the center.

2. Phi-3 Mini's CPU performance is misleading from the model card.

The model card advertises strong benchmark scores. Those scores are measured on GPU. On CPU-only inference, 4.7 tok/s makes it unusable for interactive applications. Always benchmark on your actual hardware.

3. Memory delta tells you more than peak.

Peak RSS includes OS overhead and Ollama itself. The delta between pre-inference and post-inference memory tells you how much the model's KV cache is actually growing per request. For Llama 3.2 3B, this delta was ~111 MB — relatively stable across prompt types.

4. Q4_K_M is the right default.

Ollama uses Q4_K_M by default. It's 4-bit quantization with K-means clustering, which recovers some quality compared to naive Q4_0. For factual and code tasks, quality degradation from FP16 to Q4_K_M is minimal. For complex reasoning tasks, there's a measurable drop — but at 4x the memory, FP16 isn't practical on consumer hardware anyway.

5. Sequential benchmarking is the only accurate method.

I tried parallelizing the benchmark for speed. The memory numbers became meaningless — Ollama's memory usage overlapped across concurrent requests and couldn't be attributed per-prompt. Sequential is slower but gives clean, attributable measurements.

Limitations

No GPU measurements. All results are CPU-only. Phi-3 Mini's poor CPU performance might reverse completely on GPU — it's designed for Apple Silicon and NVIDIA acceleration. If you have a GPU, run your own benchmark.

Single hardware configuration. Results are from one machine. RAM speed, CPU generation, and available cores all affect inference speed. These numbers are directional, not universal.

Quality scoring is manual. The benchmark measures latency and throughput automatically. Output quality is subjective and not automated here — it requires a golden dataset and an LLM judge (a separate project).

30 prompts is not statistically robust. P99 from 30 samples is noisy. A production benchmark should run 200+ prompts to get stable percentile estimates.

Try It

GitHub: [https://github.com/ThinkWithOps/02-local-ai-assistant]
Youtube : [https://youtu.be/SMI-eIn-tuw]

git clone https://github.com/ThinkWithOps/02-local-ai-assistant.git
cd 02-local-ai-assistant
bash scripts/install_models.sh  # pulls llama3.2:3b, phi3:mini, mistral:7b
pip install -r requirements.txt
uvicorn src.main:app --host 0.0.0.0 --port 8000

# Benchmark llama3.2:3b
python cli/main.py benchmark --model llama3.2:3b

# Compare all 3 models
python cli/main.py compare
# Generates: reports/model_comparison_YYYYMMDD.md

Which local model are you running, and what's your P95 latency? Drop it in the comments.

# I Built a RAG System That Enforces Its Own Citations — And Blocks Its Own Merges

Vijaya Rajeev Bollu — Thu, 14 May 2026 23:02:16 +0000

The Problem With Most RAG Tutorials

Every RAG tutorial ends the same way.

You send a question, the LLM returns an answer, and you ship it. What the tutorial doesn't show you: that answer might be confidently fabricated. The LLM might be citing a source it invented. There's no way to know.

I spent three days debugging exactly this in a prototype. The system sounded authoritative. It was hallucinating chunk references that didn't exist in the retrieved context.

The fix wasn't a better prompt. The fix was building actual enforcement into the pipeline — and then automating quality measurement so metric regressions literally cannot ship.

This is a production-grade RAG system that answers questions from documents with verifiable citations, hybrid search, cross-encoder re-ranking, and CI/CD quality gates.

Architecture

The request flow looks like this:

POST /query
  → HybridRetriever.search()   # BM25 + vector via RRF
  → rerank()                   # Cohere cross-encoder, top 5 from 20
  → generate_answer()          # gpt-4o-mini + citation validation
  → QueryResponse              # cited answer or refusal

Three layers, each with a measurable reason to exist.

Layer 1: Hybrid Retrieval (BM25 + Vector)

Pure vector search fails on exact terminology. If a document says "CO2 emissions" and the user asks "carbon dioxide output," a cosine similarity search might miss it. BM25 catches it because it matches exact tokens.

Reciprocal Rank Fusion (RRF) merges the two ranked lists:

score(doc) = 1/(k + rank_vector) + 1/(k + rank_bm25)

With k=60, this gives stable fusion without needing to weight or normalize the scores. BM25 index rebuilds in-memory on each query from ChromaDB — always reflects current state, no sync required.

The tradeoff: rebuild latency. For large corpora this matters. For most document Q&A workloads, it's negligible.

Layer 2: Cross-Encoder Re-Ranking

The retriever returns 20 candidates. The re-ranker returns the top 5.

Bi-encoders (used in vector search) embed query and document independently — fast but imprecise. Cross-encoders (Cohere rerank-english-v3.0) see the query and document together — slower, but significantly more accurate on relevance.

The pattern: retrieve broadly (20), re-rank precisely (top 5). You get the recall of broad retrieval with the precision of cross-encoding.

Layer 3: Citation Enforcement

Every chunk stored in ChromaDB gets a unique ID: chunk- followed by 8 hex characters. The LLM is instructed to cite these IDs inline:

Global warming is primarily driven by greenhouse gas emissions [chunk-1a2b3c4d].

After generation, a regex extracts every cited ID from the answer. Each one is checked against the set of IDs that were actually passed into the prompt. If any cited ID doesn't exist in that set — hallucinated reference — the answer is replaced with a refusal.

cited_ids = set(re.findall(r'\[chunk-([0-9a-f]{8})\]', answer))
hallucinated = cited_ids - valid_chunk_ids
if hallucinated:
    return REFUSAL_RESPONSE

This doesn't prevent the LLM from being wrong about the content of real chunks. But it prevents citation fabrication — a distinct and common failure mode.

The CI/CD Quality Gate

This is the part most RAG tutorials skip entirely.

Ragas measures two metrics:

Faithfulness: Is the answer supported by the retrieved context?
Context precision@5: Are the retrieved chunks actually relevant to the question?

I maintain a golden dataset of 20 hand-verified Q&A pairs. On every PR to main, GitHub Actions:

Starts ChromaDB
Ingests demo documents
Runs all 20 questions through the full pipeline
Scores faithfulness and context precision@5 via Ragas
Fails the workflow if faithfulness < 0.85 or context_precision@5 < 0.70

Metric regressions cannot merge. You find out in CI before any code ships, not after a deploy.

The check script is a standalone Python file (scripts/check_quality_gate.py) that exits 1 if thresholds aren't met — easy to wire into any CI system.

What I Learned

1. Chunk size is not arbitrary.
I tested 500, 700, and 1000 characters. At 500, long paragraphs split mid-sentence and the re-ranker couldn't reconstruct context. At 1000, chunks were too long for the LLM to synthesize cleanly. 700 with 100 overlap hit the right balance for the climate domain documents I was using. This number will be different for your corpus — test it.

2. The golden dataset quality matters more than its size.
My first golden dataset had subjective Q&A pairs — "What is an important source of emissions?" Any answer could be justified. I rebuilt it with binary-verifiable claims: exact figures, named entities, specific relationships. Ragas faithfulness scoring only means something if the ground truth is unambiguous.

3. Citation format is load-bearing.
The LLM initially produced (chunk-042) and chunk_1a2b3c4d — close but not matching the regex. The fix was putting the exact format string in the system prompt with an explicit example, not just a description. Format specification in prompts must be concrete.

4. BM25 re-index latency is real.
Rebuilding the BM25 index on every query adds latency proportional to corpus size. For 500 chunks it's ~5ms. For 50,000 chunks it becomes a problem. The current design is correct for a portfolio-scale corpus; at production scale you'd maintain a persistent BM25 index and update it incrementally on ingest.

5. Prompt versioning changes how you iterate.
Moving prompts to prompts/rag_prompts.yaml with a version field meant I could iterate on prompt content without touching Python code and track what changed in git diffs. It also let me hot-reload prompts at startup without redeploying. Small architectural decision, large practical impact.

Limitations

Synchronous BM25 rebuild. Rebuilds from ChromaDB on every query. Fast for small corpora, problematic at scale. A persistent index with delta updates would fix this.

Single collection. All documents share one ChromaDB collection (rag_documents). There's no namespace isolation between document sets. If you ingest documents for two different topics, retrieval can bleed across domains.

Golden dataset is climate-domain only. The evaluation is tuned for the demo documents. Ragas metrics are meaningful only when the golden dataset matches your actual document domain.

No streaming. POST /query waits for the full answer before returning. For long answers, this adds perceived latency. FastAPI supports streaming responses via StreamingResponse — not implemented here.

Citation enforcement catches fabricated IDs, not wrong facts. If the LLM correctly cites a real chunk but misrepresents what it says, that passes citation enforcement. Ragas faithfulness catches this, but at evaluation time, not at runtime.

Try It

GitHub: [https://github.com/ThinkWithOps/01-rag-from-scratch]
Demo: [https://youtu.be/wRZpmzIexnQ]

git clone https://github.com/ThinkWithOps/01-rag-from-scratch.git
cd 01-rag-from-scratch
cp .env.example .env
# Add OPENAI_API_KEY and COHERE_API_KEY

docker compose up -d
bash scripts/ingest_demo_docs.sh

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the main cause of climate change?", "top_k": 5}'

Run the full evaluation:

bash scripts/run_evaluation.sh

What's your quality bar for RAG before you'd ship it to users? Drop it in the comments.

I Built a Production Food Delivery Platform on AWS EKS — Here's Everything I Learned

Vijaya Rajeev Bollu — Thu, 30 Apr 2026 14:25:28 +0000

Why I Built This

Most Kubernetes tutorials stop at kubectl apply -f deployment.yaml. They don't show you how a VPC is laid out, why you need two availability zones, what IAM roles EKS nodes actually need, or how to debug a live failure using Prometheus metrics.

I wanted to build something that forced me to make every decision a senior DevOps engineer would make on a real project. So I built a food delivery platform — four independent microservices, a React frontend, full Terraform infrastructure on AWS, a GitHub Actions pipeline, and a Grafana dashboard — and recorded the whole thing.

This is what I learned.

How It Works

The Application Layer

Four FastAPI microservices, each completely independent with its own SQLite database:

user-service (port 8001): Registration, JWT login, user profiles. Seeds 3 users on startup.
restaurant-service (port 8002): Restaurant listing + full menus. Seeds 5 restaurants with 10 menu items each — real food names, USD prices.
order-service (port 8003): Order placement. Makes a synchronous HTTP call to restaurant-service to validate menu items before placing the order. Has a built-in ORDER_SERVICE_FAILURE_MODE env var for the observability demo.
delivery-service (port 8004): Agent assignment and delivery tracking. Seeds 5 delivery agents.

Each service exposes /health (returns {"status":"healthy","service":"<name>","version":"1.0.0"}) and /metrics (auto-generated by prometheus-fastapi-instrumentator).

An NGINX gateway (port 8080 locally) routes /api/users, /api/restaurants, /api/orders, /api/delivery to the right service and serves the React frontend at /.

The Infrastructure

Terraform is split into four modules:

modules/vpc: VPC (10.0.0.0/16), 2 public + 2 private subnets across us-east-1a and us-east-1b, Internet Gateway, 1 NAT Gateway (single point of failure — intentional cost trade-off for a demo, documented in comments), route tables.

modules/eks: EKS 1.32 cluster, managed node group with t3.small instances (min=1, desired=2, max=4 in private subnets), cluster IAM role, node IAM role with three AWS-managed policies, launch template to name EC2 instances in the console.

modules/ecr: Five repositories (food-delivery/user-service, food-delivery/frontend, etc.), image scan on push, lifecycle policy keeping last 10 images.

modules/iam: GitHub Actions IAM user with an inline policy scoped to ECR push/pull and EKS describe — nothing else.

The CI/CD Pipeline

deploy.yml triggers on push to main. It:

Applies Kubernetes manifests, ingress-nginx, and kube-prometheus-stack
Uses a matrix job for user-service, restaurant-service, order-service, delivery-service, and frontend
Logs into ECR
Builds and tags each image with $GITHUB_SHA and latest
Runs aws eks update-kubeconfig
Does kubectl set image with the SHA tag
Waits for kubectl rollout status

pr-checks.yml runs flake8, pytest, terraform fmt -check, and terraform validate on every pull request.

destroy.yml is a manual workflow_dispatch with a typed confirmation — safeguard against accidental terraform destroy.

The Observability Demo

This is the part that makes the project worth recording.

Set ORDER_SERVICE_FAILURE_MODE=true in Docker Compose and restart order-service. Now 50% of POST /orders requests return HTTP 500. Run scripts/load-test.sh — it fires 300 requests in 10 concurrent workers over 3 minutes.

In Grafana, the "Error rate per service" panel spikes immediately from 0% to ~50% for order-service. The failed_orders_total counter climbs. P95 latency creeps up because failed requests still go through the restaurant-service validation call before failing.

Meanwhile HPA detects elevated CPU, scales replicas from 2 to 6. More pods, same error rate — the bug is in code, not capacity.

kubectl logs on any order-service pod shows the failure mode immediately. Fix: set ORDER_SERVICE_FAILURE_MODE=false, redeploy. Grafana recovers in under 30 seconds.

That recovery graph — the spike, the plateau, the drop — is the money shot of the video.

What I Learned

1. EKS nodes don't get Name tags by default.
The aws_eks_node_group resource tags the node group, not the individual EC2 instances. You need a launch_template with tag_specifications { resource_type = "instance" } to see names in the EC2 console. Lost 20 minutes on this.

2. One NAT Gateway is a trade-off, not a mistake.
The prompt called for cost saving. A single NAT Gateway means if us-east-1a goes down, private subnets in us-east-1b lose internet access. I documented this in a comment on the resource. Production would use one NAT per AZ. That trade-off is worth explaining explicitly.

3. The IAM roles for EKS are the biggest footgun.
You need three separate IAM roles: cluster role (for the control plane), node role (for EC2 instances in the node group), and optionally a IRSA role per service. Mixing them up silently breaks things. The AmazonEKS_CNI_Policy on the node role is what makes pod networking work — missing it gives you running pods with no network connectivity.

4. prometheus-fastapi-instrumentator is one line of code.

Instrumentator().instrument(app).expose(app)

That's it. You get request count, latency histograms, and HTTP status breakdown per endpoint, all at /metrics. The custom counters (orders_total, failed_orders_total, order_processing_seconds) are 5 more lines.

5. Service-to-service calls need explicit timeouts.
order-service calls restaurant-service with httpx.AsyncClient(timeout=5.0). Without the timeout, a slow restaurant-service will hold an order-service worker indefinitely, causing cascade failures that look like order-service bugs in the logs.

6. maxUnavailable=0 in rolling updates protects you more than you think.
With maxSurge=1, maxUnavailable=0, Kubernetes brings up the new pod and passes readiness checks before terminating the old one. The /health readinessProbe with initialDelaySeconds=15 means the new pod gets 15 seconds to initialize SQLite and seed data before traffic hits it. Without this, users hit 503s during every deploy.

Limitations (honest)

SQLite is fine for local dev and demos. This would use RDS or Aurora in production.
Single NAT Gateway is a cost optimization, not production-ready.
The React frontend hardcodes http://localhost:8080 — a real app would use environment injection at build time.
No secrets management — passwords and JWT secret are env vars. Production would use AWS Secrets Manager + Kubernetes Secrets.
The GitHub Actions IAM user uses long-lived access keys. Production would use OIDC federation (no keys at all).
The Grafana dashboard started as a local Docker Compose dashboard. Kubernetes metrics need their own PromQL queries and dashboard panels.

Try It

# Local — everything runs in Docker
git clone https://github.com/vijayb-aiops/devops-production-projects
cd devops-production-projects/projects/01-food-delivery-eks-platform
bash scripts/bootstrap.sh

# Trigger the observability demo
ORDER_SERVICE_FAILURE_MODE=true docker compose up -d order-service
bash scripts/load-test.sh
# Open Grafana at http://localhost:3000 (admin/foodrush123)

# Deploy to AWS
cd infra/terraform
terraform init
terraform apply
cd ../..
bash scripts/deploy-eks.sh

Estimated AWS cost while recording: ~$0.19/hr. Run terraform destroy when done.

📺 Full build-along: https://www.youtube.com/watch?v=HDiWR1uVI9s
📁 GitHub: https://github.com/ThinkWithOps/devops-production-projects/tree/main/projects/01-food-delivery-eks-platform

I Built an MCP Server That Lets Claude Control My Kubernetes Cluster

Vijaya Rajeev Bollu — Thu, 09 Apr 2026 13:52:04 +0000

Why I Built This

Every DevOps incident I've dealt with follows the same pattern.

Something breaks. I open five terminal tabs. I run kubectl, check the AWS console, tail Docker logs, and manually piece together what went wrong — all while the clock is ticking.

The tools all exist. They're just completely disconnected from each other.

I wanted one interface that could ask a question, reach into all of them simultaneously, and explain what it found. So I built an MCP server that gives Claude Desktop direct access to my real infrastructure.

How It Works

The server runs locally on my laptop. Claude Desktop launches it on startup and communicates over stdio (standard input/output).

When I type a question, Claude reads the 14 registered tool descriptions, decides which ones to call, and sends a JSON request to my server:

{"method": "call_tool", "params": {"name": "get_failing_pods", "arguments": {"namespace": "default"}}}

My server executes against real infrastructure and returns structured text. Claude synthesises everything into a plain English response.

The key thing: Claude never connects to AWS or Kubernetes directly. It only talks to my Python server. My server uses local credentials (~/.aws/credentials, ~/.kube/config) to reach the real systems.

14 tools across 4 categories:

Kubernetes: pod status, failing pods, logs, restart deployment, describe pod
AWS: cost report, EC2 instances, CloudWatch alarms, S3 buckets
Docker: list containers, container logs, restart container
Terraform: run plan, check state

Demo / Results

I typed: "Give me a health report of my infrastructure"

Claude called get_pod_status, get_aws_cost, get_cloudwatch_alarms, and list_containers simultaneously and returned a unified summary — Kubernetes pods, AWS costs, Docker containers, CloudWatch alarms — all in one response.

One pod was in CrashLoopBackOff with 14 restarts. I typed:

"Pull the logs from broken-app and tell me why it's crashing"

Claude diagnosed it instantly — busybox with no long-running process, exits immediately every time, Kubernetes keeps restarting it.

Before: 45 minutes of manual investigation across 5 tabs.
After: 10 seconds, one chat window.

What I Learned

1. The tool description is everything.
Claude picks tools based on their description, not their name. A vague description causes Claude to call the wrong tool or skip it entirely. My first version had get_failing_pods described as "Get pods". Claude never called it. I changed it to "List only pods in Failed, CrashLoopBackOff, OOMKilled, or Error state" and it worked perfectly every time. The description is the API contract between you and Claude.

2. Parallel tool calls are not guaranteed — they're emergent.
I didn't write any code to call tools in parallel. Claude decided to do that on its own when the question was broad enough ("health report"). For narrow questions ("which pods are failing?") Claude called only one tool. The parallelism comes from Claude's reasoning, not from your server. This means your tool descriptions need to be distinct enough that Claude can make that judgment call correctly.

3. Sync SDKs in an async server is the silent performance killer.
boto3, the Kubernetes Python client, and the Docker SDK are all blocking. If you call them directly inside an async MCP handler, you stall the entire event loop. Every tool call wraps the SDK call in asyncio.to_thread(). Without this, Claude would time out waiting for responses when multiple tools were called simultaneously.

4. Error handling scope matters more than you think.
My first version caught SDK import errors at module level. If boto3 failed to connect to AWS at startup, the entire server crashed — Kubernetes and Docker tools went down too. The fix: catch errors inside each tool call, not at import time. One broken service should never take down the whole server.

5. The MCP protocol is simpler than it looks.
There are really only two things you implement: list_tools() (return tool names + descriptions + input schema) and call_tool() (receive tool name + arguments, return text). Everything else is handled by the MCP SDK. Building a tool is 20 lines of Python.

Limitations

Single EC2 instance. The demo infrastructure runs on one t3.small. In a real multi-node Kubernetes cluster, the pod information would be spread across namespaces and nodes. The tools work against any kubeconfig — but the demo is scoped to what's in one cluster.

Read-heavy, not write-heavy. Most tools read state — pod status, AWS costs, Docker containers. The only write operation is restart_deployment. There's no tool to edit a deployment YAML, push a config change, or create resources. Giving Claude write access to production infrastructure is a separate (and more serious) conversation about guardrails.

Claude restarts but can't fix root causes. If a pod is in CrashLoopBackOff because of a bad container image, Claude can restart it — but it'll crash again immediately. Claude diagnosed the issue correctly and warned me, but the actual fix (updating the deployment YAML and pushing a correct image) is outside the scope of the tools I gave it. The tools match what I was comfortable giving an AI access to.

No authentication on the MCP server. The server runs locally over stdio — there's no network exposure. But if you ever expose this over a network (e.g., run the server on EC2 and connect Claude Desktop remotely), you'd need to add authentication. Don't do that without thinking it through.

AWS credentials are local. The server uses whatever credentials are in ~/.aws/credentials. If you're running this on a shared machine, those credentials are accessible to anything running as your user. Use IAM roles with minimal permissions scoped to read-only for Cost Explorer, EC2, CloudWatch, and S3.

Try It

GitHub: [https://github.com/ThinkWithOps/ai-devops-systems-lab.git]
Demo: [https://youtu.be/2XsNcKSa28s]

git clone https://github.com/ThinkWithOps/ai-devops-systems-lab.git
cd 03-ai-devops-mcp-server
pip install -r requirements.txt

# Test with mock data — no real infra needed
KUBE_MOCK_MODE=true python -m pytest tests/ -v

Add the MCP server to your Claude Desktop config and restart. Full setup guide in the README.

This is Project 03 of the ThinkWithOps AI + DevOps series

What would you control first if Claude had access to your infrastructure? Drop it in the comments.

# I Built an AI That Understands Any GitHub Repo Using LangChain and ChromaDB

Vijaya Rajeev Bollu — Wed, 01 Apr 2026 16:54:53 +0000

Why I Built This

Every time I join a new codebase, the first few days are the same: open the repo, stare at folders, try to figure out which service does what, read half a file, get interrupted, lose context, start over.

GitHub's built-in search is keyword-only. ChatGPT has never seen your repo. Teammates are busy. Documentation is either missing or out of date.

I wanted a tool that could answer "how does checkout work?" from the actual code — not from training data, not from docs, but from the real source files.

So I built one.

How It Works

The system is built around a RAG (Retrieval-Augmented Generation) pipeline. The idea: instead of asking an LLM to answer from memory, you first retrieve the most relevant code chunks, then ask the LLM to answer using only those chunks.

Ingest flow:

Clone the GitHub repo locally
Walk every file and split into overlapping chunks (~500 tokens, 50-token overlap)
Convert each chunk to a vector embedding using all-MiniLM-L6-v2 (Sentence Transformers — local, free)
Store embeddings + metadata in ChromaDB

Query flow:

Embed the user's question with the same model
ChromaDB cosine similarity search → top-5 most relevant chunks
Inject chunks into a LangChain prompt
LLM generates an answer with source file citations

The frontend is a Next.js split-pane layout — file tree on the left, chat on the right. The ingest endpoint streams progress via Server-Sent Events so the user sees live updates instead of a spinner.

# LangChain RAG chain (simplified)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)
result = qa_chain({"query": question})

The LLM only ever sees 5 chunks — not the whole repo. That keeps answers fast and grounded.

Demo / Results

I indexed the ShopFlow ecommerce demo app — ~847 files across 12 services.

Input: "How does the checkout flow work?"

Output:

The checkout flow works as follows:
1. Cart validation — CartService checks inventory availability (src/services/cart.py:142)
2. Payment processing — PaymentService calls Stripe API with order total (src/services/payment.py:89)
3. Order creation — OrderService writes to PostgreSQL with status PENDING (src/models/order.py:67)
4. Inventory update — InventoryService decrements stock counts (src/services/inventory.py:203)

Index time: ~30 seconds. Answer latency: ~3 seconds. API cost: $0 (self-hosted on AWS EC2).

What Surprised Me

1. FastAPI path parameters break on slashes
Repo names like owner/repo contain a / — FastAPI treats it as a path separator and routes to the wrong endpoint. Fix: declare the parameter as {repo_name:path}. One character. Found it via a 404 in production after deployment.

2. TypeScript types drift silently from backend SSE events
The backend was emitting file, file_path, and indexed_at fields in the SSE stream that the frontend TypeScript interface didn't declare. No error in local dev — only failed during next build inside Docker. A shared OpenAPI-generated type contract would have caught this at development time.

3. Docker layer caching hides real code
After pushing a fix, the Docker build was still serving old code because it cached the COPY . . layer. Always run docker-compose build --no-cache when a fix isn't appearing after git pull.

4. Chunk overlap matters more than chunk size
I initially had no overlap between chunks. The AI would give incomplete answers for questions that spanned function boundaries — the relevant context was split across two chunks that were never returned together. Adding 50-token overlap fixed most of these cases.

Try It

git clone https://github.com/ThinkWithOps/ai-devops-systems-lab
cd projects/02-ai-github-repo-explainer
cp backend/.env.example backend/.env
# Add your GROQ_API_KEY or leave blank to use Ollama
docker-compose up -d

Open http://<your-ec2-ip>:3000, paste any public GitHub URL, and start asking questions.

🔗 GitHub: https://github.com/ThinkWithOps/ai-devops-systems-lab
📺 Full build walkthrough: https://youtu.be/a6376K9Lm00

This is Project 02 of a 30-project AI + DevOps series. Each project is a real deployed system — not a tutorial snippet.

What's the most confusing codebase you've ever had to onboard into? What would you have asked an AI first?

# I Built a DevOps Chatbot That Checks My Live App for Failures — Here's How It Works

Vijaya Rajeev Bollu — Tue, 31 Mar 2026 16:50:21 +0000

Why I Built This

Every DevOps engineer has had the 2am moment. Something is broken. You don't know what. You SSH in, check logs, Google the error, open five tabs, still nothing clear. Thirty minutes later you find it — a config flag someone changed, a slow query, a dependency timing out.

I wanted to ask an AI instead. Not a generic ChatGPT that gives you textbook answers, but an AI connected to my actual running system that can check what's broken right now.

So I built the AI DevOps Copilot — Project 01 of my 30-project AI + DevOps YouTube series.

How It Works

The system has four layers:

1. LangChain agent (the brain)
Uses create_tool_calling_agent with Llama 3.1 via Groq. When you ask a question, the agent decides whether to answer from knowledge or call a tool. General DevOps questions → instant answer. Questions about the live app → tool call.

2. ChromaDB RAG (the knowledge base)
Nine runbook documents embedded into a vector database — Docker troubleshooting, AWS debugging, Kubernetes, Terraform, Linux performance, security, and more. The agent searches these for context when answering general questions.

3. Tool layer (the live connection)
Four tools: restaurant_monitor (hits the live restaurant app API), log_search (searches application logs), github_search (searches repos), and devops_docs (searches the runbook vector store).

4. FastAPI + SSE streaming (the interface)
The agent runs in a thread executor, streaming tokens back through an asyncio Queue with a sentinel pattern. The Next.js frontend connects via Server-Sent Events and renders each token as it arrives.

Data flow:

User question → FastAPI → LangChain agent → tool call or RAG search → streamed response → Next.js UI

Demo / Results

I deployed a chaos engineering demo app — a restaurant application with injectable failure modes. One failure mode: slow_menu, which adds a 2-second artificial delay to GET /api/menu.

Input: Menu page spinning, customers waiting. I ask the copilot: "What is wrong with the restaurant app right now?"

What happened:

Tool called: restaurant_monitor → action: failures
API response: slow_menu: ACTIVE (2s delay on /api/menu)

AI answer:
The slow_menu failure mode is currently ACTIVE.
This injects a 2-second delay into the Menu API.

Fix:
1. Operator dashboard → Failures → Disable slow_menu
2. Or: POST /api/admin/failures/slow_menu/disable
3. Verify: menu should load in <100ms

Time to diagnosis: ~8 seconds.

After disabling the toggle, the menu loaded instantly. The entire incident lifecycle — detection, diagnosis, fix, verification — took under 2 minutes.

What Surprised Me

1. create_react_agent kept hitting iteration limits.
I started with the ReAct text-parsing agent. Llama 3.1-8b kept failing the "Thought/Action/Observation" format, exhausting 10 iterations on simple questions. Switching to create_tool_calling_agent — which uses native LLM function calling — fixed this completely. The model knows how to call functions; it doesn't know how to produce exact ReAct formatting.

2. Tool call chunks were streaming to the user as garbage text.
on_llm_new_token in LangChain fires for every token — including internal tool call encodings like {"name": "restaurant_monitor", "arguments":. Fixed by checking chunk.tool_call_chunks and skipping those tokens. Without this, users see raw JSON blobs in the chat.

3. host.docker.internal doesn't resolve on EC2 Linux.
Added extra_hosts: - "host.docker.internal:host-gateway" to Docker Compose — still failed. Fixed by using the EC2 private IP directly: RESTAURANT_API_URL=http://172.31.90.69:8010 in the .env file. Simple, obvious in hindsight.

4. Groq free tier is 6,000 tokens per minute.
With llama-3.3-70b-versatile, three questions hit the limit. Switched to llama-3.1-8b-instant — much faster, lower token usage, still very capable for DevOps Q&A. For a demo or portfolio project, this is the right call.

Try It

📁 GitHub: https://github.com/ThinkWithOps/ai-devops-systems-lab/tree/main/projects/01-ai-devops-copilot

🎬 Video walkthrough: https://youtu.be/a50334Szt5g

# Clone and run locally
git clone https://github.com/ThinkWithOps/ai-devops-systems-lab
cd projects/01-ai-devops-copilot
cp backend/.env.example backend/.env
# Add your GROQ_API_KEY to .env
docker compose up -d

Open http://localhost:3000 — the copilot is running.

This is Project 01 of a 30-project series building AI + DevOps systems from scratch.

What's the most painful part of your on-call experience that you wish an AI could handle? Drop it in the comments — it might become Project 02.


`

---

How I Built an AI That Diagnoses GitHub Actions Failures Automatically

Vijaya Rajeev Bollu — Fri, 20 Mar 2026 19:17:38 +0000

The Problem

GitHub Actions failure logs are noisy. Finding the actual error in 500 lines of output takes time you don't have during an incident. You get a red X, click through multiple pages, scroll past runner setup noise and dependency install output, land on the real error — and then you have to figure out what it means and what to do about it. I was doing this loop manually too often, so I automated it.

What It Does

The tool fetches your repository's failed workflow runs via the GitHub API, extracts the relevant error sections from job logs, pulls in the workflow YAML for context, and sends everything to Ollama running locally. You get back a root cause, an explanation of why it happened, exact YAML changes to make, and steps to prevent it from happening again — in about 15 seconds.

Zero cloud costs. Your logs and code never leave your machine.

The 5 Failure Types It Handles

Dependency conflicts — version mismatches, packages missing from requirements.txt or package.json
Missing secrets — env vars referenced in the workflow but not configured in repository settings
Permission errors — GITHUB_TOKEN scope issues, OIDC misconfiguration, action not allowed
Docker build failures — base image not found, build context issues, registry auth failures
Flaky tests — identifies non-deterministic failures vs real bugs based on error patterns and exit codes

Architecture

The flow is straightforward: GitHub REST API → Python log parser → Ollama.

GitHub API (workflow runs + job logs + YAML)
         │
         ▼
  extract_error_from_logs()   ← keyword scan, context windows, dedup
         │
         ▼
  analyze_failure()           ← structured prompt to Ollama
         │
         ▼
  Terminal report             ← root cause + YAML changes + prevention

The log extraction step is where most of the work happens. Raw GitHub Actions logs are thousands of lines — runner diagnostics, apt output, pip install progress bars, none of which is useful. The parser scans for a list of error keywords (error:, failed, Traceback, exit code, permission denied, etc.), then captures 8 lines of context before and 12 after each hit, deduplicates overlapping windows, and caps the result at 1500 characters before sending to the AI.

def extract_error_from_logs(self, logs: str) -> str:
    lines = logs.split('\n')
    error_keywords = [
        'error:', 'failed', 'Traceback', 'fatal:',
        'command not found', 'permission denied',
        'exit code', 'returned non-zero'
    ]
    error_sections = []
    seen_lines = set()

    for i, line in enumerate(lines):
        if any(kw in line.lower() for kw in error_keywords):
            if i in seen_lines:
                continue
            start, end = max(0, i - 8), min(len(lines), i + 12)
            for j in range(start, end):
                seen_lines.add(j)
            error_sections.append('\n'.join(lines[start:end]))

    result = '\n\n'.join(error_sections)
    return result[:1500] + "\n... (truncated)" if len(result) > 1500 else result

The AI prompt includes the workflow name, failed job name, the extracted error section, and up to 1000 characters of the workflow YAML. That YAML context is what lets the AI suggest specific YAML fixes rather than generic advice.

prompt = f"""You are a GitHub Actions expert analyzing a failed CI/CD workflow.

Workflow: {workflow_name}
Failed Job: {job_name}

{workflow_context}

ERROR LOGS:
{limited_logs}

**ROOT CAUSE:** [each specific error, all of them]
**WHY THIS HAPPENED:** [plain language explanation]
**HOW TO FIX:** [numbered, actionable steps]
**YAML CHANGES:** [exact changes needed]
**PREVENTION:** [how to avoid this next time]
"""

The Most Useful Feature: Failure Classification

The thing that saves the most time isn't root cause identification — it's the flaky vs real distinction.

When a test fails with a non-deterministic error (race condition, network timeout, port already in use), re-running the workflow is the right call. When a test fails because you introduced a bug, re-running wastes 5 minutes and doesn't help. Before this tool, I'd often re-run first and only look at logs after the second failure confirmed it wasn't flaky. That's a 5-10 minute delay every time.

The AI picks this up from patterns in the error output. A ConnectionRefused or socket timeout alongside a test failure is a different signal than a clean AssertionError: expected 200, got 404. The prompt explicitly asks for this classification, and Llama 3.2 handles it reliably — it's the kind of pattern matching that LLMs are genuinely good at.

The accuracy isn't perfect (the README documents ~85%), but it's good enough that I check the classification before deciding whether to re-run or investigate.

Demo

Input — a failed workflow log containing:

Run pytest tests/
ERROR: No module named 'requests'
FAILED tests/test_api.py::test_health - ModuleNotFoundError
error: Process completed with exit code 1.

Output:

🔍 GITHUB ACTIONS FAILURE ANALYSIS
Repository: myname/myrepo | Workflow: CI Build | Failed Job: test

🤖 AI DIAGNOSIS:

**ROOT CAUSE:**
pytest is failing because the 'requests' library is not installed.
The test imports it, but it's not listed in requirements.txt, so
pip install -r requirements.txt doesn't include it.

**WHY THIS HAPPENED:**
The package works locally because you have it installed globally on your
machine. In the GitHub Actions runner, only packages in requirements.txt
get installed — nothing else is available.

**HOW TO FIX:**
1. Add 'requests' to requirements.txt: requests>=2.32.0
2. Commit and push — the workflow will pick it up automatically.
3. If requests is only needed for tests, add a requirements-dev.txt
   and install it separately in a dedicated workflow step.

**YAML CHANGES:**
No YAML changes needed. The fix is in requirements.txt.

**PREVENTION:**
Run your CI workflow locally with a clean virtualenv before pushing:
  python -m venv .venv && source .venv/bin/activate
  pip install -r requirements.txt
  pytest tests/

That took 14 seconds. The manual version of this (click to run page, scroll logs, identify error, Google it, figure out the fix) is 10-15 minutes minimum.

Try It

GitHub: https://github.com/ThinkWithOps/ai-devops-projects
Video: https://youtu.be/EwgdZ8KmBJg

# Prerequisites: Ollama running with llama3.2 pulled
git clone https://github.com/ThinkWithOps/ai-devops-projects
cd ai-devops-projects/04-ai-github-actions-healer
pip install -r requirements.txt

# Set your GitHub token (needs repo + workflow scopes)
export GITHUB_TOKEN="your_token_here"

# Run against your repo
python src/github_actions_healer.py --repo your-username/your-repo

# Save report to file
python src/github_actions_healer.py --repo owner/repo --output report.json

Project 4 in my AI+DevOps series — all tools run locally with Ollama, zero cloud AI costs. Project 3 was an AI AWS Cost Detective, Project 5 is an AI Terraform Code Generator. Links in my profile.

What GitHub Actions failure do you dread the most? For me it's the OIDC / GITHUB_TOKEN permission errors — the error message never tells you which specific permission is missing. The AI actually handles those surprisingly well. Drop yours in the comments.

# I Built a Local AI Terraform Generator and Tested It By Actually Deploying to AWS — Here Are the Results

Vijaya Rajeev Bollu — Fri, 13 Mar 2026 15:29:11 +0000

The Idea

Every time I needed a new AWS resource, I'd spend 20 minutes reading Terraform docs just to get the syntax right for something I'd done before. I wanted to type plain English and get working HCL back. But I also didn't want to just generate code — I wanted to know if it actually deploys. So I tested every resource by running terraform apply against a real AWS account.

How It Works

You describe infrastructure in plain English. The tool sends it to a local Llama 3.2 model via Ollama, which returns four Terraform files. Those files get saved to a generated/ folder, ready for terraform init and terraform apply.

Plain English → Python → Ollama (local) → Parse HCL → main.tf + variables.tf + outputs.tf + tfvars.example

The key piece is the prompt. Getting consistent, parseable HCL out of an LLM required a very specific structure:

prompt = f"""You are a Terraform/OpenTofu expert. Generate production-ready infrastructure code.

USER REQUEST:
{description}

PROVIDER: {provider}

CRITICAL:
- Generate ONLY valid Terraform/HCL code
- NO markdown formatting or code blocks
- Start each file with a comment showing the filename
- Separate files with: ### FILENAME ###

Format your response like this:

### main.tf ###
terraform {{
  required_version = ">= 1.0"
  required_providers {{
    {provider} = {{
      source  = "hashicorp/{provider}"
      version = "~> 5.0"
    }}
  }}
}}

[rest of main.tf code]

### variables.tf ###
[variables code]

### outputs.tf ###
[outputs code]

### terraform.tfvars.example ###
[example values]

Generate production-ready code now:"""

The ### FILENAME ### markers are what make the response parseable. The script splits on ###, reads the filename, grabs everything after it until the next marker, and writes that to disk. There's also a fallback parser for when the model goes off-script and wraps things in code blocks anyway.

Test Results: 10 Resources Tested

Resource	Generated	Validated	Deployed
EC2 instance	✅	✅	✅
S3 bucket	✅	✅	✅
IAM role	✅	✅	✅
VPC + subnets	✅	✅	✅
Security group	✅	✅	✅
RDS instance	✅	✅	✅
ALB	✅	✅	✅
Lambda	✅	✅	✅
ECS task	✅	⚠️ needed fix	✅
Complex module	✅	⚠️ needed fix	✅

8/10 deployed first try. 2/10 needed minor manual fixes.

The ECS task definition had an incorrect network_mode value for Fargate. The complex multi-resource module had a missing depends_on for the security group. Both were one-line fixes once terraform validate pointed at them.

What It's Good At vs Where It Struggles

Good at:

Standard single-resource configs (EC2, S3, IAM, RDS) — near-perfect every time
Wiring dependencies correctly between resources it knows well
Generating variables.tf with descriptions and sensible defaults
Adding tags and naming conventions without being asked

Struggles with:

Very new AWS resources where Llama's training data is thin
Complex modules with many interdependent resources — sometimes misses a depends_on
Provider version pinning — occasionally suggests a deprecated argument from an older AWS provider version
ECS/EKS specifics — these configs are dense and the model sometimes gets task definition fields wrong

Honest assessment: treat it like a junior engineer who's read all the Terraform docs but hasn't deployed much to production. Good first draft, always needs a review.

The Prompt Engineering That Made It Work

Three things made the difference between garbage output and deployable HCL:

1. Kill the markdown. LLMs love wrapping code in hcl blocks. That breaks the file parser completely. The explicit instruction NO markdown formatting or code blocks eliminated this.

2. Show the exact format in the prompt. Telling the model to use ### main.tf ### as a separator, and including the terraform {} block structure directly in the prompt, anchored the output format. Without this, every response looked slightly different.

3. Demand variables explicitly. Early versions hardcoded values like instance_type = "t3.micro" directly in main.tf. Adding Use variables for configurable values to the requirements section fixed this — now everything configurable lands in variables.tf with proper descriptions.

Why Local Matters for IaC Generation

Your Terraform descriptions contain your architecture. "Create a VPC with private subnets, an RDS cluster for our auth service, and an ECS task that pulls from our private ECR registry" — that's a roadmap of your production infrastructure. Sending that to a cloud API means it leaves your machine.

Running Ollama locally means the description, the generated code, and any sensitive context like account IDs or naming patterns stay on your machine. For anything touching production infrastructure, that's not optional.

Try It

GitHub: https://github.com/ThinkWithOps
Live deploy demo: https://youtu.be/nhhZqrCEhOA

git clone https://github.com/ThinkWithOps/ai-devops-projects
cd ai-devops-projects/05-ai-terraform-generator
pip install -r requirements.txt

# Generate code
python src/terraform_generator.py \
  --description "EC2 instance with S3 bucket for logs" \
  --provider aws

# Deploy it
cd generated/
cp terraform.tfvars.example terraform.tfvars
terraform init && terraform plan && terraform apply

Project 5 in my AI+DevOps series — all tools run locally with Ollama, zero cloud API costs.

What's your current Terraform workflow? I'm curious whether people are using Copilot for this, manually writing it, or something else entirely — drop it in the comments.

# I Built an AI Kubernetes Pod Debugger — Diagnoses CrashLoopBackOff, OOMKilled, and More in Seconds

Vijaya Rajeev Bollu — Fri, 13 Mar 2026 15:14:28 +0000

Why I Built This

K8s error messages are designed for cluster operators, not for the developer who just shipped a feature and now has a pod stuck in CrashLoopBackOff at 11pm. kubectl get pods tells you something is broken. It doesn't tell you why, and it definitely doesn't tell you what to do about it. I kept doing the same 4-command loop — get pods, describe pod, logs, Google the error — and thought there had to be a better way.

So I automated the loop and added an AI in the middle.

The 6 Failure Types It Handles

ImagePullBackOff — wrong image name, missing credentials, private registry issues
CrashLoopBackOff — app crash on startup, missing env vars, bad config
OOMKilled — container exceeding memory limits
Pending — insufficient cluster resources, node selector issues
Failed — job completion errors, config map missing
Init Container failures — init container exits non-zero

How It Works

The tool runs three kubectl commands in sequence, then hands everything to Ollama.

First, kubectl get pods -o json scans the namespace and flags any pod that isn't Running, isn't ready, or has restarts > 0. For each unhealthy pod, it grabs the last 50 lines of logs and the Events section from kubectl describe. Both get truncated before hitting the AI — logs capped at 2000 chars, events at 1000 — to stay within a reasonable context window.

Then everything goes into a structured prompt:

def analyze_pod_failure(self, pod: Dict, logs: str, events: str) -> str:
    logs_sample = logs[:2000] if logs else "No logs available"
    events_sample = events[:1000] if events else "No events available"

    prompt = f"""You are a Kubernetes expert helping debug pod failures.

Pod Information:
- Name: {pod['name']}
- Status: {pod['status']}
- Ready: {pod['ready']}
- Restart Count: {pod['restarts']}

Recent Logs (last 50 lines):
{logs_sample}

Recent Events:
{events_sample}

Provide a diagnosis in this format:

**ROOT CAUSE:**
[Explain in 1-2 sentences what's causing the pod to fail, using simple terms]

**WHY THIS HAPPENS:**
[Explain why this error occurs, use an analogy if helpful]

**HOW TO FIX:**
[Provide specific kubectl commands or YAML changes to fix the issue]
"""
    response = self.ask_ollama(prompt)
    return self._remove_repeated_suffix(response)

The _remove_repeated_suffix at the end strips duplicated content that Llama 3.2 sometimes generates — more on that below.

Demo: CrashLoopBackOff Diagnosis

Input — what the cluster shows:

$ kubectl get pods
NAME          READY   STATUS             RESTARTS   AGE
broken-pod    0/1     CrashLoopBackOff   4          3m

Logs show: panic: runtime error: invalid memory address or nil pointer dereference

Output from the AI:

🔍 KUBERNETES POD DEBUG REPORT
Pod: broken-pod | Status: CrashLoopBackOff | Restarts: 4

🤖 AI DIAGNOSIS:

**ROOT CAUSE:**
The application is crashing immediately on startup due to a nil pointer
dereference — it's trying to use a variable that was never initialized,
likely a missing environment variable or config value.

**WHY THIS HAPPENS:**
Think of it like a recipe that calls for eggs, but your fridge is empty.
The app tries to read a config value that doesn't exist, gets back nothing,
and then crashes when it tries to use it. Kubernetes keeps restarting it,
hoping it'll work — it won't until the missing config is there.

**HOW TO FIX:**
1. Check what env vars the container expects:
   kubectl describe pod broken-pod | grep -A 10 'Environment'
2. Add the missing values to your deployment YAML under 'env:'
3. Or create a ConfigMap and reference it:
   kubectl create configmap app-config --from-literal=KEY=value
4. Reapply: kubectl apply -f your-deployment.yaml

💡 SUGGESTED NEXT STEP:
   kubectl logs broken-pod --previous

That diagnosis took 12 seconds. The manual version of this took me 25 minutes the first time I hit a nil pointer crash in K8s.

What Surprised Me

kubectl logs silently fails on crashed containers. If a container has already exited, the default kubectl logs command returns nothing — no error, just empty output. The fix is the --previous flag, which fetches logs from the last terminated container. I found this out the hard way when the AI kept saying "No logs available" for CrashLoopBackOff pods. The code now tries normal logs first, then falls back automatically:

except subprocess.CalledProcessError:
    # Try to get previous logs if container already crashed
    result = subprocess.run(
        ["kubectl", "logs", pod_name, "-n", namespace,
         "--previous", f"--tail={tail}"],
        ...
    )

Llama 3.2 sometimes repeats its own output. Occasionally the model generates a full response and then starts over from the beginning, appending a duplicate. For a terminal tool this looks terrible. I had to write a suffix deduplication function that checks if the second half of the response is a repeat of any trailing segment of the first half, and strips it. Not something I expected to need when I started this project.

Minikube hangs silently when the base image is cached. On Windows with Docker Desktop, minikube start can freeze indefinitely trying to pull an image that's already on disk. The fix — minikube start --base-image=gcr.io/k8s-minikube/kicbase:v0.0.49 — forces it to use the local cache. This isn't documented prominently and cost me an hour of debugging a debugging tool.

Try It

GitHub: https://github.com/ThinkWithOps
Demo video: https://youtu.be/LFF-987-uhA

# Prerequisites: Minikube running, Ollama + llama3.2 pulled
git clone https://github.com/ThinkWithOps/ai-devops-projects
cd ai-devops-projects/02-ai-k8s-debugger
pip install -r requirements.txt

# Deploy a broken pod to test with
kubectl apply -f demo/broken-pod.yaml

# Run the debugger
python src/k8s_debugger.py

# Or target a specific pod
python src/k8s_debugger.py --pod broken-pod

Project 2 in my AI+DevOps series — all tools run locally with Ollama, zero cloud costs. Project 1 was an AI Docker vulnerability scanner, Project 3 is an AI AWS Cost Detective. Links in my profile.

What K8s error do you dread seeing the most? For me it's still Pending with node affinity issues — the AI actually handles that one better than I expected. Drop yours in the comments.

I Built an AI AWS Cost Detective That Found $900/Year in Waste — Here's How

Vijaya Rajeev Bollu — Sun, 08 Mar 2026 15:22:25 +0000

The Problem

AWS Cost Explorer shows you data. It doesn't tell you what to do about it. I was paying $127/month and knew I was wasting money but couldn't quickly identify where.

What the AI Found

Running the tool against my own account uncovered:

EC2 Waste: A t3.small running 24/7 — used maybe 2 hours a day for testing. That's $45/month for 22 hours of idle compute every single day.
EBS Volumes: Three EBS volumes still attached to stopped instances. No data being written, no instance using them. $8/month evaporating for nothing.
NAT Gateway: A NAT Gateway from an old VPC setup I'd completely forgotten. Nothing routing through it. $12/month for a network door with no traffic.
Snapshots: Automated snapshots from an RDS instance I deleted months ago. The database was gone but the snapshots kept accumulating — $10/month.

Total: $75/month = $900/year

How It Works

The tool chains three things together: boto3 fetches your AWS costs and resource counts, Python shapes the data, and Ollama (local Llama 3.2) turns it into actionable recommendations.

AWS Cost Explorer API  →  Python (boto3)  →  Ollama  →  Structured report
     (billing data)        (resource counts)   (local LLM)

First it pulls 30 days of costs grouped by service, then counts your live resources (EC2 instances, EBS volumes, S3 buckets, RDS databases, Lambda functions). Both datasets go into the AI prompt together — because cost numbers without resource context give you vague answers.

def analyze_costs(self, services: List[Dict], resources: Dict) -> str:
    total_cost = sum(s['cost'] for s in services)
    top_services = services[:5]

    services_text = "\n".join([
        f"- {s['service']}: ${s['cost']:.2f}"
        for s in top_services
    ])

    resources_text = "\n".join([
        f"- {k.replace('_', ' ').title()}: {v}"
        for k, v in resources.items()
    ])

    prompt = f"""You are an AWS cost optimization expert.

COST SUMMARY (Last 30 Days):
Total Cost: ${total_cost:.2f}

Top Services by Cost:
{services_text}

Resources Currently Running:
{resources_text}

Provide recommendations in this format:

**COST ANALYSIS:**
**HIDDEN COSTS DETECTED:**
**OPTIMIZATION RECOMMENDATIONS:**
**ESTIMATED SAVINGS:**
"""
    return self.ask_ollama(prompt)

The structured output format in the prompt is what makes the response actually parseable and useful — not just a wall of text.

Setting Up Read-Only AWS Access

This is important — the tool only needs read permissions. Here's the minimal IAM policy:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "ce:GetCostAndUsage",
      "ec2:Describe*",
      "s3:ListAllMyBuckets",
      "rds:Describe*",
      "lambda:List*"
    ],
    "Resource": "*"
  }]
}

Create a dedicated IAM user (cost-detective), attach this policy, generate an access key, and run aws configure. The tool never writes anything to your account — worst case it reads data you didn't expect it to.

What Surprised Me

Two things I didn't expect:

The Cost Explorer API is completely free. I assumed querying billing data would itself have a cost. It doesn't. Zero charges for API calls to Cost Explorer.

AWS returns cost values as Python Decimal, not float. This one is a quiet killer — json.dumps() will crash when you try to save a report because the standard JSON encoder doesn't handle Decimal. Had to write a custom encoder:

class DecimalEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, Decimal):
            return float(obj)
        return super(DecimalEncoder, self).default(obj)

No error message tells you why it failed. You just get a TypeError and have to figure out where the Decimal came from.

Limitations

Being honest about what this doesn't handle well:

Data is 24–48 hours delayed. Cost Explorer isn't real-time. If you just spun up a resource today, it won't show up yet.
Single region by default. Resource counts only scan us-east-1. Multi-region setups need extra config.
Doesn't catch everything. Very small charges (under $0.01) are filtered out. Some hidden costs — like cross-AZ data transfer — aren't obvious from the Cost Explorer groupings.
AI recommendations need verification. The tool identifies patterns and suggests actions, but you should always sanity-check before terminating anything. I almost deleted an EBS volume that was actually still in use by a snapshot restore.

Try It

GitHub: https://github.com/ThinkWithOps
Demo video: https://youtu.be/rg1Vnjjt9xk

git clone https://github.com/ThinkWithOps/ai-devops-projects
cd ai-devops-projects/03-ai-aws-cost-detective
pip install -r requirements.txt
python src/aws_cost_detective.py

# Save report to JSON
python src/aws_cost_detective.py --output report.json

Project 3 in my AI+DevOps series. Project 4 is an AI GitHub Actions Auto-Healer — it reads failing CI logs and suggests fixes. Link in my profile.

What's the most unexpected thing hiding in your AWS bill? I'd have never noticed that NAT Gateway without the tool surfacing it. Drop yours in the comments.

How I Built a Local AI Docker Vulnerability Scanner (No API Costs, No Cloud)

Vijaya Rajeev Bollu — Fri, 06 Mar 2026 19:04:47 +0000

How I Built a Local AI Docker Vulnerability Scanner (No API Costs, No Cloud)

The Problem with Trivy Output

Running Trivy gives you a wall of CVE numbers. Most developers copy-paste them into Google and spend 20 minutes figuring out if each one actually matters for their use case.

I built a tool that fixes this.

What I Built

A local AI wrapper around Trivy that:

Scans any Docker image
Takes the raw CVE output
Feeds it to Ollama (local LLM — no API costs)
Returns plain English explanations + specific fix recommendations

The Interesting Finding

nginx:1.27-alpine: 14 vulnerabilities
nginx:alpine:       3 vulnerabilities

Same base image family — pinned version had 4.5x more CVEs. The AI caught this pattern and recommended variants to compare automatically.

Tech Stack

Python 3.11
Trivy (vulnerability scanner)
Ollama + Llama 3.2 (local LLM)
Zero cloud dependencies

How It Works (Code Walkthrough)

The scanner has three moving parts: Trivy does the heavy lifting of CVE detection, Python orchestrates everything, and Ollama explains what it all means.

Step 1 — Scan with Trivy and parse the JSON:

def scan_image(self, image_name: str) -> Optional[Dict]:
    result = subprocess.run(
        ["trivy", "image", "--format", "json", "--severity", "HIGH,CRITICAL", image_name],
        capture_output=True, text=True, check=True
    )
    return json.loads(result.stdout)

def extract_vulnerabilities(self, scan_data: Dict) -> List[Dict]:
    vulnerabilities = []
    seen_vulns = set()  # deduplicate by CVE ID

    for result in scan_data.get("Results", []):
        for vuln in result.get("Vulnerabilities", []):
            vuln_id = vuln.get("VulnerabilityID", "N/A")
            if vuln_id in seen_vulns:
                continue
            seen_vulns.add(vuln_id)
            vulnerabilities.append({
                "id": vuln_id,
                "package": vuln.get("PkgName", "N/A"),
                "severity": vuln.get("Severity", "N/A"),
                "fixed_version": vuln.get("FixedVersion", "Not available")
            })
    return vulnerabilities

Step 2 — Send each CVE to Ollama for a plain English explanation:

def explain_vulnerability(self, vuln: Dict) -> str:
    prompt = f"""You are a security expert explaining vulnerabilities to developers.

Vulnerability Details:
- ID: {vuln['id']}
- Package: {vuln['package']} (version {vuln['version']})
- Severity: {vuln['severity']}
- Title: {vuln['title']}

Explain in 2-3 sentences:
1. What this vulnerability means in simple terms
2. Why it's dangerous
3. How to fix it (fixed version: {vuln['fixed_version']})

Keep it concise and actionable. Use analogies if helpful."""

    response = requests.post(
        f"{self.ollama_host}/api/generate",
        json={"model": "llama3.2", "prompt": prompt, "stream": False},
        timeout=60
    )
    return response.json().get("response", "")

Step 3 — Generate an overall summary with structured output:

The summary prompt forces Ollama into a key-value format so we can parse it reliably and build a comparison command on the fly — more on that in the next section.

The Trickiest Part

Getting Ollama to return structured output consistently was harder than expected. Free-form responses were great for individual CVE explanations, but the security summary needed to be parseable — I needed specific fields like SECURITY_POSTURE and VARIANTS_TO_TEST to programmatically build the comparison command.

The solution was strict prompt formatting: I told the model to respond in KEY: value pairs and gave it an explicit example. Then I split each line on : and built a dict. When parsing failed I fell back to a hardcoded comparison command. The other challenge was Llama 3.2 sometimes repeating itself — I solved that with a deduplication pass that checks for repeated section headers (**1., **Vulnerability, etc.) and drops them before printing.

Results

Before — Raw Trivy output:

CVE-2024-1234 (CRITICAL)
Package: openssl 1.1.1k
Description: Use-after-free in X509_verify_cert function

😕 "What does this mean? Do I need to care about this?"

After — AI-enhanced output:

🤖 AI Explanation:
This is like leaving your house key under the doormat.
OpenSSL handles your HTTPS connections, and this bug lets
attackers potentially decrypt traffic. Fix: update your
Dockerfile base image to get openssl 1.1.1w or later.

✅ "Got it, I'll update the base image today."

Metric	Value
Avg scan time	15–30 seconds
AI explanation per CVE	~3 seconds
Cloud API cost	$0
Images tested	50+ (nginx, node, python, ubuntu)

Manual CVE triage that used to take 20+ minutes per image now takes under a minute for the top 5 vulnerabilities.

Try It Yourself

GitHub: https://github.com/ThinkWithOps/ai-devops-projects/tree/main/01-ai-docker-scanner
Full demo video: https://youtu.be/J6fmU6t9jUU

# Prerequisites: Docker, Trivy, Ollama + llama3.2 pulled
git clone 
cd 01-ai-docker-scanner
pip install -r requirements.txt
python src/docker_scanner.py nginx:latest

What's Next

This is Project 1 in my AI+DevOps series. Next I built an AI K8s Pod Debugger — link in my profile.