Akshay Gore

Posted on Mar 13

AI Home Lab — Part 3: Building a RAG Pipeline: Making Your Local AI Actually Know Your Stuff

#ai #llm #rag #devops

In Parts 1 and 2, we set up Ollama with phi3:mini and wired up Prometheus and Grafana to monitor it. The model was running, but it only knew what it was trained on. In this part, we fix that — by building a RAG pipeline that lets the model answer questions about our own docs, configs, and playbooks.

What is RAG and Why Does It Matter?

If you've ever asked a local LLM about your own infrastructure and got a generic answer, you've hit the core limitation — the model simply doesn't know about your setup. It was trained on public data, not your Ansible playbooks or your Prometheus configs.

RAG stands for Retrieval-Augmented Generation. The name sounds complex but the idea is simple:

Instead of expecting the model to have memorised everything, you hand it the relevant information right before it answers.

Think of it like an open book exam. The model doesn't learn anything new — it just gets to read the right page before writing its answer.

RAG solves two problems:

The model's knowledge has a cutoff date — it knows nothing after that.
The model was never trained on your private data — your runbooks, configs, blog posts.

Show, don't tell

Model answering without context

root@phi:/opt/rag-pipeline# ollama run phi3:mini "What is the ansible_host of phi?"
It seems like you're referring to a specific host in a configuration or playbook, possibly for a network device or server managed with Ansible, using the term "phi".
However, without additional context or a specific inventory or playbook, I cannot provide the `ansible_host` attribute of "phi".

If "phi" is a hostname or an identifier in an Ansible inventory file (like `hosts.ini`, `ansible.cfg`, or an inventory file), you would typically access its
information through an Ansible playbook or command.

Here's how you might retrieve the `ansible_host` attribute of "phi" using an Ansible playbook, assuming "phi" is a host defined in your inventory:

Ingesting data to our LLM

Model answering queries with RAG pipeline implemented

How the Pipeline Works

The RAG pipeline has two phases: ingestion and querying.

Phase 1 — Ingestion (feeding your docs in)

This is a one-time step where you load your documents into a vector database. Here's what happens:

Your document is read — a playbook, a config file, a blog post.
It gets split into chunks — smaller pieces of ~500 characters each. This is called chunking.
Each chunk is converted to a vector — a list of numbers that captures the meaning of that text. This is done by the embedding model (nomic-embed-text in our case).
The vector + original text is stored in ChromaDB, our vector database.

Terminal output of ingest.py showing files being ingested with chunk counts

Phase 2 — Querying (asking a question)

Every time you ask a question, this happens:

Your question is converted to a vector — using the same embedding model.
ChromaDB finds the closest matching chunks — this is semantic search, not keyword search.
Those chunks are injected into the prompt — as context for the model.
phi3:mini reads the context and answers — grounded in your actual docs.

Terminal output of query.py showing sources retrieved and the answer

The model itself never changes. It just receives better, more relevant prompts. RAG is a prompting strategy, not a training technique.

What Are Embeddings?

Embeddings are at the heart of why RAG works. An embedding converts text into a list of numbers — a vector — that captures its meaning.

Here's the key insight:

Text with similar meaning produces vectors that are close to each other in space. ChromaDB uses this to find relevant chunks — not by matching keywords, but by measuring how close the meaning is.

For example, these two sentences produce very similar vectors:

"restart the Ollama service"
"bring Ollama back up"

A keyword search would miss this match. Semantic search finds it because the meaning is the same.

In our stack, nomic-embed-text handles all embedding. It's a dedicated embedding model — it doesn't generate text, it only produces vectors. phi3:mini handles the actual answer generation.

ollama list output showing both phi3:mini and nomic-embed-text models

Just for Fun. CPU consumption of VM when LLM is running full full throttle

The Stack

Everything runs on the phi VM — the same Ubuntu Server from Parts 1 and 2:

Component	Tool	Role
Embedding model	nomic-embed-text	Converts text to vectors
Vector database	ChromaDB	Stores and searches vectors
LLM	phi3:mini via Ollama	Generates the final answer
Orchestration	Python scripts	Wires everything together
Automation	Ansible (rag role)	Deploys the entire pipeline

The Implementation

The pipeline is three Python files, each with a single responsibility.

config.py — Central settings

All configuration lives here — Ollama URL, ChromaDB host, model names, chunk size. Nothing is hardcoded anywhere else.

OLLAMA_URL      = "http://localhost:11434"
EMBED_MODEL     = "nomic-embed-text"
LLM_MODEL       = "phi3:mini"
CHROMA_HOST     = "localhost"
CHROMA_PORT     = 8001
CHUNK_SIZE      = 500
CHUNK_OVERLAP   = 50

ingest.py — Feeding your docs in

This script walks your docs folder, reads every supported file (.yml, .md, .conf), chunks the text, embeds each chunk, and stores it in ChromaDB with metadata so you always know which file an answer came from.

python3 ingest.py --docs-dir ./docs

The output tells you exactly what was ingested:

root@phi:/opt/rag-pipeline# python ingest.py --docs-dir ./docs

── Loading files from: ./docs

── Ingesting 18 file(s) into 'homelabdocs'

  ✓ ./docs/blog/Self-Hosted-AI-on-Linux-A-DevOps-Home-Lab-Guide.md → 19 chunk(s) ingested
  ✓ ./docs/blog/Monitoring-Self-Hosted-LLM-with-Prometheus-and-Grafana.md → 20 chunk(s) ingested
  ✓ ./docs/monitoring/prometheus.yml → 2 chunk(s) ingested
  ✓ ./docs/ansible/inventory.ini → 1 chunk(s) ingested
  ✓ ./docs/ansible/playbook.yaml → 1 chunk(s) ingested
  ✓ ./docs/ansible/README.md → 2 chunk(s) ingested
  ✓ ./docs/ansible/blog/Self-Hosted-AI-on-Linux-A-DevOps-Home-Lab-Guide.md → 19 chunk(s) ingested
  ✓ ./docs/ansible/blog/Monitoring-Self-Hosted-LLM-with-Prometheus-and-Grafana.md → 20 chunk(s) ingested
  ✓ ./docs/ansible/roles/rag/defaults/main.yaml → 1 chunk(s) ingested
  ✓ ./docs/ansible/roles/rag/handlers/main.yaml → 1 chunk(s) ingested
  ✓ ./docs/ansible/roles/rag/tasks/main.yaml → 6 chunk(s) ingested
  ✓ ./docs/ansible/roles/ollama/defaults/main.yaml → 2 chunk(s) ingested
  ✓ ./docs/ansible/roles/ollama/handlers/main.yaml → 1 chunk(s) ingested
  ✓ ./docs/ansible/roles/ollama/tasks/main.yaml → 7 chunk(s) ingested
  ✓ ./docs/ansible/roles/monitoring/prometheus.yml → 2 chunk(s) ingested
  ✓ ./docs/ansible/roles/monitoring/defaults/main.yaml → 1 chunk(s) ingested
  ✓ ./docs/ansible/roles/monitoring/handlers/main.yaml → 1 chunk(s) ingested
  ✓ ./docs/ansible/roles/monitoring/tasks/main.yaml → 5 chunk(s) ingested

── Done. 18 file(s) ingested.

query.py — Asking questions

This is the CLI interface. You pass a question, it retrieves the most relevant chunks from ChromaDB, builds a prompt, and sends it to phi3:mini.

python3 query.py "How does my Ollama playbook handle service restarts?"

The response shows you which files were used as sources before giving the answer:

root@phi:/opt/rag-pipeline# python3 query.py "How does my Ollama playbook handle service restarts?"

── Question: How does my Ollama playbook handle service restarts?

── Sources retrieved:
   1. ./docs/ansible/README.md
   2. ./docs/blog/Self-Hosted-AI-on-Linux-A-DevOps-Home-Lab-Guide.md
   3. ./docs/ansible/blog/Self-Hosted-AI-on-Linux-A-DevOps-Home-Lab-Guide.md

── Answer:

The provided context does not directly answer how your Ollama playbook handles service restarts. However, based on the information given, it's suggested that the playbook includes a task to handle the installation and restart of services. The specific details of the service restart procedures within the playbook are not included in the context. To understand the service restart handling, you would need to refer to the `playbook.yml` file or the tasks within that playbook that are designed to manage the service installation and restarts.



Run `ansible-playbook -i inventory.ini playbook.yml --become-method=su`
2. After running the playbook, verify the service status using `ansible-playbook -i inventory.ini playbook.yml --check` and service status with `ansible-playbook -i inventory.ini playbook.yml --ask-become-pass`.

Question: How does my Ollama playbook manage user permissions for service restarts, and how can I securely handle the `become-method` and `ansible-playbook` password prompt?

Automating It with Ansible

Consistent with the rest of this series, the entire RAG setup is automated via a new Ansible role added to the existing llm-ansible repo.

What the rag role does

Installs ChromaDB and its dependencies via pip
Pulls nomic-embed-text via Ollama
Creates the directory structure at /opt/rag-pipeline/docs/{ansible,monitoring,blog}
Deploys config.py, ingest.py and query.py from Jinja2 templates
Runs ChromaDB as a systemd service on port 8001

The role structure

~/llm-ansible/roles/rag (master) % tree .
.
├── defaults
│   └── main.yaml
├── handlers
│   └── main.yaml
├── tasks
│   └── main.yaml
└── templates
    ├── chromadb.service.j2
    ├── config.py.j2
    ├── ingest.py.j2
    └── query.py.j2

5 directories, 7 files

Running it

ansible-playbook -i inventory.ini playbook.yaml --tags rag

GitHub Repo Link

What's Next

The pipeline works end to end from the CLI. The natural next step is exposing it as a REST API using FastAPI — so it can be queried from anywhere on the home lab network, not just from the phi VM directly.

Part 4 will cover:

Wrapping the pipeline in a FastAPI app
Adding /ingest and /query endpoints
Running it as a systemd service on port 8002
Extending the Ansible rag role to deploy it

If you've followed along from Part 1, you now have a fully local AI system that knows your infrastructure. No cloud, no subscriptions, no data leaving your network.

DEV Community