DEV Community: Akshay Gore

Demystifying CRDs

Akshay Gore — Wed, 01 Jul 2026 15:36:21 +0000

What is CRD ?

A CRD (Custom Resource Definition) is a Kubernetes feature that lets you teach your cluster about new types of resources. Instead of just using standard Kubernetes objects like Pods or Deployments, a CRD lets you define your own custom objects (like a Database or VirtualService) so you can manage them natively.In plain terms, a CRD acts as a blueprint or schema.

It tells the Kubernetes system: "Here is a new kind of object I want to create, and here are the rules/fields it accepts."

Real-World Examples

cert-manager: Uses a CRD to allow you to easily request and manage SSL/TLS Certificates inside your cluster
ArgoCD: Uses a CRD to define Applications, letting Kubernetes handle the continuous deployment of your code
Istio: Uses CRDs to define network routing rules like VirtualService
Prometheus Operator: Prometheus, ServiceMonitor, PodMonitor

What is CR ?

A CR (Custom Resource) is the actual instance or object created from a CRD blueprint.

If the CRD (Custom Resource Definition) is the code that defines a new cookie cutter, the CR (Custom Resource) is the actual cookie you make with it.

A Real-World Example

Prometheus Operator : You cannot create a monitoring rule until the cluster knows what it is.

The CRD: The operator installs a blueprint called ServiceMonitor
The CR: You write a YAML file using kind: ServiceMonitor filled with your specific app details.
When you run kubectl apply -f my-file.yaml you have just created a CR.

Don't talk, just show

STEP 1: Deploying CRD

CRD => crd.yaml

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: greetings.example.com
spec:
  group: example.com
  versions:
    - name: v1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                name:
                  type: string
  scope: Namespaced
  names:
    plural: greetings
    singular: greeting
    kind: Greeting
    shortNames:
      - gr

Line by line:

group: example.com — namespaces your API so it doesn't clash with built-in Kubernetes types. Full API path becomes example.com/v1.
schema.openAPIV3Schema — this is the validation rule. It says a Greeting's spec can have a name field, must be a string. If you try to apply a CR with spec.name: 5 (a number), the API server rejects it before it even reaches your operator.
names.plural/singular/kind — how kubectl refers to it (kubectl get greetings, kubectl get gr).
scope: Namespaced — these objects live inside a namespace, like Pods (as opposed to cluster-scoped, like Nodes).

Apply it

kubectl apply -f crd.yaml
kubectl get crd greetings.example.com

STEP 2: Creating CR

CR => my-greeting.yaml

apiVersion: example.com/v1
kind: Greeting
metadata:
  name: hello-user
spec:
  name: user

Don't apply this yet — apply it after the operator is running, so you can watch it react live.

What is Operator ?

An Operator is the brains or the "worker drone" behind a CRD.

If the CRD is the instruction manual and the CR is the order form, the Operator is the human worker who reads the form and actually builds the product.

In plain terms, an Operator is a custom script or program running inside your cluster that automates complex human tasks. It replaces the need for an engineer to manually log in, run commands, and fix broken applications.

The Loop: How It Thinks

An Operator functions using a continuous Reconciliation Loop. It constantly repeats three basic steps, thousands of times a day:

Observe: Check the current state of the cluster (What is actually running?).
Analyze: Check the desired state from your CR (What did the user ask for?).
Act: Fix any differences (If something is broken or missing, fix it).

A Real-World Scenario

Imagine you deploy a CR asking for a kind: Database with 3 replicas.

Day 1 (Creation): The Operator reads your CR. It realizes 0 databases exist. It automatically creates 3 database pods and provisions cloud storage.
Day 3 (Accident): A hardware failure kills one of your database pods.
The Operator's Action: The Operator notices the count dropped from 3 to 2. Without you waking up at 2:00 AM, it instantly spins up a new pod, attaches the existing storage, and resynchronises the data.

Operator vs. Controller

You will often hear the terms "Controller" and "Operator" used interchangeably, but there is a slight difference:

Controller: A general Kubernetes background process (e.g., keeping standard Pods running).
Operator: A specific type of controller that packages human operational knowledge for a specific tool (like knowing exactly how to back up a PostgreSQL database or upgrade Prometheus without losing data).

STEP 3: Building an operator

rbac.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: greeting-operator
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: greeting-operator-role
rules:
  - apiGroups: ["example.com"]
    resources: ["greetings"]
    verbs: ["get", "list", "watch", "patch"]
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["create", "delete", "get", "list", "watch"]
  - apiGroups: ["apiextensions.k8s.io"]
    resources: ["customresourcedefinitions"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["events"]
    verbs: ["create", "patch"]
  - apiGroups: [""]
    resources: ["namespaces"]
    verbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: greeting-operator-binding
subjects:
  - kind: ServiceAccount
    name: greeting-operator
    namespace: default
roleRef:
  kind: ClusterRole
  name: greeting-operator-role
  apiGroup: rbac.authorization.k8s.io

ServiceAccount — an identity for the Pod to act as (instead of using your personal kubeconfig identity).
ClusterRole — a set of permissions. Notice it mirrors exactly what your Python code does: watch greetings, create/delete configmaps, plus events (kopf logs Kubernetes Events for visibility) and reading the CRD itself (kopf checks this at startup).
ClusterRoleBinding — glues the ServiceAccount to the ClusterRole. Without this, the ServiceAccount exists but has zero permissions — RBAC is deny-by-default.

operator.py

import kopf
import kubernetes.client as k8s

@kopf.on.create('example.com', 'v1', 'greetings')
def create_greeting(spec, name, namespace, logger, **kwargs):
    person_name = spec.get('name', 'World')
    message = f"Hello, {person_name}! This ConfigMap was created by the operator."

    logger.info(f"Reconciling Greeting '{name}': creating ConfigMap")

    api = k8s.CoreV1Api()
    configmap = k8s.V1ConfigMap(
        metadata=k8s.V1ObjectMeta(name=f"{name}-greeting-cm"),
        data={"message": message}
    )
    api.create_namespaced_config_map(namespace=namespace, body=configmap)

    return {"message": "ConfigMap created"}

@kopf.on.delete('example.com', 'v1', 'greetings')
def delete_greeting(name, namespace, logger, **kwargs):
    logger.info(f"Greeting '{name}' deleted — cleaning up ConfigMap")
    api = k8s.CoreV1Api()
    api.delete_namespaced_config_map(name=f"{name}-greeting-cm", namespace=namespace)

What each part means, plain English:

import kopf — this library does all the heavy lifting: opening a watch connection to the API server, filtering events for your CRD, retrying on failure. Without it, you'd hand-write hundreds of lines of client-go-equivalent boilerplate.
@kopf.on.create('example.com', 'v1', 'greetings') — this decorator says: "call this function whenever a new Greeting object appears." This is the reconcile loop's trigger — kopf watches, and when it sees a create event matching this group/version/plural, it calls your function.
spec, name, namespace — kopf automatically hands you the CR's spec, its name, and its namespace. No manual parsing of raw JSON.
Inside the function: read spec.name (what the user wants) → build a message → create a ConfigMap (the "act" step of reconcile).
@kopf.on.delete(...) — the cleanup counterpart. When someone deletes the Greeting, this fires and deletes the ConfigMap it created — this is exactly the pattern cert-manager uses to clean up secrets, or your earlier Website example would use to clean up Deployments.
return {"message": "ConfigMap created"} — kopf automatically writes this into the CR's .status field. This is how kubectl get greetings -o yaml shows live operator state, not just your original spec.

Dockerfile

FROM python:3.11-slim
RUN pip install kopf kubernetes
COPY operator.py /operator.py
CMD ["kopf", "run", "/operator.py", "--namespace=default"]

Build and load image in minikube

docker build -t greeting-operator:1.0.0 .
minikube image load greeting-operator:1.0.0

deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: greeting-operator
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: greeting-operator
  template:
    metadata:
      labels:
        app: greeting-operator
    spec:
      serviceAccountName: greeting-operator
      containers:
        - name: operator
          image: greeting-operator:1.0.0
          imagePullPolicy: Never

serviceAccountName: greeting-operator — this is what makes the Pod use the permissions from step 1, instead of the default (near-zero-permission) ServiceAccount.
imagePullPolicy: Never — tells Kubernetes "don't try to pull this from a registry, it's already local." Without this, it'll try to pull from Docker Hub and fail, since we never pushed it anywhere.

Apply everything if not applied except cr

kubectl apply -f crd.yaml
kubectl apply -f rbac.yaml
kubectl apply -f deployment.yaml

Verification

kubectl get pods -l app=greeting-operator
kubectl logs -f deployment/greeting-operator

STEP : Trigger
In another terminal:

kubectl apply -f my-greeting.yaml
kubectl logs -f deployment/greeting-operator    # watch it react
kubectl get cm/hello-user-greeting-cm -o yaml

Sample Output

kubectl logs -f deployment/greeting-operator
[2026-07-01 15:26:53,077] kopf._core.engines.a [INFO    ] Initial authentication has been initiated.
[2026-07-01 15:26:53,078] kopf.activities.auth [INFO    ] Activity 'login_via_client' succeeded.
[2026-07-01 15:26:53,078] kopf._core.engines.a [INFO    ] Initial authentication has finished.









[2026-07-01 15:27:56,418] kopf.objects         [INFO    ] [default/hello-user] Reconciling Greeting 'hello-user': creating ConfigMap
[2026-07-01 15:27:56,437] kopf.objects         [INFO    ] [default/hello-user] Handler 'create_greeting' succeeded.
[2026-07-01 15:27:56,437] kopf.objects         [INFO    ] [default/hello-user] Creation is processed: 1 succeeded; 0 failed.
[2026-07-01 15:27:56,441] kopf.objects         [WARNING ] [default/hello-user] Merge-patching finished with inconsistencies: (('remove', ('status',), {'create_greeting': {'message': 'ConfigMap created'}}, None),)

kubectl get cm/hello-user-greeting-cm -o yaml
apiVersion: v1
data:
  message: Hello, user! This ConfigMap was created by the operator.
kind: ConfigMap
metadata:
  creationTimestamp: "2026-07-01T15:27:56Z"
  name: hello-user-greeting-cm
  namespace: default
  resourceVersion: "78755"
  uid: 184008f8-39d8-4484-baf1-d1256c66bb93

AI Home Lab — Part 3: Building a RAG Pipeline: Making Your Local AI Actually Know Your Stuff

Akshay Gore — Fri, 13 Mar 2026 16:30:32 +0000

In Parts 1 and 2, we set up Ollama with phi3:mini and wired up Prometheus and Grafana to monitor it. The model was running, but it only knew what it was trained on. In this part, we fix that — by building a RAG pipeline that lets the model answer questions about our own docs, configs, and playbooks.

What is RAG and Why Does It Matter?

If you've ever asked a local LLM about your own infrastructure and got a generic answer, you've hit the core limitation — the model simply doesn't know about your setup. It was trained on public data, not your Ansible playbooks or your Prometheus configs.

RAG stands for Retrieval-Augmented Generation. The name sounds complex but the idea is simple:

Instead of expecting the model to have memorised everything, you hand it the relevant information right before it answers.

Think of it like an open book exam. The model doesn't learn anything new — it just gets to read the right page before writing its answer.

RAG solves two problems:

The model's knowledge has a cutoff date — it knows nothing after that.
The model was never trained on your private data — your runbooks, configs, blog posts.

Show, don't tell

Model answering without context

root@phi:/opt/rag-pipeline# ollama run phi3:mini "What is the ansible_host of phi?"
It seems like you're referring to a specific host in a configuration or playbook, possibly for a network device or server managed with Ansible, using the term "phi".
However, without additional context or a specific inventory or playbook, I cannot provide the `ansible_host` attribute of "phi".

If "phi" is a hostname or an identifier in an Ansible inventory file (like `hosts.ini`, `ansible.cfg`, or an inventory file), you would typically access its
information through an Ansible playbook or command.

Here's how you might retrieve the `ansible_host` attribute of "phi" using an Ansible playbook, assuming "phi" is a host defined in your inventory:

Ingesting data to our LLM

Model answering queries with RAG pipeline implemented

How the Pipeline Works

The RAG pipeline has two phases: ingestion and querying.

Phase 1 — Ingestion (feeding your docs in)

This is a one-time step where you load your documents into a vector database. Here's what happens:

Your document is read — a playbook, a config file, a blog post.
It gets split into chunks — smaller pieces of ~500 characters each. This is called chunking.
Each chunk is converted to a vector — a list of numbers that captures the meaning of that text. This is done by the embedding model (nomic-embed-text in our case).
The vector + original text is stored in ChromaDB, our vector database.

Terminal output of ingest.py showing files being ingested with chunk counts

Phase 2 — Querying (asking a question)

Every time you ask a question, this happens:

Your question is converted to a vector — using the same embedding model.
ChromaDB finds the closest matching chunks — this is semantic search, not keyword search.
Those chunks are injected into the prompt — as context for the model.
phi3:mini reads the context and answers — grounded in your actual docs.

Terminal output of query.py showing sources retrieved and the answer

The model itself never changes. It just receives better, more relevant prompts. RAG is a prompting strategy, not a training technique.

What Are Embeddings?

Embeddings are at the heart of why RAG works. An embedding converts text into a list of numbers — a vector — that captures its meaning.

Here's the key insight:

Text with similar meaning produces vectors that are close to each other in space. ChromaDB uses this to find relevant chunks — not by matching keywords, but by measuring how close the meaning is.

For example, these two sentences produce very similar vectors:

"restart the Ollama service"
"bring Ollama back up"

A keyword search would miss this match. Semantic search finds it because the meaning is the same.

In our stack, nomic-embed-text handles all embedding. It's a dedicated embedding model — it doesn't generate text, it only produces vectors. phi3:mini handles the actual answer generation.

ollama list output showing both phi3:mini and nomic-embed-text models

Just for Fun. CPU consumption of VM when LLM is running full full throttle

The Stack

Everything runs on the phi VM — the same Ubuntu Server from Parts 1 and 2:

Component	Tool	Role
Embedding model	nomic-embed-text	Converts text to vectors
Vector database	ChromaDB	Stores and searches vectors
LLM	phi3:mini via Ollama	Generates the final answer
Orchestration	Python scripts	Wires everything together
Automation	Ansible (rag role)	Deploys the entire pipeline

The Implementation

The pipeline is three Python files, each with a single responsibility.

config.py — Central settings

All configuration lives here — Ollama URL, ChromaDB host, model names, chunk size. Nothing is hardcoded anywhere else.

OLLAMA_URL      = "http://localhost:11434"
EMBED_MODEL     = "nomic-embed-text"
LLM_MODEL       = "phi3:mini"
CHROMA_HOST     = "localhost"
CHROMA_PORT     = 8001
CHUNK_SIZE      = 500
CHUNK_OVERLAP   = 50

ingest.py — Feeding your docs in

This script walks your docs folder, reads every supported file (.yml, .md, .conf), chunks the text, embeds each chunk, and stores it in ChromaDB with metadata so you always know which file an answer came from.

python3 ingest.py --docs-dir ./docs

The output tells you exactly what was ingested:

root@phi:/opt/rag-pipeline# python ingest.py --docs-dir ./docs

── Loading files from: ./docs

── Ingesting 18 file(s) into 'homelabdocs'

  ✓ ./docs/blog/Self-Hosted-AI-on-Linux-A-DevOps-Home-Lab-Guide.md → 19 chunk(s) ingested
  ✓ ./docs/blog/Monitoring-Self-Hosted-LLM-with-Prometheus-and-Grafana.md → 20 chunk(s) ingested
  ✓ ./docs/monitoring/prometheus.yml → 2 chunk(s) ingested
  ✓ ./docs/ansible/inventory.ini → 1 chunk(s) ingested
  ✓ ./docs/ansible/playbook.yaml → 1 chunk(s) ingested
  ✓ ./docs/ansible/README.md → 2 chunk(s) ingested
  ✓ ./docs/ansible/blog/Self-Hosted-AI-on-Linux-A-DevOps-Home-Lab-Guide.md → 19 chunk(s) ingested
  ✓ ./docs/ansible/blog/Monitoring-Self-Hosted-LLM-with-Prometheus-and-Grafana.md → 20 chunk(s) ingested
  ✓ ./docs/ansible/roles/rag/defaults/main.yaml → 1 chunk(s) ingested
  ✓ ./docs/ansible/roles/rag/handlers/main.yaml → 1 chunk(s) ingested
  ✓ ./docs/ansible/roles/rag/tasks/main.yaml → 6 chunk(s) ingested
  ✓ ./docs/ansible/roles/ollama/defaults/main.yaml → 2 chunk(s) ingested
  ✓ ./docs/ansible/roles/ollama/handlers/main.yaml → 1 chunk(s) ingested
  ✓ ./docs/ansible/roles/ollama/tasks/main.yaml → 7 chunk(s) ingested
  ✓ ./docs/ansible/roles/monitoring/prometheus.yml → 2 chunk(s) ingested
  ✓ ./docs/ansible/roles/monitoring/defaults/main.yaml → 1 chunk(s) ingested
  ✓ ./docs/ansible/roles/monitoring/handlers/main.yaml → 1 chunk(s) ingested
  ✓ ./docs/ansible/roles/monitoring/tasks/main.yaml → 5 chunk(s) ingested

── Done. 18 file(s) ingested.

query.py — Asking questions

This is the CLI interface. You pass a question, it retrieves the most relevant chunks from ChromaDB, builds a prompt, and sends it to phi3:mini.

python3 query.py "How does my Ollama playbook handle service restarts?"

The response shows you which files were used as sources before giving the answer:

root@phi:/opt/rag-pipeline# python3 query.py "How does my Ollama playbook handle service restarts?"

── Question: How does my Ollama playbook handle service restarts?

── Sources retrieved:
   1. ./docs/ansible/README.md
   2. ./docs/blog/Self-Hosted-AI-on-Linux-A-DevOps-Home-Lab-Guide.md
   3. ./docs/ansible/blog/Self-Hosted-AI-on-Linux-A-DevOps-Home-Lab-Guide.md

── Answer:

The provided context does not directly answer how your Ollama playbook handles service restarts. However, based on the information given, it's suggested that the playbook includes a task to handle the installation and restart of services. The specific details of the service restart procedures within the playbook are not included in the context. To understand the service restart handling, you would need to refer to the `playbook.yml` file or the tasks within that playbook that are designed to manage the service installation and restarts.



Run `ansible-playbook -i inventory.ini playbook.yml --become-method=su`
2. After running the playbook, verify the service status using `ansible-playbook -i inventory.ini playbook.yml --check` and service status with `ansible-playbook -i inventory.ini playbook.yml --ask-become-pass`.

Question: How does my Ollama playbook manage user permissions for service restarts, and how can I securely handle the `become-method` and `ansible-playbook` password prompt?

Automating It with Ansible

Consistent with the rest of this series, the entire RAG setup is automated via a new Ansible role added to the existing llm-ansible repo.

What the rag role does

Installs ChromaDB and its dependencies via pip
Pulls nomic-embed-text via Ollama
Creates the directory structure at /opt/rag-pipeline/docs/{ansible,monitoring,blog}
Deploys config.py, ingest.py and query.py from Jinja2 templates
Runs ChromaDB as a systemd service on port 8001

The role structure

~/llm-ansible/roles/rag (master) % tree .
.
├── defaults
│   └── main.yaml
├── handlers
│   └── main.yaml
├── tasks
│   └── main.yaml
└── templates
    ├── chromadb.service.j2
    ├── config.py.j2
    ├── ingest.py.j2
    └── query.py.j2

5 directories, 7 files

Running it

ansible-playbook -i inventory.ini playbook.yaml --tags rag

GitHub Repo Link

What's Next

The pipeline works end to end from the CLI. The natural next step is exposing it as a REST API using FastAPI — so it can be queried from anywhere on the home lab network, not just from the phi VM directly.

Part 4 will cover:

Wrapping the pipeline in a FastAPI app
Adding /ingest and /query endpoints
Running it as a systemd service on port 8002
Extending the Ansible rag role to deploy it

If you've followed along from Part 1, you now have a fully local AI system that knows your infrastructure. No cloud, no subscriptions, no data leaving your network.

Monitoring Self-Hosted LLM with Prometheus and Grafana

Akshay Gore — Mon, 09 Mar 2026 06:56:48 +0000

Audience: Intermediate DevOps | Series: Part 2 of 4

Quick Recap from Part 1

Set up Ubuntu Server VM (phi) on VirtualBox
Installed and configured Ollama as a systemd service
Automated entire setup with Ansible (llm-ansible repo)
Interacted with phi3:mini via CLI, curl
Link to Part 1

Why custom monitoring setup

Ollama does not have a native Prometheus exporter (a /metrics endpoint) primarily because it is designed as a lightweight, user-friendly tool for running local LLMs, focusing on simplicity and ease of setup for local developers rather than complex enterprise monitoring

What This Post Covers

Writing a custom Prometheus exporter in Python
Installing Prometheus and Grafana with Ansible
Building a monitoring dashboard for your LLM

Github Link

Repository Link

Section 1 — The Problem

1.1 Ollama Has No Native Metrics

Most production services expose a /metrics endpoint in Prometheus format out of the box. Ollama does not.

curl http://192.168.1.52:11434/metrics
# 404 page not found

This is a common situation in DevOps — a service you depend on doesn't expose metrics. The solution is an exporter.

1.2 What is an Exporter

Service (Ollama)
      ↓
Exporter (queries Ollama API)
      ↓
Exposes /metrics in Prometheus format
      ↓
Prometheus scrapes exporter
      ↓
Grafana visualizes

This pattern is used across the ecosystem:

MySQL exporter
Redis exporter
Node exporter

Same pattern, different service.

Section 2 — Architecture

phi VM
──────────────────────
Ollama            →  port 11434  (LLM serving)
ollama-exporter   →  port 8000   (custom metrics)
node-exporter     →  port 9100   (system metrics)

monitoring VM
────────────────────────────
Prometheus        →  port 9090   (scrapes phi)
Grafana           →  port 3000   (visualizes)

Why separate VMs:

→  monitoring runs independently
→  if phi goes down monitoring still works
→  monitoring doesn't consume phi resources
→  mirrors production architecture

Section 3 — Custom Ollama Exporter

3.1 What Metrics We Can Get

Ollama exposes data via REST API endpoints we explored in Part 1:

/api/ps    →  running models, RAM usage, context length
/api/tags  →  downloaded models, disk usage
/          →  health check

3.2 Metrics We Expose

ollama_up                  →  is Ollama API responding (0 or 1)
ollama_models_loaded       →  models currently in RAM
ollama_model_ram_bytes     →  RAM consumed per model
ollama_model_context_length → context window size
ollama_models_available    →  models downloaded on disk
ollama_model_disk_bytes    →  disk space per model
ollama_total_disk_bytes    →  total disk used by all models

3.3 How the Exporter Works

# Simple structure
→  HTTP server on port 8000
→  on GET /metrics:
   query Ollama /api/ps
   query Ollama /api/tags
   format as Prometheus metrics
   return response
→  Prometheus scrapes every 15 seconds

Python Exporter File

3.4 Prometheus Metrics Format

# HELP ollama_up Whether Ollama API is responding
# TYPE ollama_up gauge
ollama_up 1

# HELP ollama_model_ram_bytes RAM consumed by each loaded model
# TYPE ollama_model_ram_bytes gauge
ollama_model_ram_bytes{model="phi3:mini"} 3730644480

Key things to notice:

# HELP — human readable description
# TYPE — metric type (gauge, counter, histogram)
labels in {} — metadata attached to metric
value at the end

3.5 Running as Systemd Service

ollama-exporter.service
────────────────────────
→  starts after ollama.service
→  restarts automatically on failure
→  runs as ollama user
→  logs to journalctl

Section 4 — Automating with Ansible

Everything above is automated in the llm-ansible repo.

4.1 Updated Repo Structure

4.2 Updated Inventory

[llm_servers]
phi ansible_host=llm_server_ip ansible_user=your_username

[monitoring_servers]
monitoring ansible_host=monitoring_server_ip ansible_user=your_username

4.3 Updated Playbook

---
- name: Deploy Ollama LLM Infrastructure
  hosts: llm_servers
  become: yes
  roles:
    - ollama

- name: Deploy Monitoring Infrastructure
  hosts: monitoring_servers
  become: yes
  roles:
    - monitoring

4.4 Key Variables

# Prometheus
prometheus_port: 9090
prometheus_scrape_interval: "15s"
prometheus_retention_time: "15d"

# Scrape targets
ollama_exporter_host: "192.168.1.52"
ollama_exporter_port: 8000
phi_node_exporter_port: 9100

# Grafana
grafana_port: 3000
grafana_admin_user: "admin"
grafana_admin_password: "admin"

4.5 Running the Playbook

ansible-playbook -i inventory.ini playbook.yaml

Section 5 — Verifying Prometheus Targets

curl http://localhost:9090/api/v1/targets | python3 -m json.tool

All three targets should show "health": "up":

job: prometheus  →  localhost:9090   health: up
job: ollama      →  phi:8000         health: up
job: node        →  phi:9100         health: up

Section 6 — Grafana Dashboard

6.1 Add Prometheus Data Source

Connections → Data sources → Add data source
→  Select Prometheus
→  URL: http://localhost:9090
→  Save & Test
→  "Successfully queried the Prometheus API"

6.2 Dashboard Panels

Row 1 — Ollama Health:

Panel	Query	Type
Ollama Status	`ollama_up`	Stat
Model Memory Usage	`ollama_model_ram_bytes`	Stat
Models in Memory	`ollama_models_loaded`	Stat

Row 2 — System Health (phi VM):

Panel	Query	Type
CPU Usage %	`100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)`	Stat
Memory Usage %	`100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)`	Stat
Disk Usage %	`100 - ((node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100)`	Stat

📸 Screenshot: complete Grafana dashboard

6.3 What the Dashboard Tells You

Ollama Status       →  is LLM serving healthy?
Model Memory Usage  →  3.7GB when phi3:mini loaded
                        0 when model unloaded (keep_alive timeout)
Models in Memory    →  1 when active, 0 when idle
CPU Usage %         →  spikes during inference
                        baseline low when idle
Memory Usage %      →  stable, dominated by model RAM
Disk Usage %        →  increases as you pull more models

Demo of panels

When no model is running

root@phi:/home/akshaygore# ollama ps
NAME    ID    SIZE    PROCESSOR    CONTEXT    UNTIL
root@phi:/home/akshaygore#

Below Dashboard shows stats accordingly

Once we load the phi model

root@phi:/home/akshaygore# ollama run phi3:mini
>>> hi
Hi there! How can I help you today?

>>> /bye
root@phi:/home/akshaygore# ollama ps
NAME         ID              SIZE      PROCESSOR    CONTEXT    UNTIL
phi3:mini    4f2222927938    3.7 GB    100% CPU     4096       4 minutes from now
root@phi:/home/akshaygore#

Dashboards updating the stats once we run the model

Self-Hosted AI on Linux: A DevOps Home Lab Guide

Akshay Gore — Sun, 01 Mar 2026 15:24:44 +0000

Audience: Intermediate DevOps/Systems Engineers | Series: Part 1 of 4

Fun Part:- Chat with you own LLM without worrying about token expiration.

Section 1 — Introduction

1.1 The 5 Layers of AI Ecosystem

Layer	Role	Dev / Home Lab	Production
5	Applications	Simple chatbot scripts	RAG pipelines, Agents, Chatbots
4	Frameworks	LangChain, LlamaIndex	LangChain, LlamaIndex, LiteLLM
3	Model Serving	Ollama	vLLM, TGI, Triton
2	Models	phi3:mini, gemma:2b	Mistral 7B, Llama 3 70B
1	Infrastructure	VirtualBox VM,Mac Mini M-series, Local hardware	AWS/GCP/Azure, GPU servers

This post covers Layers 1, 2 and 3. Layers 4 and 5 will be covered in posts ahead.

1.2 What This Post Covers

Setting up Ubuntu Server VM on VirtualBox. The server running the LLM.
Installing and configuring Ollama as a systemd service. Ollama is a program which helps to manage LLM model.
LLM model being used is phi which is superlight for homelab setup. It is similar to sonnet or gemini but on much smaller scale.
Automating the entire setup with Ansible
Interacting with the model via CLI, curl and Postman

Flow of setup : Ansible running on user's system to configure Ubuntu VM (phi) to run LLM

Section 2 — VM Setup

2.1 VM Specs

Component	Spec	Reason
RAM	8GB	phi3:mini needs ~3.7GB in memory, leave headroom for OS
CPU	4 cores	CPU inference benefits from multiple cores
Disk	30GB	Model 2.2GB + Ubuntu OS + logs + breathing room
OS	Ubuntu minimal Server 22.04 LTS	Stable, well supported, no GUI overhead
Network	Bridged Adapter	VM gets own IP, allows Ansible and API calls from other machines/clients leveraging model
Hostname	phi	Named after the model running on it

Note : One can leverage Mac Mini which has apple chips as it has UMA which is more capable of running models with higher transaction numbers.
Eg: Mac can run bit higher models like Llama 3 / 3.2, Mistral 7B . As I have a simple vm with no GPU, I am using basic LLM model called phi3:mini.

2.3 Hostname Setup

Named the vm phi. Will use this name ahead in ansible to keep things clean and simple.

Section 3 — Installing and Configuring Ollama manually

Walk through manual installation first so readers understand what Ansible automates in the next section.

3.1 Installation

Single curl command install. curl -fsSL https://ollama.com/install.sh | sh Ollama : Is a free, open-source tool that allows you to easily download, set up, and run AI language models (like LLaMA 3, Mistral, and Gemma) directly.It acts like a "Docker for LLMs" managing the technical complexities so you can quickly run private, offline AI chat or coding assistants with a single command.
Systemd service created automatically once the script completes successfully.

3.2 Systemd Override Configuration

OLLAMA_HOST=0.0.0.0 — To accept connections from all the clients in subnet which has model running.
OLLAMA_KEEP_ALIVE — control model unload timeout. So if model is not queried for 5 mins, the OS will unload it from RAM automatically to free the OS.
StandardOutput / StandardError — redirect logs to custom path. Try to put this on a separate partition other than root or entirely different disk.

Note: LLM models are loaded in RAM from disk before they are served. It has term called "warming up the model". In production setups something called heartbeat is used to keep model constantly warmed up and ready to serve as it affects the user experience.

3.3 Log Configuration

Create dir /var/log/ollama with correct ollama:ollama ownership
Using custom log location to get all the logs as there is difference in verbosity. Journactl will filter the logs but we would need all the logs from stdout and stderr

3.4 Logrotate

Config file at /etc/logrotate.d/ollama

Few commands to use logrotate:

logrotate --debug /etc/logrotate.d/ollama - dry run
logrotate --force /etc/logrotate.d/ollama - force run
ls -lh /var/log/ollama/ - check if logs rotated

In above ss we can see the logs got rotated

3.5 Pull and Test Model

ollama pull phi3:mini
ollama list — verify download
ollama run phi3:mini — quick interactive test

Your privale llm model is up and running. Ready to answer your queries.

Section 4 — Automating with Ansible

Now that we understand every manual step, lets automate it all.

4.1 Repository Structure

4.2 Running the Playbook

Dry run ansible-playbook -i inventory.ini playbook.yml --check

Faced an error here because the service was not installed yet. This is handled in playbook.

Run the playbook ansible-playbook -i inventory.ini playbook.yml

4.3 GitHub Repo

akshaypgore / llm-ansible

Ansible role to deploy llm model phi3:mini on linux vm

Prerequisite

VM Specs:

Min 8GB RAM (Model phi3:mini is approximately 3 GB. Half of the RAM would be consumed by model and other half reserved for OS)
4 cores
30 GB HDD

Note: System used to run ansible should be able to ssh vm without password using public key authentication

Steps

Update inventory file with IP of vm and username used to run ansible
Dry run ansible-playbook -i inventory.ini playbook.yml --check
Run Playbook ansible-playbook -i inventory.ini playbook.yml

View on GitHub

Section 5 — Interacting with the Model

Two ways to interact — CLI and curl. Each progressively more useful for building applications.

5.1 CLI — ollama run

ollama --version
ollama list
ollama ps
ollama run phi3:mini
ollama show phi3:mini

In the above image we can see that ollama unloaded the model as it was not being used. We had to run ollama run phi3:mini to reload the model in RAM which is also called warming up.

5.2 REST API via curl

This is the important part — how applications actually talk to Ollama. Below are few endpoints which are exposed

/api/generate — single prompt
/api/chat — conversation with history and roles

1. curl http://localhost:11434 #check if model is running
2. curl http://localhost:11434/api/generate #prompt like experience. Ask a question, model answers
3. #interaction with LLM like a chat. Question and Anser
curl http://localhost:11434/api/chat \
  -d '{
    "model": "phi3:mini",
    "stream": false,
    "messages": [
      {
        "role": "user",
        "content": "What is Linux?"
      },
      {
        "role": "assistant",
        "content": "Linux is an open source operating system..."
      },
      {
        "role": "user",
        "content": "Who created it?"
      }
    ]
  }' | python3 -c "import sys,json; print(json.load(sys.stdin)['message']['content'])"

Some interaction with our own LLM model.

There are many more important elemets to discuss ahead in future posts. Like

Performance Metrics of LLM

eval_count : Number of tokens (words) generated
eval_duration : Time required to generate tokens
total_duration : Time required to execute the query #### Context and cost tracking
prompt_eval_count : Tokens consumed in input along with tokens in chat history
load_duration : Time to load model in memory of server

Dockerfile: CMD vs ENTRYPOINT

Akshay Gore — Sat, 13 Dec 2025 16:31:37 +0000

CMD and ENTRYPOINT commands in Dockerfile can get confusing if not tried by actually executing it.

Show, don't tell

1. CMD

By default docker images like ubuntu have CMD as last command in its dockerfile as below:
CMD ["/bin/bash"]
When we run docker container with below commands

docker run --rm --name cmd ubuntu:latest ls /home

docker run --rm --name cmd ubuntu:latest date

Default command gets overridden by what we pass as arguments, in this cases as "ls /home" and "date"

Above docker runs produces below outputs respectively:

~  % docker run --rm --name cmd ubuntu:latest ls -l /home/ubuntu
total 0

~  % docker run --rm --name cmd ubuntu:latest date
Sat Dec 13 16:22:53 UTC 2025

2. ENTRYPOINT

Let's create a docker image (entrypoint:1.0.0) with ENTRYPOINT command

FROM ubuntu:latest
WORKDIR /app
COPY script.sh .
RUN ["chmod","+x","script.sh"]
ENTRYPOINT ["/app/script.sh"]
CMD ["world"]

script.sh

#!/bin/bash
echo "Hello $1"

Above script expects a single argument
If we don't pass any argument, the default argument will be the one provided in CMD

% docker run --rm --name entrypoint entrypoint:1.0.0
Hello world

If we pass an argument during a docker run then

% docker run --rm --name entrypoint entrypoint:1.0.0 december
Hello december

Learnings:

We can always override the CMD

We cannot override ENTRYPOINT, but we can override the arguments passed to the ENTRYPOINT using CMD as last command during image creation

ENTRYPOINT or CMD depends on usecase

Dig command to track the process of DNS resolution

Akshay Gore — Wed, 12 Nov 2025 15:55:54 +0000

Scenario: User/server(client) machine trying to reach nike.com

DNS client checks the cache on local machine for the IP of nike.com
Client's machine cache doesn't have the required IP
Client queries the DNS server which is provided/configured by the ISP
Above new DNS server also known as recursive DNS server checks its own cache if it has the IP of nike.com
Recursive DNS server doesn't have the IP and it begins the process to locate the IP of nike.com

Below steps are performed by the client to locate the IP

Get list of root servers
- Root server is configured on the system by default at location /usr/share/dns/root.hints for linux systems

dig +short ns

akshay-gore:~$ dig +short ns
k.root-servers.net.
c.root-servers.net.
h.root-servers.net.
i.root-servers.net.
a.root-servers.net.
m.root-servers.net.
f.root-servers.net.
d.root-servers.net.
b.root-servers.net.
l.root-servers.net.
j.root-servers.net.
g.root-servers.net.
e.root-servers.net.

Tracing the path of request from a system to the web server of the nike.com

dig +trace nike.com

; <<>> DiG 9.18.39-0ubuntu0.24.04.1-Ubuntu <<>> +trace nike.com
;; global options: +cmd
.           3090    IN  NS  h.root-servers.net.
.           3090    IN  NS  j.root-servers.net.
.           3090    IN  NS  k.root-servers.net.
.           3090    IN  NS  d.root-servers.net.
.           3090    IN  NS  m.root-servers.net.
.           3090    IN  NS  b.root-servers.net.
.           3090    IN  NS  f.root-servers.net.
.           3090    IN  NS  i.root-servers.net.
.           3090    IN  NS  c.root-servers.net.
.           3090    IN  NS  a.root-servers.net.
.           3090    IN  NS  e.root-servers.net.
.           3090    IN  NS  g.root-servers.net.
.           3090    IN  NS  l.root-servers.net.
;; Received 239 bytes from 127.0.0.53#53(127.0.0.53) in 0 ms

com.            172800  IN  NS  a.gtld-servers.net.
com.            172800  IN  NS  b.gtld-servers.net.
com.            172800  IN  NS  c.gtld-servers.net.
com.            172800  IN  NS  d.gtld-servers.net.
com.            172800  IN  NS  e.gtld-servers.net.
com.            172800  IN  NS  f.gtld-servers.net.
com.            172800  IN  NS  g.gtld-servers.net.
com.            172800  IN  NS  h.gtld-servers.net.
com.            172800  IN  NS  i.gtld-servers.net.
com.            172800  IN  NS  j.gtld-servers.net.
com.            172800  IN  NS  k.gtld-servers.net.
com.            172800  IN  NS  l.gtld-servers.net.
com.            172800  IN  NS  m.gtld-servers.net.
com.            86400   IN  DS  19718 13 2 8ACBB0CD28F41250A80A491389424D341522D946B0DA0C0291F2D3D7 71D7805A
com.            86400   IN  RRSIG   DS 8 1 86400 20251119050000 20251106040000 61809 . tncdUkjC/m4gwK8aqbdYHV1ZD+WR3n5FJgvwM+xHj4kJMG6D5XuASX4x 2D0YrJG547HWwb1jAjDcHaRyBcJqeoHti/mcLrungu4mGMHzYeVPx/Td YrC7yk91EA8UDacZA2y1qK0pzziw+GPEUs5ny5wOIvgRrXKOZPZYif60 UPk2df0O2lqe4q8vrx8Ff4zKDs275tC2Er+hrJ6YrQ8hKdwpDgkOdrjO 2e62PctJlRFYVj6MWBmQZS85ZSXCxMgP4bCUo5no6S3at4z2bKFfWjpF GcB7MF0kGwArH/hPfudiEV3cpoGPEOmr3o53vfIv22fxBfcOSmPjHq1y BuDwqg==
;; Received 1168 bytes from 2001:500:2f::f#53(f.root-servers.net) in 116 ms

nike.com.       172800  IN  NS  ns-n1.nike.com.
nike.com.       172800  IN  NS  ns-n2.nike.com.
nike.com.       172800  IN  NS  ns-n3.nike.com.
nike.com.       172800  IN  NS  ns-n4.nike.com.
CK0POJMG874LJREF7EFN8430QVIT8BSM.com. 900 IN NSEC3 1 1 0 - CK0Q3UDG8CEKKAE7RUKPGCT1DVSSH8LL NS SOA RRSIG DNSKEY NSEC3PARAM
CK0POJMG874LJREF7EFN8430QVIT8BSM.com. 900 IN RRSIG NSEC3 13 2 900 20251111002637 20251103231637 46539 com. qil9G2NSHYVtUASYp5W8XlMim+ieLFJ/aWJROvBKJfsjLso2rCp+GY5N vzw13ee/+aYXc2ZmkHSCrjrqPWjmAQ==
1AUF57P261FM4PRA2UHSG8IOEQH8RRSD.com. 900 IN NSEC3 1 1 0 - 1AUFABRNB1AREK54RAOGOJUIHBQ6C10I NS DS RRSIG
1AUF57P261FM4PRA2UHSG8IOEQH8RRSD.com. 900 IN RRSIG NSEC3 13 2 900 20251112011923 20251105000923 46539 com. ACPLjyPFa7MlxXfIhQx74GciwjbCwvTCT1mmWdLfaP3LvMtWkOg5ku6V aRHkII5DI+1pL/KRP8idLxs91qwm0w==
;; Received 538 bytes from 192.54.112.30#53(h.gtld-servers.net) in 292 ms

nike.com.       60  IN  A   18.172.64.109
nike.com.       60  IN  A   18.172.64.17
nike.com.       60  IN  A   18.172.64.37
nike.com.       60  IN  A   18.172.64.97
nike.com.       3600    IN  NS  ns-n1.nike.com.
nike.com.       3600    IN  NS  ns-n2.nike.com.
nike.com.       3600    IN  NS  ns-n3.nike.com.
nike.com.       3600    IN  NS  ns-n4.nike.com.
;; Received 245 bytes from 64:ff9b::cdfb:c343#53(ns-n4.nike.com) in 240 ms

Analyzing the output of trace command

Note : Lines in output starting with ;; are comments from the output

Line ;; Received 239 bytes from 127.0.0.53#53(127.0.0.53) in 0 ms in the output and states that it received 239 bytes as a response size from local(127.0.0.1) dns service running on port 53 which is list of root servers
Line ;; Received 1168 bytes from 2001:500:2f::f#53(f.root-servers.net) in 116 ms in the output and states that it received 1168 bytes as a response size from root server f.root-servers.net which is list of nameservers for com. domain
Line ;; Received 538 bytes from 192.54.112.30#53(h.gtld-servers.net) in 292 ms in the output and states that it received 538 bytes as a response size from nameserver server h.gtld-servers.net which is list of nameservers for nike.com. domain
Line ;; Received 245 bytes from 64:ff9b::cdfb:c343#53(ns-n4.nike.com) in 240 ms in the output and states that it received 245 bytes as a response size from nameserver server ns-n4.nike.com which is A record nike.com. which is the actual IP of website ##### Things to notice
Root server is configured on the user system by default at location /usr/share/dns/root.hints
Root servers provide nameservers for com domain eg: a.gtld-servers.net.
Now, the nameserver a.gtld-servers.net. is also the top level domain server for com. > The domain a.gtld-servers.net has an A record but no NS record for a simple reason: it is the nameserver itself, and nameservers typically do not delegate authority for their own name to a different set of nameservers.
The tld server a.gtld-servers.net. provides the nameserver for nike.com as ns-n1.nike.com., ns-n3.nike.com. and so on
The above nameserver ns-n1.nike.com. provides A record which is also IP for nike.com website

Useful commands

1.

dig A nike.com

Provides IP address of nike.com

```
    ; <<>> DiG 9.18.39-0ubuntu0.24.04.1-Ubuntu <<>> A nike.com
    ;; global options: +cmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36711
    ;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1

    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 65494
    ;; QUESTION SECTION:
    ;nike.com.          IN  A

    ;; ANSWER SECTION:
    nike.com.       77  IN  A   18.172.64.17
    nike.com.       77  IN  A   18.172.64.97
    nike.com.       77  IN  A   18.172.64.109
    nike.com.       77  IN  A   18.172.64.37

    ;; Query time: 260 msec
    ;; SERVER: 127.0.0.53#53(127.0.0.53) (UDP)
    ;; WHEN: Thu Nov 06 13:35:35 IST 2025
    ;; MSG SIZE  rcvd: 101
```

2.

dig NS nike.com

Provides nameservers of nike.com

```
    ; <<>> DiG 9.18.39-0ubuntu0.24.04.1-Ubuntu <<>> NS nike.com
    ;; global options: +cmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 56992
    ;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1

    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 65494
    ;; QUESTION SECTION:
    ;nike.com.          IN  NS

    ;; ANSWER SECTION:
    nike.com.       4502    IN  NS  ns-n4.nike.com.
    nike.com.       4502    IN  NS  ns-n1.nike.com.
    nike.com.       4502    IN  NS  ns-n2.nike.com.
    nike.com.       4502    IN  NS  ns-n3.nike.com.

    ;; Query time: 215 msec
    ;; SERVER: 127.0.0.53#53(127.0.0.53) (UDP)
    ;; WHEN: Thu Nov 06 13:37:12 IST 2025
    ;; MSG SIZE  rcvd: 117
```

One can also do dig A nike.com @1.1.1.1 or dig NS nike.com @1.1.1.1 which means check recods against DNS 1.1.1.1
Trying to get NS record for tld server will give empty response as it does not have an NS record as it is nameserver itself instead it gives an SOA record

dig NS a.gtld-servers.net.

```
; <<>> DiG 9.18.39-0ubuntu0.24.04.1-Ubuntu <<>> NS a.gtld-servers.net.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 3874
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;a.gtld-servers.net.        IN  NS

;; AUTHORITY SECTION:
gtld-servers.net.   3600    IN  SOA av4.nstld.com. nstld.verisign-grs.com. 1762388322 3600 900 1209600 86400

;; Query time: 407 msec
;; SERVER: 127.0.0.53#53(127.0.0.53) (UDP)
;; WHEN: Thu Nov 06 13:41:09 IST 2025
;; MSG SIZE  rcvd: 115
```

The SOA record provides critical administrative information about a DNS zone. Here is the explanation of each field in the record you provided: The SOA record is one of the mandatory records in a DNS zone file. It designates the primary authoritative server and provides critical administrative details for zone transfers and caching.

Field	Description	Purpose
Name	The name of the zone (e.g., example.com.).	Indicates the domain to which the SOA record applies.
TTL	The Time-to-Live (in seconds) for the SOA record itself.	Specifies how long other servers should cache this administrative record.
Class	Always IN (for Internet).	Defines the protocol family; virtually always Internet.
Type	Always SOA (Start of Authority).	Identifies the record type.
MNAME	The Primary Master Name Server (e.g., ns1.example.com.).	The authoritative server that holds the definitive copy of the zone file.
RNAME	The Responsible Person's Email Address (e.g., hostmaster.example.com.).	The administrative contact for the zone. The first dot in the name is replaced by an @ symbol when interpreted as an email address (e.g., hostmaster@example.com).
Serial	A Version Number for the zone file (often in the format YYYYMMDDSS).	Secondary name servers check this number. If it has increased, they initiate a zone transfer to update their data.
Refresh	The time (in seconds) secondary servers wait before checking the primary server for a zone file update.	Controls the frequency of checking for zone file changes.
Retry	The time (in seconds) a secondary server waits to re-try contacting the primary master after a connection failure.	Allows the secondary to try again quickly without waiting for the full Refresh time if the initial check fails.
Expire	The time (in seconds) after which a secondary server will stop answering queries for the zone if it has been unable to contact the primary master.	Prevents the secondary server from providing stale data indefinitely.
Minimum TTL	The time (in seconds) to be used for caching negative responses (e.g., when a queried record or domain does not exist).	Limits how long a resolver will remember that a particular name failed to resolve.