Akshay Gore

Posted on Mar 9 • Edited on Mar 13

Monitoring Self-Hosted LLM with Prometheus and Grafana

#ai #devops #linux #monitoring

Audience: Intermediate DevOps | Series: Part 2 of 4

Quick Recap from Part 1

Set up Ubuntu Server VM (phi) on VirtualBox
Installed and configured Ollama as a systemd service
Automated entire setup with Ansible (llm-ansible repo)
Interacted with phi3:mini via CLI, curl
Link to Part 1

Why custom monitoring setup

Ollama does not have a native Prometheus exporter (a /metrics endpoint) primarily because it is designed as a lightweight, user-friendly tool for running local LLMs, focusing on simplicity and ease of setup for local developers rather than complex enterprise monitoring

What This Post Covers

Writing a custom Prometheus exporter in Python
Installing Prometheus and Grafana with Ansible
Building a monitoring dashboard for your LLM

Github Link

Repository Link

Section 1 — The Problem

1.1 Ollama Has No Native Metrics

Most production services expose a /metrics endpoint in Prometheus format out of the box. Ollama does not.

curl http://192.168.1.52:11434/metrics
# 404 page not found

This is a common situation in DevOps — a service you depend on doesn't expose metrics. The solution is an exporter.

1.2 What is an Exporter

Service (Ollama)
      ↓
Exporter (queries Ollama API)
      ↓
Exposes /metrics in Prometheus format
      ↓
Prometheus scrapes exporter
      ↓
Grafana visualizes

This pattern is used across the ecosystem:

MySQL exporter
Redis exporter
Node exporter

Same pattern, different service.

Section 2 — Architecture

phi VM
──────────────────────
Ollama            →  port 11434  (LLM serving)
ollama-exporter   →  port 8000   (custom metrics)
node-exporter     →  port 9100   (system metrics)

monitoring VM
────────────────────────────
Prometheus        →  port 9090   (scrapes phi)
Grafana           →  port 3000   (visualizes)

Why separate VMs:

→  monitoring runs independently
→  if phi goes down monitoring still works
→  monitoring doesn't consume phi resources
→  mirrors production architecture

Section 3 — Custom Ollama Exporter

3.1 What Metrics We Can Get

Ollama exposes data via REST API endpoints we explored in Part 1:

/api/ps    →  running models, RAM usage, context length
/api/tags  →  downloaded models, disk usage
/          →  health check

3.2 Metrics We Expose

ollama_up                  →  is Ollama API responding (0 or 1)
ollama_models_loaded       →  models currently in RAM
ollama_model_ram_bytes     →  RAM consumed per model
ollama_model_context_length → context window size
ollama_models_available    →  models downloaded on disk
ollama_model_disk_bytes    →  disk space per model
ollama_total_disk_bytes    →  total disk used by all models

3.3 How the Exporter Works

# Simple structure
→  HTTP server on port 8000
→  on GET /metrics:
   query Ollama /api/ps
   query Ollama /api/tags
   format as Prometheus metrics
   return response
→  Prometheus scrapes every 15 seconds

Python Exporter File

3.4 Prometheus Metrics Format

# HELP ollama_up Whether Ollama API is responding
# TYPE ollama_up gauge
ollama_up 1

# HELP ollama_model_ram_bytes RAM consumed by each loaded model
# TYPE ollama_model_ram_bytes gauge
ollama_model_ram_bytes{model="phi3:mini"} 3730644480

Key things to notice:

# HELP — human readable description
# TYPE — metric type (gauge, counter, histogram)
labels in {} — metadata attached to metric
value at the end

3.5 Running as Systemd Service

ollama-exporter.service
────────────────────────
→  starts after ollama.service
→  restarts automatically on failure
→  runs as ollama user
→  logs to journalctl

Section 4 — Automating with Ansible

Everything above is automated in the llm-ansible repo.

4.1 Updated Repo Structure

4.2 Updated Inventory

[llm_servers]
phi ansible_host=llm_server_ip ansible_user=your_username

[monitoring_servers]
monitoring ansible_host=monitoring_server_ip ansible_user=your_username

4.3 Updated Playbook

---
- name: Deploy Ollama LLM Infrastructure
  hosts: llm_servers
  become: yes
  roles:
    - ollama

- name: Deploy Monitoring Infrastructure
  hosts: monitoring_servers
  become: yes
  roles:
    - monitoring

4.4 Key Variables

# Prometheus
prometheus_port: 9090
prometheus_scrape_interval: "15s"
prometheus_retention_time: "15d"

# Scrape targets
ollama_exporter_host: "192.168.1.52"
ollama_exporter_port: 8000
phi_node_exporter_port: 9100

# Grafana
grafana_port: 3000
grafana_admin_user: "admin"
grafana_admin_password: "admin"

4.5 Running the Playbook

ansible-playbook -i inventory.ini playbook.yaml

Section 5 — Verifying Prometheus Targets

curl http://localhost:9090/api/v1/targets | python3 -m json.tool

All three targets should show "health": "up":

job: prometheus  →  localhost:9090   health: up
job: ollama      →  phi:8000         health: up
job: node        →  phi:9100         health: up

Section 6 — Grafana Dashboard

6.1 Add Prometheus Data Source

Connections → Data sources → Add data source
→  Select Prometheus
→  URL: http://localhost:9090
→  Save & Test
→  "Successfully queried the Prometheus API"

6.2 Dashboard Panels

Row 1 — Ollama Health:

Panel	Query	Type
Ollama Status	`ollama_up`	Stat
Model Memory Usage	`ollama_model_ram_bytes`	Stat
Models in Memory	`ollama_models_loaded`	Stat

Row 2 — System Health (phi VM):

Panel	Query	Type
CPU Usage %	`100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)`	Stat
Memory Usage %	`100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)`	Stat
Disk Usage %	`100 - ((node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100)`	Stat

📸 Screenshot: complete Grafana dashboard

6.3 What the Dashboard Tells You

Ollama Status       →  is LLM serving healthy?
Model Memory Usage  →  3.7GB when phi3:mini loaded
                        0 when model unloaded (keep_alive timeout)
Models in Memory    →  1 when active, 0 when idle
CPU Usage %         →  spikes during inference
                        baseline low when idle
Memory Usage %      →  stable, dominated by model RAM
Disk Usage %        →  increases as you pull more models

Demo of panels

When no model is running

root@phi:/home/akshaygore# ollama ps
NAME    ID    SIZE    PROCESSOR    CONTEXT    UNTIL
root@phi:/home/akshaygore#

Below Dashboard shows stats accordingly

Once we load the phi model

root@phi:/home/akshaygore# ollama run phi3:mini
>>> hi
Hi there! How can I help you today?

>>> /bye
root@phi:/home/akshaygore# ollama ps
NAME         ID              SIZE      PROCESSOR    CONTEXT    UNTIL
phi3:mini    4f2222927938    3.7 GB    100% CPU     4096       4 minutes from now
root@phi:/home/akshaygore#

Dashboards updating the stats once we run the model

DEV Community