In Article 1
When we first set up Prometheus + Grafana for Gemma 2B on Kubernetes, I expected to see nice dashboards with:
- Tokens per request
- Latency per inference
- Number of inferences processed
โฆbut all we got were boring container metrics: CPU%, memory usage, restarts.
Sure, they told us the pod was alive, but nothing about the model itself.
No clue if inference was slow, if requests were timing out, or how many tokens were processed.
๐ Debugging the Metrics Problem
We checked:
- Prometheus scraping the Ollama pod? โ
- Grafana dashboards connected? โ
- Metrics endpoint on Ollama? โ
Thatโs when we realized:
- Ollama by default doesnโt expose model-level metrics.
- It only serves the API for inference, nothing else.
- Prometheus was scrapingโฆ nothing useful.
๐ก The Fix: Ollama Exporter as Sidecar
While digging through GitHub issues, we found a project: Ollama Exporter
It runs as a sidecar container inside the same pod as Ollama, talks to the Ollama API, and exposes real metrics at /metrics for Prometheus.
Basically:
[ Ollama Pod ]
โโโ Ollama Server (API โ 11434)
โโโ Ollama Exporter (Metrics โ 11435)
๐ How We Integrated It
Hereโs the snippet we added to the Ollama deployment:
- name: ollama-exporter
image: ghcr.io/jmorganca/ollama-exporter:latest
ports:
- containerPort: 11435
env:
- name: OLLAMA_HOST
value: "http://localhost:11434"
And in Prometheus config:
scrape_configs:
- job_name: 'ollama'
static_configs:
- targets: ['ollama-service:11435']
๐ The Metrics We Finally Got
After adding the exporter, Grafana lit up with:
Metric Name What It Shows
ollama_requests_total Number of inference requests
ollama_latency_seconds Latency per inference request
ollama_tokens_processed Tokens processed per inference
ollama_model_load_time Time taken to load Gemma 2B model
Suddenly, we had real model observability, not just pod health.
๐ Lessons Learned
- Default Kubernetes metrics โ Model metrics โ You need a sidecar like Ollama Exporter.
- One scrape job away โ Prometheus wonโt scrape what you donโt tell it to.
- Metrics help tuning โ We later used these metrics to set CPU/memory requests properl
๐ฎ Whatโs Next?
- Now that we have model-level observability, the next step is:
- Adding alerting rules for latency spikes or token errors.
- Exporting historical metrics into long-term storage (e.g., Loki, Thanos).
Trying multiple models Gemma 3, LLaMA 3, Phi-3
and comparing inference latency across them.
๐ฌ Letโs Connect
If you try this setup or improve it, Iโd love to hear from you!
Drop a star โญ on the repo if it helped you โ it keeps me motivated to write more experiments like this!
Top comments (0)