DEV Community

Sumit Roy
Sumit Roy

Posted on

๐Ÿ“Š Adding Observability to Gemma 2B on Kubernetes with Prometheus & Grafana

In Article 1

When we first set up Prometheus + Grafana for Gemma 2B on Kubernetes, I expected to see nice dashboards with:

  • Tokens per request
  • Latency per inference
  • Number of inferences processed

โ€ฆbut all we got were boring container metrics: CPU%, memory usage, restarts.

Sure, they told us the pod was alive, but nothing about the model itself.
No clue if inference was slow, if requests were timing out, or how many tokens were processed.

๐Ÿ” Debugging the Metrics Problem

We checked:

  • Prometheus scraping the Ollama pod? โœ…
  • Grafana dashboards connected? โœ…
  • Metrics endpoint on Ollama? โŒ

Thatโ€™s when we realized:

  • Ollama by default doesnโ€™t expose model-level metrics.
  • It only serves the API for inference, nothing else.
  • Prometheus was scrapingโ€ฆ nothing useful.

๐Ÿ’ก The Fix: Ollama Exporter as Sidecar

While digging through GitHub issues, we found a project: Ollama Exporter

It runs as a sidecar container inside the same pod as Ollama, talks to the Ollama API, and exposes real metrics at /metrics for Prometheus.

Basically:

[ Ollama Pod ]
    โ”œโ”€โ”€ Ollama Server (API โ†’ 11434)
    โ””โ”€โ”€ Ollama Exporter (Metrics โ†’ 11435)
Enter fullscreen mode Exit fullscreen mode

๐Ÿ›  How We Integrated It

Hereโ€™s the snippet we added to the Ollama deployment:

- name: ollama-exporter
  image: ghcr.io/jmorganca/ollama-exporter:latest
  ports:
    - containerPort: 11435
  env:
    - name: OLLAMA_HOST
      value: "http://localhost:11434"

Enter fullscreen mode Exit fullscreen mode

And in Prometheus config:

scrape_configs:
  - job_name: 'ollama'
    static_configs:
      - targets: ['ollama-service:11435']
Enter fullscreen mode Exit fullscreen mode

๐Ÿ“Š The Metrics We Finally Got

After adding the exporter, Grafana lit up with:

Metric Name         What It Shows
ollama_requests_total   Number of inference requests
ollama_latency_seconds  Latency per inference request
ollama_tokens_processed Tokens processed per inference
ollama_model_load_time  Time taken to load Gemma 2B model
Enter fullscreen mode Exit fullscreen mode

Suddenly, we had real model observability, not just pod health.

๐Ÿš€ Lessons Learned

  • Default Kubernetes metrics โ‰  Model metrics โ†’ You need a sidecar like Ollama Exporter.
  • One scrape job away โ†’ Prometheus wonโ€™t scrape what you donโ€™t tell it to.
  • Metrics help tuning โ†’ We later used these metrics to set CPU/memory requests properl

๐Ÿ”ฎ Whatโ€™s Next?

  • Now that we have model-level observability, the next step is:
  • Adding alerting rules for latency spikes or token errors.
  • Exporting historical metrics into long-term storage (e.g., Loki, Thanos).

Trying multiple models Gemma 3, LLaMA 3, Phi-3 and comparing inference latency across them.

๐Ÿ’ฌ Letโ€™s Connect

If you try this setup or improve it, Iโ€™d love to hear from you!

Drop a star โญ on the repo if it helped you โ€” it keeps me motivated to write more experiments like this!

Top comments (0)