DEV Community

Sumit Roy
Sumit Roy

Posted on

Running Gemma 2B on Kubernetes (k3d) with Ollama: A Complete Local AI Setup

I was fascinated by how people were running large language models locally, fully offline, without depending on expensive GPU clusters or cloud APIs.

But when I tried deploying Gemma 2B manually on my machine, the process was messy:

  • Large model weights needed downloading
  • Restarting the container meant re-downloading everything
  • No orchestration or resilience — if the container died, my setup was gone

So, I asked myself:

“Can I run Gemma 2B efficiently, fully containerized, orchestrated by Kubernetes, with a clean local setup?”

The answer: Yes. Using k3d + Ollama + Kubernetes + Gemma 2B.

🎯 What You’ll Learn

  1. Deploy Gemma 2B using Ollama inside a k3d Kubernetes cluster
  2. Expose it via a service for local access
  3. Persist model weights to avoid re-downloading
  4. Basic troubleshooting for pods and containers

🛠️ Tech Stack

  • k3d Lightweight Kubernetes cluster inside Docker
  • Ollama Container for running LLMs locally
  • Gemma 2B Lightweight LLM (~1.7GB) from Google, runs locally
  • WSL2 Linux environment on Windows

📚 Concepts Before We Start

  1. What is Ollama?
  • Ollama is a simple tool for running LLMs locally:
  • Pulls models like Gemma, Llama, Phi
  • Provides a REST API for inference
  • Runs entirely offline once weights are downloaded

Example:

ollama run gemma:2b

Gives you a local chatbot with zero cloud dependency.

  1. Why Kubernetes (k3d)?

Instead of running Ollama bare-metal, we use k3d:

Local K8s Cluster → k3d runs Kubernetes inside Docker, very lightweight

Pods & PVCs → Pods run containers, PVCs store model weights

Services → Expose Ollama API on localhost easily

  1. Storage with PVC

Without PVCs, if your pod dies, you lose model weights.
PVC ensures models survive restarts and redeployments.

🧑‍💻 Step-by-Step Setup
Step 1: Install k3d
curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash
k3d cluster create gemma-cluster --agents 1 --servers 1

Step 2: Deploy Ollama + Gemma 2B

Create ollama-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        volumeMounts:
        - name: model-storage
          mountPath: /root/.ollama
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: ollama-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-pvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
---
apiVersion: v1
kind: Service
metadata:
  name: ollama-service
spec:
  selector:
    app: ollama
  ports:
  - protocol: TCP
    port: 11434
    targetPort: 11434
  type: LoadBalancer
Enter fullscreen mode Exit fullscreen mode

Apply it:

kubectl apply -f ollama-deployment.yaml

Step 3: Pull Gemma 2B Model
kubectl exec -it deploy/ollama -- ollama pull gemma:2b

Step 4: Test the API

curl http://localhost:11434/api/generate -d '{
  "model": "gemma:2b",
  "prompt": "Write a short poem about Kubernetes"
}'

Enter fullscreen mode Exit fullscreen mode

🐞 Problems I Faced & Fixes

  1. Pod in CrashLoopBackOff Increased CPU/RAM in deployment spec
  2. Model re-downloading on restart Used PVC to persist weights
  3. Port not accessible Used LoadBalancer + k3d port mapping

📂 Final Project Structure
gemma-k3d/
├── ollama-deployment.yaml
├── k3d-cluster-setup.sh
└── README.md

🚀 Next Steps

In the next article, we’ll add Prometheus + Grafana to monitor:

  1. CPU usage
  2. Memory usage
  3. Latency per inference

💬 Let’s Connect

If you try this setup or improve it, I’d love to hear from you!

Drop a star ⭐ on the repo if it helped you — it keeps me motivated to write more experiments like this!

Top comments (0)