I was fascinated by how people were running large language models locally, fully offline, without depending on expensive GPU clusters or cloud APIs.
But when I tried deploying Gemma 2B manually on my machine, the process was messy:
- Large model weights needed downloading
- Restarting the container meant re-downloading everything
- No orchestration or resilience — if the container died, my setup was gone
So, I asked myself:
“Can I run Gemma 2B efficiently, fully containerized, orchestrated by Kubernetes, with a clean local setup?”
The answer: Yes. Using k3d + Ollama + Kubernetes + Gemma 2B.
🎯 What You’ll Learn
- Deploy Gemma 2B using Ollama inside a k3d Kubernetes cluster
- Expose it via a service for local access
- Persist model weights to avoid re-downloading
- Basic troubleshooting for pods and containers
🛠️ Tech Stack
- k3d Lightweight Kubernetes cluster inside Docker
- Ollama Container for running LLMs locally
- Gemma 2B Lightweight LLM (~1.7GB) from Google, runs locally
- WSL2 Linux environment on Windows
📚 Concepts Before We Start
- What is Ollama?
- Ollama is a simple tool for running LLMs locally:
- Pulls models like Gemma, Llama, Phi
- Provides a REST API for inference
- Runs entirely offline once weights are downloaded
Example:
ollama run gemma:2b
Gives you a local chatbot with zero cloud dependency.
- Why Kubernetes (k3d)?
Instead of running Ollama bare-metal, we use k3d:
Local K8s Cluster → k3d runs Kubernetes inside Docker, very lightweight
Pods & PVCs → Pods run containers, PVCs store model weights
Services → Expose Ollama API on localhost easily
- Storage with PVC
Without PVCs, if your pod dies, you lose model weights.
PVC ensures models survive restarts and redeployments.
🧑💻 Step-by-Step Setup
Step 1: Install k3d
curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash
k3d cluster create gemma-cluster --agents 1 --servers 1
Step 2: Deploy Ollama + Gemma 2B
Create ollama-deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
volumeMounts:
- name: model-storage
mountPath: /root/.ollama
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: ollama-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
---
apiVersion: v1
kind: Service
metadata:
name: ollama-service
spec:
selector:
app: ollama
ports:
- protocol: TCP
port: 11434
targetPort: 11434
type: LoadBalancer
Apply it:
kubectl apply -f ollama-deployment.yaml
Step 3: Pull Gemma 2B Model
kubectl exec -it deploy/ollama -- ollama pull gemma:2b
Step 4: Test the API
curl http://localhost:11434/api/generate -d '{
"model": "gemma:2b",
"prompt": "Write a short poem about Kubernetes"
}'
🐞 Problems I Faced & Fixes
- Pod in CrashLoopBackOff Increased CPU/RAM in deployment spec
- Model re-downloading on restart Used PVC to persist weights
- Port not accessible Used LoadBalancer + k3d port mapping
📂 Final Project Structure
gemma-k3d/
├── ollama-deployment.yaml
├── k3d-cluster-setup.sh
└── README.md
🚀 Next Steps
In the next article, we’ll add Prometheus + Grafana to monitor:
- CPU usage
- Memory usage
- Latency per inference
💬 Let’s Connect
If you try this setup or improve it, I’d love to hear from you!
Drop a star ⭐ on the repo if it helped you — it keeps me motivated to write more experiments like this!
Top comments (0)