Daksh Goel

Posted on Apr 19

How to Deploy an Open Source LLM Reliably on Kubernetes (Step-by-Step)

#docker #kubernetes #opensource #devops

How to Deploy an Open Source LLM Reliably on Kubernetes

Introduction

Running AI models in production requires more than just downloading a
model and running it locally. Anyone can run ollama run mistral in a
terminal — but what happens when that process crashes at 2am? What happens
when you need to monitor memory usage, restart failed services automatically,
or scale to handle more requests?

That is exactly what Kubernetes solves.

In this guide I will walk you through the complete process of deploying
TinyLlama (a real open source LLM) inside a production-grade Kubernetes
cluster on your local machine — with live monitoring via Grafana, a working
chatbot UI, and a head-to-head performance comparison against Claude Haiku
from Anthropic.

By the end you will have a fully working AI stack that auto-restarts on
failure, shows you real-time health metrics, and costs you nothing to run.

Full code: https://github.com/daksh777f/llm-on-kubernetes

What We Are Building

[Next.js Chatbot UI :3001]
↓
[Ollama API :11434]
↓
[Kubernetes Cluster — k3d]
├── llm namespace
│ └── Ollama Pod (serves TinyLlama)
└── monitoring namespace
├── Prometheus (scrapes metrics)
└── Grafana (visualizes dashboards)

The full stack uses these tools:

k3d — Lightweight Kubernetes that runs entirely inside Docker. No cloud account needed.
Ollama — A tool that serves open source LLMs as a REST API inside a container.
TinyLlama — A 1.1B parameter open source model that runs in 637MB of RAM. Perfect for 8GB machines.
Prometheus + Grafana — Industry standard monitoring stack. Prometheus scrapes metrics, Grafana visualizes them.
Next.js — The chatbot frontend. Calls Ollama through an API route.

Prerequisites

Before starting, you need these installed:

Docker Desktop (running)
kubectl
k3d
Helm
Node.js
Python 3

On Windows, install the last four with one command using Chocolatey:

choco install kubernetes-cli k3d kubernetes-helm nodejs -y

Step 1: Create the Kubernetes Cluster

We use k3d because it creates a real multi-node Kubernetes cluster
that runs entirely inside Docker containers. No cloud account, no
VM setup, no cost.

k3d cluster create llm-cluster --agents 1 --port "8080:80@loadbalancer"

This creates a cluster with one server node and one agent node. The
--port flag exposes port 80 through a load balancer for later use.

Verify both nodes are ready:

kubectl get nodes

You should see two nodes — one control-plane and one agent — both
with status Ready. That is your Kubernetes cluster running locally
inside Docker.

Step 2: Deploy Ollama + TinyLlama

Ollama is an open source tool that serves LLMs as a REST API. We
deploy it as a Kubernetes Deployment so that if the pod ever crashes,
Kubernetes automatically restarts it.

Create a file called ollama.yaml:

apiVersion: v1
kind: Namespace
metadata:
  name: llm
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: llm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        resources:
          requests:
            memory: "2Gi"
            cpu: "1"
          limits:
            memory: "4Gi"
            cpu: "2"
        livenessProbe:
          httpGet:
            path: /api/tags
            port: 11434
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /api/tags
            port: 11434
          initialDelaySeconds: 10
          periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: ollama-service
  namespace: llm
  labels:
    app: ollama
spec:
  selector:
    app: ollama
  ports:
  - name: http
    port: 11434
    targetPort: 11434
  type: ClusterIP

Notice the livenessProbe and readinessProbe — these are what make
the deployment reliable. Kubernetes will continuously check if Ollama
is responding. If it stops responding, Kubernetes kills the pod and
starts a fresh one automatically.

Apply the deployment:

kubectl apply -f ollama.yaml

Wait for the pod to be running:

kubectl get pods -n llm -w

Once the pod shows 1/1 Running, pull TinyLlama into it:

kubectl exec -n llm deployment/ollama -- ollama pull tinyllama

This downloads the 637MB TinyLlama model inside the running pod.

Test it works:

kubectl port-forward -n llm svc/ollama-service 11434:11434

$body = '{"model":"tinyllama","prompt":"say hello","stream":false}'
Invoke-RestMethod -Uri "http://127.0.0.1:11434/api/generate" `
  -Method Post -ContentType "application/json" -Body $body

You should see a response field with text from TinyLlama. Your LLM
is running inside Kubernetes.

Step 3: Monitoring with Prometheus and Grafana

Deploying without monitoring is flying blind. We use the
kube-prometheus-stack Helm chart which installs Prometheus, Grafana,
and all the necessary exporters in a single command.

Add the Helm repo:

helm repo add prometheus-community `
  https://prometheus-community.github.io/helm-charts
helm repo update

Install the monitoring stack:

helm install monitoring prometheus-community/kube-prometheus-stack `
  --namespace monitoring `
  --create-namespace `
  --set grafana.adminPassword=admin123 `
  --set alertmanager.enabled=false `
  --set prometheus.prometheusSpec.retention=6h `
  --set prometheus.prometheusSpec.resources.requests.memory=256Mi `
  --set prometheus.prometheusSpec.resources.limits.memory=512Mi

The last two --set flags reduce memory usage — important on 8GB
machines. The alertmanager.enabled=false flag skips the alert
manager to save another ~200MB.

Wait for all pods to reach Running status:

kubectl get pods -n monitoring

Access Grafana by port-forwarding:

kubectl port-forward -n monitoring svc/monitoring-grafana 3000:80

Open http://localhost:3000 and login with admin / admin123.

Import the Kubernetes Dashboard

Left sidebar → Dashboards → New → Import
Enter ID 15661 → Load
Select Prometheus as data source → Import

You now have a live dashboard showing cluster CPU, memory, pod counts,
and network traffic — all updating in real time.

Create a Custom Ollama Panel

Dashboards → New → New Dashboard → Add visualization
Select Prometheus as data source
Switch to Code mode and enter: kube_pod_info{namespace="llm"}
Change visualization to Stat
Title: Ollama LLM Pod Status
Save dashboard as LLM Monitoring

This panel shows you at a glance whether your LLM pod is up. In a
production setup, you would wire this to an alert that pages you
when it goes down.

Step 4: Build the Chatbot UI

The chatbot is built with Next.js and Tailwind CSS. It talks to
TinyLlama through a server-side API route that calls Ollama directly.

Create the app:

npx create-next-app@latest chatbot --typescript --tailwind --app --yes
cd chatbot

The API route (app/api/chat/route.ts) handles all communication
with Ollama:

import { NextRequest, NextResponse } from "next/server";

export async function POST(req: NextRequest) {
  const { messages } = await req.json();

  const prompt = messages
    .map((m: { role: string; content: string }) =>
      m.role === "user" ? `User: ${m.content}` : `Assistant: ${m.content}`
    )
    .join("\n") + "\nAssistant:";

  const response = await fetch("http://127.0.0.1:11434/api/generate", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      model: "tinyllama",
      prompt: prompt,
      stream: false,
    }),
  });

  const data = await response.json();
  return NextResponse.json({
    response: data.response || "No response from model.",
  });
}

Run on port 3001 since Grafana already uses port 3000:

npm run dev -- --port 3001

Open http://localhost:3001 — you have a working chatbot talking to
your Kubernetes-hosted LLM.

Step 5: Open Source vs Commercial LLM — Real Numbers

This is where it gets interesting. I wrote a Python script that ran
10 identical prompts through both TinyLlama running in my Kubernetes
cluster and Claude Haiku from Anthropic's API, measuring response
time and cost for every single query.

The 10 prompts covered a range of task types:

Identity questions ("Which model are you?")
Technical explanations ("What is Kubernetes?", "What is Docker?")
Coding tasks ("Write a Python function to reverse a string")
Creative tasks ("Write a haiku about programming")
General knowledge ("What is the capital of Australia?")

Results

Metric	TinyLlama (Local K8s)	Claude Haiku (API)
Average response time	20.23 seconds	0.41 seconds
Cost for 10 queries	$0.00	~$0.003
Cost per query	FREE	~$0.0003
Runs on your hardware	Yes	No
Data leaves your machine	Never	Yes
Response quality	Good for simple tasks	Stronger reasoning
Latest knowledge	Cutoff at training	More recent

The Real Story Behind These Numbers

Claude is 49 times faster. That gap comes entirely from hardware.
TinyLlama running on a CPU on an 8GB laptop takes 20 seconds. The
same model running on a GPU node in Kubernetes would take under 1
second. Claude runs on Anthropic's optimized GPU infrastructure and
responds in under half a second consistently.

TinyLlama is completely free. For 10 queries, 100 queries, or
1 million queries — the cost is exactly zero. You pay for electricity
and hardware you already own. Claude charges per token, which at
Haiku pricing is extremely cheap (~$0.0003 per query) but adds up
at scale. At 1 million queries per day, that is $300/day vs $0/day.

Privacy is where local wins completely. Every query you send
to Claude goes over the internet to Anthropic's servers. Every query
you send to TinyLlama in your Kubernetes cluster never leaves your
machine. For healthcare, legal, or financial applications where data
privacy is non-negotiable, local is the only option.

When to Use Each

Use TinyLlama (local Kubernetes) when:

Data privacy is non-negotiable
You are cost-sensitive at scale
Tasks are simple: Q&A, summarization, basic code generation
You want full control over your AI infrastructure
You are building a product and do not want API dependency

Use Claude / commercial API when:

Response speed matters (customer-facing, real-time)
Tasks need strong reasoning or latest knowledge
You are prototyping and do not want infra overhead
Reliability SLAs matter more than cost

The production answer is usually both. Route privacy-sensitive
queries to your local Kubernetes LLM. Route quality-critical queries
to a commercial API. This hybrid architecture gives you the best of
both worlds.

Lessons Learned

1. Kubernetes adds real reliability that you cannot get with raw Docker.
The livenessProbe in our Ollama deployment means if the model
crashes or hangs, Kubernetes detects it within 30 seconds and
automatically restarts the pod. With a plain Docker container,
you would need to notice the crash yourself and restart it manually.

2. RAM matters more than you think with LLMs.
Mistral 7B requires 4.5GB of RAM and completely failed on my 8GB
machine because the OS and Kubernetes overhead left only 3.1GB free.
TinyLlama at 637MB ran perfectly. Always check model requirements
before deploying — a model that cannot load is worse than no model.

3. One Helm command beats hours of configuration.
The entire Prometheus + Grafana monitoring stack — which would take
hours to configure manually — installed in a single helm install
command. This is the power of the Kubernetes ecosystem.

4. The speed gap closes dramatically with a GPU.
TinyLlama takes 20 seconds on a CPU. On an NVIDIA T4 GPU (available
on GKE for ~$0.35/hour), the same model runs in under 1 second.
If you need local + fast, the answer is a GPU node, not a bigger
CPU model.

5. Port-forwarding is for development only.
In production you would use a Kubernetes Ingress with a real domain
name and TLS certificate — not port-forwarding. Everything in this
guide is the right foundation for production, but swap port-forwards
for a proper Ingress before going live.

What to Try Next

Add a GPU node — Deploy to GKE with a T4 GPU and watch TinyLlama go from 20s to under 1s
Try larger models — With a GPU, Mistral 7B or Llama 3 8B will fit comfortably
Add streaming responses — Ollama supports streaming; the chatbot can show tokens as they generate instead of waiting
Set up real alerting — Configure Grafana alerts to send a Slack message when the Ollama pod goes down
Deploy to cloud — Replace k3d with GKE or EKS for a production cluster with real uptime guarantees

Full Code

Everything in this guide is available on GitHub:

https://github.com/daksh777f/llm-on-kubernetes

The repo includes:

ollama.yaml — Kubernetes deployment for Ollama
chatbot/ — Complete Next.js chatbot
compare.py — The comparison script

Conclusion

Deploying an open source LLM reliably is not just about running
a model — it is about building infrastructure that handles failures
gracefully, gives you visibility into what is happening, and scales
when you need it to.

The stack we built today — k3d + Ollama + Prometheus + Grafana +
Next.js — is a genuine foundation for a production AI system.
TinyLlama is free, private, and good enough for a wide range of
tasks. When you need more power, swap it for Mistral or Llama 3
on a GPU node.

The future of AI is not just API calls to commercial providers.
It is open source models running on infrastructure you own and
control. Now you know exactly how to build it.

If this guide helped you, drop a star on the
GitHub repo
and share it with someone building with LLMs.

DEV Community