Building Karpathy's LLM Wiki: A Production Homelab Implementation

#aiagents #llmwiki #homelab #kubernetes

I tried to run Karpathy's LLM Wiki on my Proxmox homelab cluster and spent three days debugging why the front-end wouldn't load. The error log said 502 Bad Gateway, but the backend was running and the API was reachable. It turned out the problem was in the reverse proxy configuration. I'd missed a single line in the Nginx config that was required for WebSockets to work properly.

This isn't the first time I've run into an issue where the documentation says one thing and the reality is something else. That's why I'm writing this — to show you exactly what I did, what went wrong, and how I fixed it.

If you're running a multi-node Kubernetes cluster with Proxmox, or you're trying to set up a production-like AI agent environment in your homelab, this is for you. I'll walk you through the exact steps I used to deploy Karpathy's LLM Wiki in a way that mirrors real-world production setups, with the gotchas and workarounds that actually matter.

What I Tried First

I started by cloning the LLM Wiki repository and running the demo setup with Docker Compose. It worked locally, but when I tried to scale it up to Kubernetes, I hit a wall. The first thing that broke was the persistent storage. I assumed the default Kubernetes emptyDir would work, but the wiki needed to persist data across restarts. I tried Longhorn, but the initial setup didn't account for the way the LLM Wiki uses SQLite — it required a specific file permission that wasn't set in the default PVC.

Then I tried using the initContainers approach to set the right permissions, but that didn't work either. I ended up having to modify the StorageClass and PVC definitions to include fsGroup and runAsUser settings that matched the SQLite process. That was a pain, but it was necessary.

Next, I tried to set up the reverse proxy with Traefik, which is what I use in production. But again, the documentation didn't mention that the LLM Wiki's WebSockets needed a specific configuration. I spent a few hours trying to figure out why the chat interface was broken, until I found a single line in the Traefik config that I had to add: proxy-set-headers: "Connection: keep-alive".

The Actual Solution

Here's what I ended up with. This is a production-ready setup that mirrors real-world environments, not just a demo.

Prerequisites

I'm assuming you have:

A Proxmox cluster with Kubernetes installed
A working Kubernetes cluster with at least 2 nodes
A working Traefik ingress controller
A working Longhorn storage system

Deploying the LLM Wiki

I used Helm to deploy the LLM Wiki, but I had to modify the values file to match the specific needs of the application. Here's what my values.yaml looked like:

# values.yaml
ingress:
  enabled: true
  hosts:
    - host: "wiki.example.com"
      paths:
        - path: "/"
          pathType: Prefix
  annotations:
    traefik.ingress.kubernetes.io/router.middlewares: "default-websocket-middleware"
    traefik.ingress.kubernetes.io/router.tls: "true"

storage:
  enabled: true
  class: "longhorn"
  accessModes:
    - ReadWriteMany
  size: "10Gi"

Then I created the middleware for WebSockets in Traefik:

# traefik-middleware.yaml
apiVersion: traefik.containo.us/v1alpha1
kind: Middleware
metadata:
  name: default-websocket-middleware
spec:
  webSocket:
    upGrade: true

I also had to add the proxy-set-headers configuration to the Traefik IngressRoute, which I did by modifying the ingressRoute spec in the Helm chart:

# ingressroute.yaml
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
  name: wiki-ingress
spec:
  entryPoints:
    - web
    - websecure
  routes:
    - match: Host(`wiki.example.com`)
      kind: Rule
      services:
        - name: wiki
          port: 80
      middlewares:
        - name: default-websocket-middleware
  tls:
    passthrough: true

For the storage, I used Longhorn with a ReadWriteMany access mode. That was tricky because SQLite doesn't support ReadWriteMany by default, but I was able to get it working by using a Longhorn volume with a Filesystem type. That's something the documentation didn't mention.

Configuring the Application

The LLM Wiki itself had a few configuration options I had to set. I added the following environment variables to the deployment spec:

env:
  - name: WIKI_DB_URL
    value: "sqlite:///wiki.db"
  - name: WIKI_HOST
    value: "wiki.example.com"
  - name: WIKI_PORT
    value: "80"

I also had to make sure the SQLite file had the correct permissions. I used a initContainer to set the right ownership:

initContainers:
  - name: init-sqlite
    image: busybox
    command: ["sh", "-c", "touch /wiki.db && chown 1001:1001 /wiki.db"]
    volumeMounts:
      - name: wiki-data
        mountPath: /wiki.db

This was necessary because the SQLite process runs as user 1001, and without the right ownership, it wouldn't be able to write to the database.

Why It Works

The key to getting this to work was understanding the specific requirements of the LLM Wiki and how they interacted with Kubernetes and Longhorn.

Persistent Storage: SQLite requires a specific file permission that wasn't set in the default PVC. I had to use an initContainer to set the right ownership and make sure the volume was mounted correctly.
Reverse Proxy Configuration: The WebSockets in the LLM Wiki required a specific Traefik middleware to work properly. That's something the documentation didn't mention, but it was necessary for the chat interface to function.
Longhorn Configuration: The LLM Wiki uses SQLite, which doesn't support ReadWriteMany by default. I had to use a Longhorn volume with a Filesystem type to make it work.

These are the kinds of gotchas that don't appear in the documentation, but they're critical to getting the application running in a real-world environment.

Lessons Learned

This was a learning experience, and here's what I'd do differently next time:

Start with a Minimal Setup: I should have started with a minimal setup and then added complexity incrementally. That would have helped me identify the issues earlier.
Use Real-World Tools: I should have used the same tools I use in production — like Traefik and Longhorn — from the start. That would have saved me time debugging compatibility issues.
Test the WebSockets Early: I should have tested the WebSockets early on to make sure they worked before deploying the application. That would have saved me from a lot of frustration.

I also learned that the documentation doesn't always cover the edge cases. For example, the LLM Wiki's documentation didn't mention the need for a specific Traefik middleware for WebSockets. That's something that's critical to getting the application running, but it's not something you'd find in the documentation.

If you're trying to run the LLM Wiki in a production-like environment, I'd recommend using the same tools you use in production — like Traefik and Longhorn — and testing the WebSockets early on. That will help you avoid a lot of the gotchas I ran into.

Finally, I'd like to mention that if you're looking for help with AI agent orchestration, Kubernetes infrastructure, or industrial IoT systems, I'm available for consulting. You can find more information at https://guatulabs.com/services.