Clint

Posted on May 1

Your AI, Your Rules: Running a Local LLM with GPU Acceleration on Proxmox

#llm #selfhosted #proxmox #nvidia

From 3 tok/s frustration to 21 tok/s GPU-hybrid inference - a real engineer's guide to self-hosted AI that actually works.

Why Bother Running Local LLMs?

Before we get into the how, let's address the obvious question: why not just use Claude, GPT, or Gemini?

The honest answer is - for many tasks, you should. But local LLMs make sense when:

Privacy matters. Code, internal documents, proprietary configs - none of it leaves your machine.
Cost at scale. API calls add up fast when you're running a coding agent all day.
Latency control. No network round-trips, no rate limits, no API downtime.
Offline capability. Works on a plane, in a data center, behind a firewall.
Experimentation. Swap models freely, tune inference parameters, benchmark to your heart's content.

This guide documents a real setup - not a toy demo - built specifically to run Claude Code and pi.dev against a local model, transparently, with no API key required.

The Hardware Stack

Component	Spec
Host	Proxmox VE 8.x, kernel 6.17.x
CPU	12-core (AMD/Intel)
RAM	40 GB allocated to LLM container
GPU	NVIDIA RTX 2000 Ada Generation Laptop (8 GB VRAM)
Storage	120 GB root + `/mnt/models` for model files
Network	Tailscale mesh for remote access

The GPU is the critical piece - even a modest 8 GB card dramatically changes what's possible.

The Architecture: What I Built

Two services, two ports, one model:

Port 11434 - llama.cpp native OpenAI-compatible API (for pi.dev, curl, anything OpenAI-compatible)
Port 4000 - thin Python proxy translating Anthropic Messages API to OpenAI format (for Claude Code)

Part 1: The Container - Proxmox LXC Setup

Use an LXC container rather than a full VM because:

Near-native CPU performance (no hypervisor overhead)
Shared host kernel means GPU passthrough works with the host's NVIDIA driver
Faster to snapshot, clone, and manage

Container Config

File: /etc/pve/lxc/103.conf

arch: amd64
cores: 12
features: nesting=1
hostname: llm-server
memory: 40000
net0: name=eth0,bridge=vmbr0,hwaddr=BC:24:11:F0:15:BA,type=veth
ostype: debian
rootfs: local-lvm:vm-103-disk-0,size=120G
swap: 0
dev0: /dev/nvidia0,uid=0,gid=0
dev1: /dev/nvidiactl,uid=0,gid=0
dev2: /dev/nvidia-uvm,uid=0,gid=0
dev3: /dev/nvidia-uvm-tools,uid=0,gid=0

The dev0 through dev3 lines are the magic - Proxmox's native device passthrough syntax. No lxc.mount.entry hacks required in newer PVE versions.

Critical gotcha: If you ever set chattr +i on /etc/resolv.conf inside the container (e.g., to prevent Tailscale from overwriting DNS), it will break Proxmox's pre-start hook which atomically updates the DNS config. The container won't start. Fix it from the host:
pct mount 103
chattr -i /var/lib/lxc/103/rootfs/etc/resolv.conf
pct unmount 103

Part 2: The Model - Picking the Right One

Model selection for a GPU-constrained system is non-obvious. The key insight is MoE vs Dense architecture:

Architecture	Example	Active Params/Token	CPU Speed	GPU Benefit
Dense	Qwen 3.6 27B	27B	~3.5 tok/s	High - all layers benefit
MoE	Qwen 3.6 35B-A3B	~3B	~18 tok/s	Lower - sparse routing already fast
MoE	Gemma 4 26B-A4B	~4B	~16 tok/s	Medium - GPU boosts active layers

The counter-intuitive result: a 35B MoE model runs 5x faster than a 27B dense model on CPU because MoE only activates a small fraction of weights per token. Don't assume smaller parameter count means faster inference.

I chose Gemma 4 26B-A4B Q4_K_XL (15.9 GB) for its:

Strong instruction following and coding ability
Multimodal capability (vision via mmproj)
262K token context window
4B active parameters (MoE) - fast despite large parameter count
Available from unsloth/gemma-4-26B-it-GGUF

Part 3: CPU-Only First - Getting llama.cpp Running

Always start CPU-only. It's simpler, debuggable, and gives you a baseline to measure GPU gains against.

Build llama.cpp

# Inside the container
apt-get install -y git cmake build-essential libopenblas-dev

git clone https://github.com/ggml-org/llama.cpp /opt/llm/llama.cpp
cd /opt/llm/llama.cpp && mkdir build && cd build
cmake -DGGML_CUDA=OFF .. && make -j$(nproc) llama-server

Systemd Service

File: /etc/systemd/system/llama-server.service

[Unit]
Description=LLM Inference Server (llama.cpp)
After=network.target

[Service]
Type=simple
User=root
ExecStart=/opt/llm/llama.cpp/build/bin/llama-server \
  -m /mnt/models/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  -t 11 \
  -c 32768 \
  --batch-size 512 \
  --ubatch-size 128 \
  -ngl 0 \
  -fa on \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  --no-mmap \
  --reasoning off \
  --host 0.0.0.0 \
  --port 11434
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Key Flags Explained

Flag	Why
`-t 11`	Use 11 of 12 cores - leave 1 for the OS
`-c 32768`	32K context - Claude Code's system prompt alone is ~24K tokens
`--batch-size 512`	Larger batches = higher throughput during prompt processing
`--cache-type-k/v q4_0`	4-bit KV cache - 75% smaller than f16, minimal quality loss
`--no-mmap`	Load model fully into RAM - avoids slow first requests
`--reasoning off`	Disable Gemma's thinking mode - outputs go to `reasoning_content` by default, which breaks OpenAI clients
`-fa on`	Flash Attention - ~30% faster, same quality

CPU baseline: ~16-18 tok/s (MoE model, 4B active params)

Part 4: GPU Passthrough - The Hard Part

This is where most guides give up or give wrong advice. Here's what actually works.

Step 1: Install the Right Driver on the Host

The host kernel was 6.17.x (PVE custom kernel). Debian's packaged NVIDIA driver (550) doesn't support kernels past ~6.8. The solution is NVIDIA's official CUDA repo for Debian 13.

# On the Proxmox host
wget https://developer.download.nvidia.com/compute/cuda/repos/debian13/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb
apt-get update
apt-get install -y nvidia-driver-595 nvidia-open-kernel-dkms

This installs driver 595.71.05 with DKMS support - it automatically builds kernel modules for all installed kernels including the PVE 6.17.x series.

Verify with nvidia-smi on the host. It should show your GPU.

Step 2: Add GPU Devices to the Container Config

In /etc/pve/lxc/103.conf, add:

dev0: /dev/nvidia0,uid=0,gid=0
dev1: /dev/nvidiactl,uid=0,gid=0
dev2: /dev/nvidia-uvm,uid=0,gid=0
dev3: /dev/nvidia-uvm-tools,uid=0,gid=0

Restart the container:

pct stop 103 && pct start 103

Step 3: Verify GPU Visibility Inside the Container

ls -la /dev/nvidia*
# crw-rw---- 1 root root 195,   0 /dev/nvidia0
# crw-rw---- 1 root root 195, 255 /dev/nvidiactl
# crw-rw---- 1 root root 505,   0 /dev/nvidia-uvm
# crw-rw---- 1 root root 505,   1 /dev/nvidia-uvm-tools

The devices are visible - but nvidia-smi won't work yet. You need userspace libraries.

Part 5: CUDA Inside LXC - Making the GPU Work

The container needs NVIDIA userspace libraries that exactly match the host driver version (595.71.05). Version mismatch causes nvidia-smi: Failed to initialize NVML.

Install Matching Userspace Libraries

# Inside the container - add the CUDA repo for Debian 12
wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb
apt-get update

# Install userspace libraries pinned to host driver version
apt-get install -y libnvidia-ml1=595.71.05-1 nvidia-driver-cuda=595.71.05-1

# Verify
nvidia-smi

Expected output:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.71.05   Driver Version: 595.71.05      CUDA Version: 13.2               |
+-----------------------------------------+------------------------+----------------------+
|   0  NVIDIA RTX 2000 Ada Gene...    On  |   00000000:01:00.0 Off |                  N/A |
| N/A   53C    P3             11W /   39W |       0MiB /   8188MiB |      0%      Default |

Now install the CUDA toolkit for building llama.cpp:

apt-get install -y --no-install-recommends cuda-toolkit-12-6
export PATH=/usr/local/cuda/bin:$PATH
nvcc --version

Rebuild llama.cpp with CUDA

cd /opt/llm/llama.cpp
rm -rf build && mkdir build && cd build

# sm_89 = Ada Lovelace (RTX 2000 Ada, RTX 4xxx series)
# Use sm_86 for Ampere (RTX 3xxx), sm_75 for Turing (RTX 2xxx)
PATH=/usr/local/cuda/bin:$PATH cmake \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=89 \
  ..

make -j$(nproc) llama-server

Verify CUDA is linked:

ldd build/bin/llama-server | grep cuda
# libggml-cuda.so.0 => .../libggml-cuda.so.0
# libcudart.so.12  => .../libcudart.so.12
# libcublas.so.12  => .../libcublas.so.12

Finding the Optimal GPU Layer Count

With 8 GB VRAM and a 15.9 GB model, you can't fit everything on GPU. The math:

Model has 30 transformer layers
Available VRAM after system overhead: ~7.5 GB
Per-layer cost: ~600-700 MB
Safe layer count: 12 layers (leaves ~700 MB free for compute buffers)

Start low and increase until you hit OOM, then back off one step. Update the service with -ngl 12 and restart:

systemctl restart llama-server

GPU-hybrid result: ~21 tok/s (vs 16-18 tok/s CPU-only). The GPU hits 60%+ SM utilization during inference.

Part 6: The Proxy - Bridging Claude Code to Your LLM

Here's the problem nobody warns you about: Claude Code uses the Anthropic Messages API format, while llama.cpp serves the OpenAI Chat Completions format. They're incompatible.

The proxy handles both sync and streaming (SSE) responses - Claude Code uses streaming for the interactive terminal experience.

Key Translation Points

Anthropic	OpenAI
`system` (string or content array)	`messages[0]` with `role: system`
`content` (array of blocks)	`content` (plain string)
`choices[0].message.content`	`content[0].text`
SSE `content_block_delta` events	SSE `choices[0].delta.content` chunks
`stop_reason: end_turn`	`finish_reason: stop`

The full proxy is ~150 lines of Python using aiohttp. Run it as a systemd service on port 4000:

# /etc/systemd/system/anthropic-proxy.service
[Unit]
Description=Anthropic API Proxy for llama.cpp
After=network.target llama-server.service

[Service]
Type=simple
User=root
ExecStart=/usr/bin/python3 /opt/anthropic-proxy.py
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

Claude Code Config

File: ~/.claude/settings.json on the client machine

{
  "env": {
    "ANTHROPIC_BASE_URL": "http://192.168.100.103:4000",
    "ANTHROPIC_API_KEY": "sk-no-key-required"
  }
}

Test it:

claude -p "hi"
# Hello! How can I help you today?

Why not LiteLLM? I tried it. LiteLLM 1.83 introduced a ResponsesAPIResponse type internally that fails validation when converting back to AnthropicResponse. Requests hang silently with no error returned to the client. The 150-line custom proxy was faster to write and debug than fighting the library.

Part 7: pi.dev Integration

pi.dev speaks OpenAI format natively - no proxy needed, connect directly to port 11434.

File: ~/.pi/agent/models.json

{
  "providers": {
    "llama-local": {
      "baseUrl": "http://192.168.100.103:11434/v1",
      "api": "openai-completions",
      "apiKey": "sk-dummy",
      "compat": {
        "supportsDeveloperRole": false,
        "supportsReasoningEffort": false
      },
      "models": [
        {
          "id": "gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf",
          "name": "Gemma 4 26B (Local GPU)",
          "reasoning": false,
          "input": ["text"],
          "contextWindow": 32768,
          "maxTokens": 8192,
          "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }
        }
      ]
    }
  }
}

The compat block is important - llama.cpp doesn't understand the developer role (used by pi for reasoning-capable models) or the reasoning_effort parameter. Setting both to false makes pi send standard system messages instead.

Open pi and type /model to select your local model. The file reloads automatically - no restart needed.

Performance Results

Stage	Speed	Notes
CPU-only (Dense 27B)	~3.5 tok/s	Wrong model choice - dense is slow on CPU
CPU-only (MoE 35B)	~18 tok/s	Switched to MoE - massive improvement
CPU-only (Gemma 4 26B MoE)	~16 tok/s	Better model quality, similar speed
GPU hybrid (12/30 layers)	~21 tok/s	30% improvement, GPU at 60%+ utilization
Prompt processing (prefill)	~40 tok/s	GPU accelerates context loading significantly

Typical response times:

"hi" - ~1 second
100-token code explanation - ~5 seconds
500-token code generation - ~25 seconds

The GPU contributes most to prompt processing speed - loading a large codebase context into the KV cache is noticeably faster with GPU layers active.

Gotchas and Lessons Learned

1. chattr +i on resolv.conf breaks container startup

Proxmox's pre-start hook atomically renames a temp file to /etc/resolv.conf. The immutable flag blocks this. The container fails silently - only visible in lxc-start -l DEBUG logs as close (rename) atomic file failed: Operation not permitted.

2. Driver version must match exactly

Userspace libraries inside the container must match the host kernel module version exactly. A mismatch causes nvidia-smi: Failed to initialize NVML. Pin the version explicitly: apt-get install libnvidia-ml1=595.71.05-1.

3. Kernel 6.17 + Debian driver 550 = build failure

Debian's packaged NVIDIA driver 550 has no DKMS support for kernels past ~6.8. The fix is NVIDIA's official CUDA repo for Debian 13 (debian13), which ships driver 595 with working DKMS for modern kernels.

4. MoE vs Dense - the counterintuitive performance flip

A 35B MoE model genuinely outperforms a 27B dense model on CPU because sparse activation means only ~3-4B parameters are computed per token. Never assume smaller parameter count means faster inference - check the architecture first.

5. Gemma 4 thinks by default

Gemma 4 uses internal chain-of-thought thinking mode by default. With streaming, the client receives reasoning_content but empty content until thinking completes. For chat interfaces that expect immediate tokens, add --reasoning off. For code accuracy tasks, leaving it enabled is worth the latency cost.

6. LiteLLM 1.83 hangs silently

The latest LiteLLM uses a new ResponsesAPIResponse type that fails Pydantic validation when serializing to AnthropicResponse. The request completes internally but the response is never sent to the client. No error, no timeout - just silence.

7. Context window must exceed Claude Code's system prompt

Claude Code's built-in system prompt is approximately 24K tokens. A context window below 32K triggers an immediate exceed_context_size_error before any user message is processed. Set -c 32768 as the minimum.

What's Next

Short term:

Re-enable thinking mode selectively by passing budget_tokens per request
Add a /v1/models endpoint to the proxy for model auto-discovery

For a dedicated thinking model:

DeepSeek-R1-Distill-Qwen-14B (~8 GB Q4) - fits almost entirely in 8 GB VRAM, estimated 30-40 tok/s, purpose-built for reasoning tasks

For bigger hardware:

RTX 3090 or 4090 (24 GB VRAM) - entire Gemma 4-26B fits on GPU, estimated 60-80 tok/s
A dual-GPU setup with NVLink enables running 70B models entirely on GPU

The Stack, Summarized

Model:    Gemma 4 26B-A4B Q4_K_XL (15.9 GB, MoE)
Engine:   llama.cpp with CUDA (sm_89, Ada Lovelace)
GPU:      RTX 2000 Ada 8 GB - 12/30 layers on GPU
Speed:    ~21 tok/s generation, ~40 tok/s prefill
Proxy:    Python aiohttp - Anthropic <-> OpenAI translation
Clients:  Claude Code (port 4000), pi.dev (port 11434)
Access:   LAN (192.168.100.103) + Tailscale
Cost:     $0 per query after hardware

The entire setup took about 8 hours of real iteration. Most of that time was the three gotchas above - the chattr trap, the kernel/driver mismatch, and the LiteLLM silent hang. Hopefully this guide saves you all of it.

Built on Proxmox 8.x · llama.cpp · NVIDIA driver 595.71.05 · CUDA 12.6 · Gemma 4 26B · May 2026

DEV Community