DEV Community

Cover image for Your AI, Your Rules: Running a Local LLM with GPU Acceleration on Proxmox
Clint
Clint

Posted on

Your AI, Your Rules: Running a Local LLM with GPU Acceleration on Proxmox

From 3 tok/s frustration to 21 tok/s GPU-hybrid inference - a real engineer's guide to self-hosted AI that actually works.


Why Bother Running Local LLMs?

Before we get into the how, let's address the obvious question: why not just use Claude, GPT, or Gemini?

The honest answer is - for many tasks, you should. But local LLMs make sense when:

  • Privacy matters. Code, internal documents, proprietary configs - none of it leaves your machine.
  • Cost at scale. API calls add up fast when you're running a coding agent all day.
  • Latency control. No network round-trips, no rate limits, no API downtime.
  • Offline capability. Works on a plane, in a data center, behind a firewall.
  • Experimentation. Swap models freely, tune inference parameters, benchmark to your heart's content.

This guide documents a real setup - not a toy demo - built specifically to run Claude Code and pi.dev against a local model, transparently, with no API key required.


The Hardware Stack

Component Spec
Host Proxmox VE 8.x, kernel 6.17.x
CPU 12-core (AMD/Intel)
RAM 40 GB allocated to LLM container
GPU NVIDIA RTX 2000 Ada Generation Laptop (8 GB VRAM)
Storage 120 GB root + /mnt/models for model files
Network Tailscale mesh for remote access

The GPU is the critical piece - even a modest 8 GB card dramatically changes what's possible.


The Architecture: What I Built

Architecture

Two services, two ports, one model:

  • Port 11434 - llama.cpp native OpenAI-compatible API (for pi.dev, curl, anything OpenAI-compatible)
  • Port 4000 - thin Python proxy translating Anthropic Messages API to OpenAI format (for Claude Code)

Part 1: The Container - Proxmox LXC Setup

Use an LXC container rather than a full VM because:

  • Near-native CPU performance (no hypervisor overhead)
  • Shared host kernel means GPU passthrough works with the host's NVIDIA driver
  • Faster to snapshot, clone, and manage

Container Config

File: /etc/pve/lxc/103.conf

arch: amd64
cores: 12
features: nesting=1
hostname: llm-server
memory: 40000
net0: name=eth0,bridge=vmbr0,hwaddr=BC:24:11:F0:15:BA,type=veth
ostype: debian
rootfs: local-lvm:vm-103-disk-0,size=120G
swap: 0
dev0: /dev/nvidia0,uid=0,gid=0
dev1: /dev/nvidiactl,uid=0,gid=0
dev2: /dev/nvidia-uvm,uid=0,gid=0
dev3: /dev/nvidia-uvm-tools,uid=0,gid=0
Enter fullscreen mode Exit fullscreen mode

The dev0 through dev3 lines are the magic - Proxmox's native device passthrough syntax. No lxc.mount.entry hacks required in newer PVE versions.

Critical gotcha: If you ever set chattr +i on /etc/resolv.conf inside the container (e.g., to prevent Tailscale from overwriting DNS), it will break Proxmox's pre-start hook which atomically updates the DNS config. The container won't start. Fix it from the host:

pct mount 103
chattr -i /var/lib/lxc/103/rootfs/etc/resolv.conf
pct unmount 103

Part 2: The Model - Picking the Right One

Model selection for a GPU-constrained system is non-obvious. The key insight is MoE vs Dense architecture:

Architecture Example Active Params/Token CPU Speed GPU Benefit
Dense Qwen 3.6 27B 27B ~3.5 tok/s High - all layers benefit
MoE Qwen 3.6 35B-A3B ~3B ~18 tok/s Lower - sparse routing already fast
MoE Gemma 4 26B-A4B ~4B ~16 tok/s Medium - GPU boosts active layers

The counter-intuitive result: a 35B MoE model runs 5x faster than a 27B dense model on CPU because MoE only activates a small fraction of weights per token. Don't assume smaller parameter count means faster inference.

I chose Gemma 4 26B-A4B Q4_K_XL (15.9 GB) for its:

  • Strong instruction following and coding ability
  • Multimodal capability (vision via mmproj)
  • 262K token context window
  • 4B active parameters (MoE) - fast despite large parameter count
  • Available from unsloth/gemma-4-26B-it-GGUF

Part 3: CPU-Only First - Getting llama.cpp Running

Always start CPU-only. It's simpler, debuggable, and gives you a baseline to measure GPU gains against.

Build llama.cpp

# Inside the container
apt-get install -y git cmake build-essential libopenblas-dev

git clone https://github.com/ggml-org/llama.cpp /opt/llm/llama.cpp
cd /opt/llm/llama.cpp && mkdir build && cd build
cmake -DGGML_CUDA=OFF .. && make -j$(nproc) llama-server
Enter fullscreen mode Exit fullscreen mode

Systemd Service

File: /etc/systemd/system/llama-server.service

[Unit]
Description=LLM Inference Server (llama.cpp)
After=network.target

[Service]
Type=simple
User=root
ExecStart=/opt/llm/llama.cpp/build/bin/llama-server \
  -m /mnt/models/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  -t 11 \
  -c 32768 \
  --batch-size 512 \
  --ubatch-size 128 \
  -ngl 0 \
  -fa on \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  --no-mmap \
  --reasoning off \
  --host 0.0.0.0 \
  --port 11434
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
Enter fullscreen mode Exit fullscreen mode

Key Flags Explained

Flag Why
-t 11 Use 11 of 12 cores - leave 1 for the OS
-c 32768 32K context - Claude Code's system prompt alone is ~24K tokens
--batch-size 512 Larger batches = higher throughput during prompt processing
--cache-type-k/v q4_0 4-bit KV cache - 75% smaller than f16, minimal quality loss
--no-mmap Load model fully into RAM - avoids slow first requests
--reasoning off Disable Gemma's thinking mode - outputs go to reasoning_content by default, which breaks OpenAI clients
-fa on Flash Attention - ~30% faster, same quality

CPU baseline: ~16-18 tok/s (MoE model, 4B active params)


Part 4: GPU Passthrough - The Hard Part

This is where most guides give up or give wrong advice. Here's what actually works.

Step 1: Install the Right Driver on the Host

The host kernel was 6.17.x (PVE custom kernel). Debian's packaged NVIDIA driver (550) doesn't support kernels past ~6.8. The solution is NVIDIA's official CUDA repo for Debian 13.

# On the Proxmox host
wget https://developer.download.nvidia.com/compute/cuda/repos/debian13/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb
apt-get update
apt-get install -y nvidia-driver-595 nvidia-open-kernel-dkms
Enter fullscreen mode Exit fullscreen mode

This installs driver 595.71.05 with DKMS support - it automatically builds kernel modules for all installed kernels including the PVE 6.17.x series.

Verify with nvidia-smi on the host. It should show your GPU.

Step 2: Add GPU Devices to the Container Config

In /etc/pve/lxc/103.conf, add:

dev0: /dev/nvidia0,uid=0,gid=0
dev1: /dev/nvidiactl,uid=0,gid=0
dev2: /dev/nvidia-uvm,uid=0,gid=0
dev3: /dev/nvidia-uvm-tools,uid=0,gid=0
Enter fullscreen mode Exit fullscreen mode

Restart the container:

pct stop 103 && pct start 103
Enter fullscreen mode Exit fullscreen mode

Step 3: Verify GPU Visibility Inside the Container

ls -la /dev/nvidia*
# crw-rw---- 1 root root 195,   0 /dev/nvidia0
# crw-rw---- 1 root root 195, 255 /dev/nvidiactl
# crw-rw---- 1 root root 505,   0 /dev/nvidia-uvm
# crw-rw---- 1 root root 505,   1 /dev/nvidia-uvm-tools
Enter fullscreen mode Exit fullscreen mode

The devices are visible - but nvidia-smi won't work yet. You need userspace libraries.


Part 5: CUDA Inside LXC - Making the GPU Work

The container needs NVIDIA userspace libraries that exactly match the host driver version (595.71.05). Version mismatch causes nvidia-smi: Failed to initialize NVML.

Install Matching Userspace Libraries

# Inside the container - add the CUDA repo for Debian 12
wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb
apt-get update

# Install userspace libraries pinned to host driver version
apt-get install -y libnvidia-ml1=595.71.05-1 nvidia-driver-cuda=595.71.05-1

# Verify
nvidia-smi
Enter fullscreen mode Exit fullscreen mode

Expected output:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.71.05   Driver Version: 595.71.05      CUDA Version: 13.2               |
+-----------------------------------------+------------------------+----------------------+
|   0  NVIDIA RTX 2000 Ada Gene...    On  |   00000000:01:00.0 Off |                  N/A |
| N/A   53C    P3             11W /   39W |       0MiB /   8188MiB |      0%      Default |
Enter fullscreen mode Exit fullscreen mode

Now install the CUDA toolkit for building llama.cpp:

apt-get install -y --no-install-recommends cuda-toolkit-12-6
export PATH=/usr/local/cuda/bin:$PATH
nvcc --version
Enter fullscreen mode Exit fullscreen mode

Rebuild llama.cpp with CUDA

cd /opt/llm/llama.cpp
rm -rf build && mkdir build && cd build

# sm_89 = Ada Lovelace (RTX 2000 Ada, RTX 4xxx series)
# Use sm_86 for Ampere (RTX 3xxx), sm_75 for Turing (RTX 2xxx)
PATH=/usr/local/cuda/bin:$PATH cmake \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=89 \
  ..

make -j$(nproc) llama-server
Enter fullscreen mode Exit fullscreen mode

Verify CUDA is linked:

ldd build/bin/llama-server | grep cuda
# libggml-cuda.so.0 => .../libggml-cuda.so.0
# libcudart.so.12  => .../libcudart.so.12
# libcublas.so.12  => .../libcublas.so.12
Enter fullscreen mode Exit fullscreen mode

Finding the Optimal GPU Layer Count

With 8 GB VRAM and a 15.9 GB model, you can't fit everything on GPU. The math:

  • Model has 30 transformer layers
  • Available VRAM after system overhead: ~7.5 GB
  • Per-layer cost: ~600-700 MB
  • Safe layer count: 12 layers (leaves ~700 MB free for compute buffers)

Start low and increase until you hit OOM, then back off one step. Update the service with -ngl 12 and restart:

systemctl restart llama-server
Enter fullscreen mode Exit fullscreen mode

GPU-hybrid result: ~21 tok/s (vs 16-18 tok/s CPU-only). The GPU hits 60%+ SM utilization during inference.


Part 6: The Proxy - Bridging Claude Code to Your LLM

Here's the problem nobody warns you about: Claude Code uses the Anthropic Messages API format, while llama.cpp serves the OpenAI Chat Completions format. They're incompatible.

Route

The proxy handles both sync and streaming (SSE) responses - Claude Code uses streaming for the interactive terminal experience.

Key Translation Points

Anthropic OpenAI
system (string or content array) messages[0] with role: system
content (array of blocks) content (plain string)
choices[0].message.content content[0].text
SSE content_block_delta events SSE choices[0].delta.content chunks
stop_reason: end_turn finish_reason: stop

The full proxy is ~150 lines of Python using aiohttp. Run it as a systemd service on port 4000:

# /etc/systemd/system/anthropic-proxy.service
[Unit]
Description=Anthropic API Proxy for llama.cpp
After=network.target llama-server.service

[Service]
Type=simple
User=root
ExecStart=/usr/bin/python3 /opt/anthropic-proxy.py
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
Enter fullscreen mode Exit fullscreen mode

Claude Code Config

File: ~/.claude/settings.json on the client machine

{
  "env": {
    "ANTHROPIC_BASE_URL": "http://192.168.100.103:4000",
    "ANTHROPIC_API_KEY": "sk-no-key-required"
  }
}
Enter fullscreen mode Exit fullscreen mode

Test it:

claude -p "hi"
# Hello! How can I help you today?
Enter fullscreen mode Exit fullscreen mode

Why not LiteLLM? I tried it. LiteLLM 1.83 introduced a ResponsesAPIResponse type internally that fails validation when converting back to AnthropicResponse. Requests hang silently with no error returned to the client. The 150-line custom proxy was faster to write and debug than fighting the library.


Part 7: pi.dev Integration

pi.dev speaks OpenAI format natively - no proxy needed, connect directly to port 11434.

File: ~/.pi/agent/models.json

{
  "providers": {
    "llama-local": {
      "baseUrl": "http://192.168.100.103:11434/v1",
      "api": "openai-completions",
      "apiKey": "sk-dummy",
      "compat": {
        "supportsDeveloperRole": false,
        "supportsReasoningEffort": false
      },
      "models": [
        {
          "id": "gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf",
          "name": "Gemma 4 26B (Local GPU)",
          "reasoning": false,
          "input": ["text"],
          "contextWindow": 32768,
          "maxTokens": 8192,
          "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }
        }
      ]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

The compat block is important - llama.cpp doesn't understand the developer role (used by pi for reasoning-capable models) or the reasoning_effort parameter. Setting both to false makes pi send standard system messages instead.

Open pi and type /model to select your local model. The file reloads automatically - no restart needed.


Performance Results

Stage Speed Notes
CPU-only (Dense 27B) ~3.5 tok/s Wrong model choice - dense is slow on CPU
CPU-only (MoE 35B) ~18 tok/s Switched to MoE - massive improvement
CPU-only (Gemma 4 26B MoE) ~16 tok/s Better model quality, similar speed
GPU hybrid (12/30 layers) ~21 tok/s 30% improvement, GPU at 60%+ utilization
Prompt processing (prefill) ~40 tok/s GPU accelerates context loading significantly

Typical response times:

  • "hi" - ~1 second
  • 100-token code explanation - ~5 seconds
  • 500-token code generation - ~25 seconds

The GPU contributes most to prompt processing speed - loading a large codebase context into the KV cache is noticeably faster with GPU layers active.


Gotchas and Lessons Learned

1. chattr +i on resolv.conf breaks container startup

Proxmox's pre-start hook atomically renames a temp file to /etc/resolv.conf. The immutable flag blocks this. The container fails silently - only visible in lxc-start -l DEBUG logs as close (rename) atomic file failed: Operation not permitted.

2. Driver version must match exactly

Userspace libraries inside the container must match the host kernel module version exactly. A mismatch causes nvidia-smi: Failed to initialize NVML. Pin the version explicitly: apt-get install libnvidia-ml1=595.71.05-1.

3. Kernel 6.17 + Debian driver 550 = build failure

Debian's packaged NVIDIA driver 550 has no DKMS support for kernels past ~6.8. The fix is NVIDIA's official CUDA repo for Debian 13 (debian13), which ships driver 595 with working DKMS for modern kernels.

4. MoE vs Dense - the counterintuitive performance flip

A 35B MoE model genuinely outperforms a 27B dense model on CPU because sparse activation means only ~3-4B parameters are computed per token. Never assume smaller parameter count means faster inference - check the architecture first.

5. Gemma 4 thinks by default

Gemma 4 uses internal chain-of-thought thinking mode by default. With streaming, the client receives reasoning_content but empty content until thinking completes. For chat interfaces that expect immediate tokens, add --reasoning off. For code accuracy tasks, leaving it enabled is worth the latency cost.

6. LiteLLM 1.83 hangs silently

The latest LiteLLM uses a new ResponsesAPIResponse type that fails Pydantic validation when serializing to AnthropicResponse. The request completes internally but the response is never sent to the client. No error, no timeout - just silence.

7. Context window must exceed Claude Code's system prompt

Claude Code's built-in system prompt is approximately 24K tokens. A context window below 32K triggers an immediate exceed_context_size_error before any user message is processed. Set -c 32768 as the minimum.


What's Next

Short term:

  • Re-enable thinking mode selectively by passing budget_tokens per request
  • Add a /v1/models endpoint to the proxy for model auto-discovery

For a dedicated thinking model:

  • DeepSeek-R1-Distill-Qwen-14B (~8 GB Q4) - fits almost entirely in 8 GB VRAM, estimated 30-40 tok/s, purpose-built for reasoning tasks

For bigger hardware:

  • RTX 3090 or 4090 (24 GB VRAM) - entire Gemma 4-26B fits on GPU, estimated 60-80 tok/s
  • A dual-GPU setup with NVLink enables running 70B models entirely on GPU

The Stack, Summarized

Model:    Gemma 4 26B-A4B Q4_K_XL (15.9 GB, MoE)
Engine:   llama.cpp with CUDA (sm_89, Ada Lovelace)
GPU:      RTX 2000 Ada 8 GB - 12/30 layers on GPU
Speed:    ~21 tok/s generation, ~40 tok/s prefill
Proxy:    Python aiohttp - Anthropic <-> OpenAI translation
Clients:  Claude Code (port 4000), pi.dev (port 11434)
Access:   LAN (192.168.100.103) + Tailscale
Cost:     $0 per query after hardware
Enter fullscreen mode Exit fullscreen mode

The entire setup took about 8 hours of real iteration. Most of that time was the three gotchas above - the chattr trap, the kernel/driver mismatch, and the LiteLLM silent hang. Hopefully this guide saves you all of it.


Built on Proxmox 8.x · llama.cpp · NVIDIA driver 595.71.05 · CUDA 12.6 · Gemma 4 26B · May 2026

Top comments (0)