From 3 tok/s frustration to 21 tok/s GPU-hybrid inference - a real engineer's guide to self-hosted AI that actually works.
Why Bother Running Local LLMs?
Before we get into the how, let's address the obvious question: why not just use Claude, GPT, or Gemini?
The honest answer is - for many tasks, you should. But local LLMs make sense when:
- Privacy matters. Code, internal documents, proprietary configs - none of it leaves your machine.
- Cost at scale. API calls add up fast when you're running a coding agent all day.
- Latency control. No network round-trips, no rate limits, no API downtime.
- Offline capability. Works on a plane, in a data center, behind a firewall.
- Experimentation. Swap models freely, tune inference parameters, benchmark to your heart's content.
This guide documents a real setup - not a toy demo - built specifically to run Claude Code and pi.dev against a local model, transparently, with no API key required.
The Hardware Stack
| Component | Spec |
|---|---|
| Host | Proxmox VE 8.x, kernel 6.17.x |
| CPU | 12-core (AMD/Intel) |
| RAM | 40 GB allocated to LLM container |
| GPU | NVIDIA RTX 2000 Ada Generation Laptop (8 GB VRAM) |
| Storage | 120 GB root + /mnt/models for model files |
| Network | Tailscale mesh for remote access |
The GPU is the critical piece - even a modest 8 GB card dramatically changes what's possible.
The Architecture: What I Built
Two services, two ports, one model:
- Port 11434 - llama.cpp native OpenAI-compatible API (for pi.dev, curl, anything OpenAI-compatible)
- Port 4000 - thin Python proxy translating Anthropic Messages API to OpenAI format (for Claude Code)
Part 1: The Container - Proxmox LXC Setup
Use an LXC container rather than a full VM because:
- Near-native CPU performance (no hypervisor overhead)
- Shared host kernel means GPU passthrough works with the host's NVIDIA driver
- Faster to snapshot, clone, and manage
Container Config
File: /etc/pve/lxc/103.conf
arch: amd64
cores: 12
features: nesting=1
hostname: llm-server
memory: 40000
net0: name=eth0,bridge=vmbr0,hwaddr=BC:24:11:F0:15:BA,type=veth
ostype: debian
rootfs: local-lvm:vm-103-disk-0,size=120G
swap: 0
dev0: /dev/nvidia0,uid=0,gid=0
dev1: /dev/nvidiactl,uid=0,gid=0
dev2: /dev/nvidia-uvm,uid=0,gid=0
dev3: /dev/nvidia-uvm-tools,uid=0,gid=0
The dev0 through dev3 lines are the magic - Proxmox's native device passthrough syntax. No lxc.mount.entry hacks required in newer PVE versions.
Critical gotcha: If you ever set
chattr +ion/etc/resolv.confinside the container (e.g., to prevent Tailscale from overwriting DNS), it will break Proxmox's pre-start hook which atomically updates the DNS config. The container won't start. Fix it from the host:pct mount 103 chattr -i /var/lib/lxc/103/rootfs/etc/resolv.conf pct unmount 103
Part 2: The Model - Picking the Right One
Model selection for a GPU-constrained system is non-obvious. The key insight is MoE vs Dense architecture:
| Architecture | Example | Active Params/Token | CPU Speed | GPU Benefit |
|---|---|---|---|---|
| Dense | Qwen 3.6 27B | 27B | ~3.5 tok/s | High - all layers benefit |
| MoE | Qwen 3.6 35B-A3B | ~3B | ~18 tok/s | Lower - sparse routing already fast |
| MoE | Gemma 4 26B-A4B | ~4B | ~16 tok/s | Medium - GPU boosts active layers |
The counter-intuitive result: a 35B MoE model runs 5x faster than a 27B dense model on CPU because MoE only activates a small fraction of weights per token. Don't assume smaller parameter count means faster inference.
I chose Gemma 4 26B-A4B Q4_K_XL (15.9 GB) for its:
- Strong instruction following and coding ability
- Multimodal capability (vision via mmproj)
- 262K token context window
- 4B active parameters (MoE) - fast despite large parameter count
- Available from unsloth/gemma-4-26B-it-GGUF
Part 3: CPU-Only First - Getting llama.cpp Running
Always start CPU-only. It's simpler, debuggable, and gives you a baseline to measure GPU gains against.
Build llama.cpp
# Inside the container
apt-get install -y git cmake build-essential libopenblas-dev
git clone https://github.com/ggml-org/llama.cpp /opt/llm/llama.cpp
cd /opt/llm/llama.cpp && mkdir build && cd build
cmake -DGGML_CUDA=OFF .. && make -j$(nproc) llama-server
Systemd Service
File: /etc/systemd/system/llama-server.service
[Unit]
Description=LLM Inference Server (llama.cpp)
After=network.target
[Service]
Type=simple
User=root
ExecStart=/opt/llm/llama.cpp/build/bin/llama-server \
-m /mnt/models/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
-t 11 \
-c 32768 \
--batch-size 512 \
--ubatch-size 128 \
-ngl 0 \
-fa on \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
--no-mmap \
--reasoning off \
--host 0.0.0.0 \
--port 11434
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
Key Flags Explained
| Flag | Why |
|---|---|
-t 11 |
Use 11 of 12 cores - leave 1 for the OS |
-c 32768 |
32K context - Claude Code's system prompt alone is ~24K tokens |
--batch-size 512 |
Larger batches = higher throughput during prompt processing |
--cache-type-k/v q4_0 |
4-bit KV cache - 75% smaller than f16, minimal quality loss |
--no-mmap |
Load model fully into RAM - avoids slow first requests |
--reasoning off |
Disable Gemma's thinking mode - outputs go to reasoning_content by default, which breaks OpenAI clients |
-fa on |
Flash Attention - ~30% faster, same quality |
CPU baseline: ~16-18 tok/s (MoE model, 4B active params)
Part 4: GPU Passthrough - The Hard Part
This is where most guides give up or give wrong advice. Here's what actually works.
Step 1: Install the Right Driver on the Host
The host kernel was 6.17.x (PVE custom kernel). Debian's packaged NVIDIA driver (550) doesn't support kernels past ~6.8. The solution is NVIDIA's official CUDA repo for Debian 13.
# On the Proxmox host
wget https://developer.download.nvidia.com/compute/cuda/repos/debian13/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb
apt-get update
apt-get install -y nvidia-driver-595 nvidia-open-kernel-dkms
This installs driver 595.71.05 with DKMS support - it automatically builds kernel modules for all installed kernels including the PVE 6.17.x series.
Verify with nvidia-smi on the host. It should show your GPU.
Step 2: Add GPU Devices to the Container Config
In /etc/pve/lxc/103.conf, add:
dev0: /dev/nvidia0,uid=0,gid=0
dev1: /dev/nvidiactl,uid=0,gid=0
dev2: /dev/nvidia-uvm,uid=0,gid=0
dev3: /dev/nvidia-uvm-tools,uid=0,gid=0
Restart the container:
pct stop 103 && pct start 103
Step 3: Verify GPU Visibility Inside the Container
ls -la /dev/nvidia*
# crw-rw---- 1 root root 195, 0 /dev/nvidia0
# crw-rw---- 1 root root 195, 255 /dev/nvidiactl
# crw-rw---- 1 root root 505, 0 /dev/nvidia-uvm
# crw-rw---- 1 root root 505, 1 /dev/nvidia-uvm-tools
The devices are visible - but nvidia-smi won't work yet. You need userspace libraries.
Part 5: CUDA Inside LXC - Making the GPU Work
The container needs NVIDIA userspace libraries that exactly match the host driver version (595.71.05). Version mismatch causes nvidia-smi: Failed to initialize NVML.
Install Matching Userspace Libraries
# Inside the container - add the CUDA repo for Debian 12
wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb
apt-get update
# Install userspace libraries pinned to host driver version
apt-get install -y libnvidia-ml1=595.71.05-1 nvidia-driver-cuda=595.71.05-1
# Verify
nvidia-smi
Expected output:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.71.05 Driver Version: 595.71.05 CUDA Version: 13.2 |
+-----------------------------------------+------------------------+----------------------+
| 0 NVIDIA RTX 2000 Ada Gene... On | 00000000:01:00.0 Off | N/A |
| N/A 53C P3 11W / 39W | 0MiB / 8188MiB | 0% Default |
Now install the CUDA toolkit for building llama.cpp:
apt-get install -y --no-install-recommends cuda-toolkit-12-6
export PATH=/usr/local/cuda/bin:$PATH
nvcc --version
Rebuild llama.cpp with CUDA
cd /opt/llm/llama.cpp
rm -rf build && mkdir build && cd build
# sm_89 = Ada Lovelace (RTX 2000 Ada, RTX 4xxx series)
# Use sm_86 for Ampere (RTX 3xxx), sm_75 for Turing (RTX 2xxx)
PATH=/usr/local/cuda/bin:$PATH cmake \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES=89 \
..
make -j$(nproc) llama-server
Verify CUDA is linked:
ldd build/bin/llama-server | grep cuda
# libggml-cuda.so.0 => .../libggml-cuda.so.0
# libcudart.so.12 => .../libcudart.so.12
# libcublas.so.12 => .../libcublas.so.12
Finding the Optimal GPU Layer Count
With 8 GB VRAM and a 15.9 GB model, you can't fit everything on GPU. The math:
- Model has 30 transformer layers
- Available VRAM after system overhead: ~7.5 GB
- Per-layer cost: ~600-700 MB
- Safe layer count: 12 layers (leaves ~700 MB free for compute buffers)
Start low and increase until you hit OOM, then back off one step. Update the service with -ngl 12 and restart:
systemctl restart llama-server
GPU-hybrid result: ~21 tok/s (vs 16-18 tok/s CPU-only). The GPU hits 60%+ SM utilization during inference.
Part 6: The Proxy - Bridging Claude Code to Your LLM
Here's the problem nobody warns you about: Claude Code uses the Anthropic Messages API format, while llama.cpp serves the OpenAI Chat Completions format. They're incompatible.
The proxy handles both sync and streaming (SSE) responses - Claude Code uses streaming for the interactive terminal experience.
Key Translation Points
| Anthropic | OpenAI |
|---|---|
system (string or content array) |
messages[0] with role: system
|
content (array of blocks) |
content (plain string) |
choices[0].message.content |
content[0].text |
SSE content_block_delta events |
SSE choices[0].delta.content chunks |
stop_reason: end_turn |
finish_reason: stop |
The full proxy is ~150 lines of Python using aiohttp. Run it as a systemd service on port 4000:
# /etc/systemd/system/anthropic-proxy.service
[Unit]
Description=Anthropic API Proxy for llama.cpp
After=network.target llama-server.service
[Service]
Type=simple
User=root
ExecStart=/usr/bin/python3 /opt/anthropic-proxy.py
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
Claude Code Config
File: ~/.claude/settings.json on the client machine
{
"env": {
"ANTHROPIC_BASE_URL": "http://192.168.100.103:4000",
"ANTHROPIC_API_KEY": "sk-no-key-required"
}
}
Test it:
claude -p "hi"
# Hello! How can I help you today?
Why not LiteLLM? I tried it. LiteLLM 1.83 introduced a
ResponsesAPIResponsetype internally that fails validation when converting back toAnthropicResponse. Requests hang silently with no error returned to the client. The 150-line custom proxy was faster to write and debug than fighting the library.
Part 7: pi.dev Integration
pi.dev speaks OpenAI format natively - no proxy needed, connect directly to port 11434.
File: ~/.pi/agent/models.json
{
"providers": {
"llama-local": {
"baseUrl": "http://192.168.100.103:11434/v1",
"api": "openai-completions",
"apiKey": "sk-dummy",
"compat": {
"supportsDeveloperRole": false,
"supportsReasoningEffort": false
},
"models": [
{
"id": "gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf",
"name": "Gemma 4 26B (Local GPU)",
"reasoning": false,
"input": ["text"],
"contextWindow": 32768,
"maxTokens": 8192,
"cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }
}
]
}
}
}
The compat block is important - llama.cpp doesn't understand the developer role (used by pi for reasoning-capable models) or the reasoning_effort parameter. Setting both to false makes pi send standard system messages instead.
Open pi and type /model to select your local model. The file reloads automatically - no restart needed.
Performance Results
| Stage | Speed | Notes |
|---|---|---|
| CPU-only (Dense 27B) | ~3.5 tok/s | Wrong model choice - dense is slow on CPU |
| CPU-only (MoE 35B) | ~18 tok/s | Switched to MoE - massive improvement |
| CPU-only (Gemma 4 26B MoE) | ~16 tok/s | Better model quality, similar speed |
| GPU hybrid (12/30 layers) | ~21 tok/s | 30% improvement, GPU at 60%+ utilization |
| Prompt processing (prefill) | ~40 tok/s | GPU accelerates context loading significantly |
Typical response times:
- "hi" - ~1 second
- 100-token code explanation - ~5 seconds
- 500-token code generation - ~25 seconds
The GPU contributes most to prompt processing speed - loading a large codebase context into the KV cache is noticeably faster with GPU layers active.
Gotchas and Lessons Learned
1. chattr +i on resolv.conf breaks container startup
Proxmox's pre-start hook atomically renames a temp file to /etc/resolv.conf. The immutable flag blocks this. The container fails silently - only visible in lxc-start -l DEBUG logs as close (rename) atomic file failed: Operation not permitted.
2. Driver version must match exactly
Userspace libraries inside the container must match the host kernel module version exactly. A mismatch causes nvidia-smi: Failed to initialize NVML. Pin the version explicitly: apt-get install libnvidia-ml1=595.71.05-1.
3. Kernel 6.17 + Debian driver 550 = build failure
Debian's packaged NVIDIA driver 550 has no DKMS support for kernels past ~6.8. The fix is NVIDIA's official CUDA repo for Debian 13 (debian13), which ships driver 595 with working DKMS for modern kernels.
4. MoE vs Dense - the counterintuitive performance flip
A 35B MoE model genuinely outperforms a 27B dense model on CPU because sparse activation means only ~3-4B parameters are computed per token. Never assume smaller parameter count means faster inference - check the architecture first.
5. Gemma 4 thinks by default
Gemma 4 uses internal chain-of-thought thinking mode by default. With streaming, the client receives reasoning_content but empty content until thinking completes. For chat interfaces that expect immediate tokens, add --reasoning off. For code accuracy tasks, leaving it enabled is worth the latency cost.
6. LiteLLM 1.83 hangs silently
The latest LiteLLM uses a new ResponsesAPIResponse type that fails Pydantic validation when serializing to AnthropicResponse. The request completes internally but the response is never sent to the client. No error, no timeout - just silence.
7. Context window must exceed Claude Code's system prompt
Claude Code's built-in system prompt is approximately 24K tokens. A context window below 32K triggers an immediate exceed_context_size_error before any user message is processed. Set -c 32768 as the minimum.
What's Next
Short term:
- Re-enable thinking mode selectively by passing
budget_tokensper request - Add a
/v1/modelsendpoint to the proxy for model auto-discovery
For a dedicated thinking model:
- DeepSeek-R1-Distill-Qwen-14B (~8 GB Q4) - fits almost entirely in 8 GB VRAM, estimated 30-40 tok/s, purpose-built for reasoning tasks
For bigger hardware:
- RTX 3090 or 4090 (24 GB VRAM) - entire Gemma 4-26B fits on GPU, estimated 60-80 tok/s
- A dual-GPU setup with NVLink enables running 70B models entirely on GPU
The Stack, Summarized
Model: Gemma 4 26B-A4B Q4_K_XL (15.9 GB, MoE)
Engine: llama.cpp with CUDA (sm_89, Ada Lovelace)
GPU: RTX 2000 Ada 8 GB - 12/30 layers on GPU
Speed: ~21 tok/s generation, ~40 tok/s prefill
Proxy: Python aiohttp - Anthropic <-> OpenAI translation
Clients: Claude Code (port 4000), pi.dev (port 11434)
Access: LAN (192.168.100.103) + Tailscale
Cost: $0 per query after hardware
The entire setup took about 8 hours of real iteration. Most of that time was the three gotchas above - the chattr trap, the kernel/driver mismatch, and the LiteLLM silent hang. Hopefully this guide saves you all of it.
Built on Proxmox 8.x · llama.cpp · NVIDIA driver 595.71.05 · CUDA 12.6 · Gemma 4 26B · May 2026


Top comments (0)