Kai Bennett

Posted on May 25

I ran Claude Code on a local LLM for 4 hours — 7M tokens, $0 (would have cost $94)

#ai #llm #opensource #productivity

Last week I ran a 4-hour autonomous coding session using Claude Code — but not against the Anthropic API.

Instead, I routed it through a local llama.cpp instance running Qwen3.6-27B-MTP on my AMD GPU. The total cost: $0.

The same session on Claude Opus 4.7 would have cost ~$94.34.

Here's exactly how the stack works and how you can replicate it.

The stack

The key insight: Claude Code uses the Anthropic API format, but LiteLLM can proxy that to any OpenAI-compatible backend. Your local model never knows it's being called by Claude Code.

Claude Code
    ↓ (thinks it's the Anthropic API)
LiteLLM proxy (localhost:4000)
    ↓
llama.cpp server (localhost:8080)
    ↓
Qwen3.6-27B-MTP Q4_K_M on AMD GPU

Hardware

GPU: AMD Radeon AI PRO R9700 (RDNA3, 32 GB VRAM)
Backend: llama.cpp HIP/ROCm acceleration
Model: Qwen3.6-27B-MTP Q4_K_M + 0.6B MTP draft (speculative decoding)

Inference speeds (batch=1):

Prefill: ~200 tokens/sec
Generation: ~25-35 tokens/sec

The session that validated it

4-hour autonomous coding run — Hermes Agent doing a multi-step code migration, calling tools, editing files, looping while I watched from Telegram.

Stats:

Duration: ~4 hours
Tokens processed: 7,256,671
API cost: $0
Claude Opus 4.7 equivalent: ~$94.34

Why this matters beyond cost

No rate limits or weekly caps — Claude Code's usage limits don't apply to your own machine
Full privacy — your code never leaves your hardware
Offline capability — works with no internet once models are downloaded
Reproducibility — same model weights every time, no silent updates

Setup in 3 steps

1. Start llama.cpp server

./llama-server \
  --model Qwen3.6-27B-MTP-Q4_K_M.gguf \
  --draft-model Qwen3.6-0.6B-Q8_0.gguf \
  --speculative \
  --ctx-size 32768 \
  --n-gpu-layers 99 \
  --port 8080

2. LiteLLM proxy config

model_list:
  - model_name: claude-opus-4-5
    litellm_params:
      model: openai/qwen3-27b
      api_base: http://localhost:8080/v1
      api_key: fake-key

litellm --config litellm.proxy.yaml --port 4000

3. Point Claude Code to local proxy

export ANTHROPIC_BASE_URL=http://localhost:4000
export ANTHROPIC_API_KEY=fake-key
claude

Done. Claude Code now talks to your GPU.

Full stack with Hermes Agent

For agentic sessions with Telegram control, persistent task context, and tool orchestration, I use Hermes Agent on top. Full open-source setup:

github.com/KaiFelixBennett/hermes-claude-code-local

Includes: llama.cpp start scripts, LiteLLM config, Hermes Agent setup, and benchmark results.

Hardware requirements

Minimum: 16 GB VRAM for useful coding models (13B-class)
Recommended: 24+ GB for 27B-class models
NVIDIA CUDA: Supported, use CUDA llama.cpp build
Apple Silicon: Should work with Metal backend — need benchmarks!

Drop your hardware + generation speeds in the comments. Especially interested in NVIDIA and Apple Silicon numbers.

DEV Community