Last week I ran a 4-hour autonomous coding session using Claude Code — but not against the Anthropic API.
Instead, I routed it through a local llama.cpp instance running Qwen3.6-27B-MTP on my AMD GPU. The total cost: $0.
The same session on Claude Opus 4.7 would have cost ~$94.34.
Here's exactly how the stack works and how you can replicate it.
The stack
The key insight: Claude Code uses the Anthropic API format, but LiteLLM can proxy that to any OpenAI-compatible backend. Your local model never knows it's being called by Claude Code.
Claude Code
↓ (thinks it's the Anthropic API)
LiteLLM proxy (localhost:4000)
↓
llama.cpp server (localhost:8080)
↓
Qwen3.6-27B-MTP Q4_K_M on AMD GPU
Hardware
- GPU: AMD Radeon AI PRO R9700 (RDNA3, 32 GB VRAM)
- Backend: llama.cpp HIP/ROCm acceleration
- Model: Qwen3.6-27B-MTP Q4_K_M + 0.6B MTP draft (speculative decoding)
Inference speeds (batch=1):
- Prefill: ~200 tokens/sec
- Generation: ~25-35 tokens/sec
The session that validated it
4-hour autonomous coding run — Hermes Agent doing a multi-step code migration, calling tools, editing files, looping while I watched from Telegram.
Stats:
- Duration: ~4 hours
- Tokens processed: 7,256,671
- API cost: $0
- Claude Opus 4.7 equivalent: ~$94.34
Why this matters beyond cost
- No rate limits or weekly caps — Claude Code's usage limits don't apply to your own machine
- Full privacy — your code never leaves your hardware
- Offline capability — works with no internet once models are downloaded
- Reproducibility — same model weights every time, no silent updates
Setup in 3 steps
1. Start llama.cpp server
./llama-server \
--model Qwen3.6-27B-MTP-Q4_K_M.gguf \
--draft-model Qwen3.6-0.6B-Q8_0.gguf \
--speculative \
--ctx-size 32768 \
--n-gpu-layers 99 \
--port 8080
2. LiteLLM proxy config
model_list:
- model_name: claude-opus-4-5
litellm_params:
model: openai/qwen3-27b
api_base: http://localhost:8080/v1
api_key: fake-key
litellm --config litellm.proxy.yaml --port 4000
3. Point Claude Code to local proxy
export ANTHROPIC_BASE_URL=http://localhost:4000
export ANTHROPIC_API_KEY=fake-key
claude
Done. Claude Code now talks to your GPU.
Full stack with Hermes Agent
For agentic sessions with Telegram control, persistent task context, and tool orchestration, I use Hermes Agent on top. Full open-source setup:
github.com/KaiFelixBennett/hermes-claude-code-local
Includes: llama.cpp start scripts, LiteLLM config, Hermes Agent setup, and benchmark results.
Hardware requirements
- Minimum: 16 GB VRAM for useful coding models (13B-class)
- Recommended: 24+ GB for 27B-class models
- NVIDIA CUDA: Supported, use CUDA llama.cpp build
- Apple Silicon: Should work with Metal backend — need benchmarks!
Drop your hardware + generation speeds in the comments. Especially interested in NVIDIA and Apple Silicon numbers.
Top comments (0)