DEV Community

Kai Bennett
Kai Bennett

Posted on

I ran Claude Code on a local LLM for 4 hours — 7M tokens, $0 (would have cost $94)

Last week I ran a 4-hour autonomous coding session using Claude Code — but not against the Anthropic API.

Instead, I routed it through a local llama.cpp instance running Qwen3.6-27B-MTP on my AMD GPU. The total cost: $0.

The same session on Claude Opus 4.7 would have cost ~$94.34.

Here's exactly how the stack works and how you can replicate it.


The stack

The key insight: Claude Code uses the Anthropic API format, but LiteLLM can proxy that to any OpenAI-compatible backend. Your local model never knows it's being called by Claude Code.

Claude Code
    ↓ (thinks it's the Anthropic API)
LiteLLM proxy (localhost:4000)
    ↓
llama.cpp server (localhost:8080)
    ↓
Qwen3.6-27B-MTP Q4_K_M on AMD GPU
Enter fullscreen mode Exit fullscreen mode

Hardware

  • GPU: AMD Radeon AI PRO R9700 (RDNA3, 32 GB VRAM)
  • Backend: llama.cpp HIP/ROCm acceleration
  • Model: Qwen3.6-27B-MTP Q4_K_M + 0.6B MTP draft (speculative decoding)

Inference speeds (batch=1):

  • Prefill: ~200 tokens/sec
  • Generation: ~25-35 tokens/sec

The session that validated it

4-hour autonomous coding run — Hermes Agent doing a multi-step code migration, calling tools, editing files, looping while I watched from Telegram.

Stats:

  • Duration: ~4 hours
  • Tokens processed: 7,256,671
  • API cost: $0
  • Claude Opus 4.7 equivalent: ~$94.34

Why this matters beyond cost

  1. No rate limits or weekly caps — Claude Code's usage limits don't apply to your own machine
  2. Full privacy — your code never leaves your hardware
  3. Offline capability — works with no internet once models are downloaded
  4. Reproducibility — same model weights every time, no silent updates

Setup in 3 steps

1. Start llama.cpp server

./llama-server \
  --model Qwen3.6-27B-MTP-Q4_K_M.gguf \
  --draft-model Qwen3.6-0.6B-Q8_0.gguf \
  --speculative \
  --ctx-size 32768 \
  --n-gpu-layers 99 \
  --port 8080
Enter fullscreen mode Exit fullscreen mode

2. LiteLLM proxy config

model_list:
  - model_name: claude-opus-4-5
    litellm_params:
      model: openai/qwen3-27b
      api_base: http://localhost:8080/v1
      api_key: fake-key
Enter fullscreen mode Exit fullscreen mode
litellm --config litellm.proxy.yaml --port 4000
Enter fullscreen mode Exit fullscreen mode

3. Point Claude Code to local proxy

export ANTHROPIC_BASE_URL=http://localhost:4000
export ANTHROPIC_API_KEY=fake-key
claude
Enter fullscreen mode Exit fullscreen mode

Done. Claude Code now talks to your GPU.


Full stack with Hermes Agent

For agentic sessions with Telegram control, persistent task context, and tool orchestration, I use Hermes Agent on top. Full open-source setup:

github.com/KaiFelixBennett/hermes-claude-code-local

Includes: llama.cpp start scripts, LiteLLM config, Hermes Agent setup, and benchmark results.


Hardware requirements

  • Minimum: 16 GB VRAM for useful coding models (13B-class)
  • Recommended: 24+ GB for 27B-class models
  • NVIDIA CUDA: Supported, use CUDA llama.cpp build
  • Apple Silicon: Should work with Metal backend — need benchmarks!

Drop your hardware + generation speeds in the comments. Especially interested in NVIDIA and Apple Silicon numbers.

Top comments (0)