<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kai Bennett</title>
    <description>The latest articles on DEV Community by Kai Bennett (@kaibennett_dev).</description>
    <link>https://dev.to/kaibennett_dev</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3950936%2Ff18940c3-6578-4a31-a79a-50d8068323cd.png</url>
      <title>DEV Community: Kai Bennett</title>
      <link>https://dev.to/kaibennett_dev</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kaibennett_dev"/>
    <language>en</language>
    <item>
      <title>I ran Claude Code on a local LLM for 4 hours — 7M tokens, $0 (would have cost $94)</title>
      <dc:creator>Kai Bennett</dc:creator>
      <pubDate>Mon, 25 May 2026 15:02:26 +0000</pubDate>
      <link>https://dev.to/kaibennett_dev/i-ran-claude-code-on-a-local-llm-for-4-hours-7m-tokens-0-would-have-cost-94-11e0</link>
      <guid>https://dev.to/kaibennett_dev/i-ran-claude-code-on-a-local-llm-for-4-hours-7m-tokens-0-would-have-cost-94-11e0</guid>
      <description>&lt;p&gt;Last week I ran a 4-hour autonomous coding session using Claude Code — but not against the Anthropic API.&lt;/p&gt;

&lt;p&gt;Instead, I routed it through a local &lt;a href="https://github.com/ggerganov/llama.cpp" rel="noopener noreferrer"&gt;llama.cpp&lt;/a&gt; instance running Qwen3.6-27B-MTP on my AMD GPU. The total cost: &lt;strong&gt;$0&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The same session on Claude Opus 4.7 would have cost &lt;strong&gt;~$94.34&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Here's exactly how the stack works and how you can replicate it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The stack
&lt;/h2&gt;

&lt;p&gt;The key insight: Claude Code uses the Anthropic API format, but LiteLLM can proxy that to any OpenAI-compatible backend. Your local model never knows it's being called by Claude Code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Claude Code
    ↓ (thinks it's the Anthropic API)
LiteLLM proxy (localhost:4000)
    ↓
llama.cpp server (localhost:8080)
    ↓
Qwen3.6-27B-MTP Q4_K_M on AMD GPU
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Hardware
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;GPU: AMD Radeon AI PRO R9700 (RDNA3, 32 GB VRAM)&lt;/li&gt;
&lt;li&gt;Backend: llama.cpp HIP/ROCm acceleration&lt;/li&gt;
&lt;li&gt;Model: Qwen3.6-27B-MTP Q4_K_M + 0.6B MTP draft (speculative decoding)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inference speeds (batch=1):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prefill: ~200 tokens/sec&lt;/li&gt;
&lt;li&gt;Generation: ~25-35 tokens/sec&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The session that validated it
&lt;/h2&gt;

&lt;p&gt;4-hour autonomous coding run — Hermes Agent doing a multi-step code migration, calling tools, editing files, looping while I watched from Telegram.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stats:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Duration: ~4 hours&lt;/li&gt;
&lt;li&gt;Tokens processed: 7,256,671&lt;/li&gt;
&lt;li&gt;API cost: &lt;strong&gt;$0&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Claude Opus 4.7 equivalent: &lt;strong&gt;~$94.34&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Why this matters beyond cost
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;No rate limits or weekly caps&lt;/strong&gt; — Claude Code's usage limits don't apply to your own machine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full privacy&lt;/strong&gt; — your code never leaves your hardware&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Offline capability&lt;/strong&gt; — works with no internet once models are downloaded&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reproducibility&lt;/strong&gt; — same model weights every time, no silent updates&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Setup in 3 steps
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Start llama.cpp server&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; Qwen3.6-27B-MTP-Q4_K_M.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--draft-model&lt;/span&gt; Qwen3.6-0.6B-Q8_0.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--speculative&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--ctx-size&lt;/span&gt; 32768 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--n-gpu-layers&lt;/span&gt; 99 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. LiteLLM proxy config&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;model_list&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;claude-opus-4-5&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai/qwen3-27b&lt;/span&gt;
      &lt;span class="na"&gt;api_base&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://localhost:8080/v1&lt;/span&gt;
      &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fake-key&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;litellm &lt;span class="nt"&gt;--config&lt;/span&gt; litellm.proxy.yaml &lt;span class="nt"&gt;--port&lt;/span&gt; 4000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Point Claude Code to local proxy&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:4000
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;fake-key
claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Done. Claude Code now talks to your GPU.&lt;/p&gt;




&lt;h2&gt;
  
  
  Full stack with Hermes Agent
&lt;/h2&gt;

&lt;p&gt;For agentic sessions with Telegram control, persistent task context, and tool orchestration, I use Hermes Agent on top. Full open-source setup:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/KaiFelixBennett/hermes-claude-code-local" rel="noopener noreferrer"&gt;github.com/KaiFelixBennett/hermes-claude-code-local&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Includes: llama.cpp start scripts, LiteLLM config, Hermes Agent setup, and benchmark results.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hardware requirements
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Minimum&lt;/strong&gt;: 16 GB VRAM for useful coding models (13B-class)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recommended&lt;/strong&gt;: 24+ GB for 27B-class models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NVIDIA CUDA&lt;/strong&gt;: Supported, use CUDA llama.cpp build&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apple Silicon&lt;/strong&gt;: Should work with Metal backend — need benchmarks!&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Drop your hardware + generation speeds in the comments. Especially interested in NVIDIA and Apple Silicon numbers.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>productivity</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
