DEV Community: Megan Folsom

The Ghost in the CLI: Why Claude Code Kills Local Inference

Megan Folsom — Wed, 18 Feb 2026 20:17:53 +0000

It was a rainy sunday. The kind of Sunday that makes you want to stay inside with Claude Code and a good book. But I knew this wasn't going to be an ordinary Sunday the minute GLM-5 showed up and sweet talked its way into my Claude Code CLI. But Claude Code wasn't having it. You see, when you point Claude Code at a local base URL for local inference, you're inviting a poltergeist into your terminal.

For a weekend project, my mission was to set up a local version of GLM-5 as a coding agent on my new M3 Ultra. My reasons for deciding to run my local quantized GLM-5 in Claude Code are documented in my companion article, Finding my Frontier: Cloud free coding on GLM-5.

I thought this would be straight forward. Part of what makes it possible is that Claude Code has an ANTHROPIC_BASE_URL env var. Llama-server has an Anthropic Messages API endpoint. I thought this would be a walk in the park. But once I had it setup, it segfaulted immediately before my prompt even reached the model. My technical investigation lead to some very interesting findings.

Claude Code and Open Source Models

Claude Code has increasing support for running open source models and the open source community is embracing it too. Ollama allows you to launch models directly in Claude Code. Some frontier-class open source models recommend it as the primary way to access their models. These integrations are typically optimized for cloud-hosted versions of the models though, not local inference. I love the Claude Code CLI and the idea of having some of its coolest features already baked into your open source model coding setup is so very tempting. But my job today is to dampen your enthusiasm.

The Setup

Machine: M3 Ultra Mac Studio, 512GB unified RAM
Model: GLM-5 IQ2_XXS (225GB, GGUF via unsloth/GLM-5-GGUF)
Server: llama-server (llama.cpp) with Metal support
Goal: Use Claude Code with a local model instead of the Anthropic API

See my companion article, Finding my Frontier: Cloud free coding on GLM-5, for the full OpenCode setup guide and the MLX vs GGUF performance story.

The Model Works Fine on Its Own

After it crashed, I ran GLM-5 through llama-server's Anthropic Messages API and it handles tool calling no problem:

curl -s 'http://localhost:8080/v1/messages' \
  -H 'Content-Type: application/json' \
  -H 'x-api-key: none' \
  -H 'anthropic-version: 2023-06-01' \
  -d '{
    "model": "test",
    "max_tokens": 50,
    "tools": [{
      "name": "get_weather",
      "description": "Get weather for a location",
      "input_schema": {
        "type": "object",
        "properties": {
          "location": {"type": "string"}
        },
        "required": ["location"]
      }
    }],
    "messages": [{"role": "user", "content": "What is the weather in Paris?"}]
  }'

This is 164 input tokens, 50 output tokens, and a prompt reply (pun intended) in 4.7 seconds. A 744B model doing structured tool calling on consumer hardware. The model isn't the problem here.

Then I Plugged In Claude Code

ANTHROPIC_BASE_URL="http://localhost:8080" \
ANTHROPIC_API_KEY="none" \
claude --model GLM-5-UD-IQ2_XXS-00001-of-00006.gguf

Dead server. Not even a useful error message.

The Forensic Evidence

To see what was happening under the surface, I dropped a logging proxy between Claude Code and llama-server. I needed to see the exact moment the handshake turned into a death spiral.

The logs revealed a massacre.

[1]  POST /v1/messages | model=claude-haiku-4-5-20251001 | tools=0    → 200 OK
[2]  POST /v1/messages | model=claude-haiku-4-5-20251001 | tools=0    → 200 OK
[3]  POST /v1/messages/count_tokens | model=GLM-5...     | tools=1    → intercepted
[4]  POST /v1/messages/count_tokens | model=GLM-5...     | tools=1    → intercepted
...
[8]  POST /v1/messages | model=claude-haiku-4-5-20251001 | tools=1    → CRASH (segfault)
[9+] Everything after → Connection refused

This revealed three separate problems. Any one of them kills the server on its own.

Ghost Haiku Calls

What on earth was Haiku doing there? I checked every configuration file; I knew for sure I hadn’t invited it.

As it turns out, Claude Code is a creature of habit. It sends internal requests to claude-haiku-4-5-20251001 for housekeeping stuff (things like generating conversation titles, filtering tools, other background tasks). When you set ANTHROPIC_BASE_URL, all of those get routed to your local server.

In one session I counted 37 Haiku requests before the actual inference request even got sent. Title generation, tool checking for each of 30+ MCP tools, all hitting a server that has never even heard of Haiku.

Token Counting Preflight

But that wasn't all. Before the actual inference request, Claude Code hits /v1/messages/count_tokens with one request per tool group. This endpoint doesn't exist in llama-server, so it returns a 404 that Claude Code doesn't handle gracefully.

Concurrent Request Flood

The gasoline that lights the fire is one of Claude Code's best features, but a concurrency mis-match for poor little llama-server. Haiku calls to the ether, count_tokens calls, and a parallel request to run the inference for your prompt. A single-slot llama-server can't handle concurrent requests which result in, you guessed it, a croaked out "se-egfault" just before the server's untimely demise (I might have watched too many British Police Procedurals).

The GLM-5 inference request (in this case a simple "hello"), which is actually the one I cared about, never made it to the server. It was stuck behind crashed Haiku calls and preflight requests hitting endpoints that aren't there.

Here's what that looks like:

Exorcism by Proxy: 180 Lines of Python

Okay, I admit, this was a hacky fix. But it worked. Instead of waiting for upstream fixes, I wrote a proxy that sits between Claude Code and llama-server. It does three things: fakes all Haiku responses, intercepts count_tokens, and serializes real requests so they don't flood the server. Here's the walkthrough.

The plumbing

Standard library only. The proxy listens on port 9090 and forwards real requests to llama-server on 8080. All real inference requests go through a single-threaded queue so the server only ever sees one at a time.

#!/usr/bin/env python3
"""
Smart proxy for Claude Code -> llama-server.
Serializes requests, intercepts count_tokens, fakes Haiku calls.
"""
import json, threading, queue, time
from http.server import HTTPServer, BaseHTTPRequestHandler
from urllib.request import Request, urlopen
from urllib.error import HTTPError

TARGET = "http://127.0.0.1:8080"

request_queue = queue.Queue()
response_slots = {}
slot_lock = threading.Lock()
request_timestamps = {}

The worker thread

This is the single-file-line to llama-server. Requests go into the queue, this thread sends them one at a time, and stashes the response so the original handler can pick it up.

def worker():
    while True:
        req_id, method, path, headers, body = request_queue.get()
        t_start = time.time()
        try:
            req = Request(f"{TARGET}{path}", data=body, method=method)
            for k, v in headers.items():
                req.add_header(k, v)
            resp = urlopen(req, timeout=600)
            resp_data = resp.read()
            resp_headers = dict(resp.getheaders())
            elapsed = time.time() - t_start
            print(f"[{req_id}] <- {resp.status} | {elapsed:.1f}s", flush=True)
            with slot_lock:
                response_slots[req_id] = ("ok", resp.status, resp_headers, resp_data)
        except HTTPError as e:
            error_body = e.read() if e.fp else b""
            with slot_lock:
                response_slots[req_id] = ("http_error", e.code, {}, error_body)
        except Exception as e:
            with slot_lock:
                response_slots[req_id] = ("error", 502, {}, str(e).encode())
        finally:
            request_timestamps.pop(req_id, None)
            request_queue.task_done()

threading.Thread(target=worker, daemon=True).start()
req_counter = 0
counter_lock = threading.Lock()

Faking Haiku responses

When Claude Code sends a Haiku request (title generation, tool filtering, etc.), we don't bother the model. We just send back a minimal valid Anthropic Messages API response. Claude Code gets what it needs, the model never knows it happened.

def fake_response(handler, req_id, model, text):
    """Return a minimal Anthropic Messages API response."""
    fake = {
        "id": f"msg_{req_id}", "type": "message", "role": "assistant",
        "content": [{"type": "text", "text": text}],
        "model": model, "stop_reason": "end_turn", "stop_sequence": None,
        "usage": {"input_tokens": 10, "output_tokens": 1}
    }
    body = json.dumps(fake).encode()
    handler.send_response(200)
    handler.send_header("Content-Type", "application/json")
    handler.send_header("Content-Length", str(len(body)))
    handler.end_headers()
    handler.wfile.write(body)

The main proxy handler

This is where the routing logic lives. Every POST gets inspected and sent down one of three paths:

count_tokens requests get a fake estimate and never touch the server.
Haiku requests get a fake response. Title generation requests get a slightly smarter fake that includes a JSON title so Claude Code's UI still works.
Everything else (your actual GLM-5 inference) goes into the queue and waits for the worker thread to process it.

class SmartProxy(BaseHTTPRequestHandler):
    def do_POST(self):
        global req_counter
        with counter_lock:
            req_counter += 1
            req_id = req_counter

        length = int(self.headers.get("Content-Length", 0))
        body = self.rfile.read(length)
        data = json.loads(body)
        model = data.get("model", "?")
        tools = data.get("tools", [])

        # 1. Intercept count_tokens
        if "count_tokens" in self.path:
            estimated = 500 * max(len(tools), 1)
            resp = json.dumps({"input_tokens": estimated}).encode()
            self.send_response(200)
            self.send_header("Content-Type", "application/json")
            self.send_header("Content-Length", str(len(resp)))
            self.end_headers()
            self.wfile.write(resp)
            return

        # 2. Fake ALL Haiku calls
        if "haiku" in model.lower():
            system = data.get("system", [])
            is_title = False
            if isinstance(system, list):
                for b in system:
                    if isinstance(b, dict) and "new topic" in b.get("text", "").lower():
                        is_title = True
            elif isinstance(system, str) and "new topic" in system.lower():
                is_title = True

            if is_title:
                fake_response(self, req_id, model,
                    '{"isNewTopic": true, "title": "GLM-5 Chat"}')
            else:
                fake_response(self, req_id, model, "OK")
            return

        # 3. Real requests: serialize through queue
        print(f"[{req_id}] {model[:30]} | {len(tools)} tools -> queued", flush=True)
        headers_dict = {}
        for h in ["Content-Type", "Authorization", "x-api-key", "anthropic-version"]:
            if self.headers.get(h):
                headers_dict[h] = self.headers[h]

        request_timestamps[req_id] = time.time()
        request_queue.put((req_id, "POST", self.path, headers_dict, body))

        while True:
            time.sleep(0.05)
            with slot_lock:
                if req_id in response_slots:
                    result = response_slots.pop(req_id)
                    break

        status_type, code, resp_headers, resp_data = result
        self.send_response(code)
        for k, v in resp_headers.items():
            if k.lower() not in ("transfer-encoding", "content-length"):
                self.send_header(k, v)
        self.send_header("Content-Length", str(len(resp_data)))
        self.end_headers()
        self.wfile.write(resp_data)

    def log_message(self, *args):
        pass

Start it up

HTTPServer(("127.0.0.1", 9090), SmartProxy).serve_forever()

Save the whole thing as claude-proxy.py and run it with python3 claude-proxy.py. That's it.

How It Looks Now

With the proxy in place, the picture changes completely:

Claude Code's request flow goes from 42 chaotic requests to this:

[1] haiku title gen → fake response (instant)
[2] GLM-5 | 23 tools → queued
[2] ← 200 | 17.8s
[3] haiku title gen → fake response (instant)

Performance

Turn	TTFT (prefill)	Generation	Total	Notes
1st (cold cache)	336.6s / 24,974 tokens	13.7s / 133 tok	350.3s	Full prefill, tool defs + system prompt
2nd (warm cache)	0.1s / 1 token	17.0s / 165 tok	17.1s	Prompt cache hit
3rd	2.2s / 14 tokens	15.6s / 151 tok	17.8s	Near-instant prefill
4th	3.4s / 96 tokens	10.8s / 104 tok	14.1s	Stable

First turn is 5.6 minutes. Every turn after that: 2-3 seconds to first token.

The first turn is slower than OpenCode (350s vs 100s) because Claude Code sends ~25K tokens of tool definitions (23 tools including Playwright, Figma, and the built-in ones like Read, Write, Bash, Glob, Grep, etc.) compared to OpenCode's ~10K. But llama-server's prompt cache means you only pay that cost once. After the first turn the server sees the 25K token prefix hasn't changed and skips straight to the new tokens.

Usage

Three terminals:

# Terminal 1: llama-server
llama-server --model GLM-5-UD-IQ2_XXS-00001-of-00006.gguf \
  --ctx-size 65536 --parallel 1 --port 8080

# Terminal 2: proxy
python3 claude-proxy.py

# Terminal 3: Claude Code
ANTHROPIC_BASE_URL="http://localhost:9090" \
ANTHROPIC_API_KEY="none" \
claude --model GLM-5-UD-IQ2_XXS-00001-of-00006.gguf

First turn will take ~6 minutes. Be patient. After that: ~15 seconds.

Why I Still Don't Recommend It

Claude Code's ANTHROPIC_BASE_URL feature technically supports custom endpoints. But the implementation assumes a cloud-scale API server on the other end. One that can handle parallel requests, implements every endpoint in the Anthropic spec, and doesn't mind servicing dozens of lightweight Haiku calls alongside heavyweight inference.

That's fine for cloud infrastructure. It's a completely broken assumption for a single-slot local server running a 225GB model. Local model support exists on paper but crashes in practice, and the failure mode (immediate segfault, no useful error message) makes it nearly impossible to diagnose without building your own proxy.

This proxy is a workaround, not a fix. The real solution would be for coding agents to detect local endpoints and skip the background services that assume cloud-scale infrastructure. Until then, 180 lines of Python bridge the gap.

But even with the proxy working, I still wouldn't recommend this as your daily coding setup. Claude Code was purpose-built for a specialized agentic flow that works really well with Anthropic models. Giving it to your local LLM as a hand-me-down is going to end in tears and segfaults (which you now hopefully know how to fix). Coding with this setup felt janky at best. If you want to run a local model as a coding agent, OpenCode is a much better fit. I wrote about that setup here.

Your Turn

So, is this the future of development? Will cloud models always be ahead of the open source local community?

Is anyone else running Claude Code with local LLMs for production work, or do you still fall back to the cloud when the "poltergeists" start acting up?

Drop your setup and your survival stories in the comments.

Hardware: M3 Ultra Mac Studio, 512GB | Model: unsloth/GLM-5-GGUF IQ2_XXS (225GB) | Server: llama.cpp with Metal | Proxy: claude-proxy.py

Finding my Frontier: Cloud free coding on GLM-5

Megan Folsom — Wed, 18 Feb 2026 20:17:35 +0000

I unboxed my new M3 Ultra Mac Studio over the weekend and the first thing I wanted to do with it was try to fit a frontier-sized model on it. I watched some youtube videos (too many.. youtube videos of people in their homelabs. Hello Network Chuck and Alex Ziskind.) and got all fired up with visions of pretty beehive dashboards and the sound of MLX metal screaming down 512GB RAM highways.

When Z.ai unexpectedly dropped GLM-5, within hours all the home-labbers (is that a word?) were headlining things like: GLM-5 replaces OPUS 4.6! I saw those headlines and thought: this is my frontier. GLM-5 would be the model to break in my new hardware and spawn minion models that would herald a new era of local agentic coding and lifestyle enhancements (no. not through openclaw. Of course not through openclaw. I'd use something...else...secure and clawless...refuses to make eye contact).

My opening searches for how to run this model on my hardware were almost immediately rewarded by the discovery of a community/mlx build of GLM-5! I pulled it down to my local drive and set it up to run on OpenCode. It took days. What seemed like days. Not to set it up. That was super quick. To get the response back from my first prompt (which was the word "hello"). I persisted and actually sat there while the hours turned to days and GLM-5 took upwards of 30 minutes to respond to each and every prompt. I stuck it out though and prompted it to build its own WebUI front loading the prompt with a detailed waterfall style requirements doc. This was actually painful to watch and required me to don my infinite patience cloak, but all the magic cloaks in the world couldn't hide the truth. The MLX path was a dead end for agent use.

One inarguable trait of mine is that I'm stubborn. I've always believed "Dead End" signs are just suggestions if you have the right vehicle. For my next move, I decided to try Unsloth's guide to running locally with a quantized GGUF. The guide was geared towards a GPU setup and not an MLX macOS setup so I had to improvise and ask Claude Code for help in spots.

Once I got this running on OpenCode, I clocked 20 full minutes for the time to first token. My immediate thought was that my beefy Mac Studio just wasn't enough to run this model, quantized or no. But I was seeing a few clues that told me this wasn't an issue with the hardware or the model. For one thing, a direct request to the llama-server via Curl came back in under 5 seconds. This pointed to OpenCode as the culprit. But it was actually more nuanced than that. I decided to try running it in the Claude Code CLI to see if another CLI would be better. That proved to be its own dead end, but what I learned from it actually helped me figure out what the issue was with OpenCode. If you're curious you can find that writeup here: The Ghost in the CLI: Why Claude Code Kills Local Inference.

The proxy server I ran to capture data on my Claude Code run hinted at what was happening with OpenCode too. It was pre-processing over 10K tokens of invisible overhead (tool calls, policy prompts, etc.) and though OpenCode does have some support for parallelization of the tool definitions, llama.cpp apparently doesn't. These were getting processed as a giant 10K prompt on the server side. Once I figured this out, I realized GLM-5 might respond faster once that prompt was cached, and that turned out to be true.

Since you're still here and sat through the whole story, you probably want the technical tea. The least I can do is give you the complete guide to how I'm now running a frontier-class model with reasonable speed and writing code with it on my local computer. Cloud-free.

The Hardware

Machine: Apple M3 Ultra Mac Studio, 512GB unified RAM (~800 GB/s memory bandwidth)
Model: GLM-5, a 744B parameter MoE model from Zhipu AI
Coding Agent: OpenCode

GLM-5 is a Mixture-of-Experts model, so it only activates a fraction of its 744B parameters per token. At IQ2_XXS quantization it fits in 225GB. Plenty of headroom on this machine.

Why MLX Was So Slow (And Why GGUF Isn't)

The MLX I used was the community 4-bit build (390GB) through mlx-lm's server. Apple's own ML framework, purpose-built for Apple Silicon, and it was painfully slow. Here's how it stacks up against the Unsloth GGUF (IQ2_XXS, 225GB) through llama-server:

Setup	Model Size	First Turn	Subsequent Turns	Generation Speed
MLX (mlx-lm)	390GB	~20 min	~20 min	~0.5 tok/s
GGUF (llama-server)	225GB	~10-20 min	2.6s	14 tok/s

Yes, the GGUF was a smaller footprint, but that didn't fully explain the painful slowness of the mlx-lm server.

I believe there were two elements causing this slowness:

Prompt caching. One of the key features of llama-server is that it caches the KV state of the prompt prefix between turns. The first turn chews through the full prompt. That's where the ~10-20 minutes goes (more on why it's so big in the next section). The second turn recognizes the prefix hasn't changed, skips prefill, and you're generating in 0.3 seconds.

mlx-lm does have prompt caching features (mlx_lm.cache_prompt, prompt_cache in the Python API, etc.) but the server mode (mlx_lm.server) never actually cached the prompt prefix between HTTP requests in my testing. Every turn paid the full prefill cost no matter how far into the conversation I was. There are known bugs around this: mlx-lm #259 reports different logits on repeat prompts, and LM Studio hit similar KV cache issues with their MLX engine. But a broken prompt cache triggers a domino effect as your context window builds up. Without working prompt caching in server mode, each turn reprocesses the entire conversation history from scratch (system prompt + tool definitions + every prior message), so response times just keep climbing the longer your session runs.

Fair warning: the first GGUF turn also takes 10-20 minutes, so it looks identical to the MLX problem. Don't give up. Send a second message. That's when you'll see the difference.

The Hidden Cost: 10,600 Tokens Before Your Message

The invisible CLI prompt For your first simple "hello" message, here's what actually gets sent:

POST /v1/chat/completions
Messages: 2
Tools field entries: 11
System message length: 10,082 chars
Tools field size: 36,614 chars
Total prompt tokens: 10,643

Over 10,000 tokens of system prompt and tool definitions before my message even shows up. The tools do actually go into the proper tools field, but llama-server's OpenAI endpoint serializes all of that into the prompt template as text, so every token has to go through prefill. That's the "giant 10K prompt injection" I mentioned earlier.

This is actually more bearable if you think of it as a one-time cost.

Turn	Prefill	Generation	Total
1st (cold)	97.0s / 10,623 tok	3.9s / 54 tok	100.9s
2nd (cached)	0.3s / 5 tok	2.3s / 33 tok	2.6s

For a coding session that goes dozens of turns, 98 seconds of cold start is nothing. The prompt cache makes those 10,000+ tokens invisible after the first message. You need to be aware that anytime you trigger a new uncached prompt, you'll pay this cost again. Reviewing an existing codebase would be really slow here. In my mind though, this is one giant leap for nerdy woman-kind in order to run a frontier class model on my homelab. I'll don my infinite patience cloak and I probably won't have to wear it for very long. Maybe I'll make a youtube video in my homelab. Just kidding. Let's face it. I'm not as photogenic as Network Chuck or Alex Ziskind.

Setup Guide

1. Download the Model

pip install huggingface_hub
huggingface-cli download unsloth/GLM-5-GGUF \
  --local-dir ~/Models/GLM-5-GGUF \
  --include "*UD-IQ2_XXS*"

Six shards, ~225GB total. You need enough RAM for the model plus KV cache, so realistically 300GB+ (I had 512GB).

2. Build and Run llama-server

# Build from source with Metal support
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
  -DBUILD_SHARED_LIBS=OFF -DGGML_METAL=ON
cmake --build llama.cpp/build --config Release -j

# Run
llama.cpp/build/bin/llama-server \
  --model ~/Models/GLM-5-GGUF/UD-IQ2_XXS/GLM-5-UD-IQ2_XXS-00001-of-00006.gguf \
  --ctx-size 65536 --parallel 1 --port 8080

Notes:

--ctx-size 65536 gives you room for tool definitions + conversation history
--parallel 1 keeps memory usage predictable with a single inference slot
Ollama can't load GLM-5 because it doesn't support sharded GGUFs yet (issue #5245)

3. Configure OpenCode

Add to ~/.config/opencode/opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "llama.cpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama.cpp (local)",
      "options": {
        "baseURL": "http://localhost:8080/v1"
      },
      "models": {
        "GLM-5-UD-IQ2_XXS-00001-of-00006.gguf": {
          "name": "GLM-5 IQ2_XXS",
          "limit": { "context": 128000, "output": 8192 }
        }
      }
    }
  }
}

4. Run

opencode -m "llama.cpp/GLM-5-UD-IQ2_XXS-00001-of-00006.gguf"

First message will take a while. Go get coffee. Play Doom. This will be worth it.

What About Claude Code?

Claude Code also supports local models via ANTHROPIC_BASE_URL, and llama-server has an Anthropic Messages API endpoint. But Claude Code sends a bunch of internal background requests that crash a single-slot local server before your prompt ever reaches the model. I wrote a proxy to deal with it, and debugging that proxy is actually how I figured out the 10K token overhead issue. That's a whole separate writeup: The Ghost in the CLI: Why Claude Code Kills Local Inference.

Takeaways

My journey to GLM-5 was exactly like that sentence sounds: a trip to a far-off planet full of mysterious black holes, scientific conundrums, and strange alien symbols. Most of all, it was about the long passage of time. Running on an MLX server technically worked but was unusable. Prompt caching never kicked in, and the growing context meant each turn was slower than the last. Running an Unsloth quantization was the best choice for me, even though you can't hide from the tool tax. Unlike zippy cloud models, you will notice the 10K tokens of invisible overhead because they dominate your first local interaction (and any uncached prompts thereafter).

For me this was really about adjusting my expectations. But I'm here to tell you, if you have the hardware to run it, save your tokens and get your Doom game ready. You might just unlock a doorway to the future.

Hardware: M3 Ultra Mac Studio, 512GB | Model: unsloth/GLM-5-GGUF IQ2_XXS (225GB) | Server: llama.cpp with Metal | Agent: OpenCode