<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: René Zander</title>
    <description>The latest articles on DEV Community by René Zander (@reneza).</description>
    <link>https://dev.to/reneza</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1138713%2Fa7d8635c-22db-4dec-b156-1fb07de64a8d.jpeg</url>
      <title>DEV Community: René Zander</title>
      <link>https://dev.to/reneza</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/reneza"/>
    <language>en</language>
    <item>
      <title>Claude Code with Local LLMs and ANTHROPIC_BASE_URL: Ollama, LM Studio, llama.cpp, vLLM</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Wed, 29 Apr 2026 05:53:48 +0000</pubDate>
      <link>https://dev.to/reneza/claude-code-with-local-llms-and-anthropicbaseurl-ollama-lm-studio-llamacpp-vllm-1g6j</link>
      <guid>https://dev.to/reneza/claude-code-with-local-llms-and-anthropicbaseurl-ollama-lm-studio-llamacpp-vllm-1g6j</guid>
      <description>&lt;p&gt;&lt;em&gt;Native Anthropic endpoints, tool-call compatibility, and context-window sizing for local Claude Code.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Last tested: April 2026. See Changelog at the bottom.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR cheat sheet
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Goal&lt;/th&gt;
&lt;th&gt;Use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Air&lt;/td&gt;
&lt;td&gt;Gemma 4 26B-A4B Q4, &lt;strong&gt;32K context&lt;/strong&gt;, LM Studio or Ollama&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Pro&lt;/td&gt;
&lt;td&gt;Gemma 4 26B-A4B Q4 / UD-Q4, &lt;strong&gt;64K context&lt;/strong&gt;, llama.cpp or LM Studio&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Code minimum&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;32K context&lt;/strong&gt; (anything below is a chat demo)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best local backend&lt;/td&gt;
&lt;td&gt;LM Studio or Ollama first; llama.cpp for advanced; vLLM for servers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avoid&lt;/td&gt;
&lt;td&gt;8K / 16K context, dense 31B Gemma 4 on 32 GB machines, old llama.cpp builds&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The local-Claude-Code rule of thumb
&lt;/h2&gt;

&lt;p&gt;Three things decide whether a local Claude Code session works:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model quality&lt;/strong&gt; decides whether the answer is smart.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool-call formatting&lt;/strong&gt; decides whether Claude Code can act on the answer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context length&lt;/strong&gt; decides whether the session survives past the first few edits.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For local coding agents: &lt;strong&gt;32K is the floor. 64K is the sweet spot.&lt;/strong&gt; Anything below 32K is a chat demo, not Claude Code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Recommended setup
&lt;/h2&gt;

&lt;p&gt;Use this first. Don't shop the buffet of alternatives until you've tried this one.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Backend:&lt;/strong&gt; LM Studio (≥ 0.4.1) or Ollama (≥ v0.14.0) — both expose a native &lt;strong&gt;Anthropic compatible local endpoint&lt;/strong&gt;, no proxy needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model:&lt;/strong&gt; &lt;code&gt;gemma4:26b-a4b&lt;/code&gt; (Gemma 4 26B-A4B-it, Q4 quant). MoE active-param ≈ 3.88 B → laptop-friendly latency, tool-use trained directly into the model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context:&lt;/strong&gt; &lt;strong&gt;32K context&lt;/strong&gt; on a MacBook Air, &lt;strong&gt;64K context&lt;/strong&gt; on a MacBook Pro M5 Pro/Max with 48 GB+ RAM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Machine:&lt;/strong&gt; 32 GB+ RAM strongly preferred. 24 GB works at 24K–32K with care.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you don't have Anthropic-compatible mode and only have an &lt;strong&gt;OpenAI compatible local endpoint&lt;/strong&gt; running, run LiteLLM in front (see section on LiteLLM).&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Environment variables Claude Code reads
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Where Claude Code POSTs requests. Default: https://api.anthropic.com&lt;/span&gt;
&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:11434

&lt;span class="c"&gt;# Sent as auth. Local servers usually accept any non-empty value.&lt;/span&gt;
&lt;span class="nv"&gt;ANTHROPIC_AUTH_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ollama

&lt;span class="c"&gt;# Map Claude Code's "claude-opus-X-Y" / "claude-sonnet-X-Y" / "claude-haiku-X-Y"&lt;/span&gt;
&lt;span class="c"&gt;# to model names your local backend serves.&lt;/span&gt;
&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_OPUS_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gemma4:26b-a4b
&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_SONNET_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gemma4:26b-a4b
&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_HAIKU_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gpt-oss:20b

claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or override per-invocation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude &lt;span class="nt"&gt;--model&lt;/span&gt; gemma4:26b-a4b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;ANTHROPIC_BASE_URL&lt;/code&gt; is set but the URL doesn't respond with the right shape, Claude Code does not fall back to the cloud. It errors out.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Context length: the hidden failure mode
&lt;/h2&gt;

&lt;p&gt;Claude Code is not a chat prompt. Before your actual request, the backend sees:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Claude Code's system prompt (~6–10K tokens by itself)&lt;/li&gt;
&lt;li&gt;tool definitions for &lt;code&gt;Read&lt;/code&gt; / &lt;code&gt;Edit&lt;/code&gt; / &lt;code&gt;Bash&lt;/code&gt; / &lt;code&gt;Grep&lt;/code&gt; / &lt;code&gt;Glob&lt;/code&gt; / &lt;code&gt;TodoWrite&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;conversation history&lt;/li&gt;
&lt;li&gt;file excerpts and full reads&lt;/li&gt;
&lt;li&gt;diffs&lt;/li&gt;
&lt;li&gt;command output&lt;/li&gt;
&lt;li&gt;retry/error messages from failed tool calls&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That means &lt;strong&gt;8K and 16K contexts are misleading tests.&lt;/strong&gt; They may answer a chat question, but they are not enough for reliable agentic coding. The session survives a handful of turns, then silently degrades — file edits truncate, tool calls drop arguments, the loop gets confused.&lt;/p&gt;

&lt;h3&gt;
  
  
  Practical context tiers
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;th&gt;What happens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;8K&lt;/td&gt;
&lt;td&gt;Broken for Claude Code&lt;/td&gt;
&lt;td&gt;System prompt + tools eat the window before your code arrives. Chat-only.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16K&lt;/td&gt;
&lt;td&gt;Demo only&lt;/td&gt;
&lt;td&gt;Tiny edits, short sessions. Not a real test of any model.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25K&lt;/td&gt;
&lt;td&gt;LM Studio's stated minimum&lt;/td&gt;
&lt;td&gt;Good enough for small tasks if tool calls are reliable.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;32K&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Real minimum (32K context).&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ollama recommends this floor. Use as your default.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;64K&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Sweet spot (64K context).&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Best balance on 32GB+ machines. Handles medium repos and multi-file edits.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;128K+&lt;/td&gt;
&lt;td&gt;Diminishing returns&lt;/td&gt;
&lt;td&gt;Prefill latency and KV-cache memory rise hard. Worth it only on high-memory servers, and only for repo-wide reads.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Apple Silicon context presets
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Machine&lt;/th&gt;
&lt;th&gt;Recommended context&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Air M5, 16 GB&lt;/td&gt;
&lt;td&gt;16K–24K&lt;/td&gt;
&lt;td&gt;Use smaller models (≤8B). 26B-A4B is tight.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Air M5, 24 GB&lt;/td&gt;
&lt;td&gt;24K–32K&lt;/td&gt;
&lt;td&gt;32K is the target; keep other apps light.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Air M5, 32 GB&lt;/td&gt;
&lt;td&gt;32K&lt;/td&gt;
&lt;td&gt;Best Air setup. Higher rarely beats thermal throttling.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Pro M5 Pro, 24 GB&lt;/td&gt;
&lt;td&gt;32K&lt;/td&gt;
&lt;td&gt;Better sustained perf than Air at the same context.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Pro M5 Pro, 48/64 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;64K&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Sweet spot for serious local coding.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Pro M5 Max, 64/128 GB&lt;/td&gt;
&lt;td&gt;64K default, 128K experimental&lt;/td&gt;
&lt;td&gt;Use 128K for repo-wide analysis, not every edit loop.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: backend docs differ — LM Studio says "start at 25K, increase for better results," Ollama recommends 32K. &lt;strong&gt;Use 32K as the cross-backend baseline.&lt;/strong&gt; Reading "25K" as "25K is enough" is the most common mistake.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  3. Claude Code Ollama setup (native, v0.14.0+)
&lt;/h2&gt;

&lt;p&gt;Ollama announced Anthropic Messages API compatibility on 2026-01-16. No proxy, no LiteLLM, no nothing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Set context length first — this is the most important knob&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OLLAMA_CONTEXT_LENGTH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;32768   &lt;span class="c"&gt;# 65536 on a Pro&lt;/span&gt;

&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_AUTH_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ollama
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:11434

claude &lt;span class="nt"&gt;--model&lt;/span&gt; gemma4:26b-a4b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cloud-hosted Ollama models work too:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude &lt;span class="nt"&gt;--model&lt;/span&gt; glm-4.7:cloud
claude &lt;span class="nt"&gt;--model&lt;/span&gt; minimax-m2.1:cloud
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Two known limits of Ollama's Anthropic-compat layer (April 2026):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No prompt caching.&lt;/strong&gt; Anthropic's &lt;code&gt;cache_control&lt;/code&gt; doesn't apply — every Claude Code request re-processes the system prompt and conversation history from scratch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No &lt;code&gt;tool_choice&lt;/code&gt;.&lt;/strong&gt; Claude Code occasionally uses &lt;code&gt;tool_choice&lt;/code&gt; to force a specific tool call. Ollama's compat layer ignores it. When it matters, Claude Code may pick the wrong tool and get stuck in a loop.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  4. Claude Code LM Studio setup (native, 0.4.1+)
&lt;/h2&gt;

&lt;p&gt;LM Studio added the Anthropic-compatible &lt;code&gt;/v1/messages&lt;/code&gt; endpoint on 2026-01-30. Streaming, tool calls, and message-shape are all supported natively.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Set context to at least 32K in the LM Studio UI (or higher; see section 2)&lt;/span&gt;
lms server start &lt;span class="nt"&gt;--port&lt;/span&gt; 1234

&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:1234
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_AUTH_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;lmstudio

claude &lt;span class="nt"&gt;--model&lt;/span&gt; openai/gpt-oss-20b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For VS Code with the Claude Code extension (env vars from your shell are NOT inherited by VS Code):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json-doc"&gt;&lt;code&gt;&lt;span class="c1"&gt;// .vscode/settings.json&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"claudeCode.environmentVariables"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ANTHROPIC_BASE_URL"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://localhost:1234"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ANTHROPIC_AUTH_TOKEN"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"lmstudio"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;LM Studio's docs say "at least 25K." Set &lt;strong&gt;32K&lt;/strong&gt;. See section 2.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Claude Code llama.cpp setup (Apple Silicon fast path for Gemma 4 26B-A4B)
&lt;/h2&gt;

&lt;p&gt;If you're on Apple Silicon and want the absolute lowest overhead with Gemma 4 26B-A4B, llama.cpp's server is faster per-token than Ollama or LM Studio. You need a recent build (one that supports &lt;code&gt;-hf&lt;/code&gt; for HuggingFace pulls and &lt;code&gt;--jinja&lt;/code&gt; for chat templates).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./build/bin/llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-hf&lt;/span&gt; ggml-org/gemma-4-26B-A4B-it-GGUF:Q4_K_M &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--host&lt;/span&gt; 127.0.0.1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt; 8080 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 65536 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--jinja&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8080
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_AUTH_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;llama-cpp
claude &lt;span class="nt"&gt;--model&lt;/span&gt; gemma-4-26B-A4B
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Flags that matter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;-c 65536&lt;/code&gt; sets &lt;strong&gt;64K context&lt;/strong&gt; (drop to &lt;code&gt;-c 32768&lt;/code&gt; on tighter machines).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-ngl 99&lt;/code&gt; offloads all layers to Metal/GPU.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--jinja&lt;/code&gt; is required for Gemma 4's chat template to render correctly. Without it, tool calls won't format and you'll see &lt;code&gt;&amp;lt;unused24&amp;gt;&lt;/code&gt; / &lt;code&gt;&amp;lt;unused49&amp;gt;&lt;/code&gt; tokens leaking into output.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-hf ggml-org/gemma-4-26B-A4B-it-GGUF:Q4_K_M&lt;/code&gt; pulls the GGUF straight from HuggingFace.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Caveat:&lt;/strong&gt; llama.cpp's Anthropic-compat is &lt;strong&gt;partial.&lt;/strong&gt; Works for chat and basic tool calling. Streaming-shape and some Anthropic-specific request fields are rougher than Ollama or LM Studio. If something breaks weirdly, fall back to Ollama. llama.cpp is the speed play, not the compatibility play.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Claude Code vLLM setup (native + tool parser)
&lt;/h2&gt;

&lt;p&gt;vLLM ships an official Claude Code integration. Three things at server start: a tool-calling-capable model, &lt;code&gt;--enable-auto-tool-choice&lt;/code&gt;, and the right &lt;code&gt;--tool-call-parser&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vllm serve openai/gpt-oss-120b &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--served-model-name&lt;/span&gt; my-model &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--enable-auto-tool-choice&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tool-call-parser&lt;/span&gt; openai &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt; 8000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8000
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;dummy
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_AUTH_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;dummy
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_OPUS_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my-model
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_SONNET_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my-model
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_HAIKU_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my-model

claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--tool-call-parser&lt;/code&gt; value depends on the model family — &lt;code&gt;openai&lt;/code&gt; for the gpt-oss family, &lt;code&gt;llama3_json&lt;/code&gt; for Llama 3.x, &lt;code&gt;hermes&lt;/code&gt; for Hermes. Wrong parser → tool calls return as plain text and Claude Code's edit/grep/bash tools silently no-op.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. LiteLLM — for fallbacks, not for translation
&lt;/h2&gt;

&lt;p&gt;With Ollama, LM Studio, llama.cpp, and vLLM all speaking native Anthropic now, LiteLLM's role changes. It's no longer "the translator" — it's the router for &lt;strong&gt;fallbacks, request logging, per-tenant keys, and rate limits.&lt;/strong&gt; Also the right answer if your only local option is an &lt;strong&gt;OpenAI compatible local endpoint&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# litellm-config.yaml&lt;/span&gt;
&lt;span class="na"&gt;model_list&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;claude-opus-4-7&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai/my-vllm-model&lt;/span&gt;
      &lt;span class="na"&gt;api_base&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://vllm:8000/v1&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ollama/gemma4:26b-a4b&lt;/span&gt;
      &lt;span class="na"&gt;api_base&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://ollama:11434&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;claude-haiku-4-5&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropic/claude-haiku-4-5&lt;/span&gt;
      &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;os.environ/ANTHROPIC_API_KEY&lt;/span&gt;

&lt;span class="na"&gt;router_settings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;fallbacks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;claude-opus-4-7&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-haiku-4-5"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;   &lt;span class="c1"&gt;# local fail → cloud Haiku&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The single biggest win: when a local tool call silently fails, LiteLLM falls back to cloud Haiku transparently. Claude Code keeps working.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Common failures (the error strings developers google)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;tool_use parse error&lt;/code&gt; / &lt;code&gt;invalid tool call&lt;/code&gt; / &lt;code&gt;tool_use is not supported&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Three different symptoms, one root cause: the model is not emitting Anthropic-format &lt;code&gt;tool_use&lt;/code&gt; content blocks.&lt;/p&gt;

&lt;p&gt;The most deceptive symptom is the silent one — Claude Code starts, prints the model's plain-prose answer ("I would change the file like this..."), and &lt;em&gt;nothing happens&lt;/em&gt;. No file edit, no error.&lt;/p&gt;

&lt;p&gt;Common causes (April 2026):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;vLLM:&lt;/strong&gt; missing &lt;code&gt;--enable-auto-tool-choice&lt;/code&gt; or wrong &lt;code&gt;--tool-call-parser&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama:&lt;/strong&gt; model that wasn't trained for tool calling (avoid stock &lt;code&gt;llama3.x&lt;/code&gt; instruct).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;llama.cpp:&lt;/strong&gt; missing &lt;code&gt;--jinja&lt;/code&gt;. The chat template renders incorrectly and you see literal &lt;code&gt;&amp;lt;unused24&amp;gt;&lt;/code&gt; / &lt;code&gt;&amp;lt;unused49&amp;gt;&lt;/code&gt; tokens.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LM Studio:&lt;/strong&gt; model file is fine but the loaded preset uses the wrong template.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;context length exceeded&lt;/code&gt; / model stopped mid-edit
&lt;/h3&gt;

&lt;p&gt;Claude Code's prompts overflow the configured window. The session may finish a single turn, then truncate the next file edit silently. &lt;strong&gt;Fix: raise context to at least 32K.&lt;/strong&gt; If you're already at 32K and still hitting this, the model is reading too aggressively — drop to fewer tools or shorter file reads.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;empty assistant response&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Backend returned &lt;code&gt;200 OK&lt;/code&gt; with an empty content array. Causes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Streaming SSE format mismatch (mostly llama.cpp).&lt;/li&gt;
&lt;li&gt;Tool-call parser swallowed the message because it couldn't parse it.&lt;/li&gt;
&lt;li&gt;Model emitted only a &lt;code&gt;&amp;lt;unused24&amp;gt;&lt;/code&gt; / &lt;code&gt;&amp;lt;unused49&amp;gt;&lt;/code&gt; token and the parser dropped the rest.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Fix: switch backend (Ollama or LM Studio if you were on llama.cpp), or upgrade llama.cpp to a build with the patched Gemma 4 chat template.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;model not found&lt;/code&gt; / &lt;code&gt;404 the model X does not exist&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Claude Code asked for &lt;code&gt;claude-opus-4-7&lt;/code&gt; but the backend serves &lt;code&gt;gpt-oss:20b&lt;/code&gt; or &lt;code&gt;gemma4:26b-a4b&lt;/code&gt;. Fixes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Set &lt;code&gt;ANTHROPIC_DEFAULT_OPUS_MODEL&lt;/code&gt; (plus &lt;code&gt;_SONNET_&lt;/code&gt; and &lt;code&gt;_HAIKU_&lt;/code&gt;) to the backend's actual model name.&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;claude --model &amp;lt;backend-name&amp;gt;&lt;/code&gt; per call.&lt;/li&gt;
&lt;li&gt;Map the names in LiteLLM (the &lt;code&gt;model_name:&lt;/code&gt; field is what Claude Code asks for; &lt;code&gt;model:&lt;/code&gt; is what gets served).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;messages: Extra inputs are not permitted&lt;/code&gt; (HTTP 422)
&lt;/h3&gt;

&lt;p&gt;Some backends are stricter than Anthropic's own. They reject Anthropic-specific fields (&lt;code&gt;cache_control&lt;/code&gt;, &lt;code&gt;thinking&lt;/code&gt;, &lt;code&gt;tools[].input_schema&lt;/code&gt;, &lt;code&gt;metadata.user_id&lt;/code&gt;). Fix: upgrade the backend, or run a small middleware proxy that strips the unsupported fields before forwarding.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;ANTHROPIC_BASE_URL&lt;/code&gt; ignored / Claude Code still calls the real API
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Env var was set in &lt;code&gt;.zshrc&lt;/code&gt; &lt;em&gt;after&lt;/em&gt; the shell session started — restart the terminal.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;~/.config/claude/config.json&lt;/code&gt; or a &lt;code&gt;--api-key&lt;/code&gt; flag is overriding the env var.&lt;/li&gt;
&lt;li&gt;VS Code: env vars from your shell are NOT inherited. Use &lt;code&gt;claudeCode.environmentVariables&lt;/code&gt; in workspace settings (section 4).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;echo $ANTHROPIC_BASE_URL&lt;/code&gt; inside the same shell that runs &lt;code&gt;claude&lt;/code&gt;. If empty, you have a sourcing problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Debug flow
&lt;/h2&gt;

&lt;p&gt;When something breaks, walk this tree before swapping backends:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Did the model load?&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;No → check quant size vs RAM. 26B-A4B Q4 needs ~16 GB free; bigger quants need more.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Is the context at least 32K?&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;No → raise to 32K (Air) or 64K (Pro). See section 2.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Are tool calls malformed?&lt;/strong&gt; (Look for &lt;code&gt;&amp;lt;unused24&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;unused49&amp;gt;&lt;/code&gt;, plain prose where you expected an edit.)

&lt;ul&gt;
&lt;li&gt;Yes → switch to native Anthropic mode (Ollama/LM Studio), or for vLLM verify &lt;code&gt;--tool-call-parser&lt;/code&gt;, or for llama.cpp add &lt;code&gt;--jinja&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Does Claude Code stop mid-edit?&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Yes → context exhaustion. Lower context targets in your tools, or use a faster quant so the model finishes turns before the window reuse cycle.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Is the model hallucinating files that don't exist?&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Yes → the model isn't calling &lt;code&gt;Read&lt;/code&gt; before &lt;code&gt;Edit&lt;/code&gt;. Add a CLAUDE.md rule that requires reading before editing, or use a tool-finer model (Gemma 4 26B-A4B is solid here).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  10. Smoke test
&lt;/h2&gt;

&lt;p&gt;Verify your setup with one prompt. Ask Claude Code:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Create a small FastAPI app with one &lt;code&gt;/health&lt;/code&gt; endpoint, add a pytest test for it, run pytest, and fix any failures.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Passes if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It reads/writes files correctly (no hallucinated paths).&lt;/li&gt;
&lt;li&gt;It runs the test command (you see real &lt;code&gt;pytest&lt;/code&gt; output).&lt;/li&gt;
&lt;li&gt;It patches a failure (e.g. missing dependency) without losing context.&lt;/li&gt;
&lt;li&gt;It does not lose tool-call format (no &lt;code&gt;&amp;lt;unused24&amp;gt;&lt;/code&gt; / &lt;code&gt;&amp;lt;unused49&amp;gt;&lt;/code&gt; leakage).&lt;/li&gt;
&lt;li&gt;It does not truncate after the first edit.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Expected terminal feel:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✓ model loaded     (gemma4:26b-a4b, Q4_K_M)
✓ context: 32768
✓ tool call parsed (Edit)
✓ edited file      (app.py)
✓ tool call parsed (Bash)
✓ tests passed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you don't see all five, walk the debug flow above.&lt;/p&gt;




&lt;h2&gt;
  
  
  11. Compatibility matrix (April 2026)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Backend&lt;/th&gt;
&lt;th&gt;Native Anthropic API&lt;/th&gt;
&lt;th&gt;Tool calls&lt;/th&gt;
&lt;th&gt;Context floor&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Ollama (≥ v0.14.0)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Depends on model&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;32K context&lt;/strong&gt; (cross-backend baseline)&lt;/td&gt;
&lt;td&gt;Easiest setup. No prompt caching, no &lt;code&gt;tool_choice&lt;/code&gt; (see section 3).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LM Studio (≥ 0.4.1)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes (out of the box)&lt;/td&gt;
&lt;td&gt;Stated 25K, &lt;strong&gt;use 32K&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Streaming + &lt;code&gt;tool_use&lt;/code&gt; blocks supported natively. VS Code extension takes workspace env vars.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;llama.cpp server&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;Yes with &lt;code&gt;--jinja&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;32K&lt;/strong&gt;, &lt;strong&gt;64K context&lt;/strong&gt; on Pro&lt;/td&gt;
&lt;td&gt;Lowest overhead on Apple Silicon. Rougher Anthropic-compat. Best path for Gemma 4 26B-A4B.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;vLLM&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes with &lt;code&gt;--enable-auto-tool-choice&lt;/code&gt; + correct parser&lt;/td&gt;
&lt;td&gt;Model-dependent&lt;/td&gt;
&lt;td&gt;Best throughput. Requires correct parser per model family.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LiteLLM&lt;/td&gt;
&lt;td&gt;Routes to any backend&lt;/td&gt;
&lt;td&gt;Whatever the backend supports&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;td&gt;Use for fallbacks and logging, or to wrap an OpenAI compatible local endpoint as Anthropic.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Direct Ollama &amp;lt; v0.14.0&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;td&gt;Upgrade.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  12. Hardware × model × context × backend (the cheat-sheet table)
&lt;/h2&gt;

&lt;p&gt;A developer should not have to infer what to use:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Machine&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;Backend&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Air M5, 16 GB&lt;/td&gt;
&lt;td&gt;Gemma 4 E4B&lt;/td&gt;
&lt;td&gt;16K–24K&lt;/td&gt;
&lt;td&gt;LM Studio&lt;/td&gt;
&lt;td&gt;usable for small tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Air M5, 24 GB&lt;/td&gt;
&lt;td&gt;Gemma 4 26B-A4B Q4&lt;/td&gt;
&lt;td&gt;24K–32K&lt;/td&gt;
&lt;td&gt;Ollama / LM Studio&lt;/td&gt;
&lt;td&gt;good&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Air M5, 32 GB&lt;/td&gt;
&lt;td&gt;Gemma 4 26B-A4B Q4&lt;/td&gt;
&lt;td&gt;32K&lt;/td&gt;
&lt;td&gt;Ollama / LM Studio&lt;/td&gt;
&lt;td&gt;best Air setup&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Pro M5 Pro, 48 GB&lt;/td&gt;
&lt;td&gt;Gemma 4 26B-A4B Q4/UD-Q4&lt;/td&gt;
&lt;td&gt;64K&lt;/td&gt;
&lt;td&gt;llama.cpp / LM Studio&lt;/td&gt;
&lt;td&gt;sweet spot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Pro M5 Max, 64 GB+&lt;/td&gt;
&lt;td&gt;Gemma 4 26B-A4B or 31B&lt;/td&gt;
&lt;td&gt;64K–128K&lt;/td&gt;
&lt;td&gt;llama.cpp / vLLM&lt;/td&gt;
&lt;td&gt;best local&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is the single most copied table in this gist. Bookmark it.&lt;/p&gt;




&lt;h2&gt;
  
  
  13. Gemma 4 26B-A4B: the Apple Silicon sweet spot
&lt;/h2&gt;

&lt;p&gt;For Mac local Claude Code, the standout Gemma 4 variant is &lt;strong&gt;26B-A4B-it&lt;/strong&gt;, not the dense 31B. Reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Google trained tool-use directly into Gemma 4 (not bolted on as a fine-tune). It works on the first try, not after three retries.&lt;/li&gt;
&lt;li&gt;The 26B MoE activates only ~3.88 B params per inference, so latency is in the 4 B-model range — around 300 tok/sec on M2 Ultra.&lt;/li&gt;
&lt;li&gt;Strong tool-use behavior, good enough coding quality for private/local workflows.&lt;/li&gt;
&lt;li&gt;Fits at useful context sizes on high-memory MacBooks.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why 26B-A4B instead of 31B?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Faster tool calls — every Claude Code turn is bottlenecked by tool-call latency, not single-shot quality.&lt;/li&gt;
&lt;li&gt;Lower active-parameter count keeps prefill cheap.&lt;/li&gt;
&lt;li&gt;Better fit for laptops — 31B dense needs more RAM and more thermal headroom.&lt;/li&gt;
&lt;li&gt;Enough quality for iterative coding; the agent loop matters more than peak IQ.&lt;/li&gt;
&lt;li&gt;31B may be better for single-shot answers — but Claude Code is many small turns, not one big answer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For &lt;strong&gt;Gemma 4 local coding&lt;/strong&gt; specifically: pick 26B-A4B unless you're on a 64 GB+ Pro and you've measured that 31B Q4 actually finishes turns faster on your hardware.&lt;/p&gt;




&lt;h2&gt;
  
  
  14. Other model picks for Claude Code (April 2026)
&lt;/h2&gt;

&lt;p&gt;If Gemma 4 isn't available or you want to compare:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;gpt-oss:20b&lt;/code&gt;&lt;/strong&gt; — easy starting point. Tool calling reliable, runs on a single decent GPU. Recommended in Ollama's and LM Studio's official Claude Code blog posts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;gpt-oss:120b&lt;/code&gt;&lt;/strong&gt; — much smarter on real codebases. The vLLM Claude Code integration page uses this as the example. Needs serious VRAM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;qwen3-coder&lt;/code&gt;&lt;/strong&gt; — purpose-built for coding. Strong tool-call performance on Ollama. Frequently called the strongest local pick for Claude Code in March/April 2026 community threads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;qwen3.5&lt;/code&gt; family&lt;/strong&gt; — the 35B MoE variants are reported as the strongest agentic-coding open models in this size class. Verify tool-call support per quant.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;glm-4.7-flash&lt;/code&gt; / &lt;code&gt;glm-4.7:cloud&lt;/code&gt;&lt;/strong&gt; — strong agentic coder. Available as an Ollama cloud model (no local GPU needed).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;minimax-m2.1:cloud&lt;/code&gt;&lt;/strong&gt; — newer Ollama cloud option, agentic-tuned.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What to avoid: stock &lt;code&gt;llama3.x&lt;/code&gt; instruct models without tool fine-tuning. They will look like they work, then silently fail on file edits.&lt;/p&gt;




&lt;h2&gt;
  
  
  15. Setups I would avoid
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;8K context.&lt;/strong&gt; Too small for Claude Code. The system prompt eats it before your code arrives.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;16K context.&lt;/strong&gt; Demos only. Don't judge a model by 16K behavior.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Old llama.cpp builds with Gemma 4.&lt;/strong&gt; No &lt;code&gt;--jinja&lt;/code&gt; or no patched chat template → &lt;code&gt;&amp;lt;unused24&amp;gt;&lt;/code&gt; / &lt;code&gt;&amp;lt;unused49&amp;gt;&lt;/code&gt; token leakage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;128K context on a 32 GB laptop.&lt;/strong&gt; KV cache + prefill latency tax &amp;gt; the benefit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Judging model quality before tool calls are stable.&lt;/strong&gt; Fix the parser/template first, then evaluate the model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Routing through LiteLLM when the backend is already native Anthropic.&lt;/strong&gt; Adds a hop for nothing — only use LiteLLM for fallbacks or when wrapping an OpenAI compatible local endpoint.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  16. Reusable startup script
&lt;/h2&gt;

&lt;p&gt;Drop this in &lt;code&gt;start-claude-code-local.sh&lt;/code&gt; and &lt;code&gt;chmod +x&lt;/code&gt;. Default 32K context, override via env.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/usr/bin/env bash&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail

&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OLLAMA_CONTEXT_LENGTH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;OLLAMA_CONTEXT_LENGTH&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;32768&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;http&lt;/span&gt;://localhost:11434&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_AUTH_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_AUTH_TOKEN&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;ollama&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_OPUS_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_OPUS_MODEL&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;gemma4&lt;/span&gt;:26b-a4b&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_SONNET_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_SONNET_MODEL&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;gemma4&lt;/span&gt;:26b-a4b&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_HAIKU_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_HAIKU_MODEL&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;gpt&lt;/span&gt;&lt;span class="p"&gt;-oss&lt;/span&gt;:20b&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Starting Ollama with context=&lt;/span&gt;&lt;span class="nv"&gt;$OLLAMA_CONTEXT_LENGTH&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
ollama serve &amp;amp;
&lt;span class="nv"&gt;OLLAMA_PID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$!&lt;/span&gt;

&lt;span class="c"&gt;# Wait for Ollama to be ready&lt;/span&gt;
&lt;span class="k"&gt;until &lt;/span&gt;curl &lt;span class="nt"&gt;-sf&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="s2"&gt;/api/version"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /dev/null&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;&lt;span class="nb"&gt;sleep &lt;/span&gt;0.5
&lt;span class="k"&gt;done

&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Launching Claude Code → &lt;/span&gt;&lt;span class="nv"&gt;$ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Model: &lt;/span&gt;&lt;span class="nv"&gt;$ANTHROPIC_DEFAULT_OPUS_MODEL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

claude

&lt;span class="nb"&gt;kill&lt;/span&gt; &lt;span class="nv"&gt;$OLLAMA_PID&lt;/span&gt; 2&amp;gt;/dev/null &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For LM Studio, swap &lt;code&gt;ollama serve&lt;/code&gt; for &lt;code&gt;lms server start --port 1234&lt;/code&gt; and update the env vars accordingly.&lt;/p&gt;

&lt;p&gt;This script (and additions for other backends as they ship) lives in the companion repo:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://github.com/renezander030/local-ai-coding-stack" rel="noopener noreferrer"&gt;github.com/renezander030/local-ai-coding-stack&lt;/a&gt; — &lt;code&gt;git clone&lt;/code&gt;, &lt;code&gt;chmod +x scripts/start-claude-code-local.sh&lt;/code&gt;, run.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  17. Production recommendation
&lt;/h2&gt;

&lt;p&gt;For real work, do not let Claude Code talk directly to a single local endpoint without a fallback path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Claude Code
   │  ANTHROPIC_BASE_URL
   ▼
LiteLLM (router + logger)
   │  primary
   ▼
Ollama / LM Studio / llama.cpp / vLLM (local)
   │  on tool-call failure or 5xx
   ▼
Cloud Claude Haiku (fallback)
   │
   ▼
Audit log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Model swaps without restarting Claude Code; transparent fallback when local tool calling silently fails; request logs you can grep when something goes wrong. Same five-contract pattern from &lt;a href="https://github.com/renezander030/agent-approval-gate" rel="noopener noreferrer"&gt;agent-approval-gate&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  18. When local models are the wrong choice
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Repo-wide refactors.&lt;/strong&gt; Multi-step tool flows compound silent tool-call failures. Local fine-tunes drop accuracy fast.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security-sensitive edits without an approval gate.&lt;/strong&gt; Use &lt;a href="https://github.com/renezander030/agent-approval-gate" rel="noopener noreferrer"&gt;agent-approval-gate&lt;/a&gt; and the local-vs-cloud question becomes secondary.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool-heavy sessions&lt;/strong&gt; (50+ tool calls). Every silent failure compounds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anything billed by your time.&lt;/strong&gt; A failed local tool call costs your time; a successful Haiku call is roughly $0.001.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Local Claude Code is a fit for: chat-only assist on private code, classification/summarization sub-steps, air-gapped environments.&lt;/p&gt;




&lt;h2&gt;
  
  
  Series
&lt;/h2&gt;

&lt;p&gt;This gist is part of &lt;strong&gt;Production AI Automation Notes&lt;/strong&gt; — a running set of repos and gists on shipping AI agents outside demos. Other entries:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/renezander030/agent-approval-gate" rel="noopener noreferrer"&gt;agent-approval-gate&lt;/a&gt; — production-safe approval pattern. Drop in front of any local-model agent that touches real systems.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gist.github.com/renezander030/9069db775e494ffd2cdd5a09adf83add" rel="noopener noreferrer"&gt;Production AI Automation Notes #1: Agent Approval Gates&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gist.github.com/renezander030/2898eb5f0100688f4197b5e493e156a2" rel="noopener noreferrer"&gt;CLAUDE.md — 10 rules for Claude Code, edit-time and runtime&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gist.github.com/renezander030/83ad49aeffa5f8749325a2b19617823f" rel="noopener noreferrer"&gt;Context7 v2 — enterprise GraphQL MCP pattern&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://ollama.com/blog/claude" rel="noopener noreferrer"&gt;Ollama — Claude Code with Anthropic API compatibility (2026-01-16)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lmstudio.ai/blog/claudecode" rel="noopener noreferrer"&gt;LM Studio — Use your LM Studio Models in Claude Code (2026-01-30)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.vllm.ai/en/stable/serving/integrations/claude_code/" rel="noopener noreferrer"&gt;vLLM — Claude Code integration docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.claude.com/en/docs/claude-code/overview" rel="noopener noreferrer"&gt;Anthropic Claude Code documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.claude.com/en/api/messages" rel="noopener noreferrer"&gt;Anthropic Messages API reference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.litellm.ai/docs/anthropic_unified" rel="noopener noreferrer"&gt;LiteLLM Anthropic-compatible route docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/anthropics/claude-code/issues/7178" rel="noopener noreferrer"&gt;Claude Code GitHub issue #7178 — local/self-hosted model support&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Reader contributions
&lt;/h2&gt;

&lt;p&gt;If you get this working on a different Mac/RAM/model combo, comment with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;machine&lt;/li&gt;
&lt;li&gt;RAM&lt;/li&gt;
&lt;li&gt;backend&lt;/li&gt;
&lt;li&gt;model + quant&lt;/li&gt;
&lt;li&gt;context length&lt;/li&gt;
&lt;li&gt;what worked / what failed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The compatibility matrix and hardware table are updated weekly from these reports.&lt;/p&gt;

&lt;h2&gt;
  
  
  Changelog
&lt;/h2&gt;

&lt;h3&gt;
  
  
  2026-04-28
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Added TL;DR cheat sheet, Recommended setup section, smoke test, debug flow, reusable startup script, hardware × model × context × backend table.&lt;/li&gt;
&lt;li&gt;Expanded error-string section to include &lt;code&gt;&amp;lt;unused24&amp;gt;&lt;/code&gt; / &lt;code&gt;&amp;lt;unused49&amp;gt;&lt;/code&gt; template-leak symptoms.&lt;/li&gt;
&lt;li&gt;Added 26B-A4B vs 31B comparison bullets.&lt;/li&gt;
&lt;li&gt;Added "Setups I would avoid."&lt;/li&gt;
&lt;li&gt;Renamed Update log → Changelog.&lt;/li&gt;
&lt;li&gt;Added Gemma 4 26B-A4B context recommendations.&lt;/li&gt;
&lt;li&gt;Added MacBook Air vs Pro presets.&lt;/li&gt;
&lt;li&gt;Added 32K / 64K Claude Code guidance.&lt;/li&gt;
&lt;li&gt;Backend coverage rewritten: Ollama, LM Studio, vLLM all native Anthropic; llama.cpp added as Apple Silicon fast path.&lt;/li&gt;
&lt;li&gt;LiteLLM repositioned as fallback router (and OpenAI-compat wrapper), not translator.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2026-04-22
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Initial publish.&lt;/li&gt;
&lt;/ul&gt;







&lt;p&gt;&lt;strong&gt;Get the next field note in your inbox.&lt;/strong&gt; A new mini case from real AI builds every two weeks. No theory, no pitches.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://renezander.com/blog/?utm_source=dev-to&amp;amp;utm_campaign=claude-code-with-local-llms-and-anthropicbaseurl-ollama-lm-studio-llamacpp-vllm" rel="noopener noreferrer"&gt;Subscribe at renezander.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>programming</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Voice AI in Production: From RunPod to Hosted Kubernetes</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Thu, 23 Apr 2026 13:10:51 +0000</pubDate>
      <link>https://dev.to/reneza/voice-ai-in-production-from-runpod-to-hosted-kubernetes-7gg</link>
      <guid>https://dev.to/reneza/voice-ai-in-production-from-runpod-to-hosted-kubernetes-7gg</guid>
      <description>&lt;p&gt;Your voice model works in a demo. The same model in production stalls under concurrent load. The model file is identical. So is the GPU card. Only the deployment changed.&lt;/p&gt;

&lt;p&gt;If your TTS service runs on a single RunPod pod, you've already met this wall. You handle one request per GPU at a time. A crash costs ninety seconds to reload the model. Failover isn't in the setup. Your marketing page says "generate narration instantly." Your infrastructure says "please form an orderly queue."&lt;/p&gt;

&lt;p&gt;The gap between prototype and product sits in the infrastructure layer. The voice AI companies asking me for help want hosted Kubernetes because their engineering hours are going into pod management when they should be going into the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Single Pod Stops Working Around Four Concurrent Users
&lt;/h2&gt;

&lt;p&gt;A voice model like Qwen3-TTS loads into GPU memory once. Each inference holds that memory plus a working buffer. On an H100 you fit the model plus maybe four to eight concurrent generations before latency goes off a cliff. On a 4090, less.&lt;/p&gt;

&lt;p&gt;That number is the ceiling of your business on a single pod. You can buy a bigger GPU. You can't buy a second one attached to the same pod. The moment you need more than one machine, you're in distributed-systems territory whether you planned for it or not.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Breaks First
&lt;/h2&gt;

&lt;p&gt;Cold starts are the obvious one. A pod that dies takes ninety seconds to reload the model into VRAM, and during those ninety seconds your users hit 502s. Kubernetes with a warm pool absorbs it.&lt;/p&gt;

&lt;p&gt;Voice profile storage gets worse the moment you scale. On one pod a user's cloned voice sits on local disk. Spread that across ten pods and you need shared storage plus replication on every node that might serve that user. Miss one and the next request uses the wrong voice or errors out.&lt;/p&gt;

&lt;p&gt;Then there's the cost trap. You rent preemptible GPUs at a third the price, and one afternoon the cloud provider takes them back with two minutes' warning. A single pod goes dark. A K8s cluster with a warm replica serves the next request from a different node and nobody sees the eviction.&lt;/p&gt;

&lt;p&gt;Fine-tuning is the one that forces the decision. The moment you offer custom voice creation, you need training runs that don't block inference. That means another queue, another GPU pool, and priority rules that don't collide with live inference. A single pod can't multiplex that, and bolting it on later costs more than designing for it up front.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the K8s Layer Actually Buys You
&lt;/h2&gt;

&lt;p&gt;Keep model weights on the node, where they outlive any single pod. New pods scheduled to that node get a warm cache and start in under ten seconds instead of ninety.&lt;/p&gt;

&lt;p&gt;Not every request needs an H100. Real-time low-latency responses can run on a 4090 nodepool, premium batch generations go to H100. Nodepool labels and taints handle the routing without the application code caring.&lt;/p&gt;

&lt;p&gt;Pick queue depth as your autoscale signal. CPU metrics are useless here. GPU utilization also lies when the model is streaming. The number that maps to user-visible latency is requests waiting in the queue.&lt;/p&gt;

&lt;p&gt;Show the queue depth back to the caller. "You're number four, about forty seconds" keeps users on the line. A thirty-second timeout with no feedback teaches them your service is broken.&lt;/p&gt;

&lt;p&gt;None of this is visible in a Voicebox README.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hosted K8s Is the Service
&lt;/h2&gt;

&lt;p&gt;Voice AI companies keep asking for this because it's the gap between a model that works and a product that holds up under paying users. You can learn Kubernetes while trying to ship, but most founders can't afford both learning curves at once. Hiring a team is slow. Handing the layer off gets your engineering hours back on the model.&lt;/p&gt;

&lt;p&gt;If your voice AI product is past the demo and breaking under real traffic, I run the K8s layer so your team stays on the model. &lt;a href="https://renezander.com/#contact" rel="noopener noreferrer"&gt;Contact on the blog&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Your Model Is the Value. Your Pod Isn't.
&lt;/h2&gt;

&lt;p&gt;Are your engineering hours going into the model or into the pod that serves it? If the answer is the pod, you're paying to solve the wrong problem twice. Handle the infrastructure properly or hand it off. A half-built version while your competitor ships isn't a strategy.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://renezander.com/blog/voice-ai-production-kubernetes/" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Where is your engineering time actually going right now: into the model or into the pod that serves it?&lt;/p&gt;







&lt;p&gt;&lt;strong&gt;Get the next field note in your inbox.&lt;/strong&gt; A new mini case from real AI builds every two weeks. No theory, no pitches.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://renezander.com/blog/?utm_source=dev-to&amp;amp;utm_campaign=voice-ai-in-production-from-runpod-to-hosted-kubernetes" rel="noopener noreferrer"&gt;Subscribe at renezander.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>kubernetes</category>
      <category>devops</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Ten CLAUDE.md rules for Claude Code - four edit-time, six runtime</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Thu, 23 Apr 2026 04:06:42 +0000</pubDate>
      <link>https://dev.to/reneza/ten-claudemd-rules-for-claude-code-four-edit-time-six-runtime-210g</link>
      <guid>https://dev.to/reneza/ten-claudemd-rules-for-claude-code-four-edit-time-six-runtime-210g</guid>
      <description>&lt;p&gt;Forrestchang's &lt;a href="https://github.com/forrestchang/andrej-karpathy-skills" rel="noopener noreferrer"&gt;andrej-karpathy-skills&lt;/a&gt; CLAUDE.md is four rules aimed at the moment Claude is &lt;strong&gt;writing code&lt;/strong&gt;. They work. What they don't cover is the moment Claude is &lt;strong&gt;running&lt;/strong&gt;. Once a Claude-driven pipeline goes to production, a different failure mode shows up: confident outputs, silent budget overruns, destructive side-effects, prompt injection via user input.&lt;/p&gt;

&lt;p&gt;These six extension rules are what I shipped into &lt;a href="https://github.com/renezander030/fixclaw" rel="noopener noreferrer"&gt;fixclaw&lt;/a&gt; — a Go pipeline engine where Claude drafts, classifies, and summarizes, but &lt;em&gt;never&lt;/em&gt; executes. Deterministic code does. The rules below are what made that claim stick.&lt;/p&gt;

&lt;p&gt;Merge with your own project rules. Tradeoff: these bias toward caution over autonomy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Forrestchang's four (edit-time) — unchanged
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Think Before Coding&lt;/strong&gt; — state assumptions, surface tradeoffs, ask when unclear.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simplicity First&lt;/strong&gt; — minimum code, no speculative abstractions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Surgical Changes&lt;/strong&gt; — touch only what the task requires.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Goal-Driven Execution&lt;/strong&gt; — define success criteria, loop until verified.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;(Full text: forrestchang/andrej-karpathy-skills/CLAUDE.md.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Six runtime rules — lessons from fixclaw
&lt;/h2&gt;

&lt;h3&gt;
  
  
  5. Deterministic First
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Claude is for judgment calls. Plain code does everything else.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Fetching, filtering, routing, persisting, dispatching — none of it is a language task. Don't ask the model to "decide if we should retry" when a status code already answers. Use the model for: classification, drafting, summarization, extraction from unstructured text. That's the whole list.&lt;/p&gt;

&lt;p&gt;The failure mode without this rule: the model makes a routing decision one week, a different routing decision the next, and you've reinvented flaky if-else at $0.003/token.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Declare Budgets, Halt On Breach
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;No silent overruns. Ever.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every AI step runs under a token budget: per-step, per-pipeline, per-day. Exceeding any of the three halts the pipeline immediately, logs the breach, and surfaces it to the operator. Budgets live in config, not in prompts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;budgets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;per_step_tokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2048&lt;/span&gt;
  &lt;span class="na"&gt;per_pipeline_tokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10000&lt;/span&gt;
  &lt;span class="na"&gt;per_day_tokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The failure mode without this rule: a runaway loop burns $40 overnight and you find out from the invoice.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Human-In-The-Loop Is A First-Class Step Type
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Label destructive actions. Require approval. No exceptions via flags.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Anything touching the outside world — sending an email, updating a CRM, posting a message — is an &lt;code&gt;approval&lt;/code&gt; step, not an &lt;code&gt;ai&lt;/code&gt; step. The approval is routed to an operator channel (Slack, Telegram, whatever) with approve/edit/reject controls. The pipeline blocks until a decision is recorded.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;approve-send&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;approval&lt;/span&gt;
  &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hitl&lt;/span&gt;
  &lt;span class="na"&gt;channel&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;telegram&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The failure mode without this rule: a hallucinated follow-up email goes to a real customer.&lt;/p&gt;

&lt;h3&gt;
  
  
  8. Validate AI Output Against A Schema
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Unstructured strings don't belong in deterministic downstream code.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every AI step declares an output schema. The runtime rejects anything that doesn't match — missing fields, wrong types, out-of-range numbers. Rejected outputs trigger a retry (under budget) or halt.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;output_schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;object&lt;/span&gt;
  &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;reason&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;score&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;boolean&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
    &lt;span class="na"&gt;reason&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;string&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;maxLength&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;280&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
    &lt;span class="na"&gt;score&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;integer&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;minimum&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;0&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;maximum&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;100&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The failure mode without this rule: a boolean comes back as the string &lt;code&gt;"maybe"&lt;/code&gt; and a downstream &lt;code&gt;if&lt;/code&gt; branches the wrong way.&lt;/p&gt;

&lt;h3&gt;
  
  
  9. Sanitize Operator Input Before It Reaches A Prompt
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;User-supplied text is not trusted.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before any operator or external input enters a prompt, strip role markers (&lt;code&gt;system:&lt;/code&gt;, &lt;code&gt;assistant:&lt;/code&gt;, &lt;code&gt;&amp;lt;|im_start|&amp;gt;&lt;/code&gt; variants), enforce length limits, and normalize markdown so formatting can't break prompt boundaries. This is prompt-injection defense, not input validation — the goal is to stop an attacker from pivoting the model mid-run.&lt;/p&gt;

&lt;h3&gt;
  
  
  10. Log Rejections Silently
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Don't narrate to the attacker.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When input is rejected for sanitization or schema violations, log internally — never echo the rejection reason back to the source. A detailed error message is a free signal that tells the attacker which pattern to try next.&lt;/p&gt;

&lt;h2&gt;
  
  
  The "working if" test
&lt;/h2&gt;

&lt;p&gt;The full ten rules are working if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Diffs are smaller and more targeted (rules 1–4).&lt;/li&gt;
&lt;li&gt;Pipeline runs have predictable token costs (rule 6).&lt;/li&gt;
&lt;li&gt;No AI output ever reaches a production side-effect without a human approval record (rule 7).&lt;/li&gt;
&lt;li&gt;Downstream code never branches on a malformed AI response (rule 8).&lt;/li&gt;
&lt;li&gt;Operator-channel logs show silent rejections rather than echoed errors (rules 9–10).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If even one of those is failing, the rule isn't enforced — it's aspirational.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published as a gist: &lt;a href="https://gist.github.com/renezander030/2898eb5f0100688f4197b5e493e156a2" rel="noopener noreferrer"&gt;https://gist.github.com/renezander030/2898eb5f0100688f4197b5e493e156a2&lt;/a&gt; — weekly gists on Claude Code, MCP, and automation at &lt;a href="https://github.com/renezander030" rel="noopener noreferrer"&gt;@renezander030&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;







&lt;p&gt;&lt;strong&gt;Get the next field note in your inbox.&lt;/strong&gt; A new mini case from real AI builds every two weeks. No theory, no pitches.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://renezander.com/blog/?utm_source=dev-to&amp;amp;utm_campaign=ten-claudemd-rules-for-claude-code-four-edit-time-six-runtime" rel="noopener noreferrer"&gt;Subscribe at renezander.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>productivity</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>95% of PII Redaction Doesn't Need an LLM. The Other 5% Is Where Your Masker Leaks.</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Tue, 21 Apr 2026 11:43:05 +0000</pubDate>
      <link>https://dev.to/reneza/95-of-pii-redaction-doesnt-need-an-llm-the-other-5-is-where-your-masker-leaks-13pp</link>
      <guid>https://dev.to/reneza/95-of-pii-redaction-doesnt-need-an-llm-the-other-5-is-where-your-masker-leaks-13pp</guid>
      <description>&lt;p&gt;A VP at an SAP shop told me recently: "Every time we copy production to our lower environments, PII leaks. And no, we're not throwing an LLM at it. That's a thousand times the compute of what we already run."&lt;/p&gt;

&lt;p&gt;He's right.&lt;/p&gt;

&lt;p&gt;Most of the PII redaction problem in enterprise data isn't a neural network problem. It's a lookup table problem. And the incumbents already solve it. SAP TDMS, Delphix, Informatica, IBM InfoSphere Optim. All schema-aware. All row-level. All deterministic.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 95% Where Deterministic Wins
&lt;/h2&gt;

&lt;p&gt;In a SAP production database, the schema tells you almost everything. &lt;code&gt;KNA1-NAME1&lt;/code&gt; is a customer name. &lt;code&gt;BSEG-IBAN&lt;/code&gt; is a bank account. &lt;code&gt;USR02-BNAME&lt;/code&gt; is a user ID. A YAML rule says: "for this column type, replace with this pattern." Done.&lt;/p&gt;

&lt;p&gt;The math is brutal. A regex plus a lookup table costs microseconds per row. A 1.5B-parameter model costs 10 to 50 milliseconds per row, even on a GPU. That's three to five orders of magnitude. A nightly batch copy that finishes by morning with TDMS would take weeks with an LLM in the loop.&lt;/p&gt;

&lt;p&gt;Compute isn't even the main argument.&lt;/p&gt;

&lt;p&gt;Referential integrity is. "Anna Müller" has to become "Person_47" consistently across 200 tables. &lt;code&gt;KNA1&lt;/code&gt;, &lt;code&gt;VBAK&lt;/code&gt;, &lt;code&gt;VBKD&lt;/code&gt;, &lt;code&gt;BSEG&lt;/code&gt;, wherever the customer ID travels. Deterministic pseudonymization with an HMAC and a scoped salt gives you that for free. Neural outputs drift.&lt;/p&gt;

&lt;p&gt;Auditability is. A regulator asks: "show me the rule that masked this column." A YAML rule is defensible. A model output is not.&lt;/p&gt;

&lt;p&gt;So for any SAP field with a known schema type, deterministic masking wins. Full stop. Don't let anyone sell you a neural-network-powered "modernization" of that layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where a Fine-Tuned Model Earns Its Compute
&lt;/h2&gt;

&lt;p&gt;Here's what TDMS, Delphix, and their peers silently miss.&lt;/p&gt;

&lt;p&gt;Free-text columns. &lt;code&gt;BSEG-SGTXT&lt;/code&gt;, the long-text field where someone typed "Ansprechpartner Anna Müller, Tel +49-170-...". Ticket descriptions from ServiceNow mirrored into dev. Email bodies stored as CLOBs. ADRC annotations. The column type is "text." The content is gold-mine PII.&lt;/p&gt;

&lt;p&gt;Unstructured attachments. PDFs, scanned invoices, OCR'd contracts pulled into dev via ArchiveLink. Names and IBANs mid-prose, not in a column.&lt;/p&gt;

&lt;p&gt;Schema drift. Consultants add Z-tables. The data steward hasn't classified them yet. Deterministic tools don't know the column holds PII. They pass the data through untouched.&lt;/p&gt;

&lt;p&gt;On these, rule-based tools do one of two things. They wipe the whole column, destroying test fidelity, so the dev team can't debug against realistic data. Or they miss the PII entirely, and you get a compliance incident.&lt;/p&gt;

&lt;p&gt;A German-specialized redactor earns its keep here because the alternative isn't "faster regex." It's "no coverage at all."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hybrid Architecture
&lt;/h2&gt;

&lt;p&gt;This is the part that actually ships.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A classifier pass on the SAP copy. Cheap heuristics (column-name keywords, column type, sample-value regex) flag each column as &lt;code&gt;structured_pii&lt;/code&gt;, &lt;code&gt;free_text&lt;/code&gt;, or &lt;code&gt;safe&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Deterministic masker handles &lt;code&gt;structured_pii&lt;/code&gt;. TDMS or whatever you already run.&lt;/li&gt;
&lt;li&gt;Fine-tuned LLM redactor runs &lt;em&gt;only&lt;/em&gt; on &lt;code&gt;free_text&lt;/code&gt;, attachments, and unclassified Z-columns.&lt;/li&gt;
&lt;li&gt;A consistency bridge. Both paths share a pseudonym table keyed by &lt;code&gt;HMAC(value, tenant_salt)&lt;/code&gt;. "Anna Müller" becomes "Person_47" whether she was caught by regex or by the model.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Compute budget: the LLM runs on maybe 1 to 5 percent of the cells. Total cost is still dominated by the deterministic layer. You're not replacing TDMS. You're covering its blind spots.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Won't Claim
&lt;/h2&gt;

&lt;p&gt;Three things I won't sell you:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The LLM is cheaper than a regex. It isn't. Ever.&lt;/li&gt;
&lt;li&gt;It replaces your incumbent masking vendor. It doesn't.&lt;/li&gt;
&lt;li&gt;A benchmark against TDMS on structured columns is meaningful. You lose that benchmark. Benchmark on free-text and attachments, where deterministic tools score near zero.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The honest pitch to the VP was this. "You're right. For the 90% structured case, keep TDMS. The model is the long-tail layer. It runs only over the free-text fields and attachments your current tools silently leak. Small job. Different problem."&lt;/p&gt;

&lt;p&gt;That's the conversation that lands. Not "replace your stack." Not "AI-powered everything."&lt;/p&gt;

&lt;p&gt;Regex for the schema. LLM for the shadows.&lt;/p&gt;

&lt;p&gt;I reserve my audits for teams ready to take action on the results.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.cal.eu/reneza/30min" rel="noopener noreferrer"&gt;Book a 30-min call →&lt;/a&gt;&lt;/p&gt;







&lt;p&gt;&lt;strong&gt;Get the next field note in your inbox.&lt;/strong&gt; A new mini case from real AI builds every two weeks. No theory, no pitches.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://renezander.com/blog/?utm_source=dev-to&amp;amp;utm_campaign=95-of-pii-redaction-doesnt-need-an-llm-the-other-5-is-where-your-masker-leaks" rel="noopener noreferrer"&gt;Subscribe at renezander.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>gdpr</category>
      <category>dsvgo</category>
      <category>pii</category>
    </item>
    <item>
      <title>What llama.cpp's Pace Tells You About On-Prem LLM Readiness</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Tue, 14 Apr 2026 08:42:06 +0000</pubDate>
      <link>https://dev.to/reneza/what-llamacpps-pace-tells-you-about-on-prem-llm-readiness-eh1</link>
      <guid>https://dev.to/reneza/what-llamacpps-pace-tells-you-about-on-prem-llm-readiness-eh1</guid>
      <description>&lt;p&gt;Your team asked for GPU budget for self-hosted inference. You said "not yet" because last time you checked, the tooling wasn't production-grade. That was true 18 months ago. It's not true now, and the delay is costing you leverage you don't know you're losing.&lt;/p&gt;

&lt;p&gt;I'm writing this because most decision-makers I talk to are still running on an outdated mental model of what self-hosted LLM infrastructure looks like. The software moved. The org didn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Team That Celebrated Too Early
&lt;/h2&gt;

&lt;p&gt;I watched a team spin up on-prem inference, celebrate for a week, then watch it rot because nobody owned it. Six months later they were back on the API, having spent the budget anyway.&lt;/p&gt;

&lt;p&gt;This is the failure mode nobody talks about. The software works. It's been working for a while now. The problem is everything around it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nobody owns the stack.&lt;/strong&gt; Running self-hosted inference in production means someone on your team owns model updates, hardware failures, quantization tradeoffs, and latency tuning. That's a different job than calling an API. If you don't staff it, the deployment decays.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Procurement kills momentum.&lt;/strong&gt; GPU capacity is a capital expenditure conversation, not a software download. If you don't already have data center access or cloud-GPU contracts, the blocker isn't the code. It's a procurement cycle that takes months. By the time the hardware arrives, the team that asked for it has moved on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model selection is real work.&lt;/strong&gt; The quantized model that runs great for summarization falls apart on code generation. There is no default. Every use case needs evaluation, and evaluation takes time nobody budgets for.&lt;/p&gt;

&lt;p&gt;These are solvable problems. But teams that skip them end up with on-prem deployments that nobody trusts, and leadership that says "see, I told you it wasn't ready" when the real issue was organizational, not technical.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Changed While You Were Waiting
&lt;/h2&gt;

&lt;p&gt;A year ago, I would have told you to hold off. Not anymore.&lt;/p&gt;

&lt;p&gt;You can now split inference across multiple GPUs without patching anything yourself. The server mode handles concurrent requests behind a load balancer. 1-bit quantization means models that needed high-end hardware run on modest configs without catastrophic quality loss.&lt;/p&gt;

&lt;p&gt;Multi-modal support landed. Speculative decoding shipped, cutting latency on long outputs. The API compatibility layer means your existing code that talks to cloud providers works against a self-hosted endpoint with a URL change.&lt;/p&gt;

&lt;p&gt;I deployed a quantized model on a client's on-prem GPU last month. Set up the server, pointed the app at it, ran inference. It worked. First try. That sentence would have been fiction two years ago.&lt;/p&gt;

&lt;p&gt;The gap between "experimental" and "production-ready" closed while most orgs were waiting for someone else to go first.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Decision You're Actually Making
&lt;/h2&gt;

&lt;p&gt;This isn't a permanent binary. It's a portfolio allocation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Move workloads on-prem when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your inference volume is high enough that API costs became a material line item.&lt;/li&gt;
&lt;li&gt;You need predictable latency without network variability.&lt;/li&gt;
&lt;li&gt;Compliance or data residency requirements mandate it. But verify this. Many teams assume they need on-prem when they don't.&lt;/li&gt;
&lt;li&gt;You have an engineer who wants to own the stack.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Stay on the API when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're prototyping or usage is unpredictable.&lt;/li&gt;
&lt;li&gt;You need frontier models not available as open weights.&lt;/li&gt;
&lt;li&gt;Nobody on your team can own the ops burden.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The mistake I see most often: treating this as all-or-nothing. Start with API. Move specific workloads to self-hosted when economics or data constraints force the conversation. The infrastructure to do it properly exists now. It didn't two years ago.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Question for Your Next Planning Cycle
&lt;/h2&gt;

&lt;p&gt;The software is ready. The open-weight models are good enough for most production use cases. The tooling matured past the point where "not ready yet" is a defensible position.&lt;/p&gt;

&lt;p&gt;The real question isn't whether the technology works. It's whether your org is set up to operate it. That's a staffing decision and a procurement decision, not a technology bet.&lt;/p&gt;

&lt;p&gt;If you're still saying "not yet," make sure you're saying it because of an actual blocker, not because of a mental model that expired a year ago.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Related reading:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://renezander.com/guides/self-hosted-llm-vs-api/" rel="noopener noreferrer"&gt;Self-Hosted LLM vs API: when the math actually works&lt;/a&gt; — the decision framework I use with clients (&lt;a href="https://renezander.com/de/guides/self-hosted-llm-vs-api/" rel="noopener noreferrer"&gt;deutsche Version&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://renezander.com/guides/llm-api-comparison/" rel="noopener noreferrer"&gt;LLM API comparison 2026&lt;/a&gt; — Claude vs GPT vs Gemini vs Mistral vs DeepSeek for production (&lt;a href="https://renezander.com/de/guides/llm-api-comparison/" rel="noopener noreferrer"&gt;deutsche Version&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;I help teams navigate this decision. If your org is evaluating self-hosted inference and you want an honest assessment of readiness, reach out.&lt;/p&gt;







&lt;p&gt;&lt;strong&gt;Get the next field note in your inbox.&lt;/strong&gt; A new mini case from real AI builds every two weeks. No theory, no pitches.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://renezander.com/blog/?utm_source=dev-to&amp;amp;utm_campaign=what-llamacpps-pace-tells-you-about-on-prem-llm-readiness" rel="noopener noreferrer"&gt;Subscribe at renezander.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>programming</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Your AI Content Tool Knows Your Strategy. Do You Know Where It Goes?</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Tue, 07 Apr 2026 06:59:24 +0000</pubDate>
      <link>https://dev.to/reneza/your-ai-content-tool-knows-your-strategy-do-you-know-where-it-goes-23fg</link>
      <guid>https://dev.to/reneza/your-ai-content-tool-knows-your-strategy-do-you-know-where-it-goes-23fg</guid>
      <description>&lt;p&gt;Your team is using AI for content. Everybody is. LinkedIn posts, blog drafts, internal comms, maybe some customer-facing copy too.&lt;/p&gt;

&lt;p&gt;And it works. The output is decent, the speed is real, nobody wants to go back to writing everything from scratch.&lt;/p&gt;

&lt;p&gt;But have you thought about what you are actually pasting into these tools?&lt;/p&gt;

&lt;h2&gt;
  
  
  The Prompt Is the Product
&lt;/h2&gt;

&lt;p&gt;Every time someone on your team writes a prompt, they are feeding context into a system they do not control. Brand voice guidelines. Competitive positioning notes. Messaging frameworks. That internal strategy deck someone summarized into a prompt last Tuesday.&lt;/p&gt;

&lt;p&gt;This is not hypothetical. This is what good prompts look like. The more context you give, the better the output. So people give more context. They paste in the brief. They paste in the competitor analysis. They paste in the draft that legal has not approved yet.&lt;/p&gt;

&lt;p&gt;The tool gets better because your data is better. And your data is sitting on someone else's infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Trust Model Is the Problem
&lt;/h2&gt;

&lt;p&gt;Most AI content tools handle your data the same way: they promise not to train on it. That is the entire security model. A policy page. Maybe an enterprise agreement with a data processing addendum.&lt;/p&gt;

&lt;p&gt;Your data still gets processed on shared infrastructure. It still passes through systems you cannot inspect. You are trusting that the vendor's internal controls work perfectly, that no employee has access they should not have, and that every subprocessor in the chain follows the same rules.&lt;/p&gt;

&lt;p&gt;For most companies, this never becomes a visible problem. The data does not leak in a way anyone notices. The risk stays theoretical.&lt;/p&gt;

&lt;p&gt;Until it does not.&lt;/p&gt;

&lt;p&gt;A client asks where their data goes during your AI-assisted content process. Legal needs to document compliance for an audit. A competitor publishes something that looks suspiciously familiar. A new regulation drops that requires you to prove where personal data was processed, not just promise.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Technology Already Exists
&lt;/h2&gt;

&lt;p&gt;Here is what most people in the content space do not realize: the technology to solve this is not theoretical. It is production-ready. It has been running in cloud infrastructure for years. It just has not reached the content tooling layer yet.&lt;/p&gt;

&lt;p&gt;Three capabilities change the game:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Client-side encryption.&lt;/strong&gt; Your data gets encrypted before it leaves your browser. The server never sees plaintext. It processes encrypted inputs and returns encrypted outputs. The key stays with you. Not with the vendor. Not in their key management system. With you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Confidential computing.&lt;/strong&gt; Instead of shared servers where your workload runs alongside everyone else's, your data gets processed in an isolated hardware enclave. The cloud provider cannot see inside it. The vendor cannot see inside it. The operating system cannot see inside it. Your data exists in cleartext only inside a hardware boundary that nobody else can access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attestation.&lt;/strong&gt; Cryptographic proof of what code is running in that enclave. Not a vendor's word that they are running the right version. A hardware-signed certificate that you can independently verify. You know exactly what software touched your data because the hardware tells you, not the vendor.&lt;/p&gt;

&lt;p&gt;These are not research papers. AWS Nitro Enclaves, Azure Confidential VMs, and GCP Confidential Computing have been generally available for years. The infrastructure is there. The content tools just have not caught up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters Now
&lt;/h2&gt;

&lt;p&gt;Two things are converging.&lt;/p&gt;

&lt;p&gt;First, AI adoption in content workflows is no longer experimental. Teams are building real pipelines. They are feeding in real business data, not just test prompts. The volume and sensitivity of data flowing through AI tools is growing every quarter.&lt;/p&gt;

&lt;p&gt;Second, regulation is catching up. GDPR already requires you to document where personal data is processed. The EU AI Act adds requirements around transparency and risk management for AI systems. Industry-specific regulations in finance, healthcare, and legal services are getting more specific about AI data handling. "We have a DPA" is becoming insufficient.&lt;/p&gt;

&lt;p&gt;The companies that figure out verifiable AI data handling now will not be scrambling when their clients, their board, or their regulator asks how their AI content pipeline handles sensitive data.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Ask Your Vendors
&lt;/h2&gt;

&lt;p&gt;You do not need to become a cryptography expert. But you should be asking three questions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where does my data exist in plaintext?&lt;/strong&gt; If the answer is "on our servers," you are in the trust model. If the answer is "only inside a hardware enclave that we cannot access," you are in the proof model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I verify what code processes my data?&lt;/strong&gt; If the answer requires trusting the vendor's word, that is trust. If the answer involves a hardware attestation you can independently check, that is proof.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Who holds the encryption keys?&lt;/strong&gt; If the vendor holds them, they can decrypt your data whenever they want, regardless of what the policy says. If you hold them, the vendor literally cannot access your plaintext data even if they tried.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Shift from Trust to Proof
&lt;/h2&gt;

&lt;p&gt;The content industry is going to go through the same transition that payments, healthcare, and financial services already went through. The question will shift from "do you promise to protect our data?" to "can you prove it?"&lt;/p&gt;

&lt;p&gt;Right now, almost nobody in the AI content space is building with these guarantees. That gap will not last.&lt;/p&gt;

&lt;p&gt;I am building &lt;a href="https://teedian.com" rel="noopener noreferrer"&gt;Teedian&lt;/a&gt;, an AI content tool that uses exactly this architecture. Client-side encryption, confidential computing, attestation. Not as a roadmap item, but as the foundation.&lt;/p&gt;

&lt;p&gt;If you work in a regulated industry, or you handle client data in your content workflows, or you want to understand what cryptographic privacy looks like in practice, I put together a &lt;a href="https://teedian.com/#brief" rel="noopener noreferrer"&gt;short brief on teedian.com&lt;/a&gt; that walks through the architecture. Plain language, no jargon, 3 pages.&lt;/p&gt;







&lt;p&gt;&lt;strong&gt;Get the next field note in your inbox.&lt;/strong&gt; A new mini case from real AI builds every two weeks. No theory, no pitches.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://renezander.com/blog/?utm_source=dev-to&amp;amp;utm_campaign=your-ai-content-tool-knows-your-strategy-do-you-know-where-it-goes" rel="noopener noreferrer"&gt;Subscribe at renezander.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>programming</category>
      <category>privacy</category>
    </item>
    <item>
      <title>Spend Your Human Thinking Tokens Where They Compound</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Tue, 31 Mar 2026 08:32:22 +0000</pubDate>
      <link>https://dev.to/reneza/spend-your-human-thinking-tokens-where-they-compound-pf1</link>
      <guid>https://dev.to/reneza/spend-your-human-thinking-tokens-where-they-compound-pf1</guid>
      <description>&lt;p&gt;More automations running. More agents deployed. More pipelines humming in the background.&lt;/p&gt;

&lt;p&gt;I run about a dozen automated jobs. Daily briefings, proposal generation, content pipelines, data syncing, monitoring alerts. They handle a lot.&lt;/p&gt;

&lt;p&gt;But the biggest improvement to my workflow this year wasn't adding more automation. It was getting honest about where my thinking actually matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  You Have a Token Budget Too
&lt;/h2&gt;

&lt;p&gt;LLMs have context windows. Feed in too much noise and the signal degrades. The output gets worse even though you gave it more to work with.&lt;/p&gt;

&lt;p&gt;Human attention works the same way. I have maybe 4 good hours of focused thinking per day. When I spend those hours reviewing cron output or formatting documents or triaging alerts that resolve themselves, I'm burning tokens on low-value work.&lt;/p&gt;

&lt;p&gt;The quality of my actual decisions goes down. Not because the decisions got harder, but because I already used up my thinking budget on stuff that didn't need me.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where I Stopped Spending
&lt;/h2&gt;

&lt;p&gt;I used to review my morning briefing line by line. Check every data point, verify every summary. Then I realized: if the briefing is wrong, I'll notice when the information doesn't match reality later that day. The cost of a slightly wrong briefing at 6:30 is near zero. The cost of spending 20 minutes checking it every morning is real.&lt;/p&gt;

&lt;p&gt;Same with monitoring. I had alerts for everything. Cache refreshes, API response times, sync completions. Most of them were informational, not actionable. I stripped it down to alerts that require a decision: something broke, something is about to expire, something needs my approval before it touches an external system.&lt;/p&gt;

&lt;p&gt;Data syncing runs on a schedule. If it fails, I get one alert. I don't watch it run. I don't check the logs unless the alert fires.&lt;/p&gt;

&lt;p&gt;First drafts of anything. Cover letters, content outlines, research summaries. The AI produces a version. Sometimes it's good enough. Sometimes I rewrite half of it. But I never start from a blank page anymore, and that alone saves the hardest type of thinking: getting started.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where I Still Spend Every Token
&lt;/h2&gt;

&lt;p&gt;Scoping client work. An AI can research a company, summarize a job posting, draft a proposal. But deciding whether the project is actually worth pursuing? Whether the client's problem is what they say it is? That's pattern recognition built from years of seeing projects go sideways. No automation for that.&lt;/p&gt;

&lt;p&gt;Choosing what to build next. I have a backlog of 50 things I could automate, improve, or ship. The AI can't tell me which one moves the needle this week. That decision depends on context it doesn't have: what conversations I had yesterday, what I'm optimizing for this month, what feels right.&lt;/p&gt;

&lt;p&gt;Anything with my name on it that reaches another person. Proposals get edited. Posts get rewritten. Client messages get reviewed word by word. The AI drafts. I decide what actually represents me.&lt;/p&gt;

&lt;p&gt;System design decisions. Where to draw the boundary between automatic and manual. What gets a human checkpoint and what runs unsupervised. These are the highest-leverage decisions in any AI system, and they're entirely human.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Honest Ratio
&lt;/h2&gt;

&lt;p&gt;Maybe 20% of my working hours involve focused, high-stakes thinking. The rest is execution, coordination, and maintenance.&lt;/p&gt;

&lt;p&gt;Before I built these systems, that ratio was reversed. 80% thinking, 20% execution, and half the thinking was on tasks that didn't deserve it.&lt;/p&gt;

&lt;p&gt;The goal was never "automate everything." It was "protect the 20% that matters and make sure I'm not exhausted when I get there."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Shift
&lt;/h2&gt;

&lt;p&gt;This isn't about working less. I work the same hours. But the distribution changed.&lt;/p&gt;

&lt;p&gt;I spend less time on decisions that don't compound. I spend more time on the ones that do. Client relationships, system architecture, strategic bets. The stuff where being sharp at 10 in the morning instead of burned out from triaging alerts actually changes the outcome.&lt;/p&gt;

&lt;p&gt;The question isn't how much your AI can do. It's whether you're spending your own thinking tokens on the right things.&lt;/p&gt;

&lt;p&gt;Where are you still spending attention that you probably shouldn't?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I help teams figure out where AI should run unsupervised and where humans still need to be in the loop. If that's a question your team is working through, let's talk: &lt;a href="https://cal.eu/reneza" rel="noopener noreferrer"&gt;cal.eu/reneza&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;







&lt;p&gt;&lt;strong&gt;Get the next field note in your inbox.&lt;/strong&gt; A new mini case from real AI builds every two weeks. No theory, no pitches.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://renezander.com/blog/?utm_source=dev-to&amp;amp;utm_campaign=spend-your-human-thinking-tokens-where-they-compound" rel="noopener noreferrer"&gt;Subscribe at renezander.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>programming</category>
      <category>discuss</category>
    </item>
    <item>
      <title>AI Skills Are the New Boilerplate. They Solve Almost Nothing.</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Tue, 24 Mar 2026 11:13:17 +0000</pubDate>
      <link>https://dev.to/reneza/ai-skills-are-the-new-boilerplate-they-solve-almost-nothing-4obo</link>
      <guid>https://dev.to/reneza/ai-skills-are-the-new-boilerplate-they-solve-almost-nothing-4obo</guid>
      <description>&lt;p&gt;Everyone's sharing their skill libraries right now. "Here are my 20 custom slash commands." "Check out my prompt template collection." "This skill saves me 2 hours a day."&lt;/p&gt;

&lt;p&gt;I use skills too. I have about a dozen. They handle cover letters, content pipelines, code review, commit messages. Repeatable workflows where the input and output are predictable.&lt;/p&gt;

&lt;p&gt;They cover maybe 10% of what my AI system actually does.&lt;/p&gt;

&lt;p&gt;The other 90% is the part nobody shares on social media because it's ugly. It's API integrations that break when headers change. It's state management between sessions. It's error handling for when the third-party service returns garbage. It's monitoring that pages you at 6 AM because a cron failed. It's human-in-the-loop workflows where the AI proposes and you approve before anything touches production.&lt;/p&gt;

&lt;p&gt;Skills can't solve this. Every client, every codebase, every problem has different infrastructure underneath. A skill is a template. The work is everything the template doesn't cover.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Skills Actually Are
&lt;/h2&gt;

&lt;p&gt;A skill is a saved prompt with some structure. Input goes in, the agent follows instructions, output comes out. It works when the task is the same shape every time.&lt;/p&gt;

&lt;p&gt;"Generate a cover letter from this job posting." Same structure, different content. Perfect skill.&lt;/p&gt;

&lt;p&gt;"Debug why the webhook stopped firing after the API provider changed their auth flow." No skill for that. Every instance is different. The agent needs to read logs, trace requests, understand the specific integration, and propose a fix that accounts for your deployment setup. That's infrastructure knowledge, not a prompt template.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 90% Nobody Demos
&lt;/h2&gt;

&lt;p&gt;Here's what actually keeps my system running day to day.&lt;/p&gt;

&lt;p&gt;A server process that syncs data from multiple APIs, caches it locally, and exposes it to agents through a unified interface. When an API changes its response format, I fix the parser. No skill for that.&lt;/p&gt;

&lt;p&gt;Scheduled jobs that run without any agent session. They pull data, generate reports, send notifications, and alert me when something fails. The agent isn't even involved. It's just cron, a script, and an alert channel.&lt;/p&gt;

&lt;p&gt;Approval workflows where the AI researches options, presents them with rationale, and waits for a human decision before executing. The approval mechanism is buttons in a chat app. The execution layer calls APIs to star repositories, follow users, post comments. The plumbing between "AI suggested it" and "it actually happened" is custom for every use case.&lt;/p&gt;

&lt;p&gt;State that persists between sessions. Not agent memory. Infrastructure state. Cache files with TTLs. Vector indexes that get rebuilt nightly. Configuration that lives in flat files because a database would be overkill.&lt;/p&gt;

&lt;p&gt;None of this fits in a skill. It's bespoke infrastructure that exists because the specific problem required it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;The skills hype creates a misleading impression of what production AI work looks like. Someone sees a collection of 30 slash commands and thinks: that's the system. It's not. It's the tip.&lt;/p&gt;

&lt;p&gt;The system is the integration layer. The error handling. The monitoring. The state management. The human-in-the-loop controls. The deployment. The part where you wake up and the thing is still running, handling edge cases the skill never anticipated.&lt;/p&gt;

&lt;p&gt;If you're evaluating someone's AI engineering capability, don't ask how many skills they have. Ask what happens when the skill fails. Ask what runs when nobody's in a session. Ask how state persists between interactions. That's where the actual engineering lives.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Honest Ratio
&lt;/h2&gt;

&lt;p&gt;I spend maybe 5% of my time writing new skills. I spend the rest building and maintaining the infrastructure that makes skills useful in the first place.&lt;/p&gt;

&lt;p&gt;A skill that generates a cover letter is worthless without the task management system that tracks proposals, the message log that maintains conversation history, and the pipeline that routes everything to the right place.&lt;/p&gt;

&lt;p&gt;A skill that creates a content draft is worthless without the publishing pipeline, the banner generation, the cross-platform distribution, and the editorial calendar that decides what to write next.&lt;/p&gt;

&lt;p&gt;The skill is the last mile. The infrastructure is the entire road.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Question
&lt;/h2&gt;

&lt;p&gt;Next time you see someone demo their skill collection, ask yourself: what's underneath? What happens between sessions? What runs at 4 AM? What breaks, and who gets paged?&lt;/p&gt;

&lt;p&gt;That's the 90%. That's the actual work.&lt;/p&gt;




&lt;p&gt;I build production AI infrastructure, not prompt collections. If your team needs the 90% that skills don't cover, let's talk: &lt;a href="https://cal.eu/reneza" rel="noopener noreferrer"&gt;cal.eu/reneza&lt;/a&gt;&lt;/p&gt;







&lt;p&gt;&lt;strong&gt;Get the next field note in your inbox.&lt;/strong&gt; A new mini case from real AI builds every two weeks. No theory, no pitches.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://renezander.com/blog/?utm_source=dev-to&amp;amp;utm_campaign=ai-skills-are-the-new-boilerplate-they-solve-almost-nothing" rel="noopener noreferrer"&gt;Subscribe at renezander.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>automation</category>
      <category>devops</category>
    </item>
    <item>
      <title>How I Built a Business Email Agent with Compliance Controls in Go</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Sat, 21 Mar 2026 12:49:31 +0000</pubDate>
      <link>https://dev.to/reneza/how-i-built-a-business-email-agent-with-compliance-controls-in-go-1gci</link>
      <guid>https://dev.to/reneza/how-i-built-a-business-email-agent-with-compliance-controls-in-go-1gci</guid>
      <description>&lt;p&gt;Every few weeks another AI agent product launches that can "handle your email." Dispatch, OpenClaw, and a dozen others promise to read, summarize, and reply on your behalf.&lt;/p&gt;

&lt;p&gt;They work fine for personal use. But the moment you try to use them for business operations, three problems show up:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;No spending controls.&lt;/strong&gt; The agent calls an LLM as many times as it wants. You find out what it cost at the end of the month.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No approval flow.&lt;/strong&gt; It either sends emails autonomously or it doesn't. There's no "show me the draft, let me approve it" step.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No audit trail.&lt;/strong&gt; If a client asks "why did your system send me this?", you have no answer.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I needed an email agent for my consulting business that could triage inbound mail, draft replies, and digest threads. But I also needed to explain every action it took to a client if asked. So I built one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The core constraint: AI never executes
&lt;/h2&gt;

&lt;p&gt;The first decision was architectural. In most agent frameworks, the LLM decides what to do &lt;em&gt;and&lt;/em&gt; does it. Tool calling, function execution, chain-of-thought action loops.&lt;/p&gt;

&lt;p&gt;I went the other way. &lt;strong&gt;The LLM classifies and drafts. Deterministic code executes.&lt;/strong&gt; Every side effect (sending an email, posting a notification, writing a log) is plain Go code with explicit control flow. The LLM is a function call that takes text in and returns structured JSON out. Nothing more.&lt;/p&gt;

&lt;p&gt;This matters because it makes the system auditable. When something goes wrong, you read the pipeline definition, not a chat transcript.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pipelines, not prompts
&lt;/h2&gt;

&lt;p&gt;Every automation is a YAML pipeline. Here's a simplified email digest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;email-digest&lt;/span&gt;
    &lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30m&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fetch-unread&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deterministic&lt;/span&gt;
        &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gmail_unread&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;summarize&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ai&lt;/span&gt;
        &lt;span class="na"&gt;skill&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;email-digest&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;report&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deterministic&lt;/span&gt;
        &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;notify&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three step types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;deterministic&lt;/strong&gt;: Plain code. Fetch emails, send notifications, write logs. No tokens, instant.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ai&lt;/strong&gt;: Calls an LLM with a skill template. Returns structured output. Token-budgeted.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;approval&lt;/strong&gt;: Pauses the pipeline and asks a human to approve, skip, or edit.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key insight: most of what an "email agent" does is not AI. It's polling an API, parsing MIME, filtering by label, formatting output. The AI part is a small step in the middle that classifies or summarizes. Treating it as a pipeline step instead of the orchestrator keeps costs down and behavior predictable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Token budgets as circuit breakers
&lt;/h2&gt;

&lt;p&gt;Every AI step runs inside three budget boundaries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;budgets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;per_step_tokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2048&lt;/span&gt;
  &lt;span class="na"&gt;per_pipeline_tokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10000&lt;/span&gt;
  &lt;span class="na"&gt;per_day_tokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If a step exceeds its token limit, it fails. If a pipeline exceeds its limit, it stops. If the daily limit is hit, the engine pauses all pipelines until midnight.&lt;/p&gt;

&lt;p&gt;This is a circuit breaker, not a suggestion. I've seen agents get stuck in retry loops that burn through $50 in tokens before anyone notices. A hard ceiling at the engine level prevents that by design.&lt;/p&gt;

&lt;h2&gt;
  
  
  Human-in-the-loop on every outbound action
&lt;/h2&gt;

&lt;p&gt;The email connector has three permission levels: &lt;code&gt;read&lt;/code&gt;, &lt;code&gt;draft&lt;/code&gt;, and &lt;code&gt;send&lt;/code&gt;. Even at the &lt;code&gt;send&lt;/code&gt; level, outbound emails go through an approval step first.&lt;/p&gt;

&lt;p&gt;The approval flow works through a messaging bot. The operator gets the draft with three options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Approve&lt;/strong&gt;: Send immediately.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skip&lt;/strong&gt;: Cancel.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adjust&lt;/strong&gt;: Edit the text, then approve.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There's a 4-hour timeout. If no one approves, the action is logged and dropped. This is non-negotiable for business use. An autonomous email agent that sends on your behalf without approval is a liability.&lt;/p&gt;

&lt;p&gt;The approval channel itself has security controls: allowed user IDs, rate limiting per user, input length caps, and markdown stripping to prevent prompt boundary injection through the operator interface.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Gmail connector
&lt;/h2&gt;

&lt;p&gt;The email connector is ~400 lines of Go. All outbound polling, no webhooks, no open ports. It handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OAuth 2.0 with automatic token refresh&lt;/li&gt;
&lt;li&gt;Base64 MIME decoding with multipart handling&lt;/li&gt;
&lt;li&gt;Thread reconstruction in chronological order&lt;/li&gt;
&lt;li&gt;RFC 2822 compliant reply construction (In-Reply-To, References headers)&lt;/li&gt;
&lt;li&gt;Body truncation for token efficiency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Permission scoping is enforced at the connector level. If the config says &lt;code&gt;permission: read&lt;/code&gt;, the connector physically cannot call the send endpoint. It's not a policy check, it's a code path that doesn't exist.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;GmailConfig&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;TokenPath&lt;/span&gt;  &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="s"&gt;`yaml:"token_path"`&lt;/span&gt;
    &lt;span class="n"&gt;Permission&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="s"&gt;`yaml:"permission"`&lt;/span&gt; &lt;span class="c"&gt;// read | draft | send&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Skills: reusable AI templates
&lt;/h2&gt;

&lt;p&gt;AI steps reference skill files instead of inline prompts. A skill is a YAML template with a role, a prompt, and an optional output schema:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;email-digest&lt;/span&gt;
&lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;classifier&lt;/span&gt;
&lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
  &lt;span class="s"&gt;Summarize these unread emails. For each:&lt;/span&gt;
  &lt;span class="s"&gt;[priority] Sender - Subject - what they need&lt;/span&gt;
  &lt;span class="s"&gt;Priority: HIGH = needs response today,&lt;/span&gt;
  &lt;span class="s"&gt;MED = this week, LOW = informational&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output schemas enforce structure. If the LLM returns JSON that doesn't match the schema, the step fails instead of passing garbage downstream. This matters more than people think. An unvalidated LLM response in an automation pipeline is a bug waiting to happen.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Go
&lt;/h2&gt;

&lt;p&gt;Single binary. Cross-compile for the client's infrastructure. No runtime dependencies.&lt;/p&gt;

&lt;p&gt;The entire engine runs on a 2-core VPS for under $5/month. There's no database server, no message queue, no container orchestration. SQLite for state, YAML for config, one binary for the engine. Deploy with &lt;code&gt;scp&lt;/code&gt; and a systemd unit.&lt;/p&gt;

&lt;p&gt;For a tool that runs on client infrastructure, operational simplicity is a feature. Every dependency is a support ticket waiting to happen.&lt;/p&gt;

&lt;h2&gt;
  
  
  How this compares to Dispatch and OpenClaw
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Dispatch&lt;/th&gt;
&lt;th&gt;OpenClaw&lt;/th&gt;
&lt;th&gt;FixClaw&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Target&lt;/td&gt;
&lt;td&gt;Personal productivity&lt;/td&gt;
&lt;td&gt;Personal AI agent&lt;/td&gt;
&lt;td&gt;Business operations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Token controls&lt;/td&gt;
&lt;td&gt;None (subscription)&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Per-step, per-pipeline, per-day budgets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Human approval&lt;/td&gt;
&lt;td&gt;Pause on destructive&lt;/td&gt;
&lt;td&gt;Optional&lt;/td&gt;
&lt;td&gt;Every outbound action&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data residency&lt;/td&gt;
&lt;td&gt;Cloud&lt;/td&gt;
&lt;td&gt;Self-hosted&lt;/td&gt;
&lt;td&gt;Self-hosted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Configuration&lt;/td&gt;
&lt;td&gt;Natural language&lt;/td&gt;
&lt;td&gt;Natural language&lt;/td&gt;
&lt;td&gt;YAML (version-controlled, auditable)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audit trail&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Full action + decision logging&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Dispatch and OpenClaw are good products for what they do. But "what they do" is personal productivity. The moment you need to explain to a client what your automation did and why, you need governance controls that personal tools don't have.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Most agent complexity is accidental.&lt;/strong&gt; The LLM doesn't need to decide what to do next. You know what to do next. Write it in a pipeline. Use the LLM for the one step that actually requires language understanding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token budgets change how you design prompts.&lt;/strong&gt; When there's a hard ceiling, you stop sending full email bodies and start truncating aggressively. You pick smaller models for classification. You realize 90% of your use cases work fine with a fast, cheap model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Human-in-the-loop is not a UX compromise, it's a product feature.&lt;/strong&gt; Clients don't want autonomous agents sending emails on their behalf. They want a system that does the thinking and lets them press the button. The approval step isn't a limitation. It's the reason they trust it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;FixClaw is open source and written in Go. If you're building business automations that need compliance controls, check it out on &lt;a href="https://github.com/renezander030/fixclaw" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;







&lt;p&gt;&lt;strong&gt;Get the next field note in your inbox.&lt;/strong&gt; A new mini case from real AI builds every two weeks. No theory, no pitches.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://renezander.com/blog/?utm_source=dev-to&amp;amp;utm_campaign=how-i-built-a-business-email-agent-with-compliance-controls-in-go" rel="noopener noreferrer"&gt;Subscribe at renezander.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>ai</category>
      <category>email</category>
      <category>automation</category>
    </item>
    <item>
      <title>Detecting When Smart Money Stops Being Smart</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Tue, 17 Mar 2026 10:26:14 +0000</pubDate>
      <link>https://dev.to/reneza/detecting-when-smart-money-stops-being-smart-21n3</link>
      <guid>https://dev.to/reneza/detecting-when-smart-money-stops-being-smart-21n3</guid>
      <description>&lt;p&gt;Following a profitable wallet is easy. Knowing when to unfollow it is where the money is.&lt;/p&gt;

&lt;p&gt;I learned this the hard way. Not in crypto, but in forex. Years ago I built trading systems that scraped signals from profitable traders and mirrored their positions. It worked until it didn't. The problem was never finding good traders to follow. The problem was staying too long after they stopped performing.&lt;/p&gt;

&lt;p&gt;The same pattern repeats in on-chain copy trading. Someone finds a wallet with a 70% win rate, follows every move, and six weeks later wonders why they're bleeding. The wallet didn't get hacked. The market shifted and the strategy stopped working.&lt;/p&gt;

&lt;p&gt;Nobody talks about this part.&lt;/p&gt;

&lt;h3&gt;
  
  
  The decay problem
&lt;/h3&gt;

&lt;p&gt;Every trading strategy has a shelf life. A wallet that's profitable today is running a strategy tuned to current market conditions. When conditions change, the strategy breaks. Sometimes gradually, sometimes overnight.&lt;/p&gt;

&lt;p&gt;In forex I watched this happen in real time. A signal provider would post a 40% return over three months. Followers would pile in during month four. By month six the drawdown had wiped most of the gains. The provider's edge was gone but the followers didn't know because they were looking at cumulative PnL, not recent performance.&lt;/p&gt;

&lt;p&gt;On-chain it's the same dynamic but harder to detect because the data is messier. You're not looking at clean trade logs. You're parsing token swaps, liquidity events, and contract interactions, then trying to figure out if the pattern still holds.&lt;/p&gt;

&lt;h3&gt;
  
  
  What to actually measure
&lt;/h3&gt;

&lt;p&gt;When I built a detection system for this, I focused on three metrics that signal regime change:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Win rate over a rolling window.&lt;/strong&gt; Not all-time. A wallet with a 65% all-time win rate that's been running at 40% for the last three weeks is in trouble. The all-time number hides the decay.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Average return per trade, recent vs historical.&lt;/strong&gt; If the average return is shrinking even while the win rate holds, the edge is thinning. The wallet is still picking winners but the magnitude is dropping. This is the earliest warning sign.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trade frequency changes.&lt;/strong&gt; A wallet that was trading daily and suddenly goes quiet for a week might be recalibrating. Or it might be done. Either way, the pattern you were following no longer exists.&lt;/p&gt;

&lt;p&gt;None of these metrics work in isolation. A drop in win rate during a broad market pullback means nothing. But a drop in win rate while similar wallets maintain theirs, that's a signal.&lt;/p&gt;

&lt;h3&gt;
  
  
  The threshold problem
&lt;/h3&gt;

&lt;p&gt;The hardest part isn't collecting the data. It's deciding what counts as a regime change versus normal variance.&lt;/p&gt;

&lt;p&gt;Set the thresholds too tight and you're unfollowing wallets after every bad week. Too loose and you catch the decline three weeks after it started. I went through several iterations before landing on something useful: compare the wallet's recent performance against its own historical baseline, not against an absolute number.&lt;/p&gt;

&lt;p&gt;A wallet that normally runs a 55% win rate dropping to 45% is a different signal than a wallet that normally runs 75% dropping to 65%. Both dropped 10 points, but the first one is within normal variance and the second one is a structural shift.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to actually stop following
&lt;/h3&gt;

&lt;p&gt;The detection is one thing. Acting on it is another.&lt;/p&gt;

&lt;p&gt;The instinct is to wait. "Maybe it's just a bad week." That's the same instinct that kept me in losing forex signals for months. The data was clear but the hope was louder.&lt;/p&gt;

&lt;p&gt;What worked for me: automate the alert, not the action. The system flags when a wallet crosses its regime change threshold. I review it, check the context (is the whole market down or just this wallet?), and decide. But the flag forces the decision. Without it, I'd never look.&lt;/p&gt;

&lt;p&gt;Most people in DeFi are still following wallets based on a snapshot of past performance. They find a wallet on a leaderboard, follow it, and never re-evaluate. That's not a strategy. That's a bet that conditions never change.&lt;/p&gt;

&lt;p&gt;The edge isn't in finding smart money. There are tools for that everywhere. The edge is in knowing when smart money stops being smart for the current market, and stepping aside before the losses compound.&lt;/p&gt;

&lt;p&gt;If you're thinking about building something similar or want to talk through the detection approach, feel free to reach out: &lt;a href="https://cal.eu/reneza" rel="noopener noreferrer"&gt;cal.eu/reneza&lt;/a&gt;.&lt;/p&gt;







&lt;p&gt;&lt;strong&gt;Get the next field note in your inbox.&lt;/strong&gt; A new mini case from real AI builds every two weeks. No theory, no pitches.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://renezander.com/blog/?utm_source=dev-to&amp;amp;utm_campaign=detecting-when-smart-money-stops-being-smart" rel="noopener noreferrer"&gt;Subscribe at renezander.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>blockchain</category>
      <category>defi</category>
      <category>trading</category>
      <category>web3</category>
    </item>
    <item>
      <title>Your Vector Database Decision Is Simpler Than You Think</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Tue, 17 Mar 2026 07:41:59 +0000</pubDate>
      <link>https://dev.to/reneza/your-vector-database-decision-is-simpler-than-you-think-3ape</link>
      <guid>https://dev.to/reneza/your-vector-database-decision-is-simpler-than-you-think-3ape</guid>
      <description>&lt;p&gt;Every week someone asks which vector database they should use. The answer is almost always "it depends on three things," and none of them are throughput benchmarks.&lt;/p&gt;

&lt;p&gt;I run semantic search in production on a single VPS. Over a thousand items indexed, embeddings generated on the same machine, queries return in under a second. But that setup only works because of the constraints I'm operating in. Change the constraints and the answer changes completely.&lt;/p&gt;

&lt;p&gt;Here's how I think about it.&lt;/p&gt;

&lt;h3&gt;
  
  
  The overchoice problem
&lt;/h3&gt;

&lt;p&gt;There are dozens of vector databases now. Every one of them publishes benchmarks showing millions of vectors queried in milliseconds. That's great if you're building a search engine for the entire internet. Most of us aren't.&lt;/p&gt;

&lt;p&gt;The benchmarks test throughput at scale. What they don't test is: can this thing run on the same box as your application without eating all the memory? Can you set it up in ten minutes? Does it need a cluster?&lt;/p&gt;

&lt;p&gt;Those are the questions that actually matter when you're picking one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scenario 1: Local device, small dataset, ephemeral
&lt;/h3&gt;

&lt;p&gt;You have a CLI tool or a local application. Your data is a few hundred markdown files or JSON documents. The user runs it on their laptop.&lt;/p&gt;

&lt;p&gt;You don't need a database. Load the vectors into memory on startup, compute cosine similarity, done. A flat array of float32 embeddings and a brute-force search will outperform any database at this scale because there's zero overhead. No process to manage, no port to configure, no persistence to worry about.&lt;/p&gt;

&lt;p&gt;Pre-compute your embeddings at build time or on first run, store them alongside the source files. When the data changes, regenerate. At a few hundred items this takes seconds.&lt;/p&gt;

&lt;p&gt;The mistake people make here is reaching for a database because it feels like the "proper" way. It's not. It's unnecessary complexity for a problem that fits in a single array.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scenario 2: Small VPS, thousands of items, needs persistence
&lt;/h3&gt;

&lt;p&gt;Now things change. Your data lives behind an API. It updates throughout the day. You need search results to reflect changes within minutes, not hours. The whole thing runs on a VPS with maybe 2GB of RAM, shared with other services.&lt;/p&gt;

&lt;p&gt;This is where a lightweight vector database process makes sense. Something that runs as a single binary, stores vectors on disk, serves queries over a local HTTP API. You don't need clustering or replication. You need something that starts fast, uses a few hundred MB of RAM, and doesn't crash when you restart other services on the same box.&lt;/p&gt;

&lt;p&gt;The key decisions here: embedding model runs locally or via API? If your VPS has enough RAM, running a small embedding model locally saves you per-request API costs and latency. If RAM is tight, use an embedding API and only store the resulting vectors locally.&lt;/p&gt;

&lt;p&gt;Change detection matters at this scale. You don't want to re-embed everything on every sync. Hash the source content, compare to what's stored, only embed what changed. This keeps your sync jobs fast and your API costs predictable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scenario 3: Multi-service, millions of vectors, high availability
&lt;/h3&gt;

&lt;p&gt;This is where the benchmarks actually apply. Multiple services querying the same vector index. Data measured in millions of items. Uptime requirements that mean you can't tolerate a single-process restart.&lt;/p&gt;

&lt;p&gt;At this scale you need a managed service or a self-hosted cluster. Replication, sharding, automatic failover. The operational overhead is real but justified because downtime now costs money.&lt;/p&gt;

&lt;p&gt;Most teams jump here first because it looks professional. But if your dataset fits in 2GB of RAM and you have one service querying it, you're paying for complexity you don't need.&lt;/p&gt;

&lt;h3&gt;
  
  
  The three questions
&lt;/h3&gt;

&lt;p&gt;Before looking at any product page, answer these:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where does the data live?&lt;/strong&gt; If it's local files that rarely change, stay in-memory. If it's behind an API that updates constantly, you need a persistent store.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How much RAM can you spare?&lt;/strong&gt; This determines whether you run embeddings locally or call an API, and whether your database runs in-process or as a separate service.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do you need persistence or is ephemeral fine?&lt;/strong&gt; If you can regenerate everything from source in seconds, skip the database. If regeneration takes minutes or hours, persist.&lt;/p&gt;

&lt;p&gt;These three questions eliminate 80% of the options before you read a single benchmark.&lt;/p&gt;

&lt;h3&gt;
  
  
  Start from the environment
&lt;/h3&gt;

&lt;p&gt;The pattern I keep seeing is people evaluating vector databases by features and benchmarks, then trying to fit the winner into their deployment. It works the other way around. Start from your environment, your constraints, your data volume. The right answer usually becomes obvious.&lt;/p&gt;

&lt;p&gt;The comparison articles won't tell you this because they can't. They don't know if you're running on a laptop, a $5 VPS, or a Kubernetes cluster. You do.&lt;/p&gt;

&lt;p&gt;If you're building something with semantic search and want to think through which approach fits your setup, I'm always up for a quick conversation: &lt;a href="https://cal.eu/reneza" rel="noopener noreferrer"&gt;cal.eu/reneza&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>vectordatabase</category>
      <category>devops</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>I Run 10 AI Agents in Production. They're All Bash Scripts.</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Thu, 12 Mar 2026 14:29:44 +0000</pubDate>
      <link>https://dev.to/reneza/i-run-10-ai-agents-in-production-theyre-all-bash-scripts-df2</link>
      <guid>https://dev.to/reneza/i-run-10-ai-agents-in-production-theyre-all-bash-scripts-df2</guid>
      <description>&lt;p&gt;A week ago I wrote about &lt;a href="https://dev.to/renezander030/lots-of-people-are-demoing-ai-agents-almost-nobodys-shipping-them-the-right-way-5c10"&gt;shipping AI agents the right way&lt;/a&gt;. That piece was about the harness: quality gates, token economics, multi-model verification. The stuff that separates demos from production.&lt;/p&gt;

&lt;p&gt;A lot of people resonated with it. But I left out the part that actually eats most of my time: keeping the boring stuff running.&lt;/p&gt;

&lt;p&gt;So let me walk you through what production AI agents actually look like when the conference talk is over.&lt;/p&gt;

&lt;h2&gt;
  
  
  The stack
&lt;/h2&gt;

&lt;p&gt;I run about ten agents in production. They handle things like daily briefings, team follow-ups, task classification, weekly status reports, and job screening. None of them are impressive. All of them are useful.&lt;/p&gt;

&lt;p&gt;The architecture for every single one is the same:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trigger &amp;gt; Fetch data &amp;gt; Scoped prompt &amp;gt; Write output &amp;gt; Notify&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A scheduled job fires. The agent pulls data from a task system, a calendar, an inbox, whatever it needs. A scoped prompt with a token budget processes that data. The result gets written back or sent as a notification.&lt;/p&gt;

&lt;p&gt;That's it. No framework. No orchestration layer. No agent-to-agent communication protocol. Each agent is a standalone script with a timeout and a budget cap.&lt;/p&gt;

&lt;p&gt;The ones that run daily cost about $1 per run. The weekly planner, which needs more context, costs about $5. The job screener uses a smaller model and comes in at $0.25. Monthly total across everything: somewhere around $80.&lt;/p&gt;

&lt;h2&gt;
  
  
  State is the whole game
&lt;/h2&gt;

&lt;p&gt;The interesting part of running agents isn't the LLM call. It's everything around it.&lt;/p&gt;

&lt;p&gt;One of my agents tracks which emails it already processed. It maintains a list of the last 500 message IDs. If that list gets corrupted, every email gets reprocessed and I'm flooded with duplicate notifications. If it gets wiped, same thing.&lt;/p&gt;

&lt;p&gt;Another agent maintains a vector index. Every night, a sync job pulls all tasks, checks which ones changed since the last run, re-embeds only those, and updates the index. Smart sync means it skips tasks where only metadata changed. Without that optimization, the job would take 30 minutes instead of 30 seconds.&lt;/p&gt;

&lt;p&gt;A third agent prepends follow-up messages to task notes, separated by date markers. It never deletes old entries. The history is the context for the next run. Mess with the format and the agent loses its memory.&lt;/p&gt;

&lt;p&gt;None of this is AI work. It's state management. The LLM call is the easy part.&lt;/p&gt;

&lt;h2&gt;
  
  
  The maintenance nobody warns you about
&lt;/h2&gt;

&lt;p&gt;This is what I wish someone had told me before I started: the data your agents pull from and write to needs constant housekeeping.&lt;/p&gt;

&lt;p&gt;Task descriptions get stale. Context windows bloat because old entries pile up. Duplicate state creeps in when two agents touch the same data. Embeddings drift as your task descriptions evolve.&lt;/p&gt;

&lt;p&gt;I spend more time maintaining the data around my agents than I spend on the agents themselves. Trimming old entries, cleaning up state files, making sure the vector index stays in sync with the source of truth.&lt;/p&gt;

&lt;p&gt;It's the same kind of work you do with any data pipeline. But nobody frames it that way because "AI agent" sounds more exciting than "scheduled ETL job with an LLM in the middle."&lt;/p&gt;

&lt;h2&gt;
  
  
  Token economics shape everything
&lt;/h2&gt;

&lt;p&gt;Your architecture follows from your token budget, not the other way around.&lt;/p&gt;

&lt;p&gt;I learned this the hard way. My first agent design was a single mega-prompt that tried to do everything: read tasks, check calendar, scan inbox, generate a plan, write follow-ups. It worked. It also burned through tokens like they were free and hallucinated more than I was comfortable with.&lt;/p&gt;

&lt;p&gt;Now every agent has a single job. Small context, scoped prompt, hard token limit. The job screener doesn't need to know about my calendar. The weekly planner doesn't need to see my inbox. Isolation isn't just cheaper. It's more reliable.&lt;/p&gt;

&lt;p&gt;The cost breakdown matters because it changes your design decisions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A $0.25 agent can run four times a day. A $5 agent runs once a week.&lt;/li&gt;
&lt;li&gt;A fast, cheap model handles screening. A slower, expensive model handles planning.&lt;/li&gt;
&lt;li&gt;If an agent costs more than $2 per run, I look at whether the prompt can be tighter.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Framework vs. no framework
&lt;/h2&gt;

&lt;p&gt;I tried frameworks. They lasted about three months.&lt;/p&gt;

&lt;p&gt;The problem wasn't capability. The frameworks could do everything I needed. The problem was that every failure mode became a framework debugging session instead of a straightforward script fix.&lt;/p&gt;

&lt;p&gt;A shell script fails: I read the log, find the error, fix the line. Done.&lt;/p&gt;

&lt;p&gt;A framework agent fails: I dig through abstraction layers, figure out which middleware swallowed the error, check whether the state manager persisted something weird, and then fix the line.&lt;/p&gt;

&lt;p&gt;Same fix at the end. Ten times the debugging surface.&lt;/p&gt;

&lt;p&gt;My current stack is scheduled jobs, shell scripts calling an LLM API, and log files. It has fewer moving parts than most people's "hello world" agent demos. And it's been running stable for months.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd build differently
&lt;/h2&gt;

&lt;p&gt;If I started over:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;State management first.&lt;/strong&gt; Before writing a single prompt, I'd design how every agent tracks what it already did, what changed since last run, and where it writes output. This is the foundation everything else sits on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One agent, one job.&lt;/strong&gt; No multi-purpose agents. The overhead of running five cheap agents is lower than debugging one expensive agent that does five things.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scheduled over event-driven.&lt;/strong&gt; For anything that happens on a predictable cadence, a timer is simpler and more reliable than a webhook chain. Event-driven has its place, but most of my agents don't need it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Budget caps from day one.&lt;/strong&gt; Every agent gets a hard spending limit. If it hits the cap, it fails loudly instead of running up a bill.&lt;/p&gt;




&lt;p&gt;I've been building and running these systems for a while now. If you're working through similar problems, or thinking about moving agents from demo to production, I'm always up for a conversation: &lt;a href="https://cal.eu/reneza" rel="noopener noreferrer"&gt;cal.eu/reneza&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>automation</category>
      <category>infrastructure</category>
    </item>
  </channel>
</rss>
