<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Raullen Chai</title>
    <description>The latest articles on DEV Community by Raullen Chai (@raullen_chai_76e18e9705b0).</description>
    <link>https://dev.to/raullen_chai_76e18e9705b0</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2056219%2F75d4cd58-24b8-49e1-b45e-b0aed1819c29.jpg</url>
      <title>DEV Community: Raullen Chai</title>
      <link>https://dev.to/raullen_chai_76e18e9705b0</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/raullen_chai_76e18e9705b0"/>
    <language>en</language>
    <item>
      <title>Gemma 4 on Apple Silicon: 85 tok/s with a pip install</title>
      <dc:creator>Raullen Chai</dc:creator>
      <pubDate>Tue, 07 Apr 2026 21:45:09 +0000</pubDate>
      <link>https://dev.to/raullen_chai_76e18e9705b0/gemma-4-on-apple-silicon-85-toks-with-a-pip-install-299a</link>
      <guid>https://dev.to/raullen_chai_76e18e9705b0/gemma-4-on-apple-silicon-85-toks-with-a-pip-install-299a</guid>
      <description>&lt;p&gt;Last week Google released Gemma 4 — their most capable open-weight model family. Within hours I had it running locally on my Mac at 85 tokens/second, with full tool calling, streaming, and an OpenAI-compatible API that works with every major AI framework.&lt;/p&gt;

&lt;p&gt;Here's how, and what the benchmarks actually look like.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup: 2 commands
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;rapid-mlx
rapid-mlx serve gemma-4-26b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. The server downloads the 4-bit MLX-quantized model (~14 GB) and starts an OpenAI-compatible API on &lt;code&gt;http://localhost:8000/v1&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft8o44i5lgw06li1a2fmy.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft8o44i5lgw06li1a2fmy.gif" alt="Rapid-MLX demo" width="643" height="694"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmarks: Gemma 4 26B on M3 Ultra
&lt;/h2&gt;

&lt;p&gt;I benchmarked three engines on the same machine (M3 Ultra, 192GB), same model (Gemma 4 26B-A4B 4-bit), same prompt:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Engine&lt;/th&gt;
&lt;th&gt;Decode (tok/s)&lt;/th&gt;
&lt;th&gt;TTFT&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Rapid-MLX&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;85 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.26s&lt;/td&gt;
&lt;td&gt;MLX-native, prompt cache&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mlx-vlm&lt;/td&gt;
&lt;td&gt;84 tok/s&lt;/td&gt;
&lt;td&gt;0.31s&lt;/td&gt;
&lt;td&gt;VLM library (no tool calling)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ollama&lt;/td&gt;
&lt;td&gt;75 tok/s&lt;/td&gt;
&lt;td&gt;0.08s&lt;/td&gt;
&lt;td&gt;llama.cpp backend&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Rapid-MLX is 13% faster than Ollama on decode. Ollama has faster TTFT (it uses llama.cpp's Metal kernels for prefill), but for interactive use the decode speed is what you feel.&lt;/p&gt;

&lt;p&gt;On smaller models the gap is wider — Rapid-MLX hits 168 tok/s on Qwen3.5-4B vs Ollama's ~70 tok/s (2.4x).&lt;/p&gt;

&lt;h2&gt;
  
  
  Tool Calling That Actually Works
&lt;/h2&gt;

&lt;p&gt;This is where it gets interesting. Most local inference servers either don't support tool calling, or support it for one model family. Rapid-MLX ships &lt;strong&gt;18 built-in tool call parsers&lt;/strong&gt; covering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Qwen 3 / 3.5 (hermes format)&lt;/li&gt;
&lt;li&gt;Gemma 4 (native &lt;code&gt;&amp;lt;|tool_call&amp;gt;&lt;/code&gt; format)&lt;/li&gt;
&lt;li&gt;GLM-4.7, MiniMax, GPT-OSS&lt;/li&gt;
&lt;li&gt;Llama 3, Mistral, DeepSeek&lt;/li&gt;
&lt;li&gt;And more&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tool calling works out of the box — no extra flags needed for supported models:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:8000/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "default",
    "messages": [{"role": "user", "content": "What is the weather in Tokyo?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get weather for a city",
        "parameters": {
          "type": "object",
          "properties": {"city": {"type": "string"}},
          "required": ["city"]
        }
      }
    }]
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"choices"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"tool_calls"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"function"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"get_weather"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"arguments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;city&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;Tokyo&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;}"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tool call arguments are properly parsed — including bare numeric values like &lt;code&gt;{a: 3, b: 4}&lt;/code&gt; that Gemma 4 emits without JSON quotes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Works With Everything
&lt;/h2&gt;

&lt;p&gt;Because it's OpenAI-compatible, you can point any AI framework at it:&lt;/p&gt;

&lt;h3&gt;
  
  
  PydanticAI
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic_ai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic_ai.models.openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAIChatModel&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic_ai.providers.openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAIProvider&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAIChatModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;OpenAIProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8000/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;not-needed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_sync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is 2+2?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# "4"
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I've verified this end-to-end with structured output (&lt;code&gt;output_type=BaseModel&lt;/code&gt;), streaming, multi-turn conversations, and multi-tool workflows. &lt;a href="https://github.com/raullenchai/Rapid-MLX/blob/main/tests/integrations/test_pydantic_ai_full.py" rel="noopener noreferrer"&gt;Test suite here&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  LangChain
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;

&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8000/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;not-needed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Tool calling works
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.tools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tool&lt;/span&gt;

&lt;span class="nd"&gt;@tool&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;multiply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Multiply two numbers.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bind_tools&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;multiply&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is 6 * 7?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# [{"name": "multiply", "args": {"a": 6, "b": 7}}]
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Aider (AI pair programming)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_API_BASE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8000/v1
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;not-needed
aider &lt;span class="nt"&gt;--model&lt;/span&gt; openai/gemma-4-26b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Aider's full edit-and-commit workflow works — I tested it modifying a Python file with Gemma 4. &lt;a href="https://github.com/raullenchai/Rapid-MLX/blob/main/tests/integrations/test_aider.sh" rel="noopener noreferrer"&gt;Test script here&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Full Compatibility List
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Client&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PydanticAI&lt;/td&gt;
&lt;td&gt;Tested (6/6)&lt;/td&gt;
&lt;td&gt;Streaming, structured output, multi-tool&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LangChain&lt;/td&gt;
&lt;td&gt;Tested (6/6)&lt;/td&gt;
&lt;td&gt;Tools, streaming, structured output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;smolagents&lt;/td&gt;
&lt;td&gt;Tested (4/4)&lt;/td&gt;
&lt;td&gt;CodeAgent + ToolCallingAgent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anthropic SDK&lt;/td&gt;
&lt;td&gt;Tested (5/5)&lt;/td&gt;
&lt;td&gt;Via &lt;code&gt;/v1/messages&lt;/code&gt; endpoint&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Aider&lt;/td&gt;
&lt;td&gt;Tested&lt;/td&gt;
&lt;td&gt;CLI edit-and-commit workflow&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LibreChat&lt;/td&gt;
&lt;td&gt;Tested (4/4)&lt;/td&gt;
&lt;td&gt;Docker E2E with &lt;code&gt;librechat.yaml&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Open WebUI&lt;/td&gt;
&lt;td&gt;Tested (3/4)&lt;/td&gt;
&lt;td&gt;Docker, model fetch, streaming&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cursor&lt;/td&gt;
&lt;td&gt;Compatible&lt;/td&gt;
&lt;td&gt;Settings UI config&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Code&lt;/td&gt;
&lt;td&gt;Compatible&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;OPENAI_BASE_URL&lt;/code&gt; env var&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Continue.dev&lt;/td&gt;
&lt;td&gt;Compatible&lt;/td&gt;
&lt;td&gt;YAML config&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every "Tested" entry has an automated test script in the repo — not just "I tried it once."&lt;/p&gt;

&lt;h2&gt;
  
  
  What Model Should I Run?
&lt;/h2&gt;

&lt;p&gt;Depends on your Mac's RAM:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mac&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;16 GB MacBook Air&lt;/td&gt;
&lt;td&gt;Qwen3.5-4B&lt;/td&gt;
&lt;td&gt;168 tok/s&lt;/td&gt;
&lt;td&gt;Chat, coding, tools&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32 GB MacBook Pro&lt;/td&gt;
&lt;td&gt;Gemma 4 26B-A4B&lt;/td&gt;
&lt;td&gt;85 tok/s&lt;/td&gt;
&lt;td&gt;General purpose, tool calling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;64 GB Mac Mini/Studio&lt;/td&gt;
&lt;td&gt;Qwen3.5-35B&lt;/td&gt;
&lt;td&gt;83 tok/s&lt;/td&gt;
&lt;td&gt;Smart + fast balance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;96+ GB Mac Studio/Pro&lt;/td&gt;
&lt;td&gt;Qwen3.5-122B&lt;/td&gt;
&lt;td&gt;57 tok/s&lt;/td&gt;
&lt;td&gt;Frontier intelligence&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Quick alias lookup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;rapid-mlx models
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Under the Hood
&lt;/h2&gt;

&lt;p&gt;A few things that make this work well:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt cache&lt;/strong&gt; — Repeated system prompts (common in agent frameworks) are cached. On multi-turn conversations, only new tokens are processed. This cuts TTFT by 2-10x on follow-up messages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OutputRouter&lt;/strong&gt; — A token-level state machine that separates model output into channels (content / reasoning / tool calls) in real-time. No regex post-processing, no leakage of &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; tags or tool markup into the content stream.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Auto-detection&lt;/strong&gt; — Model family, tool parser, and reasoning parser are auto-detected from the model name. No manual &lt;code&gt;--tool-parser hermes&lt;/code&gt; flags needed (though you can override).&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Homebrew&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;raullenchai/rapid-mlx/rapid-mlx

&lt;span class="c"&gt;# or pip&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;rapid-mlx

&lt;span class="c"&gt;# Serve Gemma 4&lt;/span&gt;
rapid-mlx serve gemma-4-26b

&lt;span class="c"&gt;# Point any OpenAI-compatible app at http://localhost:8000/v1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Repo: &lt;a href="https://github.com/raullenchai/Rapid-MLX" rel="noopener noreferrer"&gt;github.com/raullenchai/Rapid-MLX&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built on Apple's &lt;a href="https://github.com/ml-explore/mlx" rel="noopener noreferrer"&gt;MLX framework&lt;/a&gt; and &lt;a href="https://github.com/ml-explore/mlx-lm" rel="noopener noreferrer"&gt;mlx-lm&lt;/a&gt;. Licensed Apache 2.0.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>gemma4</category>
      <category>applesilicon</category>
      <category>mlx</category>
      <category>localai</category>
    </item>
    <item>
      <title>Stop pasting 5,000 lines of logs into Claude. Use a secure context tunnel instead</title>
      <dc:creator>Raullen Chai</dc:creator>
      <pubDate>Sun, 25 Jan 2026 05:12:19 +0000</pubDate>
      <link>https://dev.to/raullen_chai_76e18e9705b0/stop-pasting-5000-lines-of-logs-into-claude-use-a-secure-context-tunnel-instead-5559</link>
      <guid>https://dev.to/raullen_chai_76e18e9705b0/stop-pasting-5000-lines-of-logs-into-claude-use-a-secure-context-tunnel-instead-5559</guid>
      <description>&lt;h1&gt;
  
  
  Stop pasting 5,000 lines of logs into Claude. Use a secure context tunnel instead.
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #ai #productivity #cli #security&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: The "Wall of Text" Friction
&lt;/h2&gt;

&lt;p&gt;We've all been there. You're debugging a nasty crash. You have a 2MB log file. You try to paste it into ChatGPT or Claude.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❌ &lt;em&gt;The UI freezes.&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;❌ &lt;em&gt;The text gets truncated.&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;❌ &lt;em&gt;You realize you just pasted your API keys into a cloud chat history.&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Solution: vnsh (Vanish)
&lt;/h2&gt;

&lt;p&gt;I built an open-source tool called &lt;strong&gt;vnsh&lt;/strong&gt;. Think of it as an ephemeral "Dropbox" designed specifically for AI agents.&lt;/p&gt;

&lt;h3&gt;
  
  
  How it works:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt; You pipe data in your terminal: &lt;code&gt;cat error.log | vn&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt; It encrypts it locally (&lt;strong&gt;AES-256-CBC&lt;/strong&gt;).&lt;/li&gt;
&lt;li&gt; It gives you a secure link.&lt;/li&gt;
&lt;li&gt; You give that link to Claude.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Because I built a &lt;strong&gt;native Model Context Protocol (MCP)&lt;/strong&gt; server for it, Claude can actually "see" inside the encrypted link and read the file directly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;

&lt;p&gt;If you are on Mac/Linux (Homebrew):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew tap raullenchai/vnsh
brew &lt;span class="nb"&gt;install &lt;/span&gt;vnsh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or via NPM (Node.js):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; vnsh-cli
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The "Magic" Workflow&lt;br&gt;
Next time you have a git diff that is too long to explain:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git diff | vn
# Output: [https://vnsh.dev/v/abc...#k=](https://vnsh.dev/v/abc...#k=)...
Paste that URL to Claude. It stays fast, the server (me) can't read your code, and the data self-destructs in 24 hours.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Self-Hosting&lt;br&gt;
Since it deals with sensitive data, I made it host-blind (the decryption key is in the URL hash fragment, never sent to the server). But if you are paranoid (like me), you can self-host the whole stack on your own Cloudflare account.&lt;/p&gt;

&lt;p&gt;Check it out on GitHub: &lt;a href="https://github.com/raullenchai/vnsh" rel="noopener noreferrer"&gt;https://github.com/raullenchai/vnsh&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>cli</category>
      <category>productivity</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Control Claude Code from Your Phone with Claw</title>
      <dc:creator>Raullen Chai</dc:creator>
      <pubDate>Sun, 18 Jan 2026 22:03:15 +0000</pubDate>
      <link>https://dev.to/raullen_chai_76e18e9705b0/control-claude-code-from-your-phone-with-claw-b8f</link>
      <guid>https://dev.to/raullen_chai_76e18e9705b0/control-claude-code-from-your-phone-with-claw-b8f</guid>
      <description>&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;You're deep in a Claude Code session. It's working through a complex task.&lt;/p&gt;

&lt;p&gt;But you need to step away - grab coffee, take a call, pick up kids.&lt;/p&gt;

&lt;p&gt;What do you do? Leave it running and hope nothing goes wrong?&lt;/p&gt;

&lt;p&gt;## The Solution: Claw&lt;/p&gt;

&lt;p&gt;I built &lt;strong&gt;Claw&lt;/strong&gt; (CLaude AnyWhere) - a zero-dependency Python tool that&lt;br&gt;
  lets&lt;br&gt;
  you monitor and control Claude Code from any device with a browser.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyv0ur0janrsw4vm9ek5x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyv0ur0janrsw4vm9ek5x.png" alt="Claw Screenshot" width="800" height="786"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;## Features&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;👀 &lt;strong&gt;Live terminal view&lt;/strong&gt; - see what Claude is doing in real-time&lt;/li&gt;
&lt;li&gt;⚡ &lt;strong&gt;Quick actions&lt;/strong&gt; - tap &lt;code&gt;yes&lt;/code&gt;, &lt;code&gt;no&lt;/code&gt;, &lt;code&gt;continue&lt;/code&gt;, or &lt;code&gt;Ctrl+C&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;📱 &lt;strong&gt;Mobile-first&lt;/strong&gt; - designed for phones with pull-to-refresh&lt;/li&gt;
&lt;li&gt;🌐 &lt;strong&gt;Access anywhere&lt;/strong&gt; - &lt;code&gt;--share&lt;/code&gt; flag creates instant public URL&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;## Quick Start&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
bash
  # Install
  pip install claw-cli

  # Run with remote access
  claw --share

  That's it. Open the URL on your phone. You're in control.

  How It Works

  Claw is a lightweight HTTP server that:
  1. Captures tmux pane content in real-time
  2. Sends keystrokes via tmux send-keys
  3. Serves a mobile-optimized dashboard

  No dependencies beyond Python stdlib. Works on macOS, Linux, and Windows
  (WSL).

  Try It Out

  GitHub: https://github.com/raullenchai/claw
  PyPI: https://pypi.org/project/claw-cli/

  Contributions welcome! Check out our good first issue labels.

  ---
  Built for developers who got tired of walking back to their desks 🦞
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>opensource</category>
      <category>python</category>
      <category>cli</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
