<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Maksim Danilchenko</title>
    <description>The latest articles on DEV Community by Maksim Danilchenko (@dmaxdev).</description>
    <link>https://dev.to/dmaxdev</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3851903%2F271b9f0d-273c-44e2-a2c7-0d4ec886b1c5.jpeg</url>
      <title>DEV Community: Maksim Danilchenko</title>
      <link>https://dev.to/dmaxdev</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dmaxdev"/>
    <language>en</language>
    <item>
      <title>How to Run Gemma 4 Locally With Ollama, llama.cpp, and vLLM</title>
      <dc:creator>Maksim Danilchenko</dc:creator>
      <pubDate>Sat, 11 Apr 2026 22:40:26 +0000</pubDate>
      <link>https://dev.to/dmaxdev/how-to-run-gemma-4-locally-with-ollama-llamacpp-and-vllm-3n44</link>
      <guid>https://dev.to/dmaxdev/how-to-run-gemma-4-locally-with-ollama-llamacpp-and-vllm-3n44</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Google Gemma 4 dropped on April 2 under Apache 2.0 and it's genuinely good: the 31B dense model hit #3 on the Arena AI leaderboard, beating models 20x its size. You can run it locally with Ollama in about two minutes, or go the llama.cpp / vLLM route if you want more control. But there are real bugs right now, especially on Apple Silicon and with tool calling. This guide covers all three options, what hardware you actually need, and the workarounds for the issues I've hit so far.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Gemma 4 Is Worth Running Locally
&lt;/h2&gt;

&lt;p&gt;I've been running local models since the Llama 2 days, and Gemma 4 is the first time an open model has made me reconsider whether I need API access to frontier models for everyday coding tasks.&lt;/p&gt;

&lt;p&gt;Look at the benchmarks. Gemma 4 31B scores 89.2% on AIME 2026 (math), 80.0% on LiveCodeBench v6 (coding), and 84.3% on GPQA Diamond (science). Gemma 3 scored 20.8%, 29.1%, and 42.4% on those same tests. Every metric roughly tripled in one generation.&lt;/p&gt;

&lt;p&gt;The family comes in four sizes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Parameters&lt;/th&gt;
&lt;th&gt;Active Params&lt;/th&gt;
&lt;th&gt;Min VRAM (Q4)&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;E2B&lt;/td&gt;
&lt;td&gt;2.3B&lt;/td&gt;
&lt;td&gt;2.3B&lt;/td&gt;
&lt;td&gt;~1.5 GB&lt;/td&gt;
&lt;td&gt;Mobile, Raspberry Pi&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;E4B&lt;/td&gt;
&lt;td&gt;4.5B&lt;/td&gt;
&lt;td&gt;4.5B&lt;/td&gt;
&lt;td&gt;~3 GB&lt;/td&gt;
&lt;td&gt;Quick local tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;26B MoE&lt;/td&gt;
&lt;td&gt;26B&lt;/td&gt;
&lt;td&gt;3.8B&lt;/td&gt;
&lt;td&gt;~14 GB&lt;/td&gt;
&lt;td&gt;Best bang per VRAM GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;31B Dense&lt;/td&gt;
&lt;td&gt;31B&lt;/td&gt;
&lt;td&gt;31B&lt;/td&gt;
&lt;td&gt;~18 GB&lt;/td&gt;
&lt;td&gt;Maximum quality&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 26B MoE model is the sleeper hit here. It only activates 3.8B parameters per token but delivers reasoning quality close to the full 31B, and it fits in 14 GB of VRAM at Q4 quantization. If you're on a 16 GB GPU or a MacBook Pro with 18 GB unified memory, go with that one.&lt;/p&gt;

&lt;p&gt;All four variants ship under Apache 2.0. No usage restrictions, no commercial limitations, no weird "you can't use this to compete with Google" clauses that plagued earlier open model releases. (If you're on a Mac and want to explore Apple's built-in local AI too, see my &lt;a href="https://danilchenko.dev/posts/2026-04-06-apfel-review-free-local-ai-mac/" rel="noopener noreferrer"&gt;Apfel review&lt;/a&gt; — different beast, but it's free and already on your machine.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Option 1: Ollama (Easiest)
&lt;/h2&gt;

&lt;p&gt;Ollama is the fastest way to get Gemma 4 running. Two commands and you're chatting.&lt;/p&gt;

&lt;h3&gt;
  
  
  Install Ollama
&lt;/h3&gt;

&lt;p&gt;On macOS:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;ollama
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On Linux:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://ollama.com/install.sh | sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On Windows, download the installer from ollama.com.&lt;/p&gt;

&lt;p&gt;You need Ollama v0.20.0 or later for Gemma 4 support. Check with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pull and Run a Model
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# The 26B MoE — best quality-to-VRAM ratio&lt;/span&gt;
ollama run gemma4:26b

&lt;span class="c"&gt;# The small but capable 4B&lt;/span&gt;
ollama run gemma4:4b

&lt;span class="c"&gt;# The full 31B dense (need 20+ GB VRAM)&lt;/span&gt;
ollama run gemma4:31b

&lt;span class="c"&gt;# Tiny model for edge devices&lt;/span&gt;
ollama run gemma4:2b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Ollama handles downloading the GGUF, quantization selection, and memory management automatically. By default it picks a quantization that fits your available memory.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pick Your Quantization
&lt;/h3&gt;

&lt;p&gt;If you want more control over the quality/memory tradeoff:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Higher quality, more memory&lt;/span&gt;
ollama run gemma4:26b-q8_0

&lt;span class="c"&gt;# Lower memory, slightly less quality&lt;/span&gt;
ollama run gemma4:26b-q4_K_M

&lt;span class="c"&gt;# Middle ground&lt;/span&gt;
ollama run gemma4:26b-q5_K_M
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the 31B model, Q4_K_M is the sweet spot. It keeps quality high while fitting in ~18 GB. Going to Q8 pushes you to ~28 GB, which means you need a 32 GB GPU or Mac with 32+ GB unified memory.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use the API
&lt;/h3&gt;

&lt;p&gt;Ollama exposes an OpenAI-compatible API on port 11434:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:11434/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "gemma4:26b",
    "messages": [{"role": "user", "content": "Write a Python function to merge two sorted arrays"}]
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works with any OpenAI SDK client. Just point the base URL to &lt;code&gt;http://localhost:11434/v1&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ollama&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# required but ignored
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4:26b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain quicksort in 3 sentences&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Known Ollama Issues (April 2026)
&lt;/h3&gt;

&lt;p&gt;I'm flagging these because they burned me:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Tool calling is broken in Ollama v0.20.0. The tool call parser crashes, and streaming drops tool calls entirely. If you need function calling, use vLLM instead for now.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If you're on an M-series Mac, don't set &lt;code&gt;OLLAMA_FLASH_ATTENTION=1&lt;/code&gt;. The 31B model will hang once your prompt exceeds ~500 tokens. Ollama's defaults work fine without it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Some general knowledge prompts cause the model to spit out an infinite stream of &lt;code&gt;&amp;lt;unused24&amp;gt;&lt;/code&gt; tokens. Tokenizer bug. If it happens, stop generation and rephrase your prompt. A fix is being tracked in llama.cpp issue #21321.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Option 2: llama.cpp (More Control)
&lt;/h2&gt;

&lt;p&gt;If you want raw performance, custom quantization, or you're deploying on hardware Ollama doesn't support well, llama.cpp gives you full control.&lt;/p&gt;

&lt;h3&gt;
  
  
  Build llama.cpp
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ggerganov/llama.cpp
&lt;span class="nb"&gt;cd &lt;/span&gt;llama.cpp
cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DGGML_CUDA&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON  &lt;span class="c"&gt;# or -DGGML_METAL=ON for Mac&lt;/span&gt;
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--config&lt;/span&gt; Release &lt;span class="nt"&gt;-j&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;nproc&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For CPU-only (no GPU acceleration):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--config&lt;/span&gt; Release &lt;span class="nt"&gt;-j&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;nproc&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Download a GGUF Model
&lt;/h3&gt;

&lt;p&gt;Grab a pre-quantized model from Hugging Face. Unsloth provides well-tested GGUFs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 31B Q4_K_M — ~18 GB, good quality&lt;/span&gt;
huggingface-cli download unsloth/gemma-4-31B-it-GGUF &lt;span class="se"&gt;\&lt;/span&gt;
  gemma-4-31B-it-Q4_K_M.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--local-dir&lt;/span&gt; ./models

&lt;span class="c"&gt;# 26B MoE Q4_K_M — ~14 GB&lt;/span&gt;
huggingface-cli download unsloth/gemma-4-26B-MoE-it-GGUF &lt;span class="se"&gt;\&lt;/span&gt;
  gemma-4-26B-MoE-it-Q4_K_M.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--local-dir&lt;/span&gt; ./models
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Run Inference
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./build/bin/llama-cli &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; ./models/gemma-4-31B-it-Q4_K_M.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Write a Rust function that implements a thread-safe LRU cache"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; 512 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99  &lt;span class="c"&gt;# offload all layers to GPU&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;-ngl 99&lt;/code&gt; flag offloads all layers to your GPU. If you don't have enough VRAM, lower this number and llama.cpp will split layers between GPU and CPU. For the 31B Q4 model, I'd start with &lt;code&gt;-ngl 40&lt;/code&gt; on a 16 GB GPU and adjust from there.&lt;/p&gt;

&lt;h3&gt;
  
  
  Run as a Server
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./build/bin/llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; ./models/gemma-4-31B-it-Q4_K_M.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt; 8080 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 8192
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives you an OpenAI-compatible API at &lt;code&gt;http://localhost:8080/v1&lt;/code&gt;. Same client code as the Ollama example above, just change the port.&lt;/p&gt;

&lt;h3&gt;
  
  
  Performance Tips for llama.cpp
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Gemma 4 advertises 256K context, but on consumer hardware you're realistically looking at ~20K tokens before memory pressure kills throughput. Qwen 3.5 27B manages ~190K on the same hardware, a 10x difference. Set &lt;code&gt;-c&lt;/code&gt; conservatively. (Compression techniques like &lt;a href="https://danilchenko.dev/posts/2026-03-27-google-turboquant-llm-compression-6x-zero-accuracy-loss/" rel="noopener noreferrer"&gt;Google's TurboQuant&lt;/a&gt; may help here eventually.)&lt;/li&gt;
&lt;li&gt;On Mac, use &lt;code&gt;-DGGML_METAL=ON&lt;/code&gt; during build. Metal acceleration gives 2-3x speedup over CPU on M-series chips.&lt;/li&gt;
&lt;li&gt;Increasing &lt;code&gt;-b&lt;/code&gt; (batch size) can improve throughput for server workloads. I use &lt;code&gt;-b 512&lt;/code&gt; for my setup.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Option 3: vLLM (Production Serving)
&lt;/h2&gt;

&lt;p&gt;vLLM is the right choice if you're serving Gemma 4 to multiple users or building it into a production pipeline. It handles batching, paged attention, and continous batching automatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  Install and Run
&lt;/h3&gt;

&lt;p&gt;The easiest path is Docker:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--gpus&lt;/span&gt; all &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; ~/.cache/huggingface:/root/.cache/huggingface &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 8000:8000 &lt;span class="se"&gt;\&lt;/span&gt;
  vllm/vllm-openai:latest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; google/gemma-4-31b-it &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-model-len&lt;/span&gt; 8192 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--gpu-memory-utilization&lt;/span&gt; 0.9
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or install directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;vllm&amp;gt;&lt;span class="o"&gt;=&lt;/span&gt;0.20.0
vllm serve google/gemma-4-31b-it &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-model-len&lt;/span&gt; 8192 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--gpu-memory-utilization&lt;/span&gt; 0.9
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This starts an OpenAI-compatible API on port 8000.&lt;/p&gt;

&lt;h3&gt;
  
  
  The vLLM Performance Bug
&lt;/h3&gt;

&lt;p&gt;Fair warning: there's a known performance issue with Gemma 4 on vLLM right now. The E4B model generates at only ~9 tokens/s on an RTX 4090. That's terrible for a 4B parameter model.&lt;/p&gt;

&lt;p&gt;The root cause is Gemma 4's hybrid attention architecture. It uses 50 sliding-window layers plus 10 global attention layers, each with different head dimensions. vLLM's FlashAttention implementation can't handle this dual-dimension layout, so it falls back to a much slower Triton attention kernel.&lt;/p&gt;

&lt;p&gt;The vLLM team is tracking this in issue #38887. Until it's fixed, you'll get better throughput from llama.cpp for single-user workloads. vLLM still wins when you're serving multiple concurrent users because of its batching, but the per-request latency is worse than it should be.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-GPU Setup
&lt;/h3&gt;

&lt;p&gt;For the 31B model on multiple GPUs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vllm serve google/gemma-4-31b-it &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tensor-parallel-size&lt;/span&gt; 2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-model-len&lt;/span&gt; 16384 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--gpu-memory-utilization&lt;/span&gt; 0.9
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two 16 GB GPUs can serve the 31B model comfortably at BF16, which avoids any quantization quality loss.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which Model Should You Pick?
&lt;/h2&gt;

&lt;p&gt;After a week of running all four variants, here's my take:&lt;/p&gt;

&lt;p&gt;Most people should start with the 26B MoE. It activates only 3.8B parameters but delivers 82.3% on GPQA and 77.1% on LiveCodeBench. It fits on a single 16 GB GPU at Q4. For coding assistance, general Q&amp;amp;A, and document analysis, it handles all of those well.&lt;/p&gt;

&lt;p&gt;The 31B dense is worth the VRAM if you have it. The jump from 26B MoE to 31B dense is noticeable on hard math and complex multi-step reasoning. If you have 24 GB VRAM (RTX 3090/4090) or 32+ GB unified memory on a Mac, run this one.&lt;/p&gt;

&lt;p&gt;I reach for the E4B when I want speed. Quick code completions, simple questions where I want sub-second responses. At ~3 GB VRAM, it runs comfortably alongside everything else on my machine.&lt;/p&gt;

&lt;p&gt;The E2B? It runs on a Raspberry Pi, which is cool, but the quality gap to E4B is too large for anything beyond simple tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hardware Cheat Sheet
&lt;/h2&gt;

&lt;p&gt;Here's what actually works based on my testing and community reports:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Hardware&lt;/th&gt;
&lt;th&gt;Best Model&lt;/th&gt;
&lt;th&gt;Quantization&lt;/th&gt;
&lt;th&gt;Tokens/s&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;RTX 4090 (24 GB)&lt;/td&gt;
&lt;td&gt;31B Dense&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;~35 t/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 3090 (24 GB)&lt;/td&gt;
&lt;td&gt;31B Dense&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;~25 t/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 4070 Ti (16 GB)&lt;/td&gt;
&lt;td&gt;26B MoE&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;~30 t/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mac M3 Pro (18 GB)&lt;/td&gt;
&lt;td&gt;26B MoE&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;~15 t/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mac M2 Ultra (64 GB)&lt;/td&gt;
&lt;td&gt;31B Dense&lt;/td&gt;
&lt;td&gt;Q8_0&lt;/td&gt;
&lt;td&gt;~20 t/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 3060 (12 GB)&lt;/td&gt;
&lt;td&gt;E4B&lt;/td&gt;
&lt;td&gt;Q8_0&lt;/td&gt;
&lt;td&gt;~45 t/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Raspberry Pi 5 (8 GB)&lt;/td&gt;
&lt;td&gt;E2B&lt;/td&gt;
&lt;td&gt;Q4&lt;/td&gt;
&lt;td&gt;~3 t/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These numbers are from llama.cpp with full GPU offloading. Ollama performance is within 5-10% of these.&lt;/p&gt;

&lt;h2&gt;
  
  
  Connecting Gemma 4 to Your Editor
&lt;/h2&gt;

&lt;p&gt;Once you have a local Gemma 4 instance running (Ollama, llama.cpp server, or vLLM), you can use it as a coding assistant in most editors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VS Code with Continue:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"models"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Gemma 4 26B Local"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ollama"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gemma4:26b"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Neovim with avante.nvim or codecompanion.nvim:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Point the OpenAI-compatible endpoint to your local server. Both plugins accept a custom base URL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Any tool that supports OpenAI API:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;Base URL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://localhost:11434/v1  (Ollama)&lt;/span&gt;
&lt;span class="na"&gt;Base URL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://localhost:8080/v1  (llama.cpp)&lt;/span&gt;
&lt;span class="na"&gt;Base URL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://localhost:8000/v1  (vLLM)&lt;/span&gt;
&lt;span class="na"&gt;API Key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;not-needed"&lt;/span&gt; &lt;span class="s"&gt;(any string works)&lt;/span&gt;
&lt;span class="na"&gt;Model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4:26b&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How much VRAM do I need to run Gemma 4?
&lt;/h3&gt;

&lt;p&gt;It depends on the model variant. The E2B runs in under 1.5 GB. The E4B needs about 3 GB at Q4. The 26B MoE needs ~14 GB at Q4. The 31B dense needs ~18 GB at Q4_K_M. On Macs, unified memory counts as VRAM, so a 16 GB MacBook can run the 26B MoE.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I run Gemma 4 on CPU only?
&lt;/h3&gt;

&lt;p&gt;Yes, but it's slow. llama.cpp supports CPU inference natively. Expect 2-5 tokens per second for the 26B model on a modern desktop CPU. The E4B at ~8-12 tokens per second on CPU is usable for simple tasks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is Gemma 4 better than Llama 3 for coding?
&lt;/h3&gt;

&lt;p&gt;On LiveCodeBench v6, Gemma 4 31B scores 80.0% versus Llama 3.3 70B's score in the low 60s. Gemma 4 is smaller and faster while producing better code. The 26B MoE at 77.1% also beats Llama 3.3 70B while using a fraction of the memory. And with &lt;a href="https://danilchenko.dev/posts/2026-04-08-meta-muse-spark-alexandr-wang-first-model/" rel="noopener noreferrer"&gt;Meta pivoting toward closed models with Muse Spark&lt;/a&gt;, Gemma 4 might be the best open alternative for a while.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does Gemma 4 support vision and audio?
&lt;/h3&gt;

&lt;p&gt;The E2B and E4B variants support multimodal input: images and audio. The larger 26B and 31B models are text-only. If you need local vision capabilities, the E4B is your best option in the Gemma 4 family.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why is Gemma 4 tool calling broken in Ollama?
&lt;/h3&gt;

&lt;p&gt;Gemma 4's hybrid attention architecture (mixing sliding-window and global attention layers with different head dimensions) exposed bugs in Ollama's tool call parser and streaming implementation. The Ollama team is working on a fix. For now, use vLLM or raw llama.cpp if you need function calling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bottom Line
&lt;/h2&gt;

&lt;p&gt;I've tried every major open model release since Llama 2, and Gemma 4's 26B MoE is the first one where I stopped reaching for API keys during normal coding work. 14 GB of VRAM, no license restrictions, and benchmark scores that would've been frontier-tier eighteen months ago. The tooling has rough edges right now. Tool calling in Ollama is broken, vLLM has a performance regression, and Apple Silicon users need to dodge a Flash Attention bug. Those will get fixed. The model quality won't go backwards. Start with &lt;code&gt;ollama run gemma4:26b&lt;/code&gt; and see where it gets you.&lt;/p&gt;

</description>
      <category>gemma4</category>
      <category>ollama</category>
      <category>llamacpp</category>
      <category>vllm</category>
    </item>
  </channel>
</rss>
