Jubin Soni

Posted on May 15

Run Gemma 4 on Your Laptop — A Hands-On Guide to Google's Latest Open Multimodal LLM

#ai #gemma #googlecloud #llm

If you've been watching the open-source LLM space, you've probably noticed it's been a great couple of years. Llama, Mistral, Phi, Qwen — a whole zoo of models you can download and run on your own machine. Google's entry into that zoo is Gemma, and the fourth generation, Gemma 4 (released April 2, 2026), is the biggest leap yet: built from Gemini 3 research, multimodal (text + image + video + audio), 256K context, native function calling, configurable "thinking mode," and — finally — a clean Apache 2.0 license.

In this post we're going to:

Understand what Gemma 4 actually is, with an architecture diagram
Get it running on your laptop with Ollama in about 5 minutes
Chat with it from the terminal
Send it an image and ask questions about it
Turn on thinking mode for harder problems
Call it from a Python script like a real API
Build a small project that glues it all together

No GPU rental, no API keys, no telemetry. Let's go.

Heads up: This guide assumes zero ML background. If you can install software and run a terminal command, you can do this.

What is Gemma 4?

Gemma is Google DeepMind's family of open-weight language models. "Open-weight" means the actual neural network weights — the giant matrices of numbers that make the model work — are freely downloadable. You can run them, modify them, fine-tune them, ship them in your product.

Gemma 4 brings several big changes over Gemma 3:

Apache 2.0 license. Earlier Gemma releases used a custom license with a Prohibited Use Policy that made some enterprise legal teams nervous. Gemma 4 is plain Apache 2.0 — unlimited commercial use, no MAU caps, no special permissions. This alone is a big deal for production deployments.
Mixture-of-Experts. A new 26B MoE variant activates only ~4B parameters per token, giving you 13B-class quality at 4B-class cost.
Thinking mode. A configurable reasoning mode where the model thinks step-by-step before answering. Toggle it on for hard problems, off for fast chat.
Native function calling. Built-in support for structured tool use — write an agent without needing prompt engineering hacks.
More modalities. Image, video frames, and (on the smaller E2B/E4B models) native audio input. Native system prompt support too.
Bigger context. 128K on the small models, 256K on the larger ones.

Model sizes at a glance

Model	Disk (Ollama)	Active params	Total params	Multimodal	Context	Best for
E2B	~7.2 GB	~2B	~2.3B	text + image + audio	128K	Phones, edge devices, browser
E4B	~9.6 GB	~4B	~4.5B	text + image + audio	128K	Most laptops — the sweet spot
26B A4B (MoE)	~18 GB	~4B	26B	text + image	256K	Consumer GPUs, agentic workloads
31B Dense	~20 GB	31B	31B	text + image	256K	Workstations, highest-quality answers

Two naming notes worth understanding:

E2B / E4B. The "E" stands for Effective parameters. These are dense edge-first models that use a trick called Per-Layer Embeddings (PLE — more on this below) to do more with fewer active parameters.
26B A4B. This is the Mixture-of-Experts model. 26B parameters total, but only ~4B "activate" per forward pass. Latency and cost behave like a 4B model; quality is closer to a 13B dense model. Caveat: you still need to load all 26B into memory.

For most readers on a laptop, E4B is the right starting point. It runs comfortably on a 16 GB Mac or any modern dev machine.

Gemma 4 vs the rest of the open-model zoo (May 2026)

Model	Sizes	Multimodal	Context	License
Gemma 4	E2B / E4B / 26B MoE / 31B	text + image + video + audio (small)	128K / 256K	Apache 2.0
Llama 4	various	text + image	128K+	Llama community license
Qwen 3.5	various	text + image	128K+	Apache 2.0
DeepSeek V4 Flash	MoE	text	128K	MIT

Gemma 4's pitch: the only family that spans phones to servers under Apache 2.0, with multimodal and audio in the same release.

The architecture (in plain English)

You don't need this section to use Gemma 4 — feel free to skip to the install steps. But if you've ever wondered what's actually happening when a multimodal model "sees" and "hears," here it is.

A few pieces worth understanding:

Three input paths. Text goes through a SentencePiece tokenizer (shared with Gemini). Images go through a vision encoder that handles variable aspect ratios and resolutions natively (no more square-only inputs like Gemma 3). On the E2B and E4B models, audio goes through a USM-style conformer encoder borrowed from Gemma 3n. All three paths produce tokens that get interleaved in a single stream — so you can freely mix text, images, and audio in any order in one prompt.
Alternating local/global attention. Most layers only look at a sliding window of recent tokens (cheap). A subset of layers attend to the full context (expensive but rare). This is the standard trick for keeping the KV cache from blowing up at 256K context.
Per-Layer Embeddings (PLE) — the small-model secret. In a normal transformer, each token gets one embedding vector at input and that's all the residual stream has to work with. PLE adds a parallel pathway: for each token, every layer gets its own small conditioning vector from a lookup table. The embedding tables are large (lots of memory) but the "active" parameters per token stay small — that's why a 4-billion-active-parameter E4B can punch above its weight.
Mixture-of-Experts (26B A4B). The MoE layer has multiple "expert" feed-forward networks. A small router picks 2 of 8 (or similar) for each token. Total params = 26B (all loaded), active params per token = ~4B (only those fire). Pareto-optimal for quality-per-FLOP.
Thinking mode. When you include the special <|think|> token at the start of the system prompt, the model emits internal reasoning between <|channel>thought\n...<channel|> markers before the final answer. Disable it for fast chat; enable it for math, code, multi-step reasoning.

That's most of what's worth knowing. Now let's actually run it.

Step 1: Install Ollama

There are a few ways to run Gemma 4 locally, but the easiest by a mile is Ollama. Think of it as "Docker for LLMs" — it handles downloading the model, managing memory, GPU acceleration, and exposing a local API. You don't have to think about CUDA versions or PyTorch.

Install it:

macOS / Windows: Download the installer at ollama.com/download and run it.
Linux:

  curl -fsSL https://ollama.com/install.sh | sh

Verify:

ollama --version

You should see a version number. Gemma 4 requires Ollama v0.20.0 or later — if you're on an older version, update first.

Step 2: Pull a Gemma 4 model

Download the default (E4B, ~9.6 GB):

ollama pull gemma4

This downloads about 9.6 GB. Grab a coffee. ☕

Other sizes if you want them:

ollama pull gemma4:e2b   # ~7.2 GB — smallest, for low-RAM machines
ollama pull gemma4:e4b   # ~9.6 GB — the default; same as `gemma4`
ollama pull gemma4:26b   # ~18 GB  — the MoE; 256K context
ollama pull gemma4:31b   # ~20 GB  — biggest dense model

Hardware reality check: On Apple Silicon, 16 GB unified memory handles E4B comfortably. NVIDIA users need the model to fit entirely in VRAM for GPU-accelerated inference. The 26B model fits on 24 GB but leaves very little headroom — treat it as the ceiling, not the target.

List what you've got:

ollama list

Step 3: Chat with it in the terminal

Easiest possible test:

ollama run gemma4

You'll get an interactive prompt:

>>> Explain what a hash map is, like I'm a junior dev.

Hit enter and watch it stream a response. To exit, type /bye.

That's it. You're running a state-of-the-art LLM locally with zero cloud dependency. Try:

"Write a Python function that finds duplicates in a list, with three different approaches and their tradeoffs."
"What's the difference between TCP and UDP? Use an analogy."
"Translate 'Where is the nearest train station?' into Japanese, Spanish, and Hindi."

Step 4: Send it an image

Gemma 4 can see. Drop any image file in your current directory, then:

ollama run gemma4
>>> Describe what's in this image: ./screenshot.png

Ollama loads the image, sends it through the vision encoder, and the model answers. Unlike Gemma 3 (which resized everything to 896×896), Gemma 4 handles variable aspect ratios and resolutions natively — so tall screenshots, wide diagrams, and high-res photos all work without manual cropping.

Try:

"What error is shown in this screenshot?" (paste a stack trace)
"What's the bounding box for the 'submit' button in this UI?" (Gemma 4 will answer in JSON — natively!)
"Read the handwriting in this note and transcribe it."

Step 5: Turn on thinking mode

For harder problems — multi-step math, complex code, logic puzzles — turn on thinking mode. Include the <|think|> token at the very start of your system prompt:

ollama run gemma4
>>> /set system "<|think|>You are a careful, methodical assistant."
>>> Three friends split a $73.42 dinner bill. Alice had a $12 appetizer, Bob had a $9 drink. The rest is shared. What does everyone pay?

The model will emit its reasoning in a <|channel>thought\n...<channel|> block before the final answer. For fast chat, leave the token out and the model answers directly.

🧠 When to use it: Code generation, math, multi-hop reasoning, agentic planning — yes. Single-turn factual questions, summarization, translation — no, it just adds latency.

Step 6: Call Gemma 4 from Python

A chat prompt is nice, but you're a developer — you want to call this thing from code. When Ollama is running, it exposes a local REST API on http://localhost:11434. There's also an official Python client.

Install it:

pip install ollama

Basic chat

import ollama

response = ollama.chat(
    model="gemma4",
    messages=[
        {"role": "system", "content": "You are a senior code reviewer. Be concise and direct."},
        {"role": "user",   "content": "Review this code:\n\ndef add(a, b):\n    return a+b"},
    ],
)

print(response["message"]["content"])

Streaming responses (ChatGPT-style)

import ollama

stream = ollama.chat(
    model="gemma4",
    messages=[{"role": "user", "content": "Write a haiku about debugging."}],
    stream=True,
)

for chunk in stream:
    print(chunk["message"]["content"], end="", flush=True)

Sending an image

import ollama

response = ollama.chat(
    model="gemma4",
    messages=[{
        "role": "user",
        "content": "What's in this image?",
        "images": ["./my_photo.jpg"],
    }],
)

print(response["message"]["content"])

Thinking mode + function calling (the agentic combo)

This is where Gemma 4 actually starts feeling like a "real" agent. You declare your tools as JSON schemas, the model decides when to call them, and you execute the call and pass results back. No prompt engineering hacks needed.

import ollama

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather for a city.",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name, e.g. 'Tokyo'"},
            },
            "required": ["city"],
        },
    },
}]

def get_weather(city: str) -> str:
    # Pretend this hits a real API.
    return f"{city}: 22°C, partly cloudy"

response = ollama.chat(
    model="gemma4",
    messages=[
        {"role": "system", "content": "<|think|>You are a helpful weather assistant."},
        {"role": "user",   "content": "Should I bring an umbrella in Tokyo today?"},
    ],
    tools=tools,
)

# If the model wants to call a tool, execute it and feed the result back:
for tool_call in response["message"].get("tool_calls", []):
    name = tool_call["function"]["name"]
    args = tool_call["function"]["arguments"]
    if name == "get_weather":
        result = get_weather(**args)
        # Send result back for the model to finalize its answer
        followup = ollama.chat(
            model="gemma4",
            messages=[
                {"role": "user", "content": "Should I bring an umbrella in Tokyo today?"},
                response["message"],
                {"role": "tool", "content": result, "name": name},
            ],
        )
        print(followup["message"]["content"])

Raw HTTP (no Python client needed)

For any other language:

curl http://localhost:11434/api/chat -d '{
  "model": "gemma4",
  "messages": [{"role": "user", "content": "Hello!"}],
  "stream": false
}'

Same JSON shape works from Node, Go, Rust, your shell — anything that can make an HTTP request.

A small project: folder-watching image describer

Here's a useful ~30-line script. It watches a folder, and any new image dropped in gets automatically described by Gemma 4. Great for accessibility tools, content moderation prototypes, or just learning.

import os, time
import ollama

WATCH_DIR = "./inbox"
os.makedirs(WATCH_DIR, exist_ok=True)
SEEN = set(os.listdir(WATCH_DIR))

print(f"📁 Watching {WATCH_DIR}/ — drop an image in to describe it.")
print("   (Ctrl+C to stop)\n")

IMAGE_EXTS = (".png", ".jpg", ".jpeg", ".webp", ".gif")

try:
    while True:
        current = set(os.listdir(WATCH_DIR))
        new_files = sorted(current - SEEN)

        for filename in new_files:
            if not filename.lower().endswith(IMAGE_EXTS):
                continue

            path = os.path.join(WATCH_DIR, filename)
            print(f"📸 New image: {filename}")

            response = ollama.chat(
                model="gemma4",
                messages=[{
                    "role": "user",
                    "content": (
                        "Describe this image in 2-3 sentences. "
                        "Mention any visible text. Be specific."
                    ),
                    "images": [path],
                }],
            )

            print(f"   → {response['message']['content']}\n")

        SEEN = current
        time.sleep(2)
except KeyboardInterrupt:
    print("\n Stopped.")

Run it, drag images into the inbox/ folder, and watch descriptions appear. That's a real, useful, completely local AI tool — written in 30 lines.

Things to know before shipping anything serious

A few honest caveats:

Caveat	Why it matters
Hallucination	Local models still confidently make things up. Don't trust factual claims without verification. Thinking mode reduces this for reasoning tasks but doesn't eliminate it.
CPU latency	Expect 1–3 tokens/sec on a CPU-only laptop with E4B. A GPU gives 3–10× speedup.
Context costs RAM	256K context is real, but actually filling it eats memory. Most use cases need <16K tokens.
MoE memory	The 26B MoE runs fast (only 4B active per token), but you still need to load all 26B into RAM. Don't confuse active params with memory footprint.
Audio is small-model only	E2B/E4B have native audio input. The 26B and 31B models do not.
Apache 2.0 ≠ no responsibilities	The license is permissive, but you're still on the hook for safety, bias, and compliance in whatever you ship.

📚 References & further reading

Gemma 4 announcement — Google blog — The launch post (April 2, 2026).
Gemma 4 model overview — Google AI for Developers — Official docs: sizes, capabilities, hardware requirements.
Welcome Gemma 4 — Hugging Face blog — Best technical write-up: covers PLE, MoE, USM audio encoder, benchmarks, and code samples.
Gemma 4 model card on Hugging Face — E4B instruct model weights and configuration.
Gemma 4 Complete Guide 2026 — dev.to — Community guide with architecture details and competitor comparisons.
SigLIP (Zhai et al., 2023) — The vision encoder family Gemma's image path builds on.
Mixture-of-Experts (Shazeer et al., 2017) — The original sparsely-gated MoE paper. The 26B A4B is a direct descendant.
Switch Transformer (Fedus et al., 2021) — Modern MoE techniques.
Llama 4 — Meta's competing open-weight family.

DEV Community