Audio ASR in 3 languages, image understanding, full-stack app generation, coding, and agentic behavior -- all running on a MacBook M4 Pro with 24GB RAM.
Interactive version with playable audio, live charts, and the working React app: gemma4-benchmark.pages.dev
Google just released Gemma 4 -- their new family of open-source multimodal models. Four sizes, Apache-2.0 licensed, supports text + image + audio.
I spent a day testing every variant. Real audio files. Real images. Code that has to compile and run. Here is my honest report.
The Gemma 4 Family
- E2B -- Dense 2.3B, Text/Image/Audio, 4 GB at 4-bit. Phones and edge.
- E4B -- Dense 4.5B, Text/Image/Audio, 5.5 GB at 4-bit. Laptops.
- 26B-A4B -- MoE 4B active/26B total, Text/Image, 16-18 GB at 4-bit.
- 31B -- Dense 31B, Text/Image, 17-20 GB at 4-bit. Maximum quality.
Speed Benchmarks
Ollama: E2B 95 tok/s | E4B 57 tok/s | 26B ~2 tok/s (swap) | 31B won't fit
Unsloth MLX: E2B 81 tok/s (3.6 GB) | E4B 49 tok/s (5.6 GB)
Ollama is 15-20% faster. Unsloth MLX uses 40% less memory.
Audio ASR: 3 Languages
Tested via Ollama OpenAI-compatible endpoint. Only E2B and E4B support audio.
Listen to all test audio samples: Audio Player
English ASR
E4B (1.0s): Perfect transcription. Every word correct with punctuation.
E2B (2.8s): Garbled -- missing words, no punctuation.
French ASR
E4B (1.6s): Perfect transcription with all French accents correct.
E2B (4.1s): Fragmented, missing most of the sentence.
Arabic ASR
E4B (6.0s): Perfect Arabic transcription -- every word correct.
E2B (6.0s): Garbled -- wrong words, disordered.
Speech Translation (E4B)
French to English: "All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience..."
Arabic to English: "Hello, I am an artificial intelligence model. Today we will test speech recognition in the Arabic language..."
E4B is dramatically better than E2B for audio across all 3 languages.
Image Understanding
Test 1: Thai Temple -- Landmark Identification
E4B (54 tok/s): Thailand, Bangkok, Wat Phra Kaew (Temple of the Emerald Buddha) within the Grand Palace.
E2B (88 tok/s): Thailand, Bangkok, Grand Palace (less specific).
Test 2: AI-Generated Tokyo + Japanese OCR
AI-generated with nano-banana / Gemini
Both models correctly read Japanese kanji: 新宿ラーメン通り (Shinjuku Ramen Street)
Test 3: Venice Seagull
E4B: "A magnificent seagull perches watchfully atop a sculpted pedestal. The backdrop is a rich study in contrasting architectural styles..."
Full-Stack App Generation
E4B generated a 155-line working React + Tailwind Task Manager:
Try it live: gemma4-benchmark.pages.dev/task_manager.html
E2B failed -- code fragments instead of single file.
Coding: Compile and Run
| Script | E2B | E4B |
|---|---|---|
| Fibonacci | PASS | PASS |
| Sieve of Eratosthenes | PASS | PASS |
| JSON processor | PASS | PASS |
| HTTP request | PASS | PASS |
| React single file | FAIL | PASS |
Agentic Multi-Step Reasoning
6-step blog platform design. Both completed 6/6 steps. E4B output was 57% longer with more detail.
Why 26B Fails on 24GB
Community reports from r/LocalLLaMA suggest Gemma 4 has a KV cache memory issue (not verified on our hardware):
- 31B at 262K context: ~22GB just for KV cache (on top of model)
- Google did not adopt KV-reducing techniques from Qwen 3.5
- Workaround:
--ctx-size 8192 --cache-type-k q4_0 --parallel 1
Official Benchmarks
Final Verdict
E4B -- The Sweet Spot -- 8.5/10
Perfect ASR in 3 languages. Working React app. Japanese OCR. 57 tok/s. 5.6 GB.
E2B -- Speed Demon -- 7/10
95 tok/s. 3.6 GB. Python works. Audio garbled. Failed complex HTML gen.
26B-A4B -- Heartbreaker -- 2/10 on 24GB
Amazing benchmarks (88.3% AIME). ~2 tok/s on 24GB. Needs 32GB+.
Quick Start
brew install ollama
ollama pull gemma4:e4b
ollama run gemma4:e4b
For 24GB MacBook: ollama run gemma4:e4b is the answer.
Tested April 3, 2026. MacBook Pro M4 Pro, 24GB, macOS Sequoia.
Interactive version: gemma4-benchmark.pages.dev
Sources: Google Model Card | HuggingFace Blog | Ollama | Unsloth Guide









Top comments (0)