Google released Gemma 4 yesterday under Apache 2.0. The benchmarks looked incredible. The community went to work. Here's what we're actually seeing.
I spent the last 24 hours reading through forums, running my own fine-tuning experiments, and collecting reports from dozens of early adopters. This is a summary of the real-world findings, the open questions, and where I think this model family lands.
The Good News First
Apache 2.0 is a big deal. Previous Gemma releases used a custom Google license that technically allowed them to restrict usage. Apache 2.0 removes that uncertainty entirely. For anyone building commercial products on open models, this matters more than any benchmark number.
Multilingual quality is genuinely strong. Users testing German, Arabic, Vietnamese, and French are reporting that Gemma 4 outperforms Qwen 3.5 in non-English tasks. One user called it "in a tier of its own" for translation. Another said it "makes translategemma feel outdated instantly." For global enterprise deployments, this is a significant differentiator.
The ELO score tells a different story than benchmarks. The 31B model scored 2150 on LMArena, which puts it above GPT-OSS-120B and comparable to GPT-5-mini. But side-by-side benchmark tables show it roughly tying with Qwen 3.5 27B. The gap between ELO (human preference) and automated benchmarks suggests Gemma 4 produces responses that humans prefer even when raw accuracy is similar.
The E2B model is absurd. Multiple users confirmed that the 2.3B effective parameter model beats Gemma 3 27B on most benchmarks. A user running it on a basic i7 laptop with 32GB RAM reported it was "not only faster, it gives significantly better answers" than Qwen 3.5 4B for finance analysis.
The Problems Nobody Warned About
Inference Speed
This is the elephant in the room. Multiple users are reporting that Gemma 4's MoE model (26B-A4B) runs significantly slower than Qwen 3.5's equivalent:
- One user: 11 tokens/sec on Gemma 4 26B-A4B vs 60+ tokens/sec on Qwen 3.5 35B-A3B on the same 5060 Ti 16GB
- Another confirmed higher VRAM usage for context at the same quantization level
- Someone running on a DGX Spark asked "why is it super slow?" with no clear answer yet
For the dense 31B model, users are reporting 18-25 tokens/sec on dual NVIDIA GPUs (5070 Ti + 5060 Ti), which is reasonable but not fast.
The speed gap against Qwen 3.5 is concerning for production deployments where latency matters.
VRAM Consumption
Gemma models have historically been VRAM-hungry for context, and Gemma 4 appears to continue this pattern. One user noted they could only fit Gemma 3 27B Q4 with 20K context on a 5090, while Qwen 3.5 27B Q4 fit with 190K context on the same card.
For the 256K context window to be useful in practice, you need significantly more VRAM than competing models at the same parameter count.
Fine-Tuning Compatibility
As someone who attempted QLoRA fine-tuning within hours of release, I can confirm the tooling is not ready. Three issues hit immediately:
-
HuggingFace Transformers didn't recognize the
gemma4architecture (required installing from source) -
PEFT couldn't handle
Gemma4ClippableLinear, a new layer type in the vision encoder (required a monkey-patch) -
A new
mm_token_type_idsfield is required during training even for text-only data (required a custom data collator)
I've filed issues on both huggingface/peft and huggingface/transformers. Both received responses within hours, and a fix for the mm_token_type_ids issue is already in progress. Unsloth also has day-one support if you prefer that path.
The community question "how easy is it to fine-tune compared to Gemma 3?" currently has no good answer beyond "harder, but solvable."
Stability Questions
One user testing the non-quantized 31B in Google AI Studio reported "infinite loops and no possibility to read text from the image." Another found that the model jailbreaks with basic system prompts. A third reported Mac hard crashes when loading either the 31B or 26B in LM Studio.
These are early reports and may be resolved with updates, but they're worth noting for anyone considering production deployment.
The Benchmark Reality
The community quickly assembled side-by-side comparisons. Here's the consolidated picture:
| Metric | Gemma 4 31B | Qwen 3.5 27B | Winner |
|---|---|---|---|
| MMLU-Pro | 85.2% | 86.1% | Qwen |
| GPQA Diamond | 84.3% | 85.5% | Qwen |
| LiveCodeBench v6 | 80.0% | 80.7% | Tie |
| Codeforces ELO | 2150 | 1899 | Gemma |
| TAU2-Bench | 76.9% | 79.0% | Qwen |
| MMMLU | 88.4% | 85.9% | Gemma |
| HLE (no tools) | 19.5% | 24.3% | Qwen |
Gemma 4 wins on competitive coding (ELO) and multilingual (MMMLU). Qwen 3.5 wins on most reasoning benchmarks. Neither is a clear overall winner.
The honest take from one top commenter: "Gemma 4 ties with Qwen, if not Qwen being slightly ahead. And Qwen 3.5 is more compute efficient too."
What the Community Is Waiting For
QAT versions. Gemma 3 QAT (quantization-aware training) models arrived weeks after the initial release. The community expects the same for Gemma 4, and these will likely improve quantized inference quality significantly.
Abliterated/uncensored versions. At least one already exists. Multiple users are requesting more. The Apache 2.0 license makes this fully legal now.
Larger models. There were rumors of a 120B model that didn't materialize. Several users expressed disappointment. A 100B+ MoE from Google could be transformative.
A 9-12B dense model. The gap between E4B (4.5B effective) and 26B MoE leaves a hole in the lineup. Gemma 3's 12B model was popular, and there's no direct upgrade path.
Where This Leaves Us
Gemma 4 is not the clear winner the benchmarks suggested. But it's not trying to be.
The real value proposition is the combination of:
- Apache 2.0 (fully permissive, no restrictions)
- Multilingual excellence (best in class for non-English)
- Base models available (fine-tuning ready on day one)
- Size diversity (2B to 31B covers edge to server)
- Native system prompts and function calling (production-ready features)
For English-only, benchmark-optimized, speed-critical deployments, Qwen 3.5 is still the better choice. For multilingual, legally unrestricted, fine-tuning-focused use cases, Gemma 4 has a compelling argument.
The speed and VRAM issues need to be addressed. The fine-tuning tooling needs a week or two to catch up. And we need QAT quantizations before the smaller models can truly compete on efficiency.
But make no mistake, releasing a 31B dense model under Apache 2.0 that rivals models 4-10x its size on human preference benchmarks is a significant moment for open AI. Google is finally competing on openness, not just capability.
I'll be publishing our fine-tuning results (including the day-zero bug fixes) and benchmark comparisons as the training run completes. Follow along if you're interested.
Nathan Maine builds AI systems for regulated industries. He is currently fine-tuning Gemma 4 31B for domain-specific deployment and has filed bug reports on huggingface/peft and huggingface/transformers for day-zero compatibility issues.

Top comments (0)