On April 2, ARM published a blog post announcing Gemma 4 optimised for ARM devices via XNNPACK + KleidiAI, reporting 5.5× prefill speedup and 1.6× faster decode. Those numbers target Armv9 chips with SME2 — flagship phone silicon.
I wanted to see what happens on the broader ARM ecosystem. So I took Gemma 4 E2B through the full PyTorch edge deployment pipeline — torch.export → torchao quantization (INT8 dynamic activations + INT4 weights) → ExecuTorch XNNPACK backend → KleidiAI — and deployed it on a Raspberry Pi 5 (Cortex-A76, 8GB, no SME2).
As far as I can tell, this is the first publicly documented Gemma 4 deployment through ExecuTorch on any hardware.
It works. The output is bit-exact — 9/9 token match against FP32. But I hit 14 issues along the way, and the performance story on non-SME2 hardware is very different from ARM's published benchmarks.
The numbers
| Setup | Decode speed |
|---|---|
| ExecuTorch + XNNPACK on Pi 5 (8GB) | 0.87 tok/s |
| llama.cpp on Pi 5 (16GB)* | 6.71 tok/s |
| ExecuTorch + XNNPACK on Mac M1 Pro | 8.66 tok/s |
*llama.cpp number from potato-os/core benchmark (April 4, 2026, Pi 5 16GB). Different RAM config but decode speed is typically memory-bandwidth-bound, not capacity-bound, so the comparison is reasonable.
The Pi 5 result is 7.7× slower than llama.cpp. But the Mac result tells a different story — on macOS arm64 where XNNPACK's fused kernel path works, ExecuTorch runs at competitive speed. The gap is specific to Linux aarch64 (Pi), not ARM in general.
Why the Pi 5 is slow (one bug)
ExecuTorch 1.2.0's XNNPACK backend rejects fused INT4 subgraphs on aarch64 with xnn_status_invalid_parameter. The workaround is per_op_mode=True, which disables kernel fusion entirely. Kernel fusion is exactly where KleidiAI's INT4 matmul speedup lives — without it, every operator runs individually with full dispatch overhead.
99.5% of wall time is in C++ XNNPACK kernels, not Python. A C++ runner wouldn't help. The bottleneck is fusion, not language overhead.
This isn't a criticism of ARM or ExecuTorch. The XNNPACK + KleidiAI pipeline is clearly fast on SME2 hardware. But the Armv8 ecosystem — Pi, older phones, embedded boards — is massive, and this is the kind of gap that only surfaces through independent testing on diverse hardware.
Three bugs that will save you days
Out of the 14 issues I documented, these three cost me the most time.
torchao 0.17 has no CPU-compatible INT4 weight-only path
The legacy int4_weight_only() factory is removed in torchao 0.17. Its replacement, Int4WeightOnlyConfig, requires Meta's mslk CUDA kernel library. Every INT4 packing format in 0.17 needs CUDA, XPU, or NPU — there is no CPU path.
If you're doing CPU-side model preparation for ExecuTorch edge deployment (which is... the primary use case), this blocks you completely.
Workaround: Use Int8DynamicActivationIntxWeightConfig(weight_dtype=torch.int4, weight_granularity=PerGroup(128)). This gives you INT8 dynamic activations with INT4 weights — the standard scheme XNNPACK and KleidiAI actually target.
torch.export.save silently corrupts large files
If you pass a pathlib.Path to torch.export.save and the export exceeds 2 GB, the zip central directory gets truncated. The save reports success. torch.export.load then fails with a cryptic PytorchStreamReader failed finding central directory error. You'll blame your model, your export config, your quantization — everything except the save call, because it told you it worked.
Workaround: Pass an open file handle instead of a Path, and verify the save immediately by reloading:
with open("model.pt2", "wb") as f:
torch.export.save(exported_program, f)
# Verify immediately
torch.export.load("model.pt2")
HuggingFace's StaticCache breaks ExecuTorch lowering
transformers.StaticCache holds KV-cache tensors as plain Python attributes, not as nn.Module buffers. During torch.export, these tensors get lifted as constants. ExecuTorch's run_decompositions then rejects them because constants can't be mutated — but the cache is mutated every forward pass.
HuggingFace's source code actually documents this (early_initialization comment), but there's no formal fix for the ExecuTorch interaction.
Workaround: Subclass StaticCache to also inherit from nn.Module. Register KV tensors and the cumulative-length counter as buffers. Wrap the layer caches in nn.ModuleList. This makes them visible to torch.export as mutable buffers instead of constants.
The export was easier than expected
Going in, I expected torch.export to be the hardest phase. Gemma 4 E2B has unusual architecture features — embed_tokens_per_layer (2.35B params in a per-layer embedding table, which is the "E2B" trick), shared RoPE as a sibling of the decoder layers, and sliding-window attention alternating with full attention across 35 layers.
I wrote up a list of seven export hazards from source inspection: dict-typed shared KV states, dynamic getattr in rotary embeddings, a @dynamic_rope_update decorator, and more.
None of them manifested. Transformers 5.5.3's Gemma 4 implementation traces cleanly through torch.export with StaticCache (within the sliding-window constraint of seq ∈ [2, 511]). The two real Phase 3 problems were both downstream of torch.export: the pathlib save bug and the decode-loop attention mask shape.
The hardest phase was actually lowering (Phase 5) — the StaticCache mutation blocker and the XNNPACK partitioner configuration. That's where the non-obvious engineering lived.
What's in the repo
Everything needed to reproduce the full pipeline or just grab the .pte and run:
Ready to use:
- 5.14 GB .pte on HuggingFace — download and run on Pi 5
- Interactive multi-turn chat REPL with KV-cache reuse
- Full phase-by-phase reproduction recipe (Mac export → Pi deploy) Documentation:
- RESULTS.md — complete chronology of every bug, fix, and design decision
- KNOWN_ISSUES.md — all 14 issues with repro steps and workarounds
- Architecture analysis of Gemma 4 E2B from an exporter's perspective ## Who this is for
If you're deploying a custom PyTorch model on ARM hardware via ExecuTorch, this repo is a worked example of the full toolchain with honest documentation of where it breaks. Substitute your model for Gemma 4 and most of the recipe transfers.
If you just want Gemma 4 on a Pi 5, use llama.cpp. It's faster and simpler today. This project exists to test and document the official PyTorch edge path — what works, what doesn't, and what needs fixing upstream.
If you maintain ExecuTorch, torchao, or HuggingFace Transformers, the KNOWN_ISSUES.md has repro steps for each bug. Upstream issues are being filed.
Links
- GitHub: github.com/bamb00boy/Gemma4_executorch_deployment
- HuggingFace (.pte): huggingface.co/bamb00boy/gemma4-e2b-int4-executorch-pi5
- ARM's Gemma 4 blog: newsroom.arm.com/blog/gemma-4-on-arm-optimized-on-device-ai
Top comments (0)