Viik

Posted on May 25

Gemma 4 ExecuTorch Deployment on Raspberry Pi 5 and Why It's 7.7 Slower Than llama.cpp

#executorch #edgeai #raspberrypi #gemma4

On April 2, ARM published a blog post announcing Gemma 4 optimised for ARM devices via XNNPACK + KleidiAI, reporting 5.5× prefill speedup and 1.6× faster decode. Those numbers target Armv9 chips with SME2 — flagship phone silicon.

I wanted to see what happens on the broader ARM ecosystem. So I took Gemma 4 E2B through the full PyTorch edge deployment pipeline — torch.export → torchao quantization (INT8 dynamic activations + INT4 weights) → ExecuTorch XNNPACK backend → KleidiAI — and deployed it on a Raspberry Pi 5 (Cortex-A76, 8GB, no SME2).

As far as I can tell, this is the first publicly documented Gemma 4 deployment through ExecuTorch on any hardware.

It works. The output is bit-exact — 9/9 token match against FP32. But I hit 14 issues along the way, and the performance story on non-SME2 hardware is very different from ARM's published benchmarks.

The numbers

Setup	Decode speed
ExecuTorch + XNNPACK on Pi 5 (8GB)	0.87 tok/s
llama.cpp on Pi 5 (16GB)*	6.71 tok/s
ExecuTorch + XNNPACK on Mac M1 Pro	8.66 tok/s

*llama.cpp number from potato-os/core benchmark (April 4, 2026, Pi 5 16GB). Different RAM config but decode speed is typically memory-bandwidth-bound, not capacity-bound, so the comparison is reasonable.

The Pi 5 result is 7.7× slower than llama.cpp. But the Mac result tells a different story — on macOS arm64 where XNNPACK's fused kernel path works, ExecuTorch runs at competitive speed. The gap is specific to Linux aarch64 (Pi), not ARM in general.

Why the Pi 5 is slow (one bug)

ExecuTorch 1.2.0's XNNPACK backend rejects fused INT4 subgraphs on aarch64 with xnn_status_invalid_parameter. The workaround is per_op_mode=True, which disables kernel fusion entirely. Kernel fusion is exactly where KleidiAI's INT4 matmul speedup lives — without it, every operator runs individually with full dispatch overhead.

99.5% of wall time is in C++ XNNPACK kernels, not Python. A C++ runner wouldn't help. The bottleneck is fusion, not language overhead.

This isn't a criticism of ARM or ExecuTorch. The XNNPACK + KleidiAI pipeline is clearly fast on SME2 hardware. But the Armv8 ecosystem — Pi, older phones, embedded boards — is massive, and this is the kind of gap that only surfaces through independent testing on diverse hardware.

Three bugs that will save you days

Out of the 14 issues I documented, these three cost me the most time.

torchao 0.17 has no CPU-compatible INT4 weight-only path

The legacy int4_weight_only() factory is removed in torchao 0.17. Its replacement, Int4WeightOnlyConfig, requires Meta's mslk CUDA kernel library. Every INT4 packing format in 0.17 needs CUDA, XPU, or NPU — there is no CPU path.

If you're doing CPU-side model preparation for ExecuTorch edge deployment (which is... the primary use case), this blocks you completely.

Workaround: Use Int8DynamicActivationIntxWeightConfig(weight_dtype=torch.int4, weight_granularity=PerGroup(128)). This gives you INT8 dynamic activations with INT4 weights — the standard scheme XNNPACK and KleidiAI actually target.

torch.export.save silently corrupts large files

If you pass a pathlib.Path to torch.export.save and the export exceeds 2 GB, the zip central directory gets truncated. The save reports success. torch.export.load then fails with a cryptic PytorchStreamReader failed finding central directory error. You'll blame your model, your export config, your quantization — everything except the save call, because it told you it worked.

Workaround: Pass an open file handle instead of a Path, and verify the save immediately by reloading:

with open("model.pt2", "wb") as f:
    torch.export.save(exported_program, f)
# Verify immediately
torch.export.load("model.pt2")

HuggingFace's StaticCache breaks ExecuTorch lowering

transformers.StaticCache holds KV-cache tensors as plain Python attributes, not as nn.Module buffers. During torch.export, these tensors get lifted as constants. ExecuTorch's run_decompositions then rejects them because constants can't be mutated — but the cache is mutated every forward pass.

HuggingFace's source code actually documents this (early_initialization comment), but there's no formal fix for the ExecuTorch interaction.

Workaround: Subclass StaticCache to also inherit from nn.Module. Register KV tensors and the cumulative-length counter as buffers. Wrap the layer caches in nn.ModuleList. This makes them visible to torch.export as mutable buffers instead of constants.

The export was easier than expected

Going in, I expected torch.export to be the hardest phase. Gemma 4 E2B has unusual architecture features — embed_tokens_per_layer (2.35B params in a per-layer embedding table, which is the "E2B" trick), shared RoPE as a sibling of the decoder layers, and sliding-window attention alternating with full attention across 35 layers.

I wrote up a list of seven export hazards from source inspection: dict-typed shared KV states, dynamic getattr in rotary embeddings, a @dynamic_rope_update decorator, and more.

None of them manifested. Transformers 5.5.3's Gemma 4 implementation traces cleanly through torch.export with StaticCache (within the sliding-window constraint of seq ∈ [2, 511]). The two real Phase 3 problems were both downstream of torch.export: the pathlib save bug and the decode-loop attention mask shape.

The hardest phase was actually lowering (Phase 5) — the StaticCache mutation blocker and the XNNPACK partitioner configuration. That's where the non-obvious engineering lived.

What's in the repo

Everything needed to reproduce the full pipeline or just grab the .pte and run:

Ready to use:

5.14 GB .pte on HuggingFace — download and run on Pi 5
Interactive multi-turn chat REPL with KV-cache reuse
Full phase-by-phase reproduction recipe (Mac export → Pi deploy) Documentation:
RESULTS.md — complete chronology of every bug, fix, and design decision
KNOWN_ISSUES.md — all 14 issues with repro steps and workarounds
Architecture analysis of Gemma 4 E2B from an exporter's perspective ## Who this is for

If you're deploying a custom PyTorch model on ARM hardware via ExecuTorch, this repo is a worked example of the full toolchain with honest documentation of where it breaks. Substitute your model for Gemma 4 and most of the recipe transfers.

If you just want Gemma 4 on a Pi 5, use llama.cpp. It's faster and simpler today. This project exists to test and document the official PyTorch edge path — what works, what doesn't, and what needs fixing upstream.

If you maintain ExecuTorch, torchao, or HuggingFace Transformers, the KNOWN_ISSUES.md has repro steps for each bug. Upstream issues are being filed.

DEV Community