<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Viik</title>
    <description>The latest articles on DEV Community by Viik (@bamb00boy).</description>
    <link>https://dev.to/bamb00boy</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3950536%2F124eeaa6-8883-4951-bec7-3d2ea61139f3.png</url>
      <title>DEV Community: Viik</title>
      <link>https://dev.to/bamb00boy</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/bamb00boy"/>
    <language>en</language>
    <item>
      <title>Gemma 4 ExecuTorch Deployment on Raspberry Pi 5 and Why It's 7.7 Slower Than llama.cpp</title>
      <dc:creator>Viik</dc:creator>
      <pubDate>Mon, 25 May 2026 11:19:02 +0000</pubDate>
      <link>https://dev.to/bamb00boy/first-gemma-4-executorch-deployment-on-raspberry-pi-5-and-why-its-77x-slower-than-llamacpp-gmb</link>
      <guid>https://dev.to/bamb00boy/first-gemma-4-executorch-deployment-on-raspberry-pi-5-and-why-its-77x-slower-than-llamacpp-gmb</guid>
      <description>&lt;p&gt;On April 2, ARM published a &lt;a href="https://newsroom.arm.com/blog/gemma-4-on-arm-optimized-on-device-ai" rel="noopener noreferrer"&gt;blog post&lt;/a&gt; announcing Gemma 4 optimised for ARM devices via XNNPACK + KleidiAI, reporting 5.5× prefill speedup and 1.6× faster decode. Those numbers target Armv9 chips with SME2 — flagship phone silicon.&lt;/p&gt;

&lt;p&gt;I wanted to see what happens on the broader ARM ecosystem. So I took Gemma 4 E2B through the full PyTorch edge deployment pipeline — &lt;code&gt;torch.export&lt;/code&gt; → torchao quantization (INT8 dynamic activations + INT4 weights) → ExecuTorch XNNPACK backend → KleidiAI — and deployed it on a Raspberry Pi 5 (Cortex-A76, 8GB, no SME2).&lt;/p&gt;

&lt;p&gt;As far as I can tell, this is the first publicly documented Gemma 4 deployment through ExecuTorch on any hardware.&lt;/p&gt;

&lt;p&gt;It works. The output is bit-exact — 9/9 token match against FP32. But I hit 14 issues along the way, and the performance story on non-SME2 hardware is very different from ARM's published benchmarks.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setup&lt;/th&gt;
&lt;th&gt;Decode speed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ExecuTorch + XNNPACK on Pi 5 (8GB)&lt;/td&gt;
&lt;td&gt;0.87 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;llama.cpp on Pi 5 (16GB)*&lt;/td&gt;
&lt;td&gt;6.71 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ExecuTorch + XNNPACK on Mac M1 Pro&lt;/td&gt;
&lt;td&gt;8.66 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;*llama.cpp number from &lt;a href="https://github.com/potato-os/core/blob/main/docs/benchmarks/gemma4-pi-benchmark-2026-04-04.md" rel="noopener noreferrer"&gt;potato-os/core benchmark&lt;/a&gt; (April 4, 2026, Pi 5 16GB). Different RAM config but decode speed is typically memory-bandwidth-bound, not capacity-bound, so the comparison is reasonable.&lt;/p&gt;

&lt;p&gt;The Pi 5 result is 7.7× slower than llama.cpp. But the Mac result tells a different story — on macOS arm64 where XNNPACK's fused kernel path works, ExecuTorch runs at competitive speed. The gap is specific to Linux aarch64 (Pi), not ARM in general.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the Pi 5 is slow (one bug)
&lt;/h2&gt;

&lt;p&gt;ExecuTorch 1.2.0's XNNPACK backend rejects fused INT4 subgraphs on aarch64 with &lt;code&gt;xnn_status_invalid_parameter&lt;/code&gt;. The workaround is &lt;code&gt;per_op_mode=True&lt;/code&gt;, which disables kernel fusion entirely. Kernel fusion is exactly where KleidiAI's INT4 matmul speedup lives — without it, every operator runs individually with full dispatch overhead.&lt;/p&gt;

&lt;p&gt;99.5% of wall time is in C++ XNNPACK kernels, not Python. A C++ runner wouldn't help. The bottleneck is fusion, not language overhead.&lt;/p&gt;

&lt;p&gt;This isn't a criticism of ARM or ExecuTorch. The XNNPACK + KleidiAI pipeline is clearly fast on SME2 hardware. But the Armv8 ecosystem — Pi, older phones, embedded boards — is massive, and this is the kind of gap that only surfaces through independent testing on diverse hardware.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three bugs that will save you days
&lt;/h2&gt;

&lt;p&gt;Out of the 14 issues I documented, these three cost me the most time.&lt;/p&gt;

&lt;h3&gt;
  
  
  torchao 0.17 has no CPU-compatible INT4 weight-only path
&lt;/h3&gt;

&lt;p&gt;The legacy &lt;code&gt;int4_weight_only()&lt;/code&gt; factory is removed in torchao 0.17. Its replacement, &lt;code&gt;Int4WeightOnlyConfig&lt;/code&gt;, requires Meta's &lt;code&gt;mslk&lt;/code&gt; CUDA kernel library. Every INT4 packing format in 0.17 needs CUDA, XPU, or NPU — there is no CPU path.&lt;/p&gt;

&lt;p&gt;If you're doing CPU-side model preparation for ExecuTorch edge deployment (which is... the primary use case), this blocks you completely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Workaround:&lt;/strong&gt; Use &lt;code&gt;Int8DynamicActivationIntxWeightConfig(weight_dtype=torch.int4, weight_granularity=PerGroup(128))&lt;/code&gt;. This gives you INT8 dynamic activations with INT4 weights — the standard scheme XNNPACK and KleidiAI actually target.&lt;/p&gt;

&lt;h3&gt;
  
  
  torch.export.save silently corrupts large files
&lt;/h3&gt;

&lt;p&gt;If you pass a &lt;code&gt;pathlib.Path&lt;/code&gt; to &lt;code&gt;torch.export.save&lt;/code&gt; and the export exceeds 2 GB, the zip central directory gets truncated. The save reports success. &lt;code&gt;torch.export.load&lt;/code&gt; then fails with a cryptic &lt;code&gt;PytorchStreamReader failed finding central directory&lt;/code&gt; error. You'll blame your model, your export config, your quantization — everything except the save call, because it told you it worked.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Workaround:&lt;/strong&gt; Pass an open file handle instead of a Path, and verify the save immediately by reloading:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model.pt2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;export&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exported_program&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Verify immediately
&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;export&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model.pt2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  HuggingFace's StaticCache breaks ExecuTorch lowering
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;transformers.StaticCache&lt;/code&gt; holds KV-cache tensors as plain Python attributes, not as &lt;code&gt;nn.Module&lt;/code&gt; buffers. During &lt;code&gt;torch.export&lt;/code&gt;, these tensors get lifted as &lt;strong&gt;constants&lt;/strong&gt;. ExecuTorch's &lt;code&gt;run_decompositions&lt;/code&gt; then rejects them because constants can't be mutated — but the cache is mutated every forward pass.&lt;/p&gt;

&lt;p&gt;HuggingFace's source code actually documents this (&lt;code&gt;early_initialization&lt;/code&gt; comment), but there's no formal fix for the ExecuTorch interaction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Workaround:&lt;/strong&gt; Subclass &lt;code&gt;StaticCache&lt;/code&gt; to also inherit from &lt;code&gt;nn.Module&lt;/code&gt;. Register KV tensors and the cumulative-length counter as buffers. Wrap the layer caches in &lt;code&gt;nn.ModuleList&lt;/code&gt;. This makes them visible to &lt;code&gt;torch.export&lt;/code&gt; as mutable buffers instead of constants.&lt;/p&gt;

&lt;h2&gt;
  
  
  The export was easier than expected
&lt;/h2&gt;

&lt;p&gt;Going in, I expected &lt;code&gt;torch.export&lt;/code&gt; to be the hardest phase. Gemma 4 E2B has unusual architecture features — &lt;code&gt;embed_tokens_per_layer&lt;/code&gt; (2.35B params in a per-layer embedding table, which is the "E2B" trick), shared RoPE as a sibling of the decoder layers, and sliding-window attention alternating with full attention across 35 layers.&lt;/p&gt;

&lt;p&gt;I wrote up a list of seven export hazards from source inspection: dict-typed shared KV states, dynamic &lt;code&gt;getattr&lt;/code&gt; in rotary embeddings, a &lt;code&gt;@dynamic_rope_update&lt;/code&gt; decorator, and more.&lt;/p&gt;

&lt;p&gt;None of them manifested. Transformers 5.5.3's Gemma 4 implementation traces cleanly through &lt;code&gt;torch.export&lt;/code&gt; with &lt;code&gt;StaticCache&lt;/code&gt; (within the sliding-window constraint of &lt;code&gt;seq ∈ [2, 511]&lt;/code&gt;). The two real Phase 3 problems were both downstream of &lt;code&gt;torch.export&lt;/code&gt;: the pathlib save bug and the decode-loop attention mask shape.&lt;/p&gt;

&lt;p&gt;The hardest phase was actually lowering (Phase 5) — the StaticCache mutation blocker and the XNNPACK partitioner configuration. That's where the non-obvious engineering lived.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's in the repo
&lt;/h2&gt;

&lt;p&gt;Everything needed to reproduce the full pipeline or just grab the .pte and run:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ready to use:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/bamb00boy/gemma4-e2b-int4-executorch-pi5" rel="noopener noreferrer"&gt;5.14 GB .pte on HuggingFace&lt;/a&gt; — download and run on Pi 5&lt;/li&gt;
&lt;li&gt;Interactive multi-turn chat REPL with KV-cache reuse&lt;/li&gt;
&lt;li&gt;Full phase-by-phase reproduction recipe (Mac export → Pi deploy)
&lt;strong&gt;Documentation:&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/bamb00boy/Gemma4_executorch_deployment/blob/master/RESULTS.md" rel="noopener noreferrer"&gt;RESULTS.md&lt;/a&gt; — complete chronology of every bug, fix, and design decision&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/bamb00boy/Gemma4_executorch_deployment/blob/master/KNOWN_ISSUES.md" rel="noopener noreferrer"&gt;KNOWN_ISSUES.md&lt;/a&gt; — all 14 issues with repro steps and workarounds&lt;/li&gt;
&lt;li&gt;Architecture analysis of Gemma 4 E2B from an exporter's perspective
## Who this is for&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're deploying a custom PyTorch model on ARM hardware via ExecuTorch, this repo is a worked example of the full toolchain with honest documentation of where it breaks. Substitute your model for Gemma 4 and most of the recipe transfers.&lt;/p&gt;

&lt;p&gt;If you just want Gemma 4 on a Pi 5, use llama.cpp. It's faster and simpler today. This project exists to test and document the official PyTorch edge path — what works, what doesn't, and what needs fixing upstream.&lt;/p&gt;

&lt;p&gt;If you maintain ExecuTorch, torchao, or HuggingFace Transformers, the &lt;a href="https://github.com/bamb00boy/Gemma4_executorch_deployment/blob/master/KNOWN_ISSUES.md" rel="noopener noreferrer"&gt;KNOWN_ISSUES.md&lt;/a&gt; has repro steps for each bug. Upstream issues are being filed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/bamb00boy/Gemma4_executorch_deployment" rel="noopener noreferrer"&gt;github.com/bamb00boy/Gemma4_executorch_deployment&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HuggingFace (.pte):&lt;/strong&gt; &lt;a href="https://huggingface.co/bamb00boy/gemma4-e2b-int4-executorch-pi5" rel="noopener noreferrer"&gt;huggingface.co/bamb00boy/gemma4-e2b-int4-executorch-pi5&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ARM's Gemma 4 blog:&lt;/strong&gt; &lt;a href="https://newsroom.arm.com/blog/gemma-4-on-arm-optimized-on-device-ai" rel="noopener noreferrer"&gt;newsroom.arm.com/blog/gemma-4-on-arm-optimized-on-device-ai&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>executorch</category>
      <category>edgeai</category>
      <category>raspberrypi</category>
      <category>gemma4</category>
    </item>
  </channel>
</rss>
