DEV Community

Discussion on: The Local Model That Doesn't Sleep: Gemma 4 + MTP as a Marathon Engine

Collapse
 
tahosin profile image
S M Tahosin

The MTP angle is interesting, haven't experimented with that yet. I've been focused on the vision side of Gemma 4 for edge deployment. Running detection tasks on a Raspberry Pi at 7.5W power draw. Curious how MTP performs on resource constrained hardware, have you tested it on anything smaller than a desktop?

Collapse
 
ertugrul_demir profile image
Ertuğrul Demir Google Developer Experts

I haven't tried on edge devices myself but there are small variants such as gemma4 e2b, and seems to be having some gains already. I read that the draft model for these as small as 70mb (if I recall correctly). Here are official numbers for smaller models on mobile gpu speed ups:

By nature of mtp you won't notice speed ups at image understanding initially but generation speeds (+reasoning) might get some small boosts...

Collapse
 
tahosin profile image
S M Tahosin

Oh nice, 70MB for the drafter is way smaller than I expected. That actually makes me want to try pairing it with the E2B variant on the Pi and see if the reasoning step gets any faster. Right now my pipeline spends most of its time on the generation side after the image is already processed, so even a small boost there would be noticeable. That speed-up chart is really useful too, thanks for sharing it. The Pixel TPU and Apple M4 numbers are interesting, feels like on-device MTP is closer than most people think. Good point about image understanding not getting the speed-up though, that makes sense since MTP is about token prediction not the vision encoder. Appreciate the detailed answer!