I haven't tried on edge devices myself but there are small variants such as gemma4 e2b, and seems to be having some gains already. I read that the draft model for these as small as 70mb (if I recall correctly). Here are official numbers for smaller models on mobile gpu speed ups:
By nature of mtp you won't notice speed ups at image understanding initially but generation speeds (+reasoning) might get some small boosts...
Full-stack developer building AI-powered tools that are free, fast, and actually useful. Creator of Hocks AI & PromptCraft AI. I ship products, write about AI/web dev, and open-source everything.
Oh nice, 70MB for the drafter is way smaller than I expected. That actually makes me want to try pairing it with the E2B variant on the Pi and see if the reasoning step gets any faster. Right now my pipeline spends most of its time on the generation side after the image is already processed, so even a small boost there would be noticeable. That speed-up chart is really useful too, thanks for sharing it. The Pixel TPU and Apple M4 numbers are interesting, feels like on-device MTP is closer than most people think. Good point about image understanding not getting the speed-up though, that makes sense since MTP is about token prediction not the vision encoder. Appreciate the detailed answer!
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
I haven't tried on edge devices myself but there are small variants such as gemma4 e2b, and seems to be having some gains already. I read that the draft model for these as small as 70mb (if I recall correctly). Here are official numbers for smaller models on mobile gpu speed ups:
By nature of mtp you won't notice speed ups at image understanding initially but generation speeds (+reasoning) might get some small boosts...
Oh nice, 70MB for the drafter is way smaller than I expected. That actually makes me want to try pairing it with the E2B variant on the Pi and see if the reasoning step gets any faster. Right now my pipeline spends most of its time on the generation side after the image is already processed, so even a small boost there would be noticeable. That speed-up chart is really useful too, thanks for sharing it. The Pixel TPU and Apple M4 numbers are interesting, feels like on-device MTP is closer than most people think. Good point about image understanding not getting the speed-up though, that makes sense since MTP is about token prediction not the vision encoder. Appreciate the detailed answer!