DEV Community

Discussion on: Has anyone actually shipped a free offline mobile LLM?

Collapse
 
simplewbs profile image
SimpleWBS

Yes, there are several small LLMs perfect for running locally on your phone with solid performance.

Top Picks

These models (under 4B params) fit in 4-8GB RAM and hit 5-15 tokens/sec on modern devices like recent Pixels or iPhones.

Model Size Best For
Gemma 2B ~1.4GB Chat, quick responses
Phi-3 Mini ~2.3GB Reasoning, code snippets
TinyLlama ~1.7GB General tasks, efficient

How to Run

Grab MLC LLM or PocketPal from app stores, download quantized GGUF versions from Hugging Face, and load 'em up—no cloud needed. Start small to test speed!

Collapse
 
ujja profile image
ujja • Edited

Yep, those are exactly the models I tested.

They are impressive on their own, but moving from a chat demo to an offline speech based app is where the cracks show. I documented the full attempt here if you are interested: Offline AI on Mobile: Tackling Whisper, LLMs, and Size Roadblocks.

A few things that caught me out:

  • Running Whisper and phi-mini together pushes memory harder than expected
  • Android asset handling gets painful fast once models get big
  • JNI plus llama.cpp works in theory, but debugging it is not fun
  • Tokens per second was not the main issue, latency spikes were

Tools like MLC LLM and PocketPal definitely help, but shipping this inside a real app still meant choosing between speed, size, or quality. Never all three.

Feels like we are close, just not quite there yet for offline, speech first experiences.