Discussion on: Has anyone actually shipped a free offline mobile LLM?

View post

Yes, there are several small LLMs perfect for running locally on your phone with solid performance.

Top Picks

These models (under 4B params) fit in 4-8GB RAM and hit 5-15 tokens/sec on modern devices like recent Pixels or iPhones.

Model	Size	Best For
Gemma 2B	~1.4GB	Chat, quick responses
Phi-3 Mini	~2.3GB	Reasoning, code snippets
TinyLlama	~1.7GB	General tasks, efficient

How to Run

Grab MLC LLM or PocketPal from app stores, download quantized GGUF versions from Hugging Face, and load 'em up—no cloud needed. Start small to test speed!

ujja • Jan 27 • Edited

Yep, those are exactly the models I tested.

They are impressive on their own, but moving from a chat demo to an offline speech based app is where the cracks show. I documented the full attempt here if you are interested: Offline AI on Mobile: Tackling Whisper, LLMs, and Size Roadblocks.

A few things that caught me out:

Running Whisper and phi-mini together pushes memory harder than expected
Android asset handling gets painful fast once models get big
JNI plus llama.cpp works in theory, but debugging it is not fun
Tokens per second was not the main issue, latency spikes were

Tools like MLC LLM and PocketPal definitely help, but shipping this inside a real app still meant choosing between speed, size, or quality. Never all three.

Feels like we are close, just not quite there yet for offline, speech first experiences.