Discussion on: Has anyone actually shipped a free offline mobile LLM?

View post

Yes — there are usable offline, free, lightweight mobile LLMs in the wild (e.g., running quantized LLaMA, Mistral 7B, or GGML-based variants on device), but getting them performant for speech without significant compromises in latency/accuracy is still nontrivial and depends heavily on aggressive quantization and model choice.

Most shipped examples lean on frameworks like GGML/llama.cpp with 4-bit quantization or similar, and integrate small local encoders/decoders for speech; if you need higher accuracy or larger context, you still need to accept tradeoffs or offload to the cloud.

ujja • Jan 27 • Edited

Totally agree. This is pretty much why I asked the question.

I actually tried shipping this for a speech first mobile app and wrote up the whole journey in an earlier post called Offline AI on Mobile: Tackling Whisper, LLMs, and Size Roadblocks.

What I found in practice:

Whisper on device works really well
The moment you add an LLM, things get messy fast
Even smaller models hit Android and iOS limits around assets, memory, and native bridging
Quantization helps, but conversational UX takes a hit pretty quickly

So yeah, it is real, but only if you accept very tight constraints around context, speed, or response quality. Anything that feels like a smooth assistant still involves pretty visible trade offs.