Yes — there are usable offline, free, lightweight mobile LLMs in the wild (e.g., running quantized LLaMA, Mistral 7B, or GGML-based variants on device), but getting them performant for speech without significant compromises in latency/accuracy is still nontrivial and depends heavily on aggressive quantization and model choice.
Most shipped examples lean on frameworks like GGML/llama.cpp with 4-bit quantization or similar, and integrate small local encoders/decoders for speech; if you need higher accuracy or larger context, you still need to accept tradeoffs or offload to the cloud.
Even smaller models hit Android and iOS limits around assets, memory, and native bridging
Quantization helps, but conversational UX takes a hit pretty quickly
So yeah, it is real, but only if you accept very tight constraints around context, speed, or response quality. Anything that feels like a smooth assistant still involves pretty visible trade offs.
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
Yes — there are usable offline, free, lightweight mobile LLMs in the wild (e.g., running quantized LLaMA, Mistral 7B, or GGML-based variants on device), but getting them performant for speech without significant compromises in latency/accuracy is still nontrivial and depends heavily on aggressive quantization and model choice.
Most shipped examples lean on frameworks like GGML/llama.cpp with 4-bit quantization or similar, and integrate small local encoders/decoders for speech; if you need higher accuracy or larger context, you still need to accept tradeoffs or offload to the cloud.
Totally agree. This is pretty much why I asked the question.
I actually tried shipping this for a speech first mobile app and wrote up the whole journey in an earlier post called Offline AI on Mobile: Tackling Whisper, LLMs, and Size Roadblocks.
What I found in practice:
So yeah, it is real, but only if you accept very tight constraints around context, speed, or response quality. Anything that feels like a smooth assistant still involves pretty visible trade offs.