Yes, there are several small LLMs perfect for running locally on your phone with solid performance.
Top Picks
These models (under 4B params) fit in 4-8GB RAM and hit 5-15 tokens/sec on modern devices like recent Pixels or iPhones.
Model
Size
Best For
Gemma 2B
~1.4GB
Chat, quick responses
Phi-3 Mini
~2.3GB
Reasoning, code snippets
TinyLlama
~1.7GB
General tasks, efficient
How to Run
Grab MLC LLM or PocketPal from app stores, download quantized GGUF versions from Hugging Face, and load 'em up—no cloud needed. Start small to test speed!
Running Whisper and phi-mini together pushes memory harder than expected
Android asset handling gets painful fast once models get big
JNI plus llama.cpp works in theory, but debugging it is not fun
Tokens per second was not the main issue, latency spikes were
Tools like MLC LLM and PocketPal definitely help, but shipping this inside a real app still meant choosing between speed, size, or quality. Never all three.
Feels like we are close, just not quite there yet for offline, speech first experiences.
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
Yes, there are several small LLMs perfect for running locally on your phone with solid performance.
Top Picks
These models (under 4B params) fit in 4-8GB RAM and hit 5-15 tokens/sec on modern devices like recent Pixels or iPhones.
How to Run
Grab MLC LLM or PocketPal from app stores, download quantized GGUF versions from Hugging Face, and load 'em up—no cloud needed. Start small to test speed!
Yep, those are exactly the models I tested.
They are impressive on their own, but moving from a chat demo to an offline speech based app is where the cracks show. I documented the full attempt here if you are interested: Offline AI on Mobile: Tackling Whisper, LLMs, and Size Roadblocks.
A few things that caught me out:
Tools like MLC LLM and PocketPal definitely help, but shipping this inside a real app still meant choosing between speed, size, or quality. Never all three.
Feels like we are close, just not quite there yet for offline, speech first experiences.