TLDR: Modern Android flagships can run 7B parameter models locally. Here's the threshold, the app, and the one setting that matters.
The setup I tested:
ROG Phone 7 Ultimate, Snapdragon 8 Gen 2, 16GB RAM. App: Off Grid. Model: Qwen 3 4B, Q4_K_M quantization. Speed: 15–30 tokens per second. Use case: lightweight workflow triggers without touching cloud tokens.
RAM thresholds
6GB — 1B to 3B models. Technically works. Not practically useful for anything beyond autocomplete.
8GB + Snapdragon 8 Gen 2 — 3B to 7B models. This is the useful tier.
12GB+ — Llama 3.2 7B and Qwen 3 4B without thermal throttle issues.
The app
Off Grid handles NPU routing automatically on supported Snapdragon hardware. Supports Qwen 3, Llama 3.2, Gemma 3, Phi-4, and any GGUF you want to import from local storage. First thing to do after install: go to settings, switch KV cache to q4_0. That's it. Biggest single performance gain you'll get.
Google's AI Edge Gallery is the lower-friction entry point if you want to test the concept before committing Gemma 4 on-device, minimal config, works on Android and iOS.
Quantization rule for mobile
Always Q4 or Q5. Full precision is for desktops with VRAM headroom. Q4_K_M gives you the majority of the model's capability at half the memory footprint. The quality delta in everyday use is smaller than the numbers suggest.
What it can't replace
Complex code review, multi-step reasoning across long context, sustained conversation where the model needs to hold a lot of state those still belong on the desktop or cloud. The phone model handles the first step. The pipeline handles the rest.
Full breakdown: https://engineeredai.net/run-local-llm-on-android-phone/
Top comments (0)