DEV Community

Cover image for Run a Local LLM on Android: What RAM Tier You Need and Which Models Actually Work
EngineeredAI
EngineeredAI

Posted on • Originally published at engineeredai.net

Run a Local LLM on Android: What RAM Tier You Need and Which Models Actually Work

TLDR: Modern Android flagships can run 7B parameter models locally. Here's the threshold, the app, and the one setting that matters.

The setup I tested:

ROG Phone 7 Ultimate, Snapdragon 8 Gen 2, 16GB RAM. App: Off Grid. Model: Qwen 3 4B, Q4_K_M quantization. Speed: 15–30 tokens per second. Use case: lightweight workflow triggers without touching cloud tokens.

RAM thresholds

6GB — 1B to 3B models. Technically works. Not practically useful for anything beyond autocomplete.
8GB + Snapdragon 8 Gen 2 — 3B to 7B models. This is the useful tier.
12GB+ — Llama 3.2 7B and Qwen 3 4B without thermal throttle issues.

The app

Off Grid handles NPU routing automatically on supported Snapdragon hardware. Supports Qwen 3, Llama 3.2, Gemma 3, Phi-4, and any GGUF you want to import from local storage. First thing to do after install: go to settings, switch KV cache to q4_0. That's it. Biggest single performance gain you'll get.
Google's AI Edge Gallery is the lower-friction entry point if you want to test the concept before committing Gemma 4 on-device, minimal config, works on Android and iOS.

Quantization rule for mobile

Always Q4 or Q5. Full precision is for desktops with VRAM headroom. Q4_K_M gives you the majority of the model's capability at half the memory footprint. The quality delta in everyday use is smaller than the numbers suggest.

What it can't replace

Complex code review, multi-step reasoning across long context, sustained conversation where the model needs to hold a lot of state those still belong on the desktop or cloud. The phone model handles the first step. The pipeline handles the rest.

Full breakdown: https://engineeredai.net/run-local-llm-on-android-phone/

Top comments (0)