Mohammed Ali Chherawalla

Posted on Mar 1 • Edited on Mar 2

How to Run LLMs Locally on Your Android Phone in 2026 (No Cloud, No Account)

#android #ai #privacy #llm

Your Android phone has a GPU more powerful than most 2018 laptops. Modern Snapdragon chips have dedicated AI accelerators that sit idle while you pay $20/month to run AI on someone else's server. That's changing.

Off Grid is a free, open-source app that runs large language models entirely on your Android phone. No internet connection after the initial model download. No account. No data leaving your device. This guide covers how to set it up, which models to use, and what performance to expect on your specific hardware.

Play Store | GitHub

What You Need

Minimum hardware: 6GB RAM, ARM64 processor (any phone from the last 4 to 5 years). You can start with models as small as 80MB.

Recommended hardware: 8GB+ RAM, Snapdragon 8 Gen 2 or newer. This opens up 3B to 7B parameter models that produce genuinely useful output.

What you're giving up vs cloud AI: Cloud LLMs like ChatGPT and Claude run models with hundreds of billions of parameters on data center GPUs. Your phone runs smaller models (1B to 7B parameters). The output is less sophisticated for complex reasoning, but for everyday tasks like quick questions, summarization, drafting, and document analysis, it's surprisingly capable.

What Off Grid Can Do

Off Grid isn't just a text chatbot. It runs six AI capabilities in a single app, all on device:

Text generation. Run Qwen 3, Llama 3.2, Gemma 3, Phi-4, or any GGUF model. Streaming responses with markdown rendering. 15 to 30 tokens per second on flagship devices, 5 to 15 on mid-range.

Image generation. On-device Stable Diffusion with real-time preview. NPU-accelerated on Snapdragon (5 to 10 seconds per image). 20+ models including Absolute Reality, DreamShaper, and Anything V5.

Vision AI. Point your camera at something or attach an image and ask questions about it. SmolVLM and Qwen3-VL run at about 7 seconds on flagship.

Voice transcription. On-device Whisper speech to text. Hold to record, real-time partial transcription. No audio ever leaves your phone.

Tool calling. Models that support function calling can use built-in tools: web search, calculator, date/time, device info. The model chains them together in an automatic loop with runaway prevention.

Document analysis. Attach PDFs, code files, CSVs, and more to your conversations.

Onboarding	Text Generation	Image Generation
Vision	Attachments

Which Models to Use

Off Grid's model browser filters by your device's RAM so you never download something your phone can't run. Here's what works on real hardware:

6GB RAM phones: Stick with 1B to 2B models. Qwen 3 0.6B or SmolLM3 are good starting points. Expect 5 to 10 tokens per second. Usable for short answers and simple tasks.

8GB RAM phones: The sweet spot. Qwen 3 1.5B or Phi-4 Mini give you noticeably better output quality. 10 to 20 tokens per second on recent Snapdragon chips.

12GB+ RAM phones: You can run 7B models like Llama 3.2 7B or Qwen 3 4B. These approach the quality of early ChatGPT for many tasks. 15 to 30 tokens per second on Snapdragon 8 Gen 3.

Quantization matters. A Q4_K_M quantized model uses roughly half the memory of a full precision version with minimal quality loss. Always go for Q4 or Q5 quantization on mobile.

You can also import your own .gguf files from device storage if you already have models downloaded.

Hardware Acceleration

Off Grid automatically detects your hardware and uses the fastest available path:

Snapdragon 8 Gen 1+ with QNN: The dedicated Neural Processing Unit is significantly faster and more power efficient than CPU or GPU. If you have a Snapdragon 8 Gen 2 or Gen 3, this is the fastest path. Off Grid uses QNN automatically when available.

Adreno GPU via OpenCL: Available on most Snapdragon phones. Faster than CPU alone. Good fallback for older Snapdragon devices.

CPU only: Works on everything. Slower but usable for smaller models.

The KV Cache Trick That Triples Your Speed

This is the single biggest performance win most people miss. The KV cache stores the context of your conversation. By default it uses f16 (16-bit floating point). Off Grid lets you switch to q4_0 (4-bit quantization) in settings.

Going from f16 to q4_0 roughly triples inference speed with minimal quality impact on most models. The app nudges you to optimize after your first generation.

Memory: The Real Constraint

Your phone has 8GB of RAM but the OS uses 3 to 4GB. You actually have around 4GB available for inference.

A rough formula: model file size x 1.5 = actual RAM needed at runtime (the extra is for KV cache and activations). So a 4GB model file needs about 6GB of free RAM.

Off Grid checks available RAM before every model load and warns you if a model won't fit. This prevents the silent crash that happens when the OS kills an app that's using too much memory. You'll see a human-readable warning instead of a random app closure.

Privacy: What "Local" Actually Means

Running a model locally means the computation happens on your phone's processor. After the initial model download from HuggingFace, Off Grid makes zero network requests. You can verify this by enabling airplane mode and using the app normally. Everything works.

Off Grid is open source and MIT licensed. You don't have to trust anyone's privacy claims. Read the code yourself. The app has no analytics, no telemetry, no tracking, no accounts.

For sensitive use cases like medical questions, legal document analysis, journaling, or work notes containing proprietary information, on-device AI removes the tradeoff between capability and privacy entirely.

Getting Started

Install Off Grid from the Play Store
Open the model browser and pick a recommended model for your device's RAM
Download the model over WiFi (sizes range from 80MB to 4GB+)
Turn on airplane mode to verify it works offline
Start chatting

The first generation will be slower as the model loads into memory. Subsequent messages are faster. Go into settings and switch KV cache to q4_0 for the best speed.

What's Next

Qualcomm's next generation Snapdragon is expected to hit 200 tokens per second for on-device inference. Samsung's Galaxy S26 ships with built-in on-device AI. Model optimization techniques keep improving quality at smaller sizes.

Off Grid is under active development with new features shipping weekly. Tool calling, configurable KV cache, and vision support all shipped in the last month. Check the GitHub for the latest releases.

A year from now, running AI on your phone won't be a power user trick. It'll be the default.

Top comments (2)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.