Running LLMs On-Device in Android: GGUF Models, NNAPI, and the Real Performance Tradeoffs

#webdev #programming

---
title: "On-Device LLMs in Android: GGUF Models, NNAPI, and Real Performance Tradeoffs"
published: true
description: "A practical guide to shipping on-device LLM inference in production Android apps — covering GGUF quantization, NNAPI delegation, memory management, and benchmarking that reflects real user latency."
tags: android, kotlin, mobile, performance
canonical_url: https://blog.mvpfactory.co/on-device-llms-android-gguf-nnapi-performance-tradeoffs
---

## What You Will Learn

By the end of this guide, you will know how to pick the right quantization format for on-device LLM inference, build a chipset-aware backend selection strategy, manage memory pressure on mid-range Android hardware, and benchmark in a way that actually predicts what your users will experience. This comes from shipping to 200K+ devices — not from reading spec sheets.

## Prerequisites

- An Android project targeting API 26+
- Familiarity with Kotlin and Android lifecycle callbacks
- A physical test device (emulators will not give you meaningful numbers)
- A GGUF-format model (3B parameters or smaller for mobile)

## Step 1: Choose Your Quantization Format

This is the single highest-leverage decision you will make. Here is what I measured on a Pixel 8 Pro (Tensor G3) with a 3B parameter LLaMA-class model:

| Format | Model Size | RAM Usage | Tokens/sec | Perplexity | Cold Start |
|---|---|---|---|---|---|
| FP16 (baseline) | 6.0 GB | 7.2 GB | 2.1 | 8.2 | 14.3s |
| GGUF Q8_0 | 3.2 GB | 4.1 GB | 5.4 | 8.4 | 8.1s |
| **GGUF Q4_K_M** | **1.7 GB** | **2.1 GB** | **11.2** | **8.9** | **4.2s** |
| GGUF Q4_0 | 1.5 GB | 1.9 GB | 12.8 | 9.6 | 3.8s |

Let me show you a pattern I use in every project: start with Q4_K_M. It delivers 5x the throughput of FP16 with only an 8.5% perplexity increase. Q4_0 looks tempting on paper, but in A/B tests users reported noticeably more nonsensical completions. The perplexity gap does not capture how bad those feel in practice.

## Step 2: Build a Chipset-Aware Backend Strategy

Here is the gotcha that will save you hours. Most teams benchmark on a Pixel and ship to the world. NNAPI vendor implementations vary wildly:

| Chipset | GPU Delegate | P95 Latency Variance |
|---|---|---|
| Snapdragon 8 Gen 2 | Adreno 740, solid | ±12% |
| Tensor G3 | Mali-G715, good | ±15% |
| Dimensity 9200 | Mali-G715, partial ops | ±38% |
| Exynos 2400 | Xclipse 940, inconsistent | ±52% |

The docs do not mention this, but that ±52% variance on Exynos will wreck your user experience. Users perceive variance as jank — they forgive steady-but-slower output far more than unpredictable stuttering. Here is the minimal setup to get this working:

kotlin
fun selectBackend(chipset: ChipsetInfo): Backend {
return when {
chipset.isSnapdragon8Series() -> Backend.GPU
chipset.isTensorG3OrNewer() -> Backend.GPU
chipset.totalRamGb >= 8 -> Backend.CPU // 4 threads, predictable
else -> Backend.CPU // 2 threads, conservative
}
}


Build an allowlist for GPU delegation and default to CPU everywhere else. Predictable latency beats peak throughput every time.

## Step 3: Manage Memory on Mid-Range Hardware

63% of Android devices globally have 6 GB of RAM or less. After the OS takes its share, you often have 1.5–2 GB to work with. Three things that actually work in production:

1. **Memory-map the model file.** GGUF supports mmap natively, letting the OS page in weights on demand instead of loading the entire model into RAM.

2. **Monitor `onTrimMemory` aggressively.** Release KV cache at `TRIM_MEMORY_RUNNING_LOW` and unload the model entirely at `TRIM_MEMORY_COMPLETE`.

3. **Pre-warm selectively.** Load the model when the user navigates to the relevant feature, not at app start. Eager loading sounds smart until you are fighting the OS for memory before the user even needs inference.

## Step 4: Benchmark Like Your Users Live

Synthetic throughput on a fresh device with nothing running will make you feel great. Then your users will tell you the app is slow. Measure these instead:

- **Time-to-first-token (TTFT):** what users actually wait for. Target under 400ms.
- **P95 latency, not mean.** One bad inference ruins the session.
- **Thermal throttle recovery.** After 60 seconds of continuous inference, throughput drops 20–40%. Your benchmark must capture that tail.
- **Memory-pressure scenarios.** Run benchmarks with YouTube and Chrome in the background. That is what your users' phones actually look like.

## Gotchas

- **Q4_0 vs Q4_K_M:** The speed difference is marginal. The quality cliff is not. Always prefer Q4_K_M unless you have measured quality on your specific use case.
- **NNAPI on Samsung:** Exynos NNAPI implementations are unstable enough that I would CPU-only every Samsung device unless you can test on each specific SoC.
- **Pixel benchmarks are misleading:** Your Pixel 9 Pro results are irrelevant to 60%+ of your users. Run your benchmarks on a Redmi Note 13 with Spotify playing. That is your real performance floor.
- **Cold start on first install:** The first model load after install is significantly slower due to filesystem caching. Measure it separately.

## Wrapping Up

On-device LLM inference on Android is viable today, but only if you respect the constraints. Use GGUF Q4_K_M, build a chipset allowlist for GPU delegation, manage memory like it is scarce (because it is), and benchmark under real-world pressure. The gap between cloud and on-device latency is not just a speed difference — it unlocks entirely different interaction patterns. Autocomplete at 50ms feels like typing. At 500ms, it feels like waiting. That is a different product.

For further reading, check out the [MediaPipe LLM Inference API docs](https://developers.google.com/mediapipe/solutions/genai/llm_inference/android) and the [GGUF format specification](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md).

Top comments (1)

SoftwareDevs mvpfactory.io • Mar 10

For more content about LLM subscribe me :)