On-Device RAG for Android

#webdev #programming

---
title: "On-Device RAG for Android: Architecture Guide"
published: true
description: "Build a fully offline RAG pipeline on Android using ONNX Runtime embeddings, SQLite vector search, and local LLM inference with Jetpack Compose."
tags: android, kotlin, architecture, mobile
canonical_url: https://blog.mvpfactory.co/on-device-rag-for-android-architecture-guide
---

## What We're Building

A fully offline retrieval-augmented generation pipeline on Android. No server round-trips, no data leaving the device. By the end of this guide, you'll understand how to wire together quantized embedding models via ONNX Runtime Mobile, vector indexing in SQLite with sqlite-vec, smart chunking that respects mobile memory constraints, and a local LLM inference loop streaming tokens into a Compose UI.

Let me show you a pattern I use in every project that handles sensitive data — medical records, financial documents, legal contracts — where pushing embeddings to a cloud endpoint becomes a liability.

## Prerequisites

- Familiarity with Kotlin and Jetpack Compose
- Android Studio with NDK configured (for native sqlite-vec and llama.cpp bindings)
- A quantized ONNX embedding model (`all-MiniLM-L6-v2` exported to INT8)
- A GGUF-format local LLM (3B parameter, Q4 quantized)

## Step 1: Embedding with ONNX Runtime Mobile

The embedding model is the bottleneck that matters most. You need a model small enough to run in under 200ms per chunk but accurate enough for meaningful retrieval.

Take `all-MiniLM-L6-v2` (384-dimensional output, ~22M parameters), export it to ONNX, then apply INT8 dynamic quantization. This shrinks the model from ~90MB to ~23MB while preserving most retrieval quality.

kotlin
class OnDeviceEmbedder(context: Context) {
private val session: OrtSession = OrtEnvironment.getEnvironment()
.createSession(
context.assets.open("minilm-quantized.onnx").readBytes(),
OrtSession.SessionOptions().apply {
addConfigEntry("session.intra_op.allow_spinning", "0")
setIntraOpNumThreads(2)
}
)

fun embed(text: String): FloatArray {
    val tokenized = tokenizer.encode(text)
    val inputTensor = OnnxTensor.createTensor(env, tokenized)
    val result = session.run(mapOf("input_ids" to inputTensor))
    return meanPooling(result)
}

}


Here's the gotcha that will save you hours: limit `intraOpNumThreads` to 2. Mobile CPUs thermal-throttle fast. Saturating all cores gives you a burst of speed followed by a cliff. Two threads sustains consistent throughput.

## Step 2: Vector Search with sqlite-vec

sqlite-vec, the successor to sqlite-vss by Alex Garcia, is a zero-dependency, single-C-file SQLite extension for vector search. Unlike sqlite-vss, which pulled in Faiss, sqlite-vec means a smaller binary, simpler build, and no native library headaches on mobile.

sql
CREATE VIRTUAL TABLE doc_embeddings USING vec0(
chunk_id INTEGER PRIMARY KEY,
embedding FLOAT[384]
);

-- Query: find top-5 nearest chunks
SELECT chunk_id, distance
FROM doc_embeddings
WHERE embedding MATCH ?
ORDER BY distance
LIMIT 5;


For corpora under 50,000 chunks — which covers most on-device use cases — brute-force search runs in single-digit milliseconds. You get SQLite's full transactional guarantees for free: atomic writes, crash recovery, single-file portability.

## Step 3: Chunking for Mobile Memory

Here is the minimal setup to get this working without blowing your memory budget. A 512-token chunk with 50-token overlap is fine with 64GB of RAM on a server. On a device with 6–8GB shared between OS, app, and background processes, you need discipline:

- **Chunk size:** 256 tokens max
- **Overlap:** 32 tokens (12.5%)
- **Strategy:** Sentence-boundary-aware splitting, never breaking mid-sentence
- **Budget:** Keep total indexed corpus under 10,000 chunks (~15–20MB SQLite DB)

## Step 4: Wiring the Inference Loop

The retrieval-to-generation pipeline in a coroutine-based architecture:

kotlin
fun ragQuery(query: String): Flow = flow {
val queryVector = embedder.embed(query)
val topChunks = vectorDb.search(queryVector, k = 5)
val context = topChunks.joinToString("\n\n") { it.text }

val prompt = """
    |Given the following context, answer the question.
    |Context: $context
    |Question: $query
""".trimMargin()

localLlm.generate(prompt).collect { token ->
    emit(token)
}

}


On the Compose side, collect this flow into a `mutableStateOf` string. Each emitted token appends to the displayed text, giving users that streaming feel without any network dependency.

## Performance Expectations

| Operation | Typical Latency (Flagship SoC) |
|---|---|
| Embed single chunk (INT8, 256 tokens) | 30–80ms |
| Vector search (10K chunks, brute-force) | 2–8ms |
| LLM first token (3B param, Q4 quantized) | 500ms–1.5s |
| LLM token throughput | 8–15 tokens/sec |

These ranges reflect recent Snapdragon 8-series and Tensor G-series hardware. Mid-range chipsets will sit at the slower end or beyond.

## Gotchas

- **Start with embeddings, not the LLM.** Retrieval quality gates everything downstream. A mediocre LLM with excellent retrieval outperforms a strong LLM with poor retrieval. Get your embedding pipeline right first, measure recall, then layer in generation.
- **Use sqlite-vec over sqlite-vss for mobile.** Zero dependencies, smaller binary, simpler cross-compilation. For realistic on-device corpus sizes, brute-force search is fast enough. You don't need HNSW complexity on a phone.
- **The docs don't mention this, but** thermal throttling is your true constraint, not peak FLOPS. Cap ONNX Runtime threads, batch embedding work with delays between batches, and profile on real mid-range devices — not just your development flagship.
- **Don't port server-side chunking directly.** This is where most teams go wrong. Respect the memory envelope or you'll learn the hard way via OOM kills.

## Wrapping Up

On-device RAG is viable today on flagship and even mid-range Android hardware. The stack — ONNX Runtime Mobile for embeddings, sqlite-vec for vector search, and llama.cpp for local generation — gives you a privacy-first pipeline with no cloud dependency. Start small, measure your embedding recall, respect the thermal envelope, and you'll have a production-ready architecture that keeps sensitive data exactly where it belongs: on the device.

DEV Community

On-Device RAG for Android

Top comments (0)