DEV Community

Sagar Gupta
Sagar Gupta

Posted on • Originally published at urjalabs.in

I Built NativeLM for Android (And Bypassed OEM RAM Lies to Do It)

Running large language models on-device is the ultimate answer to privacy. But what good is an LLM if it doesn't know about your private data?

I wanted a fully offline AI assistant — an app where I could import my own PDFs and notes, and ask a local model questions about them without a single byte leaving my phone.

So I built it. litertlm-kmp is a Kotlin Multiplatform wrapper around Google's LiteRT-LM (the rebranded TensorFlow Lite). Its companion app, NativeLM, is a fully on-device Document RAG pipeline for Android running Gemma 4.

Here is how I assembled the local RAG pipeline, and the massive OEM memory bug I had to solve to ship it to production.


The Pipeline: Fully On-Device RAG

A traditional cloud RAG architecture sends documents to a server to be chunked, calls the OpenAI Embeddings API, stores vectors in Pinecone, and retrieves them for an OpenAI chat completion.

NativeLM's pipeline does all of this locally on the phone.

1. Import and Embed (USE-Lite)

When you import a PDF, NativeLM uses PDFBox to extract the text and splits it into 500-character chunks. We wired up MediaPipe's TextEmbedder running the Universal Sentence Encoder Lite (USE-Lite) model. It's incredibly lightweight (~6 MB) and generates 100-dimensional embeddings perfect for mobile memory constraints.

2. Vector Search (ObjectBox HNSW)

The embeddings need to be queried instantly during a chat. We used ObjectBox, which natively supports HNSW (Hierarchical Navigable Small World) vector search on edge devices.

@Entity
data class DocumentChunkEntity(
    @Id var id: Long = 0,
    var documentId: Long = 0,
    var text: String = "",

    @HnswIndex(dimensions = 100)
    var embedding: FloatArray? = null
)
Enter fullscreen mode Exit fullscreen mode

3. Grounding Gemma

When a user asks a question, we embed the query using the USE-Lite model, run a kNN search against ObjectBox to retrieve the top matching chunks, and inject them into Gemma's prompt. Gemma answers the user's question using only the provided context, and the UI renders the retrieved chunks as citations.


The Trap: OEM RAM-Expansion Lies

The RAG pipeline was working perfectly, but the app kept crashing on Xiaomi, Realme, and OPPO devices during model loading.

These OEMs have features called "Memory Extension" or "Dynamic RAM Expansion" that use swap-to-flash to artificially inflate the device's reported RAM. A phone with 6GB of physical RAM will report 8GB or even 10GB to the operating system.

If you use the standard Android API to check available memory:

val memInfo = ActivityManager.MemoryInfo()
activityManager.getMemoryInfo(memInfo)
val totalRam = memInfo.totalMem // LIES on Xiaomi/Realme/OPPO
Enter fullscreen mode Exit fullscreen mode

...you'll get the inflated number. Your model-loading code sees "8GB available", decides it's safe to load a 4GB model, and the kernel OOM-kills your process because the actual physical RAM can't handle it.

The Fix: Bypass the OS

I wrote a hardware tiering system that reads /proc/meminfo directly:

private fun detectVirtualRamExpansion(): Boolean {
    val memInfo = File("/proc/meminfo").readText()
    val swapTotal = memInfo.lines()
        .find { it.startsWith("SwapTotal:") }
        ?.split("\\s+".toRegex())
        ?.getOrNull(1)?.toLongOrNull() ?: 0L

    // If SwapTotal > 1GB, the OEM is using RAM expansion
    return swapTotal > 1_048_576 // kB
}
Enter fullscreen mode Exit fullscreen mode

When swap is detected above 1GB, the library forcibly downgrades the hardware tier. A device reporting 8GB with 3GB of swap gets classified as a 5GB device, and the model catalog offers the smaller Gemma variant instead.

This single fix eliminated 100% of the OOM crashes on Xiaomi and Realme test devices.


Stateful KV-Cache Sessions

Another major issue with local AI: chat gets slower every turn. Most apps re-send the entire conversation transcript to the model on every turn. Time-to-first-token (TTFT) degrades linearly.

LiteRT-LM supports keeping a KV-cache alive across turns via openChatSession(). The cache stores the key-value attention states from all prior turns, so the model only needs to process the new tokens.

// Open a persistent session — KV-cache persists across turns
val session = engine.openChatSession()

// Turn 1: full processing
session.sendMessage("What is this document about?")

// Turn 5: only processes the NEW prompt, not the full history
session.sendMessage("Give me more details.")
Enter fullscreen mode Exit fullscreen mode

I built a session manager that handles the tight constraints of local KV caching transparently. Result: TTFT stays flat at ~20 tok/s on a Snapdragon 8 Gen 2, regardless of conversation length.


The Result: NativeLM v0.4.0

The latest release of NativeLM exercises the full library — onboarding, model management, stateful KV-cache sessions, and the brand new fully offline Document RAG feature.

Everything is open source: github.com/sagar-develop/litertlm-kmp

If you want to try it on your device: Download the v0.4.0 APK

For more technical deep dives on shipping on-device AI, check out the Urja Labs blog.

I'd love to hear what problems you've hit running local AI on mobile. Drop a comment or open an issue on the repo.


Built at Urja Labs. Dual-licensed: AGPL-3.0 for open-source, commercial license available for proprietary distribution.

Follow the journey: LinkedIn · X / Twitter

Top comments (0)