<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sagar Gupta</title>
    <description>The latest articles on DEV Community by Sagar Gupta (@sagar_gupta_35066c051032b).</description>
    <link>https://dev.to/sagar_gupta_35066c051032b</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3962794%2F663c0422-6271-42a9-8392-8f9370a0f885.jpg</url>
      <title>DEV Community: Sagar Gupta</title>
      <link>https://dev.to/sagar_gupta_35066c051032b</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sagar_gupta_35066c051032b"/>
    <language>en</language>
    <item>
      <title>I Built NativeLM for Android (And Bypassed OEM RAM Lies to Do It)</title>
      <dc:creator>Sagar Gupta</dc:creator>
      <pubDate>Mon, 01 Jun 2026 13:42:40 +0000</pubDate>
      <link>https://dev.to/sagar_gupta_35066c051032b/i-built-nativelm-for-android-and-bypassed-oem-ram-lies-to-do-it-27lg</link>
      <guid>https://dev.to/sagar_gupta_35066c051032b/i-built-nativelm-for-android-and-bypassed-oem-ram-lies-to-do-it-27lg</guid>
      <description>&lt;p&gt;Running large language models on-device is the ultimate answer to privacy. But what good is an LLM if it doesn't know about &lt;em&gt;your&lt;/em&gt; private data?&lt;/p&gt;

&lt;p&gt;I wanted a fully offline AI assistant — an app where I could import my own PDFs and notes, and ask a local model questions about them without a single byte leaving my phone.&lt;/p&gt;

&lt;p&gt;So I built it. &lt;a href="https://github.com/sagar-develop/litertlm-kmp" rel="noopener noreferrer"&gt;litertlm-kmp&lt;/a&gt; is a Kotlin Multiplatform wrapper around Google's LiteRT-LM (the rebranded TensorFlow Lite). Its companion app, &lt;strong&gt;NativeLM&lt;/strong&gt;, is a fully on-device Document RAG pipeline for Android running Gemma 4.&lt;/p&gt;

&lt;p&gt;Here is how I assembled the local RAG pipeline, and the massive OEM memory bug I had to solve to ship it to production.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Pipeline: Fully On-Device RAG
&lt;/h2&gt;

&lt;p&gt;A traditional cloud RAG architecture sends documents to a server to be chunked, calls the OpenAI Embeddings API, stores vectors in Pinecone, and retrieves them for an OpenAI chat completion.&lt;/p&gt;

&lt;p&gt;NativeLM's pipeline does all of this locally on the phone.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Import and Embed (USE-Lite)
&lt;/h3&gt;

&lt;p&gt;When you import a PDF, NativeLM uses PDFBox to extract the text and splits it into 500-character chunks. We wired up &lt;strong&gt;MediaPipe's TextEmbedder&lt;/strong&gt; running the Universal Sentence Encoder Lite (USE-Lite) model. It's incredibly lightweight (~6 MB) and generates 100-dimensional embeddings perfect for mobile memory constraints.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Vector Search (ObjectBox HNSW)
&lt;/h3&gt;

&lt;p&gt;The embeddings need to be queried instantly during a chat. We used &lt;strong&gt;ObjectBox&lt;/strong&gt;, which natively supports HNSW (Hierarchical Navigable Small World) vector search on edge devices.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Entity&lt;/span&gt;
&lt;span class="kd"&gt;data class&lt;/span&gt; &lt;span class="nc"&gt;DocumentChunkEntity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nd"&gt;@Id&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt; &lt;span class="py"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Long&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="kd"&gt;var&lt;/span&gt; &lt;span class="py"&gt;documentId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Long&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="kd"&gt;var&lt;/span&gt; &lt;span class="py"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

    &lt;span class="nd"&gt;@HnswIndex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dimensions&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="kd"&gt;var&lt;/span&gt; &lt;span class="py"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;FloatArray&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Grounding Gemma
&lt;/h3&gt;

&lt;p&gt;When a user asks a question, we embed the query using the USE-Lite model, run a kNN search against ObjectBox to retrieve the top matching chunks, and inject them into Gemma's prompt. Gemma answers the user's question &lt;em&gt;using only the provided context&lt;/em&gt;, and the UI renders the retrieved chunks as citations.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Trap: OEM RAM-Expansion Lies
&lt;/h2&gt;

&lt;p&gt;The RAG pipeline was working perfectly, but the app kept crashing on Xiaomi, Realme, and OPPO devices during model loading.&lt;/p&gt;

&lt;p&gt;These OEMs have features called "Memory Extension" or "Dynamic RAM Expansion" that use swap-to-flash to artificially inflate the device's reported RAM. A phone with 6GB of physical RAM will report 8GB or even 10GB to the operating system.&lt;/p&gt;

&lt;p&gt;If you use the standard Android API to check available memory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;memInfo&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ActivityManager&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;MemoryInfo&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;activityManager&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getMemoryInfo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;memInfo&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;totalRam&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;memInfo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;totalMem&lt;/span&gt; &lt;span class="c1"&gt;// LIES on Xiaomi/Realme/OPPO&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;...you'll get the inflated number. Your model-loading code sees "8GB available", decides it's safe to load a 4GB model, and the kernel OOM-kills your process because the actual physical RAM can't handle it.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Fix: Bypass the OS
&lt;/h3&gt;

&lt;p&gt;I wrote a hardware tiering system that reads &lt;code&gt;/proc/meminfo&lt;/code&gt; directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;detectVirtualRamExpansion&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="nc"&gt;Boolean&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;memInfo&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;File&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/proc/meminfo"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;readText&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;swapTotal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;memInfo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startsWith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"SwapTotal:"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="o"&gt;?.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"\\s+"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toRegex&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="o"&gt;?.&lt;/span&gt;&lt;span class="nf"&gt;getOrNull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?.&lt;/span&gt;&lt;span class="nf"&gt;toLongOrNull&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;?:&lt;/span&gt; &lt;span class="mi"&gt;0L&lt;/span&gt;

    &lt;span class="c1"&gt;// If SwapTotal &amp;gt; 1GB, the OEM is using RAM expansion&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;swapTotal&lt;/span&gt; &lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1_048_576&lt;/span&gt; &lt;span class="c1"&gt;// kB&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When swap is detected above 1GB, the library forcibly downgrades the hardware tier. A device reporting 8GB with 3GB of swap gets classified as a 5GB device, and the model catalog offers the smaller Gemma variant instead.&lt;/p&gt;

&lt;p&gt;This single fix eliminated 100% of the OOM crashes on Xiaomi and Realme test devices.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stateful KV-Cache Sessions
&lt;/h2&gt;

&lt;p&gt;Another major issue with local AI: chat gets slower every turn. Most apps re-send the entire conversation transcript to the model on every turn. Time-to-first-token (TTFT) degrades linearly.&lt;/p&gt;

&lt;p&gt;LiteRT-LM supports keeping a KV-cache alive across turns via &lt;code&gt;openChatSession()&lt;/code&gt;. The cache stores the key-value attention states from all prior turns, so the model only needs to process the &lt;em&gt;new&lt;/em&gt; tokens.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Open a persistent session — KV-cache persists across turns&lt;/span&gt;
&lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;session&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;openChatSession&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;// Turn 1: full processing&lt;/span&gt;
&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sendMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"What is this document about?"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// Turn 5: only processes the NEW prompt, not the full history&lt;/span&gt;
&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sendMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Give me more details."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I built a session manager that handles the tight constraints of local KV caching transparently. Result: TTFT stays flat at ~20 tok/s on a Snapdragon 8 Gen 2, regardless of conversation length.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Result: NativeLM v0.4.0
&lt;/h2&gt;

&lt;p&gt;The latest release of NativeLM exercises the full library — onboarding, model management, stateful KV-cache sessions, and the brand new fully offline Document RAG feature.&lt;/p&gt;

&lt;p&gt;Everything is open source: &lt;a href="https://github.com/sagar-develop/litertlm-kmp" rel="noopener noreferrer"&gt;github.com/sagar-develop/litertlm-kmp&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you want to try it on your device: &lt;a href="https://github.com/sagar-develop/litertlm-kmp/releases/tag/v0.4.0" rel="noopener noreferrer"&gt;Download the v0.4.0 APK&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For more technical deep dives on shipping on-device AI, check out the &lt;a href="https://urjalabs.in/blog" rel="noopener noreferrer"&gt;Urja Labs blog&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I'd love to hear what problems you've hit running local AI on mobile. Drop a comment or open an issue on the repo.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built at &lt;a href="https://urjalabs.in" rel="noopener noreferrer"&gt;Urja Labs&lt;/a&gt;. Dual-licensed: AGPL-3.0 for open-source, commercial license available for proprietary distribution.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Follow the journey: &lt;a href="https://linkedin.com/in/sagarandroid" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; · &lt;a href="https://x.com/sagar8874" rel="noopener noreferrer"&gt;X / Twitter&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>android</category>
      <category>kotlin</category>
      <category>rag</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
