<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: B Sai Kiran Reddy</title>
    <description>The latest articles on DEV Community by B Sai Kiran Reddy (@sai888).</description>
    <link>https://dev.to/sai888</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3996951%2Ff0934dd2-27e4-473b-8dca-60833dff48d0.jpg</url>
      <title>DEV Community: B Sai Kiran Reddy</title>
      <link>https://dev.to/sai888</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sai888"/>
    <language>en</language>
    <item>
      <title>Running Gemma 4 On-Device on Android: Building a Private AI Assistant with LiteRT</title>
      <dc:creator>B Sai Kiran Reddy</dc:creator>
      <pubDate>Mon, 22 Jun 2026 12:49:44 +0000</pubDate>
      <link>https://dev.to/sai888/running-gemma-4-on-device-on-android-building-a-private-ai-assistant-with-litert-1a0</link>
      <guid>https://dev.to/sai888/running-gemma-4-on-device-on-android-building-a-private-ai-assistant-with-litert-1a0</guid>
      <description>&lt;h1&gt;
  
  
  Running Gemma 4 On-Device on Android: Building a Private AI Assistant with LiteRT
&lt;/h1&gt;

&lt;p&gt;I built an Android app that runs Gemma 4 entirely on-device using Google's LiteRT framework no cloud API calls, no data leaving your phone, no API costs. Here's how it works and what I learned.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why On-Device AI?
&lt;/h2&gt;

&lt;p&gt;Cloud AI is powerful, but comes with downsides: latency, privacy concerns, rate limits, and recurring API costs. Running a model locally solves all of these your data stays on the device, responses are instantaneous after the first token, and there's no per-request cost.&lt;/p&gt;

&lt;p&gt;The tradeoff is compute. A modern flagship phone can run a 2-3B parameter model comfortably, but you need the right quantization and engine.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model:&lt;/strong&gt; Gemma 4 E2B (Google's latest edge-optimized model)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engine:&lt;/strong&gt; LiteRT LM (Google's on-device LLM runtime)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quantization:&lt;/strong&gt; Q4 (3.2GB), SFP8 (4.6GB), BF16 (9.6GB)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Language:&lt;/strong&gt; Kotlin&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backend:&lt;/strong&gt; Multi-provider fallback (LiteRT → Gemini → Claude → OpenAI)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Model Download &amp;amp; Management
&lt;/h2&gt;

&lt;p&gt;The model is hosted on Hugging Face and downloaded on first launch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="kd"&gt;object&lt;/span&gt; &lt;span class="nc"&gt;ModelManager&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;MODEL_URL&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;
        &lt;span class="s"&gt;"https://huggingface.co/sai8151/gemma-4-e2b-it-litertlm/resolve/main/gemma-4-e2b-it.litertlm"&lt;/span&gt;

    &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;getModelFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nc"&gt;File&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;
        &lt;span class="nc"&gt;File&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filesDir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"gemma-4-e2b-it.litertlm"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;suspend&lt;/span&gt; &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;downloadModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;onProgress&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;Unit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;client&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OkHttpClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;request&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Builder&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;MODEL_URL&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;response&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;newCall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;body&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;?:&lt;/span&gt; &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="nc"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Download failed"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;total&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;contentLength&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;input&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;byteStream&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;file&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getModelFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;outputStream&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;use&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt;
            &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;buffer&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ByteArray&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8192&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="kd"&gt;var&lt;/span&gt; &lt;span class="py"&gt;downloaded&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0L&lt;/span&gt;
            &lt;span class="kd"&gt;var&lt;/span&gt; &lt;span class="py"&gt;read&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Int&lt;/span&gt;
            &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;also&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;read&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;it&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;!=&lt;/span&gt; &lt;span class="p"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;downloaded&lt;/span&gt; &lt;span class="p"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;read&lt;/span&gt;
                &lt;span class="nf"&gt;onProgress&lt;/span&gt;&lt;span class="p"&gt;(((&lt;/span&gt;&lt;span class="n"&gt;downloaded&lt;/span&gt; &lt;span class="p"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;/&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toInt&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We validate the download if the file is under 1GB (indicating a partial/corrupt download), we delete it and force a re-download.&lt;br&gt;
LiteRT Integration&lt;br&gt;
Google's LiteRT LM provides an Engine and Conversation API for running LLMs on-device.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;backend&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;useGpu&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nc"&gt;Backend&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GPU&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nc"&gt;Backend&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;CPU&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;config&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;EngineConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;modelPath&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;modelPath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;backend&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;backend&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;visionBackend&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Backend&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;CPU&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;cacheDir&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cacheDir&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;absolutePath&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;engine&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Engine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;initialize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;conversation&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createConversation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nc"&gt;ConversationConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;systemInstruction&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Contents&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;of&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;systemPrompt&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;samplerConfig&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SamplerConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;topK&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;topK&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;topP&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;topP&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toDouble&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;temperature&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toDouble&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The sendMessage API is straightforward you pass the prompt and get back a response. For streaming, you can use sendMessageAsync with a Flow collector. I also built a chatWithImage method that uses Message.of(Content.ImageBytes(imageBytes), Content.Text(userMessage)) for multimodal prompts.&lt;br&gt;
Auto-Tuner: Adaptive Performance&lt;br&gt;
Different devices have wildly different performance. I built an auto-tuner that adjusts model parameters based on real-time metrics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="kd"&gt;object&lt;/span&gt; &lt;span class="nc"&gt;AutoTuner&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;tune&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;TunedConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;PerfMetrics&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nc"&gt;TunedConfig&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;when&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="c1"&gt;// Very slow device — dial everything down&lt;/span&gt;
            &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tps&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;2.0&lt;/span&gt; &lt;span class="p"&gt;||&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;firstLatency&lt;/span&gt; &lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;4000&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt;
                &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;useGpu&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;topK&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;topK&lt;/span&gt; &lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;coerceAtLeast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="n"&gt;contextSize&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;contextSize&lt;/span&gt; &lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;coerceAtLeast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="c1"&gt;// Fast device — crank it up&lt;/span&gt;
            &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tps&lt;/span&gt; &lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;8.0&lt;/span&gt; &lt;span class="p"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;firstLatency&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;1500&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt;
                &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;topK&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;topK&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;coerceAtMost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="n"&gt;contextSize&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;contextSize&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;coerceAtMost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On my realme device, I get ~6-8 tokens/sec with Q4 quantization usable for chat, summarization, and quick Q&amp;amp;A. The tuner automatically switches between CPU and GPU backends based on throughput.&lt;br&gt;
Multi-Provider Architecture&lt;br&gt;
The app isn't limited to on-device. I defined a common AiClient interface:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="kd"&gt;interface&lt;/span&gt; &lt;span class="nc"&gt;AiClient&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;suspend&lt;/span&gt; &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;systemPrompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Pair&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&amp;gt;,&lt;/span&gt;
        &lt;span class="n"&gt;userMessage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;onToken&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;Unit&lt;/span&gt;&lt;span class="p"&gt;)?&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;
    &lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nc"&gt;Triple&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;?,&lt;/span&gt; &lt;span class="nc"&gt;PerfMetrics&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;

    &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;isLocal&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="nc"&gt;Boolean&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This lets me swap between LiteRtClient, GeminiClient, ClaudeClient, and OpenAiClient transparently. The app auto-fallsback: if the on-device model isn't downloaded, it offers cloud options. If the cloud API fails, it falls back further.&lt;br&gt;
Clipboard Sync (The "URL AI" feature)&lt;br&gt;
The app also polls a configurable URL endpoint originally built for clipboard sync. A companion Windows app (Flask + Cloudflare Tunnel) pushes clipboard content to a URL, and the Android app polls it periodically, feeding new content directly to the AI. This enables a workflow like:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Copy text on Windows&lt;/li&gt;
&lt;li&gt;It appears in the Android app automatically&lt;/li&gt;
&lt;li&gt;The AI processes it on-device and returns results
No cloud dependency, no manual copy-paste, end-to-end private.
Performance Numbers
Quantization
Q4
SFP8
BF16
The auto-tuner starts with GPU + Q4, measures performance over the first few interactions, and adjusts accordingly.
What's Next&lt;/li&gt;
&lt;li&gt;LoRA fine-tuning directly on-device&lt;/li&gt;
&lt;li&gt;Function calling with on-device models&lt;/li&gt;
&lt;li&gt;Streaming TTS for voice responses&lt;/li&gt;
&lt;li&gt;RAG with local embeddings (FAISS on-device)
The full source code is available on GitHub (&lt;a href="https://github.com/sai5880" rel="noopener noreferrer"&gt;https://github.com/sai5880&lt;/a&gt;) and the app is published as Sentinel Link on Google Play (Alpha testing) intrested in testing can contact.
If you're building on-device AI for Android, LiteRT LM is the most mature option right now. The ecosystem is evolving fast Gemma 4, Llama 3.2, and Phi-3 all have on-device variants worth exploring.&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>android</category>
      <category>gemma</category>
      <category>ondeviceai</category>
      <category>litert</category>
    </item>
  </channel>
</rss>
