<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: KRISHNA D</title>
    <description>The latest articles on DEV Community by KRISHNA D (@krishna_apex).</description>
    <link>https://dev.to/krishna_apex</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3920098%2Ff65839aa-39f8-4875-8185-cee4ceba2b7f.png</url>
      <title>DEV Community: KRISHNA D</title>
      <link>https://dev.to/krishna_apex</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/krishna_apex"/>
    <language>en</language>
    <item>
      <title>Private AI on a Normal Android Phone: Building Krexel with Gemma 4 E2B</title>
      <dc:creator>KRISHNA D</dc:creator>
      <pubDate>Wed, 20 May 2026 16:35:51 +0000</pubDate>
      <link>https://dev.to/krishna_apex/private-ai-on-a-normal-android-phone-building-krexel-with-gemma-4-e2b-473e</link>
      <guid>https://dev.to/krishna_apex/private-ai-on-a-normal-android-phone-building-krexel-with-gemma-4-e2b-473e</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge: Build with Gemma 4&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;Every AI assistant you use today sends your data to a server. Your messages. Your documents. Your medical reports. Your private thoughts.&lt;/p&gt;

&lt;p&gt;That's the deal. You get intelligence, they get your data.&lt;/p&gt;

&lt;p&gt;The most personal conversations people have with AI are often the exact conversations they should never have to upload anywhere.&lt;/p&gt;

&lt;p&gt;A student trying to understand a medical report about their parent. A teenager writing something private. A developer working on code they haven't patented yet. A person in a rural area with no reliable internet who just wants to learn.&lt;/p&gt;

&lt;p&gt;I wanted to break that deal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Krexel&lt;/strong&gt; is a fully offline AI productivity suite for Android, powered by Gemma 4 E2B running entirely on-device via llama.cpp.&lt;/p&gt;

&lt;p&gt;No cloud. No API keys. No internet required. &lt;strong&gt;Your data never leaves your phone.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Four features in one app:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Chat AI&lt;/strong&gt; — conversational AI with visible reasoning mode&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keyboard AI&lt;/strong&gt; — AI assistance inside every Android app you already use&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Notes AI&lt;/strong&gt; — summarize, rewrite, polish, generate code, and translate locally&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Translation AI&lt;/strong&gt; — 70+ languages, zero API cost&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Built for real-world mid-range Android phones with 6–8GB RAM — the hardware billions of people actually own. This is not a remote wrapper over a hosted model. The model runs directly on the phone itself.&lt;/p&gt;

&lt;p&gt;Krexel is proprietary. Google Play release coming soon.&lt;/p&gt;




&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/0KgOmIEK-RE"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;The demo shows offline AI chat in airplane mode, Keyboard AI inside Android apps, local translation, medical report analysis fully offline, and Gemma 4 reasoning mode running on-device.&lt;/p&gt;




&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;p&gt;Krexel is proprietary, but here's the architecture that made running a ~3GB LLM across four Android surfaces actually work.&lt;/p&gt;

&lt;p&gt;Building local AI on mobile isn't just about loading a model — it's about surviving strict OS memory constraints, JNI crashes, resource contention, and UI deadlocks.&lt;/p&gt;

&lt;p&gt;The core is &lt;code&gt;SharedAIManager&lt;/code&gt; — a singleton that routes all inference requests from Chat, Keyboard, Notes, and Translation through a single serialized pipeline. One model. Four surfaces. Zero conflicts.&lt;/p&gt;




&lt;h3&gt;
  
  
  1. The Keyboard OOM Killer
&lt;/h3&gt;

&lt;p&gt;An Android keyboard is a background system service. Load a 2GB model inside it and the OS kills the keyboard silently — mid-typing.&lt;/p&gt;

&lt;p&gt;The fix: the entire llama.cpp inference engine runs in a completely isolated background process. Tokens pipe back to the keyboard via Android Messenger IPC.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;generateStreaming&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;systemPrompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;maxTokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;enableThinking&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Boolean&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nc"&gt;Int&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;activeRequestId&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requestId&lt;/span&gt;
    &lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;what&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KrexelAiService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;MSG_GENERATE_STREAMING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;requestId&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requestId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Bundle&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;putString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;KrexelAiService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;KEY_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;putBoolean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;KrexelAiService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;KEY_ENABLE_THINKING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;enableThinking&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;requestId&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the LLM hits OOM, it crashes in its own sandbox. The keyboard never drops a frame.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. One Model, Four Surfaces — Priority Preemption
&lt;/h3&gt;

&lt;p&gt;What happens when the keyboard is generating a suggestion and the user opens Chat? Lower-priority work gets preempted instantly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="k"&gt;enum&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Priority&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;level&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;BACKGROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;// keyboard suggestions&lt;/span&gt;
    &lt;span class="nc"&gt;NORMAL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;      &lt;span class="c1"&gt;// chat responses&lt;/span&gt;
    &lt;span class="nc"&gt;HIGH&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;        &lt;span class="c1"&gt;// interactive note editing&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;isGenerating&lt;/span&gt; &lt;span class="p"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt; &lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;currentPriority&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cancelGeneration&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  3. Race-Condition Safe Generation
&lt;/h3&gt;

&lt;p&gt;Every generation acquires a mutex. State always cleans up in &lt;code&gt;finally&lt;/code&gt; — no matter what happens, no matter how fast the user taps.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;result&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;generationMutex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withLock&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;isGenerating&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;generateWithSystemBlocking&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="p"&gt;.)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;finally&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;isGenerating&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;
        &lt;span class="n"&gt;activeRequestId&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="n"&gt;currentPriority&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Priority&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;BACKGROUND&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  4. The Fast-Tap JNI Deadlock
&lt;/h3&gt;

&lt;p&gt;Rapid task switching fires commands into llama.cpp out of order. On a native C++ JNI bridge, that's a hard crash — no stack trace, no recovery.&lt;/p&gt;

&lt;p&gt;The fix: a Kotlin Flow state machine cancels the native thread and waits for &lt;code&gt;ModelReady&lt;/code&gt; before proceeding.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;suspend&lt;/span&gt; &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;waitForReadyState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeoutMs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Long&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nc"&gt;Boolean&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;engine&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;inferenceEngine&lt;/span&gt; &lt;span class="o"&gt;?:&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="k"&gt;is&lt;/span&gt; &lt;span class="nc"&gt;InferenceEngine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;State&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Generating&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;cancelGeneration&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;withTimeoutOrNull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeoutMs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;first&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;it&lt;/span&gt; &lt;span class="k"&gt;is&lt;/span&gt; &lt;span class="nc"&gt;InferenceEngine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;State&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ModelReady&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;true&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;?:&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No UI blocking. No JNI crashes. Clean state transitions.&lt;/p&gt;




&lt;h3&gt;
  
  
  5. Safety Without a Classifier
&lt;/h3&gt;

&lt;p&gt;No RAM left for a secondary safety model. A real-time token buffering state machine evaluates the stream as it arrives — before a single character reaches the screen.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="k"&gt;when&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;streamResult&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;streamingFilter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;processToken&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;is&lt;/span&gt; &lt;span class="nc"&gt;Safe&lt;/span&gt;       &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;emit&lt;/span&gt; &lt;span class="n"&gt;instantly&lt;/span&gt;
    &lt;span class="k"&gt;is&lt;/span&gt; &lt;span class="nc"&gt;Suspicious&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;hold&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;buffer&lt;/span&gt;
    &lt;span class="k"&gt;is&lt;/span&gt; &lt;span class="nc"&gt;Blocked&lt;/span&gt;    &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;abort&lt;/span&gt; &lt;span class="n"&gt;and&lt;/span&gt; &lt;span class="n"&gt;replace&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Zero classifier overhead. Zero latency penalty. Zero unsafe output.&lt;/p&gt;




&lt;h3&gt;
  
  
  6. Streaming Directly Into the Cursor
&lt;/h3&gt;

&lt;p&gt;Most keyboard AI tools wait for full generation then paste. Krexel pipes tokens directly into the Android &lt;code&gt;InputConnection&lt;/code&gt; as they arrive — inside WhatsApp, Gmail, Telegram — no app switching, no internet, no waiting.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="nc"&gt;FlorisImeService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;requestKeyboardAiStreaming&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;filteredPrompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;onToken&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt;
        &lt;span class="n"&gt;currentInputConnection&lt;/span&gt;&lt;span class="o"&gt;?.&lt;/span&gt;&lt;span class="nf"&gt;commitText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It doesn't feel like AI. It feels like the keyboard itself got smarter.&lt;/p&gt;




&lt;h3&gt;
  
  
  7. Hardware-Gated Model Selection
&lt;/h3&gt;

&lt;p&gt;One model size for all devices leaves half your users with an OOM crash on install. Krexel hard-maps quantization to physical RAM.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;tier&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;when&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;totalRam&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;4096&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;DeviceTier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;LOW_RAM&lt;/span&gt;    &lt;span class="c1"&gt;// max 350MB model&lt;/span&gt;
    &lt;span class="n"&gt;totalRam&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;6144&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;DeviceTier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;FOUR_GB&lt;/span&gt;   &lt;span class="c1"&gt;// max 550MB model&lt;/span&gt;
    &lt;span class="n"&gt;totalRam&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;8192&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;DeviceTier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;MID_RANGE&lt;/span&gt; &lt;span class="c1"&gt;// max 1200MB model&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;            &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;DeviceTier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;HIGH_END&lt;/span&gt;  &lt;span class="c1"&gt;// unlocks full Gemma 4 E2B&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every device gets the best model it can actually run.&lt;/p&gt;




&lt;h2&gt;
  
  
  How I Used Gemma 4
&lt;/h2&gt;

&lt;p&gt;This was not a default choice.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;RAM Required&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 31B Dense&lt;/td&gt;
&lt;td&gt;24GB+&lt;/td&gt;
&lt;td&gt;Server-grade only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 26B MoE&lt;/td&gt;
&lt;td&gt;18GB+&lt;/td&gt;
&lt;td&gt;Too large for phones&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 E4B&lt;/td&gt;
&lt;td&gt;4GB+&lt;/td&gt;
&lt;td&gt;Possible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gemma 4 E2B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2–3GB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;✅ Ideal for Android&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Krexel targets the hardware normal people actually own — not workstations, not cloud GPUs.&lt;/p&gt;

&lt;p&gt;The specific model: &lt;code&gt;unsloth/gemma-4-E2B-it-GGUF&lt;/code&gt; (~2.9GB). On my test device — Realme RMX5070, 7.2GB RAM, Android 16, arm64-v8a — it runs at &lt;strong&gt;5.74 tokens/sec&lt;/strong&gt;. That performance on a normal phone completely changed how I thought about local AI.&lt;/p&gt;




&lt;h3&gt;
  
  
  What Gemma 4 Specifically Unlocked
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Private Medical Analysis&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Users upload blood test reports in full Airplane Mode and get plain-English explanations.&lt;/p&gt;

&lt;p&gt;Running entirely on-device gives Krexel strong privacy guarantees because documents never leave the phone.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;em&gt;Krexel is not a medical tool. AI responses are for informational purposes only — always consult a qualified doctor for medical decisions.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;2. Reasoning On-Device&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Gemma 4's reasoning support lets users watch reasoning chains run directly on their own hardware. Zero server round-trips. The phone itself becomes the AI computer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Offline Translation&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;TRANSLATION_SYSTEM_PROMPT&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"""
You are a professional translator.
- Output ONLY the translated text
- No explanations, no preamble
- Preserve formatting and punctuation
- Match tone: formal stays formal
"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One model. 70+ languages. No separate engine needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. AI in Every App — Keyboard AI&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Think of it like having an AI writing assistant built directly into your keyboard — the kind developers get inside their IDE, but for every app on your phone. The difference: nothing is ever sent to a server. No prompt is stored. No conversation is used to train anything. It's purely local.&lt;/p&gt;

&lt;p&gt;Grammar correction, tone rewriting, translation — all without leaving the app, all without internet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Notes AI — Your Offline Writing and Coding Assistant&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Notes AI works like an AI assistant inside a code editor — suggest, explain, refactor, debug — except it never sends your code or writing anywhere. Your ideas stay yours.&lt;/p&gt;

&lt;p&gt;Students use it to understand textbook chapters, rewrite assignments, and summarize notes. Developers use it to generate snippets, explain unfamiliar code, and organize architecture thoughts. All offline. All private.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Child Safe Mode — Safety Without a Cloud&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If a user is under 18, Child Safe Mode activates automatically based on date of birth. Unsafe token patterns are intercepted in real-time during streaming — before a single character reaches the screen.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyiry2aovkyoar1eg9zfy.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyiry2aovkyoar1eg9zfy.jpg" alt="Krexel profile screen showing date of birth locked with biometric protection" width="800" height="1047"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftgisgo6tc2el8btraqrf.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftgisgo6tc2el8btraqrf.jpg" alt="Krexel Keyboard AI showing grammar correction inside WhatsApp" width="800" height="1115"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq9ofr9r5q4fvoron9ehw.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq9ofr9r5q4fvoron9ehw.jpg" alt="Krexel Chat AI returning blocked response — I can't assist with that request" width="800" height="855"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once set, changing the date of birth requires biometric verification. A parent sets it once. It cannot be bypassed.&lt;/p&gt;




&lt;h3&gt;
  
  
  Why This Matters
&lt;/h3&gt;

&lt;p&gt;Most people on Earth don't own AI workstations. They own Android phones.&lt;/p&gt;

&lt;p&gt;India alone has 650 million smartphone users. Most use mid-range devices. Many rely entirely on their phones as their primary computing device. Many study and work in areas where connectivity is a luxury, not a given.&lt;/p&gt;

&lt;p&gt;300 million students. Most own Android phones. Most just want to learn — without their curiosity being logged on a server somewhere.&lt;/p&gt;

&lt;p&gt;Gemma 4 E2B runs on exactly these devices.&lt;/p&gt;

&lt;p&gt;Privacy should be accessible on mainstream hardware too.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not bigger servers. Smarter devices.&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Technical Stack
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;llama.cpp             → inference engine (JNI bridge)
Gemma 4 E2B GGUF      → unsloth/gemma-4-E2B-it-GGUF
SharedAIManager       → centralized generation pipeline
ModelLoadCoordinator  → serialized loading, race-condition safe
MemoryWarningChecker  → RAM tier detection
FlorisBoard fork      → Keyboard AI
Markor fork           → Notes AI
Kotlin 2.3.0 | Min SDK: 26 | Target: 36 | arm64-v8a
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Settings stored in &lt;code&gt;EncryptedSharedPreferences&lt;/code&gt;. No API keys or server URLs ever stored in plaintext. Model downloads happen directly inside the app via built-in HuggingFace search.&lt;/p&gt;




&lt;h3&gt;
  
  
  Open Source Credits
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FlorisBoard&lt;/strong&gt; (Apache 2.0) — Keyboard foundation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Markor&lt;/strong&gt; (Apache 2.0) — Notes foundation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;llama.cpp&lt;/strong&gt; (MIT) — Inference engine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unsloth&lt;/strong&gt; — Optimized Gemma 4 E2B GGUF&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Built with Kotlin · llama.cpp · Gemma 4 E2B · Android 16&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Test device: Realme RMX5070 · 7.2GB RAM · arm64-v8a&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
      <category>mobile</category>
    </item>
  </channel>
</rss>
