<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: KRISHNA D</title>
    <description>The latest articles on DEV Community by KRISHNA D (@krishna_apex).</description>
    <link>https://dev.to/krishna_apex</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3920098%2Ff65839aa-39f8-4875-8185-cee4ceba2b7f.png</url>
      <title>DEV Community: KRISHNA D</title>
      <link>https://dev.to/krishna_apex</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/krishna_apex"/>
    <language>en</language>
    <item>
      <title>Private AI on a Normal Android Phone: Building Krexel with Gemma 4 E2B</title>
      <dc:creator>KRISHNA D</dc:creator>
      <pubDate>Wed, 20 May 2026 16:35:51 +0000</pubDate>
      <link>https://dev.to/krishna_apex/private-ai-on-a-normal-android-phone-building-krexel-with-gemma-4-e2b-473e</link>
      <guid>https://dev.to/krishna_apex/private-ai-on-a-normal-android-phone-building-krexel-with-gemma-4-e2b-473e</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge: Build with Gemma 4&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;Every AI assistant you use today sends your data to a server. Your messages. Your documents. Your medical reports. Your private thoughts.&lt;/p&gt;

&lt;p&gt;That's the deal. You get intelligence, they get your data.&lt;/p&gt;

&lt;p&gt;The most personal conversations people have with AI are often the exact conversations they should not have to upload anywhere.&lt;/p&gt;

&lt;p&gt;I wanted to break that deal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Krexel&lt;/strong&gt; is a fully offline AI productivity suite for Android powered by Gemma 4 E2B, running entirely on-device via llama.cpp.&lt;/p&gt;

&lt;p&gt;No cloud. No API keys. No internet required. Your data never leaves your phone.&lt;/p&gt;

&lt;p&gt;Four features in one app:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Chat AI&lt;/strong&gt; — conversational AI with visible reasoning mode&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keyboard AI&lt;/strong&gt; — AI assistance inside every Android app you already use&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Notes AI&lt;/strong&gt; — summarize, rewrite, polish, and translate locally&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Translation AI&lt;/strong&gt; — 70+ languages, zero API cost&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Built for real-world mid-range Android phones with 6–8GB RAM — the hardware billions of people actually own. This is not a remote wrapper over a hosted model. The model runs directly on the phone itself.&lt;/p&gt;

&lt;p&gt;Krexel is proprietary. Google Play release coming soon.&lt;/p&gt;




&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/0KgOmIEK-RE"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;The demo shows offline AI chat in airplane mode, Keyboard AI inside Android apps, local translation, medical report analysis fully offline, and Gemma 4 reasoning mode running on-device.&lt;/p&gt;




&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;p&gt;Krexel is proprietary, but here's the architecture that made running a ~3GB LLM across four Android surfaces actually work. Building local AI on mobile isn't just about loading a model — it's about surviving strict OS memory constraints, JNI crashes, resource contention, and UI deadlocks.&lt;/p&gt;

&lt;p&gt;The core is &lt;code&gt;SharedAIManager&lt;/code&gt; — a singleton that routes all inference requests from Chat, Keyboard, Notes, and Translation through a single serialized pipeline. One model. Four surfaces. Zero conflicts.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;1. The Keyboard OOM Killer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An Android keyboard is a background system service. Load a 2GB model inside it and the OS kills the keyboard for exceeding memory limits — silently, mid-typing.&lt;/p&gt;

&lt;p&gt;The fix: the entire &lt;code&gt;llama.cpp&lt;/code&gt; inference engine runs in a completely isolated background process. Tokens pipe back to the keyboard via Android &lt;code&gt;Messenger&lt;/code&gt; IPC.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;generateStreaming&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;systemPrompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;maxTokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;enableThinking&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Boolean&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nc"&gt;Int&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;activeRequestId&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requestId&lt;/span&gt;
    &lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;what&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KrexelAiService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;MSG_GENERATE_STREAMING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;requestId&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requestId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Bundle&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;putString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;KrexelAiService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;KEY_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;putBoolean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;KrexelAiService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;KEY_ENABLE_THINKING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;enableThinking&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;requestId&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the LLM hits OOM, it crashes in its own sandbox. The keyboard never drops a frame.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;2. One Model, Four Surfaces — Priority Preemption&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What happens when the keyboard is generating a suggestion and the user opens Chat? Lower-priority work gets preempted instantly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="k"&gt;enum&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Priority&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;level&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;BACKGROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;// keyboard suggestions, notification quick-replies&lt;/span&gt;
    &lt;span class="nc"&gt;NORMAL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;      &lt;span class="c1"&gt;// chat responses&lt;/span&gt;
    &lt;span class="nc"&gt;HIGH&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;        &lt;span class="c1"&gt;// interactive note editing (user is watching and waiting)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;isGenerating&lt;/span&gt; &lt;span class="p"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt; &lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;currentPriority&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;Log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;d&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;TAG&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Preempting ${currentPriority.name} for ${priority.name}"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cancelGeneration&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;strong&gt;3. Race-Condition Safe Generation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every generation acquires a mutex. State always cleans up in &lt;code&gt;finally&lt;/code&gt; — no matter what happens, no matter how fast the user taps.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;result&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;generationMutex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withLock&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;isGenerating&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;
    &lt;span class="n"&gt;activeRequestId&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requestId&lt;/span&gt;
    &lt;span class="n"&gt;currentPriority&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;priority&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;generateWithSystemBlocking&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="p"&gt;.)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;finally&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;isGenerating&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;
        &lt;span class="n"&gt;activeRequestId&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="n"&gt;currentPriority&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Priority&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;BACKGROUND&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;strong&gt;4. The Fast-Tap JNI Deadlock&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Rapid task switching fires commands into llama.cpp out of order. On a native C++ JNI bridge, that's a hard crash — no stack trace, no recovery.&lt;/p&gt;

&lt;p&gt;The fix: a Kotlin Flow state machine intercepts the busy engine, cancels the native thread, and waits for &lt;code&gt;ModelReady&lt;/code&gt; before proceeding.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;suspend&lt;/span&gt; &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;waitForReadyState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeoutMs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Long&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nc"&gt;Boolean&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;engine&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;inferenceEngine&lt;/span&gt; &lt;span class="o"&gt;?:&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="k"&gt;is&lt;/span&gt; &lt;span class="nc"&gt;InferenceEngine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;State&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Generating&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;cancelGeneration&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;withTimeoutOrNull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeoutMs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;first&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;it&lt;/span&gt; &lt;span class="k"&gt;is&lt;/span&gt; &lt;span class="nc"&gt;InferenceEngine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;State&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ModelReady&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;true&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;?:&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No UI blocking. No JNI crashes. Clean state transitions.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;5. Safety Without a Classifier&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No RAM left for a secondary safety model. A real-time token buffering state machine evaluates the stream as it arrives.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="k"&gt;when&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;streamResult&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;streamingFilter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;processToken&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;is&lt;/span&gt; &lt;span class="nc"&gt;StreamingContentFilter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ProcessResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Safe&lt;/span&gt;       &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="cm"&gt;/* emit instantly */&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;is&lt;/span&gt; &lt;span class="nc"&gt;StreamingContentFilter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ProcessResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Suspicious&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="cm"&gt;/* hold in buffer */&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;is&lt;/span&gt; &lt;span class="nc"&gt;StreamingContentFilter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ProcessResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Blocked&lt;/span&gt;    &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="cm"&gt;/* abort &amp;amp; mask */&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Zero classifier overhead. Zero latency penalty. Zero unsafe output.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;6. Streaming Directly Into the Cursor&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most keyboard AI waits for full generation then pastes. Krexel pipes tokens directly into the Android &lt;code&gt;InputConnection&lt;/code&gt; as they arrive — inside WhatsApp, Gmail, Telegram — no app switching, no internet, no waiting.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="nc"&gt;FlorisImeService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;requestKeyboardAiStreaming&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;filteredPrompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;onToken&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt;
        &lt;span class="n"&gt;currentInputConnection&lt;/span&gt;&lt;span class="o"&gt;?.&lt;/span&gt;&lt;span class="nf"&gt;commitText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It doesn't feel like AI. It feels like the keyboard itself got smarter.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;7. Hardware-Gated Model Selection&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One model size for all devices leaves half your users with an OOM crash on install. Krexel hard-maps quantization to physical RAM — every device gets the best model it can actually run.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;tier&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;when&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;totalRam&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;4096&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;DeviceTier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;LOW_RAM&lt;/span&gt;    &lt;span class="c1"&gt;// max 350MB model&lt;/span&gt;
    &lt;span class="n"&gt;totalRam&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;6144&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;DeviceTier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;FOUR_GB&lt;/span&gt;   &lt;span class="c1"&gt;// max 550MB model&lt;/span&gt;
    &lt;span class="n"&gt;totalRam&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;8192&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;DeviceTier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;MID_RANGE&lt;/span&gt; &lt;span class="c1"&gt;// max 1200MB model&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;            &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;DeviceTier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;HIGH_END&lt;/span&gt;  &lt;span class="c1"&gt;// unlocks full Gemma 4 E2B&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  How I Used Gemma 4
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Why E2B and not the others — this was not a default choice:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;RAM Required&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 31B Dense&lt;/td&gt;
&lt;td&gt;24GB+&lt;/td&gt;
&lt;td&gt;Server-grade only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 26B MoE&lt;/td&gt;
&lt;td&gt;18GB+&lt;/td&gt;
&lt;td&gt;Too large for phones&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 E4B&lt;/td&gt;
&lt;td&gt;4GB+&lt;/td&gt;
&lt;td&gt;Possible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gemma 4 E2B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2–3GB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;✅ Ideal for Android&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Krexel targets the hardware normal people actually own. Not RTX workstations. Not Mac Studios. Not cloud GPUs.&lt;/p&gt;

&lt;p&gt;The specific model: &lt;code&gt;unsloth/gemma-4-E2B-it-GGUF&lt;/code&gt; (~2.9GB). On my test device — Realme RMX5070, 7.2GB RAM, Android 16, arm64-v8a — it runs at &lt;strong&gt;5.74 tokens/sec&lt;/strong&gt;. That performance on a normal phone completely changed how I thought about local AI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Gemma 4 specifically unlocked:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Private Medical Analysis&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Users upload blood test reports in full Airplane Mode and get plain-English explanations entirely offline. No server. No upload. No third-party processing. Cloud AI can never offer this. With Gemma 4 on-device, users never have to choose between intelligence and privacy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Reasoning On-Device&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Gemma 4's &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; token support lets users watch reasoning chains run directly on their own hardware. Zero server round-trips. The phone itself becomes the AI computer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Offline Translation&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;TRANSLATION_SYSTEM_PROMPT&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"""
You are a professional translator.
- Output ONLY the translated text
- No explanations, no preamble
- Preserve formatting and punctuation
- Match tone: formal stays formal
"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One model handles 70+ languages. No separate translation engine needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. AI in Every App&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Keyboard AI feature puts Gemma 4 directly into WhatsApp, Gmail, Telegram — grammar correction, tone rewriting, translation — without leaving the keyboard, without internet. Nothing sent externally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Technical Stack:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;llama.cpp             → inference engine (JNI bridge)
Gemma 4 E2B GGUF      → unsloth/gemma-4-E2B-it-GGUF
SharedAIManager       → centralized generation pipeline
ModelLoadCoordinator  → serialized loading, race-condition safe
MemoryWarningChecker  → RAM tier detection
FlorisBoard fork      → Keyboard AI
Markor fork           → Notes AI
Kotlin 2.3.0 | Min SDK: 26 | Target: 36 | arm64-v8a
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key decisions: ARM64-only for v1, no cloud inference, Firebase Crashlytics only, model downloads integrated directly inside the app via built-in HuggingFace search. Settings stored in &lt;code&gt;EncryptedSharedPreferences&lt;/code&gt; — API keys and server URLs never stored in plaintext.&lt;/p&gt;

&lt;p&gt;Most people on Earth don't own AI workstations. They own Android phones. Many can't afford $20/month cloud subscriptions. Many have unreliable internet. Many don't want their personal data on remote servers.&lt;/p&gt;

&lt;p&gt;Gemma 4 E2B is one of the first open models that makes private, capable AI genuinely practical on mainstream mobile hardware. Privacy is not a luxury feature. It is a baseline requirement.&lt;/p&gt;

&lt;p&gt;Not bigger servers. Smarter devices.&lt;/p&gt;




&lt;h2&gt;
  
  
  Open Source Credits
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FlorisBoard&lt;/strong&gt; (Apache 2.0) — Keyboard foundation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Markor&lt;/strong&gt; (Apache 2.0) — Notes foundation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;llama.cpp&lt;/strong&gt; (MIT) — Inference engine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unsloth&lt;/strong&gt; — Optimized Gemma 4 E2B GGUF&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Built with Kotlin · llama.cpp · Gemma 4 E2B · Android 16&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Test device: Realme RMX5070 · 7.2GB RAM · arm64-v8a&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
      <category>mobile</category>
    </item>
  </channel>
</rss>
