KRISHNA D

Posted on May 20 • Edited on May 24

Private AI on a Normal Android Phone: Building Krexel with Gemma 4 E2B

#devchallenge #gemmachallenge #gemma #mobile

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

Every AI assistant you use today sends your data to a server. Your messages. Your documents. Your medical reports. Your private thoughts.

That's the deal. You get intelligence, they get your data.

The most personal conversations people have with AI are often the exact conversations they should never have to upload anywhere.

A student trying to understand a medical report about their parent. A teenager writing something private. A developer working on code they haven't patented yet. A person in a rural area with no reliable internet who just wants to learn.

I wanted to break that deal.

Krexel is a fully offline AI productivity suite for Android, powered by Gemma 4 E2B running entirely on-device via llama.cpp.

No cloud. No API keys. No internet required. Your data never leaves your phone.

Four features in one app:

Chat AI — conversational AI with visible reasoning mode
Keyboard AI — AI assistance inside every Android app you already use
Notes AI — summarize, rewrite, polish, generate code, and translate locally
Translation AI — 70+ languages, zero API cost

Built for real-world mid-range Android phones with 6–8GB RAM — the hardware billions of people actually own. This is not a remote wrapper over a hosted model. The model runs directly on the phone itself.

Krexel is proprietary. Google Play release coming soon.

Demo

The demo shows offline AI chat in airplane mode, Keyboard AI inside Android apps, local translation, medical report analysis fully offline, and Gemma 4 reasoning mode running on-device.

Code

Krexel is proprietary, but here's the architecture that made running a ~3GB LLM across four Android surfaces actually work.

Building local AI on mobile isn't just about loading a model — it's about surviving strict OS memory constraints, JNI crashes, resource contention, and UI deadlocks.

The core is SharedAIManager — a singleton that routes all inference requests from Chat, Keyboard, Notes, and Translation through a single serialized pipeline. One model. Four surfaces. Zero conflicts.

1. The Keyboard OOM Killer

An Android keyboard is a background system service. Load a 2GB model inside it and the OS kills the keyboard silently — mid-typing.

The fix: the entire llama.cpp inference engine runs in a completely isolated background process. Tokens pipe back to the keyboard via Android Messenger IPC.

fun generateStreaming(
    prompt: String,
    systemPrompt: String,
    maxTokens: Int,
    enableThinking: Boolean,
): Int {
    activeRequestId = requestId
    send(what = KrexelAiService.MSG_GENERATE_STREAMING, requestId = requestId, payload = Bundle().apply {
        putString(KrexelAiService.KEY_PROMPT, prompt)
        putBoolean(KrexelAiService.KEY_ENABLE_THINKING, enableThinking)
    })
    return requestId
}

If the LLM hits OOM, it crashes in its own sandbox. The keyboard never drops a frame.

2. One Model, Four Surfaces — Priority Preemption

What happens when the keyboard is generating a suggestion and the user opens Chat? Lower-priority work gets preempted instantly.

enum class Priority(val level: Int) {
    BACKGROUND(0),  // keyboard suggestions
    NORMAL(1),      // chat responses
    HIGH(2),        // interactive note editing
}

if (isGenerating && priority.level > currentPriority.level) {
    provider.cancelGeneration()
}

3. Race-Condition Safe Generation

Every generation acquires a mutex. State always cleans up in finally — no matter what happens, no matter how fast the user taps.

val result = generationMutex.withLock {
    isGenerating = true
    try {
        generateWithSystemBlocking(...)
    } finally {
        isGenerating = false
        activeRequestId = -1
        currentPriority = Priority.BACKGROUND
    }
}

4. The Fast-Tap JNI Deadlock

Rapid task switching fires commands into llama.cpp out of order. On a native C++ JNI bridge, that's a hard crash — no stack trace, no recovery.

The fix: a Kotlin Flow state machine cancels the native thread and waits for ModelReady before proceeding.

private suspend fun waitForReadyState(timeoutMs: Long = 3000): Boolean {
    val engine = inferenceEngine ?: return false
    if (engine.state.value is InferenceEngine.State.Generating) {
        cancelGeneration()
    }
    return withTimeoutOrNull(timeoutMs) {
        engine.state.first { it is InferenceEngine.State.ModelReady }
        true
    } ?: false
}

No UI blocking. No JNI crashes. Clean state transitions.

5. Safety Without a Classifier

No RAM left for a secondary safety model. A real-time token buffering state machine evaluates the stream as it arrives — before a single character reaches the screen.

when (val streamResult = streamingFilter.processToken(token)) {
    is Safe       -> emit instantly
    is Suspicious -> hold in buffer
    is Blocked    -> abort and replace
}

Zero classifier overhead. Zero latency penalty. Zero unsafe output.

6. Streaming Directly Into the Cursor

Most keyboard AI tools wait for full generation then paste. Krexel pipes tokens directly into the Android InputConnection as they arrive — inside WhatsApp, Gmail, Telegram — no app switching, no internet, no waiting.

FlorisImeService.requestKeyboardAiStreaming(
    prompt = filteredPrompt,
    onToken = { token ->
        currentInputConnection?.commitText(token, 1)
    }
)

It doesn't feel like AI. It feels like the keyboard itself got smarter.

7. Hardware-Gated Model Selection

One model size for all devices leaves half your users with an OOM crash on install. Krexel hard-maps quantization to physical RAM.

val tier = when {
    totalRam < 4096 -> DeviceTier.LOW_RAM    // max 350MB model
    totalRam < 6144 -> DeviceTier.FOUR_GB   // max 550MB model
    totalRam < 8192 -> DeviceTier.MID_RANGE // max 1200MB model
    else            -> DeviceTier.HIGH_END  // unlocks full Gemma 4 E2B
}

Every device gets the best model it can actually run.

How I Used Gemma 4

This was not a default choice.

Model	RAM Required	Verdict
Gemma 4 31B Dense	24GB+	Server-grade only
Gemma 4 26B MoE	18GB+	Too large for phones
Gemma 4 E4B	4GB+	Possible
Gemma 4 E2B	2–3GB	✅ Ideal for Android

Krexel targets the hardware normal people actually own — not workstations, not cloud GPUs.

The specific model: unsloth/gemma-4-E2B-it-GGUF (~2.9GB). On my test device — Realme RMX5070, 7.2GB RAM, Android 16, arm64-v8a — it runs at 5.74 tokens/sec. That performance on a normal phone completely changed how I thought about local AI.

What Gemma 4 Specifically Unlocked

1. Private Medical Analysis

Users upload blood test reports in full Airplane Mode and get plain-English explanations.

Running entirely on-device gives Krexel strong privacy guarantees because documents never leave the phone.

⚠️ Krexel is not a medical tool. AI responses are for informational purposes only — always consult a qualified doctor for medical decisions.

2. Reasoning On-Device

Gemma 4's reasoning support lets users watch reasoning chains run directly on their own hardware. Zero server round-trips. The phone itself becomes the AI computer.

3. Offline Translation

const val TRANSLATION_SYSTEM_PROMPT = """
You are a professional translator.
- Output ONLY the translated text
- No explanations, no preamble
- Preserve formatting and punctuation
- Match tone: formal stays formal
"""

One model. 70+ languages. No separate engine needed.

4. AI in Every App — Keyboard AI

Think of it like having an AI writing assistant built directly into your keyboard — the kind developers get inside their IDE, but for every app on your phone. The difference: nothing is ever sent to a server. No prompt is stored. No conversation is used to train anything. It's purely local.

Grammar correction, tone rewriting, translation — all without leaving the app, all without internet.

5. Notes AI — Your Offline Writing and Coding Assistant

Notes AI works like an AI assistant inside a code editor — suggest, explain, refactor, debug — except it never sends your code or writing anywhere. Your ideas stay yours.

Students use it to understand textbook chapters, rewrite assignments, and summarize notes. Developers use it to generate snippets, explain unfamiliar code, and organize architecture thoughts. All offline. All private.

6. Child Safe Mode — Safety Without a Cloud

If a user is under 18, Child Safe Mode activates automatically based on date of birth. Unsafe token patterns are intercepted in real-time during streaming — before a single character reaches the screen.

Once set, changing the date of birth requires biometric verification. A parent sets it once. It cannot be bypassed.

Why This Matters

Most people on Earth don't own AI workstations. They own Android phones.

India alone has 650 million smartphone users. Most use mid-range devices. Many rely entirely on their phones as their primary computing device. Many study and work in areas where connectivity is a luxury, not a given.

300 million students. Most own Android phones. Most just want to learn — without their curiosity being logged on a server somewhere.

Gemma 4 E2B runs on exactly these devices.

Privacy should be accessible on mainstream hardware too.

Not bigger servers. Smarter devices.

Technical Stack

llama.cpp             → inference engine (JNI bridge)
Gemma 4 E2B GGUF      → unsloth/gemma-4-E2B-it-GGUF
SharedAIManager       → centralized generation pipeline
ModelLoadCoordinator  → serialized loading, race-condition safe
MemoryWarningChecker  → RAM tier detection
FlorisBoard fork      → Keyboard AI
Markor fork           → Notes AI
Kotlin 2.3.0 | Min SDK: 26 | Target: 36 | arm64-v8a

Settings stored in EncryptedSharedPreferences. No API keys or server URLs ever stored in plaintext. Model downloads happen directly inside the app via built-in HuggingFace search.

Open Source Credits

FlorisBoard (Apache 2.0) — Keyboard foundation
Markor (Apache 2.0) — Notes foundation
llama.cpp (MIT) — Inference engine
Unsloth — Optimized Gemma 4 E2B GGUF

Built with Kotlin · llama.cpp · Gemma 4 E2B · Android 16
Test device: Realme RMX5070 · 7.2GB RAM · arm64-v8a

DEV Community