DEV Community

KRISHNA D
KRISHNA D

Posted on

Private AI on a Normal Android Phone: Building Krexel with Gemma 4 E2B

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4


What I Built

Every AI assistant you use today sends your data to a server. Your messages. Your documents. Your medical reports. Your private thoughts.

That's the deal. You get intelligence, they get your data.

The most personal conversations people have with AI are often the exact conversations they should not have to upload anywhere.

I wanted to break that deal.

Krexel is a fully offline AI productivity suite for Android powered by Gemma 4 E2B, running entirely on-device via llama.cpp.

No cloud. No API keys. No internet required. Your data never leaves your phone.

Four features in one app:

  • Chat AI — conversational AI with visible reasoning mode
  • Keyboard AI — AI assistance inside every Android app you already use
  • Notes AI — summarize, rewrite, polish, and translate locally
  • Translation AI — 70+ languages, zero API cost

Built for real-world mid-range Android phones with 6–8GB RAM — the hardware billions of people actually own. This is not a remote wrapper over a hosted model. The model runs directly on the phone itself.

Krexel is proprietary. Google Play release coming soon.


Demo

The demo shows offline AI chat in airplane mode, Keyboard AI inside Android apps, local translation, medical report analysis fully offline, and Gemma 4 reasoning mode running on-device.


Code

Krexel is a proprietary app. The core of the system is SharedAIManager — a singleton that routes all inference requests from four separate features (Chat, Keyboard, Notes, Translation) through a single serialized pipeline.

1. Priority preemption — four features, one model, no conflicts

The hardest problem with one model serving four surfaces: what happens when the keyboard is generating a suggestion and the user opens Chat? Lower-priority work gets preempted instantly.

enum class Priority(val level: Int) {
    BACKGROUND(0),  // keyboard suggestions, notification quick-replies
    NORMAL(1),      // chat responses
    HIGH(2),        // interactive note editing (user is watching and waiting)
}

// If a lower-priority generation is running and we're higher priority, cancel it
if (isGenerating && priority.level > currentPriority.level) {
    Log.d(TAG, "Preempting ${currentPriority.name} generation for ${priority.name} request")
    provider.cancelGeneration()
}
Enter fullscreen mode Exit fullscreen mode

2. Queue-based mutex — race-condition safe generation

Every generation acquires a mutex. State is always cleaned up in finally — no matter what happens.

val result = generationMutex.withLock {
    isGenerating = true
    activeRequestId = requestId
    currentPriority = priority
    try {
        generateWithSystemBlocking(...)
    } finally {
        isGenerating = false
        activeRequestId = -1
        currentPriority = Priority.BACKGROUND
    }
}
Enter fullscreen mode Exit fullscreen mode

How I Used Gemma 4

Why E2B and not the others — this was not a default choice:

Model RAM Required Verdict
Gemma 4 31B Dense 24GB+ Server-grade only
Gemma 4 26B MoE 18GB+ Too large for phones
Gemma 4 E4B 4GB+ Possible
Gemma 4 E2B 2–3GB ✅ Ideal for Android

Krexel targets the hardware normal people actually own. Not RTX workstations. Not Mac Studios. Not cloud GPUs.

The specific model: unsloth/gemma-4-E2B-it-GGUF (~2.9GB). On my test device — Realme RMX5070, 7.2GB RAM, Android 16, arm64-v8a — it runs at 5.74 tokens/sec. That performance on a normal phone completely changed how I thought about local AI.

Smart RAM tier detection automatically adjusts model recommendations per device:

val tier = when {
    totalRam < 4096 -> DeviceTier.LOW_RAM    // max 350MB model
    totalRam < 6144 -> DeviceTier.FOUR_GB   // max 550MB model
    totalRam < 8192 -> DeviceTier.MID_RANGE // max 1200MB model
    else            -> DeviceTier.HIGH_END  // max 2400MB model
}
Enter fullscreen mode Exit fullscreen mode

What Gemma 4 specifically unlocked:

1. Private Medical Analysis

Users upload blood test reports in full Airplane Mode and get plain-English explanations entirely offline. No server. No upload. No third-party processing. Cloud AI can never offer this. With Gemma 4 on-device, users never have to choose between intelligence and privacy.

2. Reasoning On-Device

Gemma 4's <think> token support lets users watch reasoning chains run directly on their own hardware. Zero server round-trips. The phone itself becomes the AI computer.

3. Offline Translation

const val TRANSLATION_SYSTEM_PROMPT = """
You are a professional translator.
- Output ONLY the translated text
- No explanations, no preamble
- Preserve formatting and punctuation
- Match tone: formal stays formal
"""
Enter fullscreen mode Exit fullscreen mode

One model handles 70+ languages. No separate translation engine needed.

4. AI in Every App

The Keyboard AI feature puts Gemma 4 directly into WhatsApp, Gmail, Telegram — grammar correction, tone rewriting, translation — without leaving the keyboard, without internet. Nothing sent externally.

Technical Stack:

llama.cpp             → inference engine (JNI bridge)
Gemma 4 E2B GGUF      → unsloth/gemma-4-E2B-it-GGUF
SharedAIManager       → centralized generation pipeline
ModelLoadCoordinator  → serialized loading, race-condition safe
MemoryWarningChecker  → RAM tier detection
FlorisBoard fork      → Keyboard AI
Markor fork           → Notes AI
Kotlin 2.3.0 | Min SDK: 26 | Target: 36 | arm64-v8a
Enter fullscreen mode Exit fullscreen mode

Key decisions: ARM64-only for v1, no cloud inference, Firebase Crashlytics only, model downloads integrated directly inside the app via built-in HuggingFace search. Settings stored in EncryptedSharedPreferences — API keys and server URLs never stored in plaintext.

Most people on Earth don't own AI workstations. They own Android phones. Many can't afford $20/month cloud subscriptions. Many have unreliable internet. Many don't want their personal data on remote servers.

Gemma 4 E2B is one of the first open models that makes private, capable AI genuinely practical on mainstream mobile hardware. Privacy is not a luxury feature. It is a baseline requirement.

Not bigger servers. Smarter devices.


Open Source Credits

  • FlorisBoard (Apache 2.0) — Keyboard foundation
  • Markor (Apache 2.0) — Notes foundation
  • llama.cpp (MIT) — Inference engine
  • Unsloth — Optimized Gemma 4 E2B GGUF

Built with Kotlin · llama.cpp · Gemma 4 E2B · Android 16
Test device: Realme RMX5070 · 7.2GB RAM · arm64-v8a

Top comments (0)