This is a submission for the Gemma 4 Challenge: Build with Gemma 4
What I Built
Every AI assistant you use today sends your data to a server. Your messages. Your documents. Your medical reports. Your private thoughts.
That's the deal. You get intelligence, they get your data.
The most personal conversations people have with AI are often the exact conversations they should not have to upload anywhere.
I wanted to break that deal.
Krexel is a fully offline AI productivity suite for Android powered by Gemma 4 E2B, running entirely on-device via llama.cpp.
No cloud. No API keys. No internet required. Your data never leaves your phone.
Four features in one app:
- Chat AI — conversational AI with visible reasoning mode
- Keyboard AI — AI assistance inside every Android app you already use
- Notes AI — summarize, rewrite, polish, and translate locally
- Translation AI — 70+ languages, zero API cost
Built for real-world mid-range Android phones with 6–8GB RAM — the hardware billions of people actually own. This is not a remote wrapper over a hosted model. The model runs directly on the phone itself.
Krexel is proprietary. Google Play release coming soon.
Demo
The demo shows offline AI chat in airplane mode, Keyboard AI inside Android apps, local translation, medical report analysis fully offline, and Gemma 4 reasoning mode running on-device.
Code
Krexel is a proprietary app. The core of the system is SharedAIManager — a singleton that routes all inference requests from four separate features (Chat, Keyboard, Notes, Translation) through a single serialized pipeline.
1. Priority preemption — four features, one model, no conflicts
The hardest problem with one model serving four surfaces: what happens when the keyboard is generating a suggestion and the user opens Chat? Lower-priority work gets preempted instantly.
enum class Priority(val level: Int) {
BACKGROUND(0), // keyboard suggestions, notification quick-replies
NORMAL(1), // chat responses
HIGH(2), // interactive note editing (user is watching and waiting)
}
// If a lower-priority generation is running and we're higher priority, cancel it
if (isGenerating && priority.level > currentPriority.level) {
Log.d(TAG, "Preempting ${currentPriority.name} generation for ${priority.name} request")
provider.cancelGeneration()
}
2. Queue-based mutex — race-condition safe generation
Every generation acquires a mutex. State is always cleaned up in finally — no matter what happens.
val result = generationMutex.withLock {
isGenerating = true
activeRequestId = requestId
currentPriority = priority
try {
generateWithSystemBlocking(...)
} finally {
isGenerating = false
activeRequestId = -1
currentPriority = Priority.BACKGROUND
}
}
How I Used Gemma 4
Why E2B and not the others — this was not a default choice:
| Model | RAM Required | Verdict |
|---|---|---|
| Gemma 4 31B Dense | 24GB+ | Server-grade only |
| Gemma 4 26B MoE | 18GB+ | Too large for phones |
| Gemma 4 E4B | 4GB+ | Possible |
| Gemma 4 E2B | 2–3GB | ✅ Ideal for Android |
Krexel targets the hardware normal people actually own. Not RTX workstations. Not Mac Studios. Not cloud GPUs.
The specific model: unsloth/gemma-4-E2B-it-GGUF (~2.9GB). On my test device — Realme RMX5070, 7.2GB RAM, Android 16, arm64-v8a — it runs at 5.74 tokens/sec. That performance on a normal phone completely changed how I thought about local AI.
Smart RAM tier detection automatically adjusts model recommendations per device:
val tier = when {
totalRam < 4096 -> DeviceTier.LOW_RAM // max 350MB model
totalRam < 6144 -> DeviceTier.FOUR_GB // max 550MB model
totalRam < 8192 -> DeviceTier.MID_RANGE // max 1200MB model
else -> DeviceTier.HIGH_END // max 2400MB model
}
What Gemma 4 specifically unlocked:
1. Private Medical Analysis
Users upload blood test reports in full Airplane Mode and get plain-English explanations entirely offline. No server. No upload. No third-party processing. Cloud AI can never offer this. With Gemma 4 on-device, users never have to choose between intelligence and privacy.
2. Reasoning On-Device
Gemma 4's <think> token support lets users watch reasoning chains run directly on their own hardware. Zero server round-trips. The phone itself becomes the AI computer.
3. Offline Translation
const val TRANSLATION_SYSTEM_PROMPT = """
You are a professional translator.
- Output ONLY the translated text
- No explanations, no preamble
- Preserve formatting and punctuation
- Match tone: formal stays formal
"""
One model handles 70+ languages. No separate translation engine needed.
4. AI in Every App
The Keyboard AI feature puts Gemma 4 directly into WhatsApp, Gmail, Telegram — grammar correction, tone rewriting, translation — without leaving the keyboard, without internet. Nothing sent externally.
Technical Stack:
llama.cpp → inference engine (JNI bridge)
Gemma 4 E2B GGUF → unsloth/gemma-4-E2B-it-GGUF
SharedAIManager → centralized generation pipeline
ModelLoadCoordinator → serialized loading, race-condition safe
MemoryWarningChecker → RAM tier detection
FlorisBoard fork → Keyboard AI
Markor fork → Notes AI
Kotlin 2.3.0 | Min SDK: 26 | Target: 36 | arm64-v8a
Key decisions: ARM64-only for v1, no cloud inference, Firebase Crashlytics only, model downloads integrated directly inside the app via built-in HuggingFace search. Settings stored in EncryptedSharedPreferences — API keys and server URLs never stored in plaintext.
Most people on Earth don't own AI workstations. They own Android phones. Many can't afford $20/month cloud subscriptions. Many have unreliable internet. Many don't want their personal data on remote servers.
Gemma 4 E2B is one of the first open models that makes private, capable AI genuinely practical on mainstream mobile hardware. Privacy is not a luxury feature. It is a baseline requirement.
Not bigger servers. Smarter devices.
Open Source Credits
- FlorisBoard (Apache 2.0) — Keyboard foundation
- Markor (Apache 2.0) — Notes foundation
- llama.cpp (MIT) — Inference engine
- Unsloth — Optimized Gemma 4 E2B GGUF
Built with Kotlin · llama.cpp · Gemma 4 E2B · Android 16
Test device: Realme RMX5070 · 7.2GB RAM · arm64-v8a
Top comments (0)