Running Gemma 4 On-Device on Android: Building a Private AI Assistant with LiteRT
I built an Android app that runs Gemma 4 entirely on-device using Google's LiteRT framework no cloud API calls, no data leaving your phone, no API costs. Here's how it works and what I learned.
Why On-Device AI?
Cloud AI is powerful, but comes with downsides: latency, privacy concerns, rate limits, and recurring API costs. Running a model locally solves all of these your data stays on the device, responses are instantaneous after the first token, and there's no per-request cost.
The tradeoff is compute. A modern flagship phone can run a 2-3B parameter model comfortably, but you need the right quantization and engine.
The Stack
- Model: Gemma 4 E2B (Google's latest edge-optimized model)
- Engine: LiteRT LM (Google's on-device LLM runtime)
- Quantization: Q4 (3.2GB), SFP8 (4.6GB), BF16 (9.6GB)
- Language: Kotlin
- Backend: Multi-provider fallback (LiteRT → Gemini → Claude → OpenAI)
Model Download & Management
The model is hosted on Hugging Face and downloaded on first launch:
object ModelManager {
private const val MODEL_URL =
"https://huggingface.co/sai8151/gemma-4-e2b-it-litertlm/resolve/main/gemma-4-e2b-it.litertlm"
fun getModelFile(context: Context): File =
File(context.filesDir, "gemma-4-e2b-it.litertlm")
suspend fun downloadModel(context: Context, onProgress: (Int) -> Unit) {
val client = OkHttpClient()
val request = Request.Builder().url(MODEL_URL).build()
val response = client.newCall(request).execute()
val body = response.body ?: throw Exception("Download failed")
val total = body.contentLength()
val input = body.byteStream()
val file = getModelFile(context)
file.outputStream().use { output ->
val buffer = ByteArray(8192)
var downloaded = 0L
var read: Int
while (input.read(buffer).also { read = it } != -1) {
output.write(buffer, 0, read)
downloaded += read
onProgress(((downloaded * 100) / total).toInt())
}
}
}
}
We validate the download if the file is under 1GB (indicating a partial/corrupt download), we delete it and force a re-download.
LiteRT Integration
Google's LiteRT LM provides an Engine and Conversation API for running LLMs on-device.
val backend = if (useGpu) Backend.GPU() else Backend.CPU()
val config = EngineConfig(
modelPath = modelPath,
backend = backend,
visionBackend = Backend.CPU(),
cacheDir = context.cacheDir.absolutePath
)
val engine = Engine(config)
engine.initialize()
val conversation = engine.createConversation(
ConversationConfig(
systemInstruction = Contents.of(systemPrompt),
samplerConfig = SamplerConfig(
topK = topK,
topP = topP.toDouble(),
temperature = temperature.toDouble()
)
)
)
The sendMessage API is straightforward you pass the prompt and get back a response. For streaming, you can use sendMessageAsync with a Flow collector. I also built a chatWithImage method that uses Message.of(Content.ImageBytes(imageBytes), Content.Text(userMessage)) for multimodal prompts.
Auto-Tuner: Adaptive Performance
Different devices have wildly different performance. I built an auto-tuner that adjusts model parameters based on real-time metrics:
object AutoTuner {
fun tune(current: TunedConfig, metrics: PerfMetrics): TunedConfig {
return when {
// Very slow device — dial everything down
metrics.tps < 2.0 || metrics.firstLatency > 4000 ->
current.copy(
useGpu = false,
topK = (current.topK - 10).coerceAtLeast(20),
contextSize = (current.contextSize - 1).coerceAtLeast(2)
)
// Fast device — crank it up
metrics.tps > 8.0 && metrics.firstLatency < 1500 ->
current.copy(
topK = (current.topK + 5).coerceAtMost(80),
contextSize = (current.contextSize + 1).coerceAtMost(8)
)
else -> current
}
}
}
On my realme device, I get ~6-8 tokens/sec with Q4 quantization usable for chat, summarization, and quick Q&A. The tuner automatically switches between CPU and GPU backends based on throughput.
Multi-Provider Architecture
The app isn't limited to on-device. I defined a common AiClient interface:
interface AiClient {
suspend fun chat(
systemPrompt: String,
history: List<Pair<String, String>>,
userMessage: String,
onToken: ((String) -> Unit)? = null
): Triple<String, String?, PerfMetrics>
fun isLocal(): Boolean = false
}
This lets me swap between LiteRtClient, GeminiClient, ClaudeClient, and OpenAiClient transparently. The app auto-fallsback: if the on-device model isn't downloaded, it offers cloud options. If the cloud API fails, it falls back further.
Clipboard Sync (The "URL AI" feature)
The app also polls a configurable URL endpoint originally built for clipboard sync. A companion Windows app (Flask + Cloudflare Tunnel) pushes clipboard content to a URL, and the Android app polls it periodically, feeding new content directly to the AI. This enables a workflow like:
- Copy text on Windows
- It appears in the Android app automatically
- The AI processes it on-device and returns results No cloud dependency, no manual copy-paste, end-to-end private. Performance Numbers Quantization Q4 SFP8 BF16 The auto-tuner starts with GPU + Q4, measures performance over the first few interactions, and adjusts accordingly. What's Next
- LoRA fine-tuning directly on-device
- Function calling with on-device models
- Streaming TTS for voice responses
- RAG with local embeddings (FAISS on-device) The full source code is available on GitHub (https://github.com/sai5880) and the app is published as Sentinel Link on Google Play (Alpha testing) intrested in testing can contact. If you're building on-device AI for Android, LiteRT LM is the most mature option right now. The ecosystem is evolving fast Gemma 4, Llama 3.2, and Phi-3 all have on-device variants worth exploring.
Top comments (0)