DEV Community

B Sai Kiran Reddy
B Sai Kiran Reddy

Posted on

Running Gemma 4 On-Device on Android: Building a Private AI Assistant with LiteRT

Running Gemma 4 On-Device on Android: Building a Private AI Assistant with LiteRT

I built an Android app that runs Gemma 4 entirely on-device using Google's LiteRT framework no cloud API calls, no data leaving your phone, no API costs. Here's how it works and what I learned.

Why On-Device AI?

Cloud AI is powerful, but comes with downsides: latency, privacy concerns, rate limits, and recurring API costs. Running a model locally solves all of these your data stays on the device, responses are instantaneous after the first token, and there's no per-request cost.

The tradeoff is compute. A modern flagship phone can run a 2-3B parameter model comfortably, but you need the right quantization and engine.

The Stack

  • Model: Gemma 4 E2B (Google's latest edge-optimized model)
  • Engine: LiteRT LM (Google's on-device LLM runtime)
  • Quantization: Q4 (3.2GB), SFP8 (4.6GB), BF16 (9.6GB)
  • Language: Kotlin
  • Backend: Multi-provider fallback (LiteRT → Gemini → Claude → OpenAI)

Model Download & Management

The model is hosted on Hugging Face and downloaded on first launch:

object ModelManager {
    private const val MODEL_URL =
        "https://huggingface.co/sai8151/gemma-4-e2b-it-litertlm/resolve/main/gemma-4-e2b-it.litertlm"

    fun getModelFile(context: Context): File =
        File(context.filesDir, "gemma-4-e2b-it.litertlm")

    suspend fun downloadModel(context: Context, onProgress: (Int) -> Unit) {
        val client = OkHttpClient()
        val request = Request.Builder().url(MODEL_URL).build()
        val response = client.newCall(request).execute()
        val body = response.body ?: throw Exception("Download failed")
        val total = body.contentLength()
        val input = body.byteStream()
        val file = getModelFile(context)

        file.outputStream().use { output ->
            val buffer = ByteArray(8192)
            var downloaded = 0L
            var read: Int
            while (input.read(buffer).also { read = it } != -1) {
                output.write(buffer, 0, read)
                downloaded += read
                onProgress(((downloaded * 100) / total).toInt())
            }
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

We validate the download if the file is under 1GB (indicating a partial/corrupt download), we delete it and force a re-download.
LiteRT Integration
Google's LiteRT LM provides an Engine and Conversation API for running LLMs on-device.

val backend = if (useGpu) Backend.GPU() else Backend.CPU()
val config = EngineConfig(
    modelPath = modelPath,
    backend = backend,
    visionBackend = Backend.CPU(),
    cacheDir = context.cacheDir.absolutePath
)
val engine = Engine(config)
engine.initialize()

val conversation = engine.createConversation(
    ConversationConfig(
        systemInstruction = Contents.of(systemPrompt),
        samplerConfig = SamplerConfig(
            topK = topK,
            topP = topP.toDouble(),
            temperature = temperature.toDouble()
        )
    )
)
Enter fullscreen mode Exit fullscreen mode

The sendMessage API is straightforward you pass the prompt and get back a response. For streaming, you can use sendMessageAsync with a Flow collector. I also built a chatWithImage method that uses Message.of(Content.ImageBytes(imageBytes), Content.Text(userMessage)) for multimodal prompts.
Auto-Tuner: Adaptive Performance
Different devices have wildly different performance. I built an auto-tuner that adjusts model parameters based on real-time metrics:

object AutoTuner {
    fun tune(current: TunedConfig, metrics: PerfMetrics): TunedConfig {
        return when {
            // Very slow device — dial everything down
            metrics.tps < 2.0 || metrics.firstLatency > 4000 ->
                current.copy(
                    useGpu = false,
                    topK = (current.topK - 10).coerceAtLeast(20),
                    contextSize = (current.contextSize - 1).coerceAtLeast(2)
                )
            // Fast device — crank it up
            metrics.tps > 8.0 && metrics.firstLatency < 1500 ->
                current.copy(
                    topK = (current.topK + 5).coerceAtMost(80),
                    contextSize = (current.contextSize + 1).coerceAtMost(8)
                )
            else -> current
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

On my realme device, I get ~6-8 tokens/sec with Q4 quantization usable for chat, summarization, and quick Q&A. The tuner automatically switches between CPU and GPU backends based on throughput.
Multi-Provider Architecture
The app isn't limited to on-device. I defined a common AiClient interface:

interface AiClient {
    suspend fun chat(
        systemPrompt: String,
        history: List<Pair<String, String>>,
        userMessage: String,
        onToken: ((String) -> Unit)? = null
    ): Triple<String, String?, PerfMetrics>

    fun isLocal(): Boolean = false
}
Enter fullscreen mode Exit fullscreen mode

This lets me swap between LiteRtClient, GeminiClient, ClaudeClient, and OpenAiClient transparently. The app auto-fallsback: if the on-device model isn't downloaded, it offers cloud options. If the cloud API fails, it falls back further.
Clipboard Sync (The "URL AI" feature)
The app also polls a configurable URL endpoint originally built for clipboard sync. A companion Windows app (Flask + Cloudflare Tunnel) pushes clipboard content to a URL, and the Android app polls it periodically, feeding new content directly to the AI. This enables a workflow like:

  1. Copy text on Windows
  2. It appears in the Android app automatically
  3. The AI processes it on-device and returns results No cloud dependency, no manual copy-paste, end-to-end private. Performance Numbers Quantization Q4 SFP8 BF16 The auto-tuner starts with GPU + Q4, measures performance over the first few interactions, and adjusts accordingly. What's Next
  4. LoRA fine-tuning directly on-device
  5. Function calling with on-device models
  6. Streaming TTS for voice responses
  7. RAG with local embeddings (FAISS on-device) The full source code is available on GitHub (https://github.com/sai5880) and the app is published as Sentinel Link on Google Play (Alpha testing) intrested in testing can contact. If you're building on-device AI for Android, LiteRT LM is the most mature option right now. The ecosystem is evolving fast Gemma 4, Llama 3.2, and Phi-3 all have on-device variants worth exploring.

Top comments (0)