The era of "please wait while we process your request" is dying. In the rapidly evolving landscape of Generative AI, user expectations have shifted from mere capability to instantaneous interaction. If you are building Android applications integrated with Large Language Models (LLMs), you’ve likely encountered the "latency wall." Waiting for a model to generate a 500-word response in one go can leave your UI frozen for several seconds, leading to a user experience that feels sluggish, dated, and frustrating.
The solution lies in Streaming. By leveraging Gemini Nano, Google’s on-device LLM, and the reactive power of Kotlin Flow, developers can transform a static, "chunky" response system into a fluid, token-by-token experience. In this comprehensive guide, we will dive deep into the architecture of AICore, the mechanics of on-device inference, and the production-ready patterns required to implement streaming text outputs in modern Android apps.
The Imperative of Streaming in On-Device GenAI
In traditional REST-based API interactions, we are accustomed to the Request-Response cycle. The client sends a prompt, the server processes it entirely, and the client receives the full response. While this works for fetching a user profile or a list of products, it is catastrophic for LLM-based UX.
LLMs generate text autoregressively—meaning they predict one token at a time. A 500-word response doesn't appear out of thin air; it is built piece by piece. If your app waits for the final token before displaying anything, the Time to First Token (TTFT) is effectively the same as the time to the last token.
Streaming solves this by emitting tokens as they are generated. This provides immediate visual feedback, reducing perceived latency. In the world of Android, this necessitates a shift from standard suspend functions returning a single String to using Flow<String>. This transformation turns your AI interaction into a reactive stream that breathes life into your UI.
AICore: The Silent Engine Behind Gemini Nano
To implement streaming effectively, we must first understand the environment where the model lives. Google’s AICore is the system-level service responsible for managing Gemini Nano. Unlike traditional libraries that you bundle within your APK, AICore resides at the OS level. This architectural choice was driven by three critical constraints:
1. Binary Size and Distribution
Even a highly quantized LLM like Gemini Nano is massive, often weighing in at several hundred megabytes. If every AI-powered app—from your note-taker to your email client—bundled its own model, a user’s device storage would be depleted in minutes. AICore acts as a shared system provider. Much like the Android WebView or Google Play Services, AICore hosts the model once, allowing multiple applications to interface with it without duplicating the storage footprint.
2. The Hardware Abstraction Layer (HAL)
Running an LLM is a computationally expensive task that requires tight orchestration between the CPU, GPU, and the NPU (Neural Processing Unit). Every System on Chip (SoC) vendor—be it Qualcomm, MediaTek, or Google’s own Tensor—has different acceleration instructions.
AICore abstracts this complexity. Think of it as CameraX for AI. Just as CameraX provides a unified API regardless of whether a device has a single lens or a triple-camera setup, AICore provides a consistent interface for developers while handling the low-level driver optimizations for the specific NPU on the user's device.
3. Lifecycle and Resource Arbitration
LLMs are memory-hungry. If three different apps tried to load Gemini Nano into VRAM simultaneously, the system would likely trigger an Out-Of-Memory (OOM) event. AICore acts as the arbiter, managing the model's residency in memory. It handles the "heavy lifting" of model initialization—a process conceptually similar to a Room database migration—ensuring that the model is loaded efficiently and released when the system is under memory pressure.
Connecting Modern Kotlin Features to the AI Pipeline
Implementing a streaming architecture requires more than just a basic understanding of coroutines. We need to leverage the most advanced features of Kotlin 2.x to create a pipeline that is both efficient and maintainable.
Kotlin Flow: The Backbone of Streaming
Flow is the natural choice for streaming text. Unlike LiveData, which is tied to the UI lifecycle and only holds the "latest" value, Flow is a cold asynchronous stream. It supports powerful operators for data transformation and, crucially, handles backpressure. When AICore emits a chunk of text, Flow allows us to pipe these events from the native layer up to the UI layer with minimal overhead.
Context Receivers for AI Scoping
In complex AI applications, many functions need access to an AiSession or a ModelConfiguration. Passing these as parameters to every function clutters the API, while using global singletons hinders testing. Kotlin’s Context Receivers allow us to define functions that require a certain context to be present in the calling scope.
interface AiSession {
val modelName: String
val temperature: Float
}
// This function can only be called within an AiSession context
context(AiSession)
fun generatePrompt(userInput: String): String {
return "Using model $modelName with temp $temperature: $userInput"
}
Kotlinx Serialization for Structured Outputs
While we often display plain text, production-ready AI often requires Structured Outputs (like JSON). Using kotlinx.serialization, we can parse streaming chunks. Since a stream might arrive in fragments, we often implement a "buffer-and-parse" strategy where the Flow accumulates a string until a complete, valid JSON object is detected.
The "Under the Hood" Execution Flow
When you call a streaming method in the Gemini Nano SDK, a sophisticated sequence of events occurs:
- The Request: The Kotlin wrapper invokes a JNI (Java Native Interface) call to the AICore C++ runtime.
- The Token Loop: The LLM begins its autoregressive process. It predicts the next token, appends it to the sequence, and feeds that sequence back into itself.
- The Bridge: As each token is generated, AICore pushes it into a native queue. The Kotlin layer receives a callback, which is then wrapped into a
flow { ... }builder. - The Dispatch: To prevent UI stuttering, the flow is shifted to
Dispatchers.Defaultusing.flowOn(). This ensures that string concatenation and token decoding don't block the Main thread.
Implementation Blueprint: Building the Stream
Let’s look at a production-ready implementation pattern using Hilt for Dependency Injection and Jetpack Compose for the UI.
1. Setting Up Dependencies
First, ensure your build.gradle.kts is equipped with the necessary modern libraries:
dependencies {
// Coroutines and Flow
implementation("org.jetbrains.kotlinx:kotlinx-coroutines-android:1.8.0")
// Jetpack Compose and Lifecycle
implementation("androidx.lifecycle:lifecycle-viewmodel-compose:2.8.0")
implementation("androidx.lifecycle:lifecycle-runtime-compose:2.8.0")
// Hilt for DI
implementation("com.google.dagger:hilt-android:2.51")
kapt("com.google.dagger:hilt-compiler:2.51")
// MediaPipe LLM Inference (The engine for Gemini Nano)
implementation("com.google.mediapipe:tasks-genai:0.10.14")
}
2. The Repository Layer
The Repository is responsible for interacting with the AI engine. It abstracts the complexity of the MediaPipe or AICore API and provides a clean Flow to the rest of the app.
@Singleton
class GeminiRepository @Inject constructor(
@ApplicationContext private val context: Context
) {
private var llmInference: LlmInference? = null
fun initialize() {
val options = LlmInference.LlmInferenceOptions.builder()
.setModelPath("/data/local/tmp/gemini_nano.bin")
.setTemperature(0.7f)
.build()
llmInference = LlmInference.createFromOptions(context, options)
}
fun streamResponse(prompt: String): Flow<String> = callbackFlow {
val engine = llmInference ?: throw IllegalStateException("Model not ready")
engine.generateResponseAsync(prompt) { partialResult, done ->
trySend(partialResult)
if (done) channel.close()
}
awaitClose { /* Handle cancellation */ }
}.flowOn(Dispatchers.Default)
}
3. The ViewModel: Managing State
The ViewModel converts the cold Flow from the repository into a hot StateFlow that the UI can observe. It also manages the "is generating" state to toggle UI elements.
@HiltViewModel
class ChatViewModel @Inject constructor(
private val repository: GeminiRepository
) : ViewModel() {
private val _uiState = MutableStateFlow("")
val uiState: StateFlow<String> = _uiState.asStateFlow()
private val _isGenerating = MutableStateFlow(false)
val isGenerating: StateFlow<Boolean> = _isGenerating.asStateFlow()
fun sendPrompt(prompt: String) {
viewModelScope.launch {
_isGenerating.value = true
_uiState.value = "" // Clear previous response
repository.streamResponse(prompt)
.catch { e -> _uiState.value = "Error: ${e.message}" }
.collect { token ->
_uiState.value += token
}
_isGenerating.value = false
}
}
}
4. The UI Layer: Jetpack Compose
In the UI, we use collectAsStateWithLifecycle() to observe the stream. This is the modern standard for collecting flows in Compose, as it automatically manages the collection based on the lifecycle of the Composable.
@Composable
fun ChatScreen(viewModel: ChatViewModel = hiltViewModel()) {
val response by viewModel.uiState.collectAsStateWithLifecycle()
val isGenerating by viewModel.isGenerating.collectAsStateWithLifecycle()
var inputText by remember { mutableStateOf("") }
Column(modifier = Modifier.fillMaxSize().padding(16.dp)) {
TextField(
value = inputText,
onValueChange = { inputText = it },
modifier = Modifier.fillMaxWidth(),
enabled = !isGenerating
)
Button(
onClick = { viewModel.sendPrompt(inputText) },
enabled = !isGenerating && inputText.isNotBlank()
) {
Text(if (isGenerating) "Gemini is thinking..." else "Generate")
}
Text(
text = response,
modifier = Modifier.verticalScroll(rememberScrollState())
)
}
}
Advanced Optimization: Avoiding Common Pitfalls
While the above implementation works, production-grade applications require a higher level of scrutiny. Here are the critical areas where developers often stumble:
1. The Threading Trap
AI inference is computationally brutal. If you accidentally trigger the inference loop on the Main thread, your app will trigger an Application Not Responding (ANR) dialog. Always use .flowOn(Dispatchers.Default) to ensure the NPU-to-JVM bridge doesn't starve the UI thread.
2. The Garbage Collection (GC) Pressure
In Kotlin, strings are immutable. Every time you perform _uiState.value += token, you are creating a new String object and discarding the old one. For a 1,000-token response, this creates massive GC pressure, which can cause "micro-stutters" in the UI.
The Fix: For extremely long outputs, consider using a StringBuilder or a List<String> of tokens, and only update the UI state at specific intervals (e.g., every 5 tokens).
3. Lifecycle Leaks
If a user starts a prompt and then immediately navigates away from the screen, the LLM will continue to churn in the background, wasting battery and NPU cycles. By using viewModelScope.launch, the coroutine is automatically cancelled when the ViewModel is cleared. However, you must ensure your Repository's awaitClose block properly signals the underlying engine to stop generation.
4. Singleton Model Management
Never initialize your LLM engine inside a Composable or a standard class. LLMs should be managed as Singletons via Hilt. Initializing a model takes time and memory; doing it multiple times will lead to OutOfMemoryError (OOM) crashes.
Summary of Design Decisions
| Feature | Decision | Why? |
|---|---|---|
| AICore | System Service | Reduces APK size; manages shared VRAM across apps. |
| Gemini Nano | Quantized Model | Balances reasoning capability with on-device memory limits. |
| Kotlin Flow | Cold Stream | Allows for lazy execution and efficient cancellation via viewModelScope. |
| Dispatchers.Default | Off-main-thread | Prevents UI jank during high-frequency token emissions. |
| StateFlow | UI State Holder | Ensures the UI survives configuration changes (e.g., rotation) without restarting the stream. |
Conclusion: The Future is Reactive
Streaming text with Kotlin Flow isn't just a technical choice; it's a UX necessity. By moving away from the static Request-Response model and embracing the reactive nature of on-device GenAI, we create applications that feel alive. As AICore continues to evolve and Gemini Nano becomes available on more devices, the ability to build efficient, lifecycle-aware streaming pipelines will become a core competency for every Android developer.
The transition from "Loading..." to "Typing..." is where the magic happens. By mastering the integration of AICore, Coroutines, and Flow, you are not just writing code—you are crafting the next generation of human-computer interaction.
Let's Discuss
- How do you plan to handle "Structured Outputs" (like JSON) in a streaming context where the data might be incomplete for several seconds?
- With the move toward system-level AI providers like AICore, do you think developers will eventually stop bundling smaller TFLite models altogether? Why or why not?
The concepts and code demonstrated here are drawn directly from the comprehensive roadmap laid out in the ebook
On-Device GenAI with Android Kotlin: Mastering Gemini Nano, AICore, and local LLM deployment using MediaPipe and Custom TFLite models. You can find it here: Leanpub.com
Check also all the other programming & AI ebooks with python, typescript, c#, swift, kotlin: Leanpub.com
Android Kotlin & AI Masterclass:
Book 1: On-Device GenAI. Mastering Gemini Nano, AICore, and local LLM deployment using MediaPipe and Custom TFLite models.
Book 2: Edge AI Performance. Optimizing hardware acceleration via NPU (Neural Processing Unit), GPU, and DSP. Advanced quantization and model pruning.
Book 3: Android AI Agents. Building autonomous apps that use Tool Calling, Function Injection, and Screen Awareness to perform tasks for the user.
Top comments (0)