DEV Community

Programming Central
Programming Central

Posted on • Originally published at programmingcentral.hashnode.dev

Stop the Low Memory Killer: Mastering Memory-Efficient RAG on Android with Gemini Nano

The dream of on-device Generative AI is finally a reality. With the release of Gemini Nano and Google’s AICore, Android developers can now build applications that summarize text, suggest smart replies, and answer complex queries without ever sending data to a cloud server. But as the saying goes, "With great power comes great memory pressure."

When you move from a basic LLM implementation to a Retrieval-Augmented Generation (RAG) architecture, you aren't just running a model; you are managing a complex pipeline of embeddings, vector databases, and dynamic context windows. On a mobile device, where the Android Low Memory Killer (LMK) lurks around every corner, an inefficient RAG implementation is a one-way ticket to a crashed application and a frustrated user.

In this deep dive, we will explore how to solve the "Memory Paradox" of on-device RAG, leverage the latest Kotlin 2.x features for AI orchestration, and implement an adaptive context window that keeps your app responsive even on mid-range hardware.

The Memory Paradox of On-Device RAG

Retrieval-Augmented Generation transforms a general-purpose LLM into a domain-specific expert. By providing the model with external data (like a user’s private notes or a company’s technical manual) at inference time, we drastically reduce hallucinations and increase utility.

However, RAG introduces a severe technical conflict. To make the model "smarter," we must feed it more context. In the world of LLMs, context equals tokens. In the world of Android, tokens equal RAM. This is the Memory Paradox: the more context you provide to ensure accuracy, the higher the likelihood that the system will terminate your app to reclaim memory.

In a standard GenAI flow, memory is dominated by model weights. In a RAG-enabled app, the footprint is split into three competing domains:

  1. The Model Weights: The static parameters of Gemini Nano (typically 4-bit or 8-bit quantized).
  2. The Vector Store: The indexed embeddings of your local documents, which must be searched and partially loaded.
  3. The KV Cache (Key-Value Cache): The dynamic "short-term memory" used by the transformer architecture to store previous tokens during a session.

Understanding how to balance these three pillars is the difference between a production-ready AI app and a research prototype that crashes on 8GB RAM devices.

The Architectural Shift: From App-Centric to System-Centric AI

Historically, if you wanted to run a model on Android, you bundled a .tflite file in your assets folder. This was "App-Centric AI." If five different apps each bundled a 2GB model, the device wasted 10GB of storage and potentially gigabytes of RAM.

Google’s AICore shifts this paradigm to "System-Centric AI." AICore is a system-level service that manages Gemini Nano. Instead of your app "owning" the model, it "requests" a session from the system.

Think of it like CameraX. You don't manage the raw camera hardware or handle the fragmented complexities of the Camera2 API directly; you manage a "capture session" through a consistent, lifecycle-aware interface. AICore does the same for AI. It abstracts the underlying hardware acceleration—whether it's the GPU, NPU, or TPU—and handles model versioning and updates. This centralisation is the first step in memory optimization, as it allows the OS to manage the model's lifecycle and RAM usage globally.

Under the Hood: Where the Bytes Actually Go

To optimize RAG, we have to look at the three primary memory consumers during a generation cycle.

1. The KV Cache: The Silent RAM Eater

When Gemini Nano processes a prompt, it doesn't re-calculate every previous word for every new word it generates. It stores the "Keys" and "Values" of previous tokens in a KV Cache.

The problem is that the KV Cache grows linearly with the sequence length. In RAG, where we inject large chunks of retrieved text into the prompt, the KV Cache can balloon into hundreds of megabytes. To combat this, AICore employs PagedAttention. Much like how a modern OS manages virtual memory using pages, PagedAttention partitions the KV cache into non-contiguous blocks. This reduces fragmentation and allows for much larger context windows than traditional contiguous allocation would permit.

2. Quantization and the SRAM Limit

Gemini Nano doesn't use 32-bit floating-point numbers for its weights. That would be far too large for a mobile device. Instead, it uses 4-bit or 8-bit quantization. This reduces the memory footprint by 4x to 8x, allowing the model to fit into the limited SRAM of a mobile NPU (Neural Processing Unit).

While quantization introduces a small amount of "noise," RAG actually helps mitigate this. By providing factual, concrete context in the prompt, the model doesn't have to rely as heavily on the high-precision recall of its internal weights. The context acts as a "cheat sheet" that compensates for the lower precision of the model's "brain."

3. The Vector Store Overhead

RAG requires converting text into embeddings—mathematical vectors. These are typically Float32 arrays. If you have 10,000 document chunks with 768 dimensions each, you’re looking at roughly 30MB of data. While that sounds small, searching through them requires loading them into RAM and performing high-speed math.

Treating a vector index like a static singleton is a recipe for disaster. Instead, we must treat it like a Room database migration. If you load a massive index on the main thread, you get an ANR (Application Not Responding). If you load it all at once without pagination, you get a memory spike that triggers the LMK.

Connecting Modern Kotlin to AI Memory Management

Kotlin 2.x provides a sophisticated toolset for managing the multi-stage RAG pipeline (Query -> Embedding -> Search -> Augment -> Generate).

Asynchronous Orchestration with Flow

RAG is inherently a streaming process. Using Flow, we can stream the results of the vector search and the LLM response. This ensures we never hold the entire augmented prompt and the entire generated response in memory as massive strings simultaneously.

Context Receivers for AI Scoping

One of the most powerful (and still experimental) features in Kotlin 2.x is Context Receivers. They allow us to define functions that require a specific context—like an active AiSession—without polluting every function signature with extra parameters. This is perfect for ensuring that AI operations only occur within a valid, memory-managed session.

// Example of using Context Receivers for AI Scoping
context(AiSession)
suspend fun performRAGQuery(userQuery: String, vectorDb: VectorDatabase): String {
    // 1. Retrieve relevant context from Vector DB
    val context = vectorDb.search(userQuery, limit = 3)

    // 2. Augment the prompt
    val augmentedPrompt = "Context: $context\n\nQuestion: $userQuery"

    // 3. Use the session from the context receiver to generate
    // generateResponse is a member of AiSession
    return generateResponse(augmentedPrompt).toList().joinToString("")
}
Enter fullscreen mode Exit fullscreen mode

Implementation: Building a Memory-Aware RAG Orchestrator

Let’s look at a production-ready implementation. This example uses a MemoryPressureMonitor to sense the device's state and adjust the RAG "Top-K" (the number of documents retrieved) dynamically.

1. The Memory Pressure Monitor

First, we need a way to tell the app how much RAM is left.

sealed class MemoryPressure {
    object Optimal : MemoryPressure()    // High RAM: Maximize context
    object Warning : MemoryPressure()    // Moderate RAM: Truncate context
    object Critical : MemoryPressure()   // Low RAM: Minimal context
}

@Singleton
class MemoryPressureMonitor @Inject constructor(
    @ApplicationContext private val context: Context
) {
    private val activityManager = context.getSystemService(Context.ACTIVITY_SERVICE) as ActivityManager

    fun getCurrentPressure(): MemoryPressure {
        val memoryInfo = ActivityManager.MemoryInfo()
        activityManager.getMemoryInfo(memoryInfo)

        val availablePercent = memoryInfo.availMem.toDouble() / memoryInfo.totalMem.toDouble()

        return when {
            availablePercent > 0.30 -> MemoryPressure.Optimal
            availablePercent > 0.15 -> MemoryPressure.Warning
            else -> MemoryPressure.Critical
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

2. The RAG Repository

The repository handles the heavy lifting of vector math. Note the use of withContext(Dispatchers.Default) to ensure we don't freeze the UI during the cosine similarity calculations.

class RAGRepository @Inject constructor(
    private val memoryMonitor: MemoryPressureMonitor
) {
    private val knowledgeBase = listOf(/* ... your document chunks ... */)

    suspend fun retrieveRelevantContext(queryEmbedding: FloatArray): String = withContext(Dispatchers.Default) {
        val pressure = memoryMonitor.getCurrentPressure()

        // Adaptive Top-K: Adjust retrieval depth based on RAM
        val topK = when (pressure) {
            is MemoryPressure.Optimal -> 3
            is MemoryPressure.Warning -> 1
            is MemoryPressure.Critical -> 1
        }

        knowledgeBase
            .map { chunk -> chunk to cosineSimilarity(queryEmbedding, chunk.embedding) }
            .sortedByDescending { it.second }
            .take(topK)
            .joinToString("\n") { it.first.text }
    }

    private fun cosineSimilarity(vecA: FloatArray, vecB: FloatArray): Float {
        // High-performance floating point math
        var dotProduct = 0.0f
        var normA = 0.0f
        var normB = 0.0f
        for (i in vecA.indices) {
            dotProduct += vecA[i] * vecB[i]
            normA += vecA[i] * vecA[i]
            normB += vecB[i] * vecB[i]
        }
        return dotProduct / (sqrt(normA) * sqrt(normB))
    }
}
Enter fullscreen mode Exit fullscreen mode

3. The ViewModel Orchestrator

The ViewModel ties it all together, ensuring that we handle the "Augmentation" phase without creating massive string overhead.

@HiltViewModel
class RAGViewModel @Inject constructor(
    private val repository: RAGRepository
) : ViewModel() {

    private val _uiState = MutableStateFlow<RAGUiState>(RAGUiState.Idle)
    val uiState: StateFlow<RAGUiState> = _uiState.asStateFlow()

    fun askQuestion(userQuery: String) {
        viewModelScope.launch {
            _uiState.value = RAGUiState.Loading

            try {
                // 1. Embedding Phase (Simulated)
                val queryEmbedding = floatArrayOf(0.12f, 0.75f, 0.22f) 

                // 2. Retrieval Phase
                val context = repository.retrieveRelevantContext(queryEmbedding)

                // 3. Augmentation Phase with Truncation
                val augmentedPrompt = buildPrompt(userQuery, context)

                // 4. Generation Phase
                val response = generateResponse(augmentedPrompt)

                _uiState.value = RAGUiState.Success(response)
            } catch (e: Exception) {
                _uiState.value = RAGUiState.Error(e.localizedMessage ?: "Unknown Error")
            }
        }
    }

    private fun buildPrompt(query: String, context: String): String {
        // Memory Optimization: Use StringBuilder and hard limits
        return StringBuilder().apply {
            append("Context: ${context.take(1000)}\n\n") 
            append("Question: $query\n\n")
            append("Answer concisely:")
        }.toString()
    }
}
Enter fullscreen mode Exit fullscreen mode

Critical Best Practices for On-Device AI

Never Skip the close() Method

This is the single most common cause of native memory leaks in Android AI apps. LLM models and TFLite interpreters reside in native memory (C++). The JVM Garbage Collector has no visibility into this heap. If you don't manually call llmInference.close() in your ViewModel's onCleared() method, that memory is lost until the OS kills your process.

Beware of the "Context Window" Limit

Every model has a hard limit on tokens (e.g., 2048 or 4096). If your RAG system retrieves a massive document, you might exceed this limit. This doesn't just result in poor answers; it can cause the underlying TFLite engine to throw a native exception and crash the app. Always truncate your retrieved context before sending it to the model.

Use Binary Serialization

When moving embeddings between your database and the model, avoid JSON. Parsing a large JSON array of floats creates thousands of short-lived String and Double objects, triggering frequent GC cycles and UI "jank." Use kotlinx.serialization with a binary format like ProtoBuf or a custom FloatArray serializer to keep the heap clean.

Summary of Design Decisions

Feature Design Decision Why?
AICore System-level Provider Prevents redundant model weights; centralizes NPU orchestration.
Gemini Nano 4-bit Quantization Fits the model into mobile SRAM; reduces power consumption.
KV Cache PagedAttention Prevents memory fragmentation during long context windows.
Flow/Coroutines Reactive Streams Avoids blocking the UI thread; minimizes peak memory via streaming.
Adaptive Windowing Dynamic Top-K Scales retrieval depth based on real-time device RAM availability.

Conclusion

Building RAG applications on Android is a balancing act. By treating the AI model not as a simple library, but as a system resource—much like the GPU or the Camera—you can build apps that are both intelligent and incredibly stable.

The key is to be proactive. Monitor your memory pressure, use structured concurrency to manage AI lifecycles, and always respect the native heap. As on-device hardware continues to evolve, these memory management patterns will become the foundation of the next generation of mobile software.

Let's Discuss

  1. How are you handling the trade-off between retrieval accuracy (Top-K) and app performance on lower-end Android devices?
  2. With the introduction of AICore, do you think we will see a move away from custom TFLite models in favor of standardized system-level LLMs?

Leave a comment below and let's build the future of on-device AI together!

The concepts and code demonstrated here are drawn directly from the comprehensive roadmap laid out in the ebook
On-Device GenAI with Android Kotlin: Mastering Gemini Nano, AICore, and local LLM deployment using MediaPipe and Custom TFLite models. You can find it here: Leanpub.com

Check also all the other programming & AI ebooks with python, typescript, c#, swift, kotlin: Leanpub.com

Android Kotlin & AI Masterclass:
Book 1: On-Device GenAI. Mastering Gemini Nano, AICore, and local LLM deployment using MediaPipe and Custom TFLite models.
Book 2: Edge AI Performance. Optimizing hardware acceleration via NPU (Neural Processing Unit), GPU, and DSP. Advanced quantization and model pruning.
Book 3: Android AI Agents. Building autonomous apps that use Tool Calling, Function Injection, and Screen Awareness to perform tasks for the user.

Top comments (0)