Programming Central

Posted on May 5 • Originally published at programmingcentral.hashnode.dev

Beyond Keywords: Building Production-Grade On-Device RAG Pipelines with Gemini Nano and AICore

#android #kotlin #ai

The era of "dumb" search is officially over. For decades, mobile developers relied on lexical matching—the simple process of checking if a specific string of characters existed within a database. If a user searched for "canine" but your database only contained the word "dog," the search failed. It was rigid, literal, and increasingly out of step with how humans actually communicate.

Enter Semantic Search. By shifting from keyword matching to conceptual matching, we allow applications to understand the intent and meaning behind a query. When you combine this with the power of Large Language Models (LLMs) like Gemini Nano, you unlock a new architectural pattern: Retrieval-Augmented Generation (RAG).

Even more revolutionary is the fact that we can now do this entirely on-device. No cloud latency, no massive API bills, and total user privacy. In this deep dive, we will explore the theoretical core of semantic search, the system-level architecture of Android’s AICore, and how to implement a production-grade context injection pipeline using Kotlin 2.x and MediaPipe.

The Theoretical Core of Semantic Search

At its most fundamental level, semantic search represents a paradigm shift. Instead of looking for character overlaps, we project text into a high-dimensional mathematical space. In this space, words with similar meanings are physically close to one another, regardless of their spelling.

Vector Embeddings: The Mathematical Foundation

The engine of semantic search is the Embedding Model. An embedding is a dense vector—essentially a long list of floating-point numbers—that represents the "essence" of a piece of text.

To visualize this, imagine a 3D space where one axis represents "Living Thing," another "Size," and a third "Domestication."

The phrase "Golden Retriever" would be plotted at a specific coordinate (High Living, Medium Size, High Domestication).
"Labrador" would be plotted very close to it.
"Toaster" would be plotted in a completely different quadrant (Low Living, Small Size, Low Domestication).

In production pipelines using Gemini Nano or MediaPipe, these vectors aren't 3D; they often span 768 or 1024 dimensions. This high dimensionality allows the model to capture incredibly subtle nuances in language, such as tone, technical vs. casual register, and complex relationships between abstract concepts.

Measuring Meaning: Cosine Similarity

How do we determine if two vectors are "close"? In semantic search, we typically use Cosine Similarity. Rather than measuring the Euclidean distance (a straight line between two points), we measure the angle between two vectors.

Angle = 0° (Cosine = 1): The meanings are identical.
Angle = 90° (Cosine = 0): The concepts are orthogonal or unrelated.
Angle = 180° (Cosine = -1): The concepts are diametrically opposed.

For on-device AI, we focus on the direction of the vector because it represents the "concept" regardless of the length of the text. Whether it's a short sentence or a long paragraph, if they discuss the same topic, their vectors will point in the same direction.

The RAG Pipeline: Context Injection Explained

LLMs, including Gemini Nano, have a "knowledge cutoff." They only know what they were trained on. If you ask Gemini Nano about a private company policy or a user's personal notes from yesterday, it will hallucinate or admit ignorance.

Retrieval-Augmented Generation (RAG) solves this by injecting real-time, private, or specific data into the prompt at runtime. The pipeline follows a strict four-stage sequence:

Indexing: Your documents are broken into chunks, passed through an embedding model, and stored in a Vector Database.
Retrieval: When a user asks a question, their query is embedded. The system performs a vector search to find the "Top-K" most relevant chunks from your database.
Augmentation: The system constructs a final prompt: "Using the following context: [Retrieved Chunks], answer the user's question: [Query]."
Generation: This "augmented" prompt is sent to Gemini Nano, which generates a response grounded in the provided facts.

AICore and the System-Level AI Provider Architecture

Google’s implementation of AICore is a strategic masterpiece for the Android ecosystem. Rather than bundling a 2GB LLM into every single APK, AICore acts as a system-level service.

Why AICore Matters

If every app bundled its own version of Gemini Nano, the Android ecosystem would collapse under three major weights:

Storage Bloat: Ten apps using Gemini Nano would consume 20GB of disk space. With AICore, they share one instance.
VRAM Exhaustion: Loading multiple LLMs into the GPU or NPU (Neural Processing Unit) would trigger the Android Low Memory Killer (LMK) instantly. AICore manages the model lifecycle, ensuring only one instance occupies memory while serving multiple apps.
Update Fragmentation: When Google improves the model, they update AICore via the Google Play Store. Developers don't need to push a new APK to give their users a better AI.

The CameraX Analogy: Think of AICore like CameraX. CameraX abstracts the fragmented hardware of various camera vendors into a unified API. Similarly, AICore abstracts the underlying NPU and GPU acceleration, providing a consistent interface for developers regardless of whether the user is on a Pixel, a Samsung, or a Xiaomi device.

The "Migration" Challenge

One critical detail for developers: updating a local vector index is similar to a Room database migration. If you upgrade your embedding model (e.g., moving from a small TFLite model to a larger one), the "coordinate system" of your vector space changes. A vector generated by Model A is meaningless to Model B. If you change models, you must re-embed and re-index every single document in your local store.

Mapping Kotlin 2.x Features to AI Pipelines

Implementing high-performance AI pipelines requires handling high-latency asynchronous operations and complex data structures. Modern Kotlin provides the ideal toolset for this.

1. Asynchronous Streams with `Flow`

Retrieval is not a single event; it’s a pipeline. We use Flow to stream chunks of data from the vector database to the LLM. This ensures the UI remains responsive even when the system is performing heavy mathematical calculations on the NPU.

2. Type-Safe Data with `kotlinx.serialization`

Vectors are essentially FloatArrays. To store these in a local database (like Room) or cache them, kotlinx.serialization allows us to transform these high-dimensional arrays into efficient binary formats without the overhead of traditional reflection-based serialization.

3. Scoped Environments with Context Receivers

AI operations require a specific environment: an AICoreClient, a CoroutineScope, and a ModelConfiguration. Instead of passing these as parameters to every function (the "parameter drill"), Context Receivers allow us to define functions that require these contexts to be present in the calling scope.

Implementation: A Production-Ready Semantic Search Example

Let’s look at how to build a "Local Knowledge Base" using MediaPipe for embeddings and Kotlin for the orchestration.

The Embedding Repository

This repository handles the heavy lifting of converting text to vectors and calculating similarity.

@Singleton
class EmbeddingRepository @Inject constructor(
    @ApplicationContext private val context: Context
) {
    // Initialize MediaPipe TextEmbedder lazily
    private val textEmbedder: TextEmbedder by lazy {
        val options = TextEmbedder.TextEmbedderOptions.builder()
            .setBaseOptions(
                com.google.mediapipe.tasks.core.BaseOptions.builder()
                    .setModelAssetPath("universal_sentence_encoder.tflite")
                    .setDelegate(com.google.mediapipe.tasks.core.Delegate.GPU)
                    .build()
            )
            .build()
        TextEmbedder.createFromOptions(context, options)
    }

    /**
     * Converts text into a semantic vector.
     * Must be run on Dispatchers.Default to avoid UI jank.
     */
    suspend fun embedText(text: String): FloatArray = withContext(Dispatchers.Default) {
        val result = textEmbedder.embed(text)
        result.embeddingResult().embeddings()[0].floatArray()
    }

    /**
     * Mathematical implementation of Cosine Similarity
     */
    fun calculateSimilarity(vectorA: FloatArray, vectorB: FloatArray): Float {
        var dotProduct = 0.0f
        var normA = 0.0f
        var normB = 0.0f
        for (i in vectorA.indices) {
            dotProduct += vectorA[i] * vectorB[i]
            normA += vectorA[i] * vectorA[i]
            normB += vectorB[i] * vectorB[i]
        }
        return dotProduct / (sqrt(normA) * sqrt(normB))
    }
}

The ViewModel Orchestrator

The ViewModel manages the state and ensures that we aren't performing redundant calculations.

@HiltViewModel
class SemanticSearchViewModel @Inject constructor(
    private val repository: EmbeddingRepository
) : ViewModel() {

    private val _uiState = MutableStateFlow<SearchUiState>(SearchUiState.Idle)
    val uiState = _uiState.asStateFlow()

    // Mock Knowledge Base
    private val localDocs = listOf(
        "Remote work is allowed up to 3 days per week.",
        "The annual bonus is paid out in the first week of March.",
        "Parking passes are available in the basement level B2."
    )

    fun onSearchClicked(query: String) {
        viewModelScope.launch {
            _uiState.value = SearchUiState.Loading

            val queryVector = repository.embedText(query)

            // In production, pre-calculate doc vectors and store in Room!
            val bestMatch = localDocs.map { doc ->
                val docVector = repository.embedText(doc)
                doc to repository.calculateSimilarity(queryVector, docVector)
            }.maxByOrNull { it.second }

            _uiState.value = SearchUiState.Success(
                content = bestMatch?.first ?: "No relevant info found.",
                confidence = bestMatch?.second ?: 0f
            )
        }
    }
}

Under the Hood: Memory and Constraints

When designing these pipelines for Android, you cannot ignore the hardware. Unlike a cloud server with 80GB of H100 VRAM, a mid-range Android phone might only have 6GB of total RAM.

The Context Window

Gemini Nano has a finite Context Window (the number of tokens it can process at once). If your semantic search retrieves 10 long documents, you might exceed the token limit. This causes the model to "forget" the beginning of the prompt or simply fail.

The Ranking Strategy

To solve this, senior AI engineers use a multi-stage approach:

Coarse Retrieval: Use a fast, low-dimension vector search to get 50 candidates.
Reranking: Use a more expensive "Cross-Encoder" model to pick the top 3-5 most relevant candidates.
Trimming: Use a tokenizer to ensure the final prompt fits within the model's token limit (typically 4k or 8k for Gemini Nano).

Common Pitfalls to Avoid

Main Thread Inference: Never call embed() on the Main Thread. TFLite inference is a CPU-heavy operation that will trigger an ANR (Application Not Responding) error.
Redundant Embeddings: In the code example above, we embed the documents every time a search is performed. Do not do this in production. Embed your knowledge base once, store the vectors in a database, and only embed the user's query at runtime.
Model Quantization: Always use quantized models (INT8 or FP16). They are significantly smaller and faster on mobile hardware with negligible loss in accuracy for most RAG tasks.

The Future of On-Device Intelligence

We are moving toward a world where apps are no longer just interfaces for remote databases. With AICore and Gemini Nano, apps are becoming intelligent agents capable of understanding the user's local context without ever compromising their privacy.

By mastering semantic search and RAG pipelines, you aren't just building a better search bar—you are building the foundation for the next generation of "Local-First" AI applications. Whether it's an intelligent note-taking app that remembers everything you've written or a corporate tool that answers policy questions offline, the tools are now in your hands.

Let's Discuss

How do you plan to handle vector database migrations when you decide to upgrade your embedding model in a live app?
Given the memory constraints of mobile devices, do you think RAG will eventually replace fine-tuning for most on-device AI use cases?

Leave a comment below and let's build the future of Android AI together!

The concepts and code demonstrated here are drawn directly from the comprehensive roadmap laid out in the ebook
On-Device GenAI with Android Kotlin: Mastering Gemini Nano, AICore, and local LLM deployment using MediaPipe and Custom TFLite models. You can find it here: Leanpub.com

Check also all the other programming & AI ebooks with python, typescript, c#, swift, kotlin: Leanpub.com

Android Kotlin & AI Masterclass:
Book 1: On-Device GenAI. Mastering Gemini Nano, AICore, and local LLM deployment using MediaPipe and Custom TFLite models.
Book 2: Edge AI Performance. Optimizing hardware acceleration via NPU (Neural Processing Unit), GPU, and DSP. Advanced quantization and model pruning.
Book 3: Android AI Agents. Building autonomous apps that use Tool Calling, Function Injection, and Screen Awareness to perform tasks for the user.

DEV Community

Beyond Keywords: Building Production-Grade On-Device RAG Pipelines with Gemini Nano and AICore

The Theoretical Core of Semantic Search