Programming Central

Posted on May 10 • Originally published at programmingcentral.hashnode.dev

Beyond SQL: How to Build a High-Performance On-Device Vector Search Engine for Android

#android #kotlin #ai

In the traditional world of Android development, we’ve spent decades perfecting the art of the exact match. We write SQL queries like SELECT * FROM users WHERE id = 5 or WHERE name LIKE '%Apple%'. This works perfectly for structured data, but it fails miserably when we try to interact with the messy, nuanced world of human language.

Imagine a user searching their notes app for "the feeling of a rainy afternoon in Kyoto." A traditional database would look for those exact words. If the user’s note actually said, "The petrichor filled the air as I walked through the Gion district under a gray sky," the search would return zero results.

The gap between what a user means and what a computer sees is the final frontier of mobile UX. To bridge this gap, we have to move away from discrete symbols—strings and integers—and into the world of continuous high-dimensional space. We need to build a Vector Search Repository.
(This article is based on the ebook On-Device GenAI with Android Kotlin)

The Theoretical Foundation: Translating Meaning into Geometry

At its core, a Vector Search Repository is not a traditional database; it is a geometric engine. To understand how it works, we must first master the concept of Embeddings.

1. The Concept of Embeddings

An embedding is a numerical representation of data—be it text, images, or audio—as a dense vector of floating-point numbers. When we "embed" a piece of text, we are essentially plotting it as a point in a space that might have 512, 768, or even thousands of dimensions.

If we represent the word "Apple" in a 3D space, it might look like [0.12, -0.59, 0.88]. In a production-grade model like Gemini Nano, these vectors are far more complex. Each dimension represents a latent feature of the data—features the model learned during training, such as "fruit-ness," "technology-ness," or "sentiment."

The Geometry of Meaning:
In this high-dimensional space, semantic similarity is equivalent to geometric proximity. If two pieces of text are conceptually similar, their corresponding vectors will be positioned close to one another.

Semantic Proximity: "The king's crown" and "The monarch's headpiece" will result in vectors that are nearly identical because they describe the same concept.
Semantic Distance: "The king's crown" and "A recipe for chocolate cake" will result in vectors that are geometrically distant because they share no conceptual overlap.

2. Similarity Metrics: How We Measure "Closeness"

Once we have transformed our data into vectors, we need a mathematical way to calculate the distance between them. In on-device AI development, we generally rely on three primary metrics:

A. Cosine Similarity
This is the gold standard for Natural Language Processing (NLP). Instead of measuring the straight-line distance between two points, it measures the angle between two vectors.

Why it matters: It ignores the magnitude (length) of the vector and focuses on the direction. This is critical because a short sentence and a long paragraph might discuss the same topic; their vectors will point in the same direction even if the paragraph’s vector is "longer."

B. Euclidean Distance (L2)
This measures the straight-line distance between two points in space. It is most effective when the magnitude of the vector is just as important as its direction.

C. Dot Product
A mathematical operation that combines magnitude and angle. This is often used in high-performance neural networks where vectors are already normalized, allowing for lightning-fast calculations.

AICore: The System-Level Revolution

Google’s introduction of AICore marks a massive shift in how we handle AI on Android. In the past, if you wanted to run a Large Language Model (LLM) or an embedding engine, you had to bundle the model within your app. This was a disaster for resources. A single model can take up gigabytes of RAM and drain the battery in minutes.

The Shared Provider Model

Just as CameraX abstracts fragmented camera hardware into a unified API, AICore acts as a system-level service that abstracts AI hardware (NPUs and TPUs).

Centralized Model Management: AICore manages the lifecycle of models like Gemini Nano. It handles the heavy lifting of downloading, updating, and loading models into the NPU.
Resource Arbitration: It ensures that multiple apps aren't fighting for the NPU simultaneously, managing the "scheduling" of AI inference tasks so the device stays responsive.
Privacy First: The data never leaves the device. AICore provides the interface for the app to send a prompt and receive a vector without any cloud round-trips.

Think of the transition from local app-specific models to AICore as similar to moving from raw SQLite cursors to Room. AICore is the "Room" for LLMs; it handles the "migration" of model weights and the "threading" of hardware acceleration.

Mapping AI Concepts to Modern Kotlin 2.x

Building a Vector Repository requires a bridge between the asynchronous, heavy-compute nature of AI and the reactive nature of the Android UI. Kotlin 2.x provides the perfect toolset for this.

Coroutines and Structured Concurrency: Generating embeddings is a blocking, CPU/NPU-intensive operation. We utilize Dispatchers.Default for mathematical calculations to ensure we don't freeze the Main thread.
Kotlin Flow for Streaming Results: Vector search often involves "Top-K" retrieval (e.g., "Give me the 5 most similar results"). As the repository scans the vector space, we can use Flow to stream results back to the UI as they are found, rather than waiting for the entire search to complete.
Context Receivers: In Kotlin 2.x, we can use context receivers to ensure that any function performing a vector search has access to the EmbeddingEngine without explicitly passing it as a parameter every time, leading to cleaner, more maintainable code.
Serialization for Persistence: Vectors are essentially FloatArrays. To store these in a local database, we use kotlinx.serialization to efficiently encode these arrays into binary formats like ProtoBuf, minimizing disk I/O.

Step-by-Step Implementation Guide

Let’s build a "Knowledge Base" where we store facts and search through them semantically using the MediaPipe Text Embedder.

1. Gradle Dependencies

First, we need to bring in the MediaPipe tasks and Hilt for dependency injection.

dependencies {
    // MediaPipe Text tasks for embedding generation
    implementation("com.google.mediapipe:tasks-text:0.10.14")

    // Jetpack Compose & Lifecycle
    implementation("androidx.lifecycle:lifecycle-viewmodel-ktx:2.7.0")
    implementation("androidx.lifecycle:lifecycle-runtime-compose:2.7.0")

    // Hilt for Dependency Injection
    implementation("com.google.dagger:hilt-android:2.50")
    kapt("com.google.dagger:hilt-compiler:2.50")
}

2. The Vector Repository

The repository handles the "heavy lifting" of AI inference and the vector math required for Cosine Similarity.

@Singleton
class VectorSearchRepository @Inject constructor(context: Context) {

    private val textEmbedder: TextEmbedder = TextEmbedder.createFromOptions(
        context,
        TextEmbedder.TextEmbedderOptions.builder()
            .setBaseOptions(
                com.google.mediapipe.tasks.core.BaseOptions.builder()
                    .setModelAssetPath("universal_sentence_encoder.tflite")
                    .build()
            )
            .build()
    )

    private val vectorStore = mutableListOf<VectorItem>()

    suspend fun addTextToRepository(text: String) = withContext(Dispatchers.Default) {
        val result: TextEmbedderResult = textEmbedder.embed(text)
        val embedding = result.embeddingResult().embeddings().get(0).floatArray()
        vectorStore.add(VectorItem(text, embedding))
    }

    suspend fun search(query: String): List<Pair<String, Float>> = withContext(Dispatchers.Default) {
        val queryResult = textEmbedder.embed(query)
        val queryVector = queryResult.embeddingResult().embeddings().get(0).floatArray()

        vectorStore.map { item ->
            val similarity = calculateCosineSimilarity(queryVector, item.embedding)
            item.text to similarity
        }.sortedByDescending { it.second }
    }

    private fun calculateCosineSimilarity(vectorA: FloatArray, vectorB: FloatArray): Float {
        var dotProduct = 0.0f
        var normA = 0.0f
        var normB = 0.0f
        for (i in vectorA.indices) {
            dotProduct += vectorA[i] * vectorB[i]
            normA += vectorA[i] * vectorA[i]
            normB += vectorB[i] * vectorB[i]
        }
        return if (normA == 0f || normB == 0f) 0f else dotProduct / (sqrt(normA) * sqrt(normB))
    }
}

3. The ViewModel and UI

We use a ViewModel to manage the UI state and ensure that our search operations don't leak memory or block the UI.

@HiltViewModel
class VectorViewModel @Inject constructor(
    private val repository: VectorSearchRepository
) : ViewModel() {

    private val _searchResults = MutableStateFlow<List<Pair<String, Float>>>(emptyList())
    val searchResults = _searchResults.asStateFlow()

    fun performSearch(query: String) {
        viewModelScope.launch {
            val results = repository.search(query)
            _searchResults.value = results
        }
    }
}

In the Compose UI, the user enters a query like "Tell me about puppies," and the system retrieves the "Golden Retriever" fact, even if the word "puppy" was never explicitly used in the source text.

Advanced Implementation: Semantic Memory and RAG

In a production-grade application, a Vector Search Repository is more than just a search bar; it is the Semantic Memory of the application. This leads us to Retrieval-Augmented Generation (RAG).

RAG allows the device to search through thousands of local documents, retrieve the most relevant snippets, and feed those snippets into Gemini Nano. This "grounds" the LLM in factual, local context, effectively preventing the "hallucinations" that plague many AI models.

The Production Pipeline

Input: The user asks a question.
Retrieval: The app converts the question to a vector and searches the local Vector Repository (stored in Room via BLOBs).
Augmentation: The top 3 most relevant snippets are retrieved.
Generation: The snippets and the original question are sent to Gemini Nano via AICore.
Output: The user receives a response that is both intelligent and factually accurate based on their own data.

Common Pitfalls to Avoid

1. The Main Thread Trap

Running textEmbedder.embed() on the Main thread is a guaranteed way to trigger an Application Not Responding (ANR) dialog. AI inference is computationally expensive. Always wrap your AI calls in withContext(Dispatchers.Default).

2. Memory Leaks and Model Lifecycle

TFLite models occupy significant native memory. If you create multiple instances of a TextEmbedder, you will quickly run into OutOfMemoryError. Use Hilt’s @Singleton scope to ensure only one instance of the model exists for the entire application lifetime.

3. Vector Normalization

If you use Euclidean Distance instead of Cosine Similarity without normalizing your vectors first, your results will be skewed by the length of the text rather than its meaning. Stick to Cosine Similarity for text-based applications as it inherently handles vector magnitude.

4. Asset Loading Latency

Loading a .tflite model from assets can take several hundred milliseconds. If this happens during the first screen render, the user will experience a visible stutter. Initialize your repository lazily or use a splash screen to mask the loading time.

Conclusion: The New Standard for Android UX

The era of keyword-based search is coming to an end. As users demand more intuitive, "human-like" interactions with their devices, building a robust Vector Search Repository becomes a mandatory skill for modern Android developers.

By leveraging AICore, MediaPipe, and Kotlin 2.x, we can build applications that don't just store data—they understand it. We are moving from apps that are passive tools to apps that act as intelligent partners, capable of navigating the complex geometry of human meaning.

Let's Discuss

How do you see semantic search changing the way users interact with productivity apps like Notes or Email?
Given the privacy benefits of AICore, would you prefer on-device vector search over cloud-based solutions like Pinecone or Weaviate for your next project?

The concepts and code demonstrated here are drawn directly from the comprehensive roadmap laid out in the ebook
On-Device GenAI with Android Kotlin: Mastering Gemini Nano, AICore, and local LLM deployment using MediaPipe and Custom TFLite models. You can find it here: Leanpub.com

Check also all the other programming & AI ebooks with python, typescript, c#, swift, kotlin: Leanpub.com

Android Kotlin & AI Masterclass:
Book 1: On-Device GenAI. Mastering Gemini Nano, AICore, and local LLM deployment using MediaPipe and Custom TFLite models.
Book 2: Edge AI Performance. Optimizing hardware acceleration via NPU (Neural Processing Unit), GPU, and DSP. Advanced quantization and model pruning.
Book 3: Android AI Agents. Building autonomous apps that use Tool Calling, Function Injection, and Screen Awareness to perform tasks for the user.

DEV Community