Running Vision-Language Models On-Device in Android

#webdev #programming

---
title: "Running Vision-Language Models On-Device in Android"
published: true
description: "A hands-on guide to running quantized VLMs on Android using split-delegate architecture, CameraX integration, and Kotlin coroutines for real-time on-device image understanding."
tags: android, kotlin, architecture, mobile
canonical_url: https://blog.mvpfactory.co/running-vision-language-models-on-device-in-android
---

## What We Will Build

Let me show you how to run a vision-language model — think LLaVA or MobileVLM — directly on an Android device. By the end of this tutorial, you will have a pipeline that captures camera frames, encodes them through a CLIP vision encoder, and streams text responses from a language decoder. All on-device, no server round-trips.

The key pattern I use in every on-device multimodal project: **split-delegate architecture**. The vision encoder and language decoder run on different hardware delegates. This is the minimal setup to get real-time image understanding working without melting the device.

## Prerequisites

- Android device with Snapdragon 8 Gen 3, Tensor G4, or equivalent
- TFLite with GPU and NNAPI delegate support
- CameraX dependency in your project
- Familiarity with Kotlin coroutines and Flows

## Step 1: Understand the Dual-Model Reality

Vision-language models are not a single model. They are two models stitched together: a **CLIP-family vision encoder** that converts images into embedding vectors, and a **language model decoder** that consumes those embeddings to generate text.

Each component has a different computational profile and belongs on a different delegate:

| Component | Optimal Delegate | Quantization | Typical Latency (Pixel 8 Pro) | Memory Footprint |
|---|---|---|---|---|
| CLIP Vision Encoder | GPU Delegate | INT8 | ~40-80ms per frame | ~150-300MB |
| Language Decoder (1.3B-3B params) | NNAPI / CPU | INT4 (GPTQ/AWQ) | ~200-500ms per token | ~800MB-1.5GB |
| Projection Layer | CPU | FP16 | <5ms | Negligible |

The vision encoder is dense matrix math — it maps cleanly onto GPU shader cores via TFLite's GPU delegate. The language decoder, with its autoregressive token-by-token generation, runs better on NNAPI or optimized CPU paths with XNNPACK.

## Step 2: Quantize Asymmetrically

Here is the gotcha that will save you hours: **do not apply the same quantization to both components.**

The vision tower is sensitive to aggressive quantization. Dropping CLIP to INT4 measurably degrades embedding quality, which cascades into worse language output. Use **INT8 symmetric quantization** — it preserves visual fidelity with minimal accuracy loss.

The language decoder tolerates INT4 well, especially with group-wise quantization (GPTQ with 128-group size or AWQ). A 3B-parameter decoder drops from ~6GB (FP16) to ~1.5GB (INT4). The perplexity increase is marginal, but the memory savings are real.

## Step 3: Build the CameraX Frame Buffer Pipeline

Feeding camera frames into the vision encoder requires careful buffer management. Here is the minimal setup:

kotlin
class VLMFrameAnalyzer(
private val visionEncoder: Interpreter,
private val scope: CoroutineScope
) : ImageAnalysis.Analyzer {

private val frameChannel = Channel<Bitmap>(capacity = 1, 
    onBufferOverflow = BufferOverflow.DROP_OLDEST)

override fun analyze(imageProxy: ImageProxy) {
    val bitmap = imageProxy.toBitmap()
    frameChannel.trySend(bitmap)
    imageProxy.close()  // always close immediately
}

fun embeddings(): Flow<FloatArray> = frameChannel.receiveAsFlow()
    .flowOn(Dispatchers.Default)
    .map { bitmap ->
        val input = preprocessForCLIP(bitmap, 224)
        val output = Array(1) { FloatArray(768) }
        visionEncoder.run(input, output)
        output[0]
    }

}


The `DROP_OLDEST` on the channel is critical. Under sustained inference, you will fall behind real-time. Dropping stale frames is correct behavior — users want the model to reason about what the camera sees *now*, not 400ms ago.

## Step 4: Wire the Streaming Pipeline

Connect CameraX → vision encoder → projection → language decoder as a structured coroutine flow:

kotlin
fun runVLMPipeline(
analyzer: VLMFrameAnalyzer,
decoder: LanguageDecoder,
prompt: String
): Flow = analyzer.embeddings()
.sample(500) // limit to ~2 inferences/sec
.map { embeddings -> decoder.generate(prompt, embeddings) }
.flowOn(Dispatchers.Default)


The `sample(500)` operator is your thermal throttling knob. On sustained inference, SoC temperatures climb fast with dual-model workloads. Sampling at 500ms intervals keeps most devices under thermal limits.

## Step 5: Manage Memory Pressure

Running two models on a device with 8-12GB total RAM (shared with the OS, other apps, and the camera HAL) takes discipline:

- **Lazy-load the language decoder.** Keep only the vision encoder resident during camera preview. Load the decoder on first query.
- **Memory-map model weights** via TFLite's `MappedByteBuffer`. This lets the OS page out inactive segments under pressure.
- **Monitor `ComponentCallbacks2`** and downgrade gracefully: drop to vision-only mode on `TRIM_MEMORY_RUNNING_LOW`.

## Gotchas

- **Do not run both models on the same delegate.** You will hit contention and get worse throughput than splitting. GPU for vision, NNAPI/CPU for the decoder.
- **Test embedding cosine similarity against FP16 baselines before shipping.** The docs do not mention this, but INT8 quantization on the vision tower can silently degrade embedding quality in ways that only surface in downstream text generation.
- **Design for thermal steady-state, not peak throughput.** Instrument `ThermalStatusListener`. The fastest model is worthless if the device throttles to half speed after 30 seconds.
- **Always close `imageProxy` immediately** in your analyzer. Holding references will stall the CameraX pipeline and kill your preview frame rate.

## Wrapping Up

On-device VLMs are viable today — but only if you respect the hardware constraints instead of fighting them. Split your delegates, quantize asymmetrically, sample frames at sustainable rates, and instrument thermals from day one. This pattern has worked reliably across every production on-device ML system I have shipped.

Start with the frame buffer pipeline above, verify your latency numbers on target hardware, and iterate from there.

DEV Community

Running Vision-Language Models On-Device in Android

Top comments (0)