DEV Community

SoftwareDevs mvpfactory.io
SoftwareDevs mvpfactory.io

Posted on • Originally published at mvpfactory.io

Adaptive Bitrate Model Loading on Android: Dynamic GGUF Shard Selection Based on Runtime Memory Pressure and Thermal State

---
title: "Adaptive Bitrate Model Loading on Android"
published: true
description: "Build an adaptive GGUF model loader that swaps quantization shards based on real-time memory pressure and thermal state on Android."
tags: android, kotlin, architecture, mobile
canonical_url: https://blog.mvpfactory.co/adaptive-bitrate-model-loading-android
---

## What We Are Building

Let me show you a pattern I use for on-device LLM inference that borrows directly from video streaming. We will build an adaptive GGUF model loader that monitors memory pressure and thermal state at runtime, then dynamically selects between Q4_K_M, Q5_K_S, and Q8_0 quantization shards — including mid-session shard swapping with KV cache migration when conditions degrade.

By the end, you will have three components wired together: a `MemoryPressureMonitor`, a `ThermalStateObserver`, and a `ShardOrchestrator` that treats quantization tiers exactly like HLS/DASH bitrate tiers.

## Prerequisites

- Android project targeting API 29+ (for thermal callbacks)
- llama.cpp with JNI bindings integrated into your app
- Three GGUF shards of the same base model (Q8_0, Q5_K_S, Q4_K_M)
- Familiarity with Kotlin coroutines and `StateFlow`

## Step 1: Define Your Shard Tiers

Enter fullscreen mode Exit fullscreen mode


kotlin
enum class GgufTier(
val fileName: String,
val estimatedRamMb: Int,
val qualityScore: Float
) {
HIGH("model-q8_0.gguf", 7200, 0.95f),
MEDIUM("model-q5_k_s.gguf", 4800, 0.88f),
LOW("model-q4_k_m.gguf", 3400, 0.82f);
}


These RAM estimates target a 7B parameter model. The actual footprint varies by ~8-12% depending on context length and batch size, so always add a buffer.

## Step 2: Monitor Memory Pressure

Enter fullscreen mode Exit fullscreen mode


kotlin
class MemoryPressureMonitor(private val context: Context) {
private val activityManager = context.getSystemService()

fun availableHeadroomMb(): Long {
    val memInfo = ActivityManager.MemoryInfo()
    activityManager.getMemoryInfo(memInfo)
    return (memInfo.availMem - memInfo.threshold) / (1024 * 1024)
}

fun recommendTier(): GgufTier {
    val headroom = availableHeadroomMb()
    return when {
        headroom > 8000 -> GgufTier.HIGH
        headroom > 5500 -> GgufTier.MEDIUM
        else -> GgufTier.LOW
    }
}
Enter fullscreen mode Exit fullscreen mode

}


Here is the minimal setup to get this working. `ActivityManager.getMemoryInfo()` gives you available RAM minus the low-memory threshold — that delta is your real headroom.

## Step 3: Observe Thermal State

The docs do not mention this, but thermal throttling murders inference throughput *before* it kills your process. On a Snapdragon 8 Gen 2 hitting `THERMAL_STATUS_MODERATE`, expect 30-40% throughput degradation on Q8_0. Dropping to Q5_K_S recovers most of that.

Enter fullscreen mode Exit fullscreen mode


kotlin
class ThermalStateObserver(context: Context) {
private val powerManager = context.getSystemService()
private val _thermalState = MutableStateFlow(PowerManager.THERMAL_STATUS_NONE)
val thermalState: StateFlow = _thermalState.asStateFlow()

init {
    if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.Q) {
        powerManager.addThermalStatusListener(Executors.newSingleThreadExecutor()) {
            _thermalState.value = it
        }
    }
}

fun shouldDownshift(): Boolean =
    _thermalState.value >= PowerManager.THERMAL_STATUS_MODERATE
Enter fullscreen mode Exit fullscreen mode

}


## Step 4: Orchestrate Mid-Session Shard Swapping

This is the hard part. Naively swapping shards discards the KV cache and loses conversational context. The workaround: serialize the KV cache, unload the current shard, load the new one, then deserialize.

Enter fullscreen mode Exit fullscreen mode


kotlin
class ShardOrchestrator(
private val memoryMonitor: MemoryPressureMonitor,
private val thermalObserver: ThermalStateObserver
) {
private var activeTier: GgufTier = GgufTier.MEDIUM
private var llamaContext: Long = 0L // JNI pointer

suspend fun evaluateAndSwap() {
    val targetTier = when {
        thermalObserver.shouldDownshift() ->
            minOf(activeTier.ordinal + 1, GgufTier.entries.lastIndex)
                .let { GgufTier.entries[it] }
        else -> memoryMonitor.recommendTier()
    }

    if (targetTier != activeTier) {
        val kvCacheBytes = LlamaBridge.serializeKvCache(llamaContext)
        LlamaBridge.freeContext(llamaContext)
        llamaContext = LlamaBridge.loadModel(targetTier.fileName)
        LlamaBridge.deserializeKvCache(llamaContext, kvCacheBytes)
        activeTier = targetTier
    }
}
Enter fullscreen mode Exit fullscreen mode

}


The JNI work to expose llama.cpp's `llama_copy_state_data` / `llama_set_state_data` is non-trivial but pays off immediately.

## Performance Under Pressure

| Scenario | Q8_0 | Q5_K_S | Q4_K_M |
|---|---|---|---|
| RAM usage (7B model) | ~7.2 GB | ~4.8 GB | ~3.4 GB |
| Tokens/sec (SD 8 Gen 2, cool) | ~12 | ~18 | ~24 |
| Tokens/sec (thermally throttled) | ~7 | ~14 | ~20 |
| Perplexity delta vs FP16 | +0.05 | +0.12 | +0.18 |

The throughput advantage of lower quantization tiers grows proportionally larger under thermal constraints — exactly when you need it.

## Gotchas

Here is the gotcha that will save you hours:

1. **KV cache dimension mismatch.** If your GGUF shards share the same base architecture and context length (generated from the same source model), the KV cache is compatible. Mismatched cache dimensions will produce garbage output or segfault through the JNI layer. Verify this in testing.
2. **Thermal before memory.** Prioritize thermal state over memory pressure. Memory warnings give you seconds to react; thermal throttling gives you milliseconds of degraded performance before the OS intervenes. Wire `PowerManager.addThermalStatusListener()` first.
3. **Static loading is the real bug.** Most teams treat model loading as a one-shot decision. In production, device conditions are non-stationary — a user opening a background music app can flip `lowMemory = true` instantly.

## Wrapping Up

Treat quantization selection as a runtime decision, not a build-time one. Ship all three GGUF shards in your APK (or download them on demand via Play Asset Delivery) and let device conditions drive the choice. Invest in KV cache serialization early — mid-session shard swapping without cache migration destroys the user experience.
Enter fullscreen mode Exit fullscreen mode

Top comments (0)