---
title: "Adaptive Bitrate Model Loading on Android"
published: true
description: "Build an adaptive GGUF model loader that swaps quantization shards based on real-time memory pressure and thermal state on Android."
tags: android, kotlin, architecture, mobile
canonical_url: https://blog.mvpfactory.co/adaptive-bitrate-model-loading-android
---
## What We Are Building
Let me show you a pattern I use for on-device LLM inference that borrows directly from video streaming. We will build an adaptive GGUF model loader that monitors memory pressure and thermal state at runtime, then dynamically selects between Q4_K_M, Q5_K_S, and Q8_0 quantization shards — including mid-session shard swapping with KV cache migration when conditions degrade.
By the end, you will have three components wired together: a `MemoryPressureMonitor`, a `ThermalStateObserver`, and a `ShardOrchestrator` that treats quantization tiers exactly like HLS/DASH bitrate tiers.
## Prerequisites
- Android project targeting API 29+ (for thermal callbacks)
- llama.cpp with JNI bindings integrated into your app
- Three GGUF shards of the same base model (Q8_0, Q5_K_S, Q4_K_M)
- Familiarity with Kotlin coroutines and `StateFlow`
## Step 1: Define Your Shard Tiers
kotlin
enum class GgufTier(
val fileName: String,
val estimatedRamMb: Int,
val qualityScore: Float
) {
HIGH("model-q8_0.gguf", 7200, 0.95f),
MEDIUM("model-q5_k_s.gguf", 4800, 0.88f),
LOW("model-q4_k_m.gguf", 3400, 0.82f);
}
These RAM estimates target a 7B parameter model. The actual footprint varies by ~8-12% depending on context length and batch size, so always add a buffer.
## Step 2: Monitor Memory Pressure
kotlin
class MemoryPressureMonitor(private val context: Context) {
private val activityManager = context.getSystemService()
fun availableHeadroomMb(): Long {
val memInfo = ActivityManager.MemoryInfo()
activityManager.getMemoryInfo(memInfo)
return (memInfo.availMem - memInfo.threshold) / (1024 * 1024)
}
fun recommendTier(): GgufTier {
val headroom = availableHeadroomMb()
return when {
headroom > 8000 -> GgufTier.HIGH
headroom > 5500 -> GgufTier.MEDIUM
else -> GgufTier.LOW
}
}
}
Here is the minimal setup to get this working. `ActivityManager.getMemoryInfo()` gives you available RAM minus the low-memory threshold — that delta is your real headroom.
## Step 3: Observe Thermal State
The docs do not mention this, but thermal throttling murders inference throughput *before* it kills your process. On a Snapdragon 8 Gen 2 hitting `THERMAL_STATUS_MODERATE`, expect 30-40% throughput degradation on Q8_0. Dropping to Q5_K_S recovers most of that.
kotlin
class ThermalStateObserver(context: Context) {
private val powerManager = context.getSystemService()
private val _thermalState = MutableStateFlow(PowerManager.THERMAL_STATUS_NONE)
val thermalState: StateFlow = _thermalState.asStateFlow()
init {
if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.Q) {
powerManager.addThermalStatusListener(Executors.newSingleThreadExecutor()) {
_thermalState.value = it
}
}
}
fun shouldDownshift(): Boolean =
_thermalState.value >= PowerManager.THERMAL_STATUS_MODERATE
}
## Step 4: Orchestrate Mid-Session Shard Swapping
This is the hard part. Naively swapping shards discards the KV cache and loses conversational context. The workaround: serialize the KV cache, unload the current shard, load the new one, then deserialize.
kotlin
class ShardOrchestrator(
private val memoryMonitor: MemoryPressureMonitor,
private val thermalObserver: ThermalStateObserver
) {
private var activeTier: GgufTier = GgufTier.MEDIUM
private var llamaContext: Long = 0L // JNI pointer
suspend fun evaluateAndSwap() {
val targetTier = when {
thermalObserver.shouldDownshift() ->
minOf(activeTier.ordinal + 1, GgufTier.entries.lastIndex)
.let { GgufTier.entries[it] }
else -> memoryMonitor.recommendTier()
}
if (targetTier != activeTier) {
val kvCacheBytes = LlamaBridge.serializeKvCache(llamaContext)
LlamaBridge.freeContext(llamaContext)
llamaContext = LlamaBridge.loadModel(targetTier.fileName)
LlamaBridge.deserializeKvCache(llamaContext, kvCacheBytes)
activeTier = targetTier
}
}
}
The JNI work to expose llama.cpp's `llama_copy_state_data` / `llama_set_state_data` is non-trivial but pays off immediately.
## Performance Under Pressure
| Scenario | Q8_0 | Q5_K_S | Q4_K_M |
|---|---|---|---|
| RAM usage (7B model) | ~7.2 GB | ~4.8 GB | ~3.4 GB |
| Tokens/sec (SD 8 Gen 2, cool) | ~12 | ~18 | ~24 |
| Tokens/sec (thermally throttled) | ~7 | ~14 | ~20 |
| Perplexity delta vs FP16 | +0.05 | +0.12 | +0.18 |
The throughput advantage of lower quantization tiers grows proportionally larger under thermal constraints — exactly when you need it.
## Gotchas
Here is the gotcha that will save you hours:
1. **KV cache dimension mismatch.** If your GGUF shards share the same base architecture and context length (generated from the same source model), the KV cache is compatible. Mismatched cache dimensions will produce garbage output or segfault through the JNI layer. Verify this in testing.
2. **Thermal before memory.** Prioritize thermal state over memory pressure. Memory warnings give you seconds to react; thermal throttling gives you milliseconds of degraded performance before the OS intervenes. Wire `PowerManager.addThermalStatusListener()` first.
3. **Static loading is the real bug.** Most teams treat model loading as a one-shot decision. In production, device conditions are non-stationary — a user opening a background music app can flip `lowMemory = true` instantly.
## Wrapping Up
Treat quantization selection as a runtime decision, not a build-time one. Ship all three GGUF shards in your APK (or download them on demand via Play Asset Delivery) and let device conditions drive the choice. Invest in KV cache serialization early — mid-session shard swapping without cache migration destroys the user experience.
Top comments (0)