DEV Community

SoftwareDevs mvpfactory.io
SoftwareDevs mvpfactory.io

Posted on • Originally published at mvpfactory.io

On-Device LLM Inference via KMP and llama.cpp

---
title: "On-Device LLMs via KMP: A Production Architecture with llama.cpp"
published: true
description: "Build a KMP shared module wrapping llama.cpp with mmap loading, hardware acceleration, and thermal management for 3B-parameter models on mobile."
tags: kotlin, mobile, architecture, android
canonical_url: https://blog.mvp-factory.com/on-device-llms-via-kmp-production-architecture
---

## What We Will Build

In this workshop, I will walk you through a Kotlin Multiplatform shared module that wraps llama.cpp to run 3B-parameter LLMs directly on iOS and Android. By the end, you will understand mmap-based model loading, hardware accelerator delegation across Apple Neural Engine and Android NNAPI, quantization format tradeoffs, and the thermal throttling patterns that separate a demo from a shippable feature.

Let me show you a pattern I use in every project that touches on-device inference.

## Prerequisites

- Kotlin Multiplatform project configured for iOS and Android targets
- llama.cpp compiled as a static library for both platforms
- A Q4_K_M quantized 3B model (roughly 1.8 GB on disk)
- Test devices: flagship-tier (Pixel 8, iPhone 15 Pro or equivalent)

## Step 1: The KMP Bridge — cinterop and JNI

The shared module exposes a single `LlmEngine` interface using KMP's `expect/actual` pattern. On iOS, you bridge to llama.cpp through Kotlin/Native's cinterop, generating Kotlin bindings from C headers directly. On Android, you go through JNI with a thin C++ wrapper.

Enter fullscreen mode Exit fullscreen mode


kotlin
expect class LlmEngine {
fun loadModel(path: String, config: ModelConfig): Boolean
fun generate(prompt: String, params: GenerationParams): Flow
fun currentThermalState(): ThermalState
}


Your feature layer never touches llama.cpp directly. This is the key — your app code stays completely platform-agnostic.

## Step 2: Memory-Mapped Model Loading

Here is the gotcha that will save you hours: do not read the entire model into heap memory. A Q4_K_M quantized 3B model is roughly 1.8–2.0 GB. Loading that into the app's memory space on a device with 6 GB total RAM is a guaranteed OOM kill.

The solution is `mmap`. llama.cpp supports memory-mapped file access natively, letting the OS page model weights in and out of physical RAM on demand. Your resident memory footprint stays manageable because the kernel evicts pages under pressure instead of killing your process.

## Step 3: Pick Your Quantization Format

Quantization format selection is a direct tradeoff between quality, speed, and memory pressure. Here is the minimal data to get this decision right:

| Format | Size (3B) | Peak RAM | Tokens/sec (Pixel 8) | Tokens/sec (iPhone 15 Pro) | Perplexity Delta |
|--------|-----------|----------|----------------------|---------------------------|-----------------|
| Q4_K_M | ~1.8 GB | ~2.1 GB | ~12–15 t/s | ~18–22 t/s | +0.3–0.5 |
| Q5_K_S | ~2.2 GB | ~2.5 GB | ~9–12 t/s | ~14–18 t/s | +0.1–0.2 |

Q4_K_M is the sweet spot for mobile. The perplexity difference is negligible for structured output tasks like JSON generation and classification. Reserve Q5_K_S for quality-critical use cases where you can guarantee flagship hardware.

## Step 4: Hardware Accelerator Delegation

On iOS, delegate matrix operations to the Apple Neural Engine through CoreML integration. llama.cpp supports Metal acceleration out of the box, and ANE delegation via CoreML conversion pushes throughput significantly higher on A17/M-series silicon.

On Android, NNAPI delegation and GPU compute via Vulkan or OpenCL are available, but the gains vary across the fragmented device ecosystem. Pixel 8's Tensor G3 handles GPU delegation well; mid-range Snapdragon chips can actually *regress* in performance with NNAPI due to driver overhead. Profile per-device and fall back to CPU gracefully.

## Step 5: Thermal Throttling — Adaptive Generation

The docs do not mention this, but sustained inference generates heat. After 60–90 seconds of continuous generation, thermal throttling can drop your token rate by 40–60%.

Monitor thermal state through platform APIs (`ProcessInfo.ThermalState` on iOS, `PowerManager.THERMAL_STATUS_*` on Android) and implement adaptive generation:

Enter fullscreen mode Exit fullscreen mode


kotlin
when (currentThermalState()) {
ThermalState.NOMINAL -> params.copy(throttleMs = 0)
ThermalState.FAIR -> params.copy(throttleMs = 15)
ThermalState.SERIOUS -> params.copy(throttleMs = 50, nPredict = 128)
ThermalState.CRITICAL -> suspend generation, notify user
}


Same principle behind any sustained mobile workload — deliberate pacing beats brute force. When I use [HealthyDesk](https://play.google.com/store/apps/details?id=com.healthydesk) to remind me to take breaks during long coding sessions, it is the same idea applied to humans: sustained output requires pacing.

## Step 6: Structured Output Parsing

For app-integrated features, use constrained grammar sampling (llama.cpp's GBNF grammars) to force valid JSON output. Parse it in the shared KMP layer using `kotlinx.serialization`. This eliminates retry loops and makes on-device LLM output as reliable as any API response.

## Gotchas

1. **Never allocate model weights on the heap.** Always use mmap. Let the OS manage paging, and your app survives memory pressure instead of getting killed.
2. **Do not trust NNAPI blindly on Android.** Mid-range Snapdragon drivers can cause performance regression. Always benchmark CPU fallback against accelerator delegation per device.
3. **Instrument thermal state from day one.** Demos run for 10 seconds; production runs for minutes. If you skip this, your first user complaint will be about the app freezing after a minute of use.
4. **Default to Q4_K_M.** Only step up to Q5_K_S when you have confirmed hardware headroom and a quality-critical use case.

## Wrapping Up

On-device inference on mobile is real and production-viable today with 3B-parameter models. The architecture — KMP shared module, mmap loading, adaptive thermal management, and constrained output parsing — gives you zero-latency responses, offline functionality, and data privacy that no API call can match. Start with Q4_K_M on flagship devices, instrument everything, and build up from there.

For deeper reference, check the [llama.cpp documentation](https://github.com/ggerganov/llama.cpp) and the [Kotlin Multiplatform docs](https://kotlinlang.org/docs/multiplatform.html).
Enter fullscreen mode Exit fullscreen mode

Top comments (0)