Compile-Time Memory Layout Optimization for On-Device ML Models

#webdev #programming

---
title: "ART Memory Tuning: Cut On-Device ML GC Pauses 90%"
published: true
description: "Profile-guided allocation, object pinning, and RegionSpace tuning eliminate GC stalls during on-device ML inference. Practical ART memory strategies that actually work."
tags: android, kotlin, architecture, mobile
canonical_url: https://mvpfactory.co/blog/art-memory-tuning-cut-on-device-ml-gc-pauses-90
---

## What We're Building

Let me show you a pattern I use in every project that runs ML inference on Android. By the end of this tutorial, you'll know how to eliminate up to 90% of GC-induced frame drops during on-device inference using three concrete strategies: ART profile-guided compilation hints, large object space pinning, and JNI boundary isolation.

Most teams misdiagnose inference jank as a model performance problem. It's not. It's an allocation pattern problem — and I'll walk you through fixing it step by step.

## Prerequisites

- Familiarity with Android development in Kotlin
- An on-device ML pipeline (TFLite, ONNX Runtime, or MediaPipe)
- Android Studio with a debuggable build variant
- ADB access for profiling GC stats

## Step 1: Understand Where Your Allocations Land

Before changing anything, you need to know what ART's Concurrent Copying (CC) collector does with your tensors. Here is the minimal mental model to get this working:

| Allocation event | Where it lands | GC risk |
|---|---|---|
| Small tensors (<12KB) | RegionSpace TLAB | Low — thread-local, fast |
| Medium tensors (12KB-128KB) | RegionSpace shared regions | Medium — contention + region exhaustion |
| Large tensors (>128KB) | Large Object Space (LOS) | High — LOS collections are expensive |
| JNI native buffers | Native heap (outside ART) | None — invisible to GC |

The docs don't mention this, but most inference frameworks allocate intermediate buffers in the 16KB-256KB range. That's the danger zone where RegionSpace fills quickly and LOS triggers costly collections. I've seen blocking pauses from 5ms to 40ms here — enough to blow a 16ms frame budget.

Profile first: `adb shell setprop dalvik.vm.gcstats 1` captures allocation rates during inference. Target that 12KB-256KB range.

## Step 2: Add Profile-Guided Allocation Hints

This is the lowest-effort, highest-impact change you can make. Since Android 9, baseline profiles influence allocation behavior by marking hot allocation sites for pre-tenuring or region pre-sizing.

kotlin
// In your baseline profile rules (baseline-prof.txt)
// Mark inference-heavy classes for optimized allocation
HSPLcom/myapp/ml/InferenceSession;->runInference([F)[F
HSPLcom/myapp/ml/TensorBuffer;->(I)V


ART compiles profiled methods with optimized allocation sequences that reduce TLAB overflow and region contention. This alone cuts minor GC events by 30-40% during inference bursts. Most teams simply forget to include ML pipeline classes in their profile rules.

## Step 3: Pin Large Objects with Direct ByteBuffers

For tensor I/O, use direct `ByteBuffer` allocations that bypass RegionSpace entirely:

kotlin
// Use direct ByteBuffers for large tensor I/O
val inputBuffer = ByteBuffer.allocateDirect(modelInputSize * 4)
.order(ByteOrder.nativeOrder())

// These live in native memory, completely outside ART's GC
val outputBuffer = ByteBuffer.allocateDirect(modelOutputSize * 4)
.order(ByteOrder.nativeOrder())


This eliminates the CC collector's copy overhead for large, short-lived buffers. For buffers that must remain as managed objects, `sun.misc.Unsafe`-based pinning APIs available through ART internals prevent relocation during CC phases. Expect a 50-60% GC pause reduction from this step alone.

## Step 4: Push the Pipeline Below the JNI Boundary

Here is the gotcha that will save you hours: most teams run inference through managed Kotlin wrappers that create dozens of intermediate managed objects per frame. The real fix is making the JNI boundary your GC firewall.

kotlin
class NativeInferenceEngine {
// All tensor allocation happens in native heap
external fun initModel(modelPath: String): Long // returns native handle
external fun runInference(handle: Long, input: FloatArray): FloatArray

// Only crossing JNI for input/output —
// intermediate tensors never touch managed heap
external fun releaseModel(handle: Long)

}


Every tensor you keep off the managed heap is a GC pause you'll never see. This is the highest-effort strategy, but it delivers 80-90% pause reduction.

| Strategy | GC pause reduction | Implementation effort |
|---|---|---|
| Baseline profile hints | 30-40% | Low — profile rules only |
| Direct ByteBuffer for I/O | 50-60% | Medium — buffer management |
| Full JNI-boundary isolation | 80-90% | High — native pipeline |
| All three combined | ~90% | High — but worth it for real-time inference |

## Step 5: Tune RegionSpace for Remaining Managed Allocations

For managed allocations you can't eliminate, tune RegionSpace behavior through system properties on debug builds or ART runtime flags:

- Larger regions (512KB vs default 256KB) reduce region exhaustion during bursts
- Increasing thread-local allocation buffer size absorbs more burst allocations before falling back to shared regions
- Adjusting the CC collector urgency threshold prevents premature blocking collections

## Gotchas

- **The 12KB-256KB danger zone**: This is where GC pressure concentrates during inference. Profile this range specifically before optimizing anything else.
- **Forgetting baseline profiles for ML classes**: Your inference pipeline classes need to appear in `baseline-prof.txt`. ART can't optimize what it doesn't know about.
- **Managed wrappers creating hidden allocations**: A single Kotlin convenience layer around your inference engine can generate dozens of managed objects per frame. Audit the allocation path, not just the inference call.
- **Misdiagnosing model performance**: If you see 5-40ms stalls during inference, check GC logs before reaching for a smaller model. The managed heap isn't your enemy — uncontrolled allocation patterns are.

## Wrapping Up

Start with baseline profiles (Step 2) — it's a few lines in a text file and delivers 30-40% improvement. Then move to direct `ByteBuffer` for I/O. Only invest in full JNI isolation when you need real-time inference alongside UI rendering. Control the allocation pattern, and GC pauses during inference stop being a problem.

DEV Community

Compile-Time Memory Layout Optimization for On-Device ML Models

Top comments (0)