---
title: "Profile-Guided Optimization: Android Cold Start Under 400ms"
published: true
description: "A hands-on walkthrough of Baseline Profiles, Cloud Profiles, and DEX layout reordering — the pipeline that cut our cold start from 1.2s to 380ms."
tags: android, kotlin, mobile, performance
canonical_url: https://blog.mvpfactory.co/profile-guided-optimization-android-cold-start-under-400ms
---
## What We're Building
By the end of this tutorial, you'll have a complete profile-guided optimization pipeline that seeds ART's ahead-of-time compiler at install time, reorders your DEX layout for minimal page faults, and validates everything with Macrobenchmark. We took a 1.2s cold start down to 380ms with this exact setup. Let me show you each layer.
## Prerequisites
- Android Gradle Plugin 7.3+ (R8 enabled by default)
- Macrobenchmark library (`androidx.benchmark:benchmark-macro-junit4`)
- A physical device or managed device for profiling (emulators skew results)
- Familiarity with Perfetto traces (helpful, not required)
## Step 1: Understand What ART Does Without You
ART runs in multiple compilation modes, and the one your users hit on first launch is the worst one:
| Mode | When It Runs | Startup Impact |
|------|-------------|----------------|
| Interpret-only | First install, no profile | Slowest: bytecode interpreted at runtime |
| Speed-profile | After profile collection (idle maintenance) | Fast for profiled methods, slow for the rest |
| Speed | Full AOT (rare, OEM-triggered) | Fastest but largest on-disk footprint |
On a fresh install with no Baseline Profile, ART defaults to interpret-only. The JIT compiles hot methods at runtime, writes a profile to disk, and only during idle device maintenance does `bg-dexopt` AOT-compile your critical path. Your user's first session — the one that determines retention — runs on the slowest mode. That's the gap we fill.
## Step 2: Generate Baseline Profiles in CI
A Baseline Profile lists classes and methods to AOT-compile at install time. Here is the minimal setup to get this working with Macrobenchmark:
kotlin
@ExperimentalBaselineProfilesApi
@RunWith(AndroidJUnit4::class)
class StartupProfile {
@get:Rule
val rule = BaselineProfileRule()
@Test
fun generateBaselineProfile() = rule.collect(
packageName = "com.example.app",
maxIterations = 5,
stableIterations = 3
) {
pressHome()
startActivityAndWait()
device.findObject(By.res("main_feed"))
.wait(Until.hasObject(By.res("feed_item")), 5000)
}
}
Our generated profile covered about 12% of total DEX methods, but those methods represented 94% of wall-clock time in the first 500ms of startup. Wire this into your release pipeline — not a one-time task.
## Step 3: Enable DEX Layout Reordering
Here is the gotcha that will save you hours. Baseline Profiles alone got us from 1,204ms to 620ms. Adding DEX startup layout reordering pushed us to 445ms — a bigger incremental gain than the profile itself. It is a one-line Gradle property:
properties
android.enableStartupDex=true
This tells R8 to reorder classes in `classes.dex` so startup-critical classes are physically contiguous. It matters because of page faults — a 4KB memory page loaded from disk contains multiple class definitions. Scattered startup classes mean the kernel loads pages full of irrelevant data. Reordering cut our page faults by 30–50%.
## Step 4: Let Cloud Profiles Compound
Starting with API 28+, Google Play aggregates anonymized runtime profiles and delivers them to new installs. Combined with the previous steps, here are the real numbers:
| Configuration | Page Faults (Median) | Cold Start (P50) | Cold Start (P95) |
|--|--|--|--|
| No profile, default layout | 1,847 | 1,204ms | 1,890ms |
| Baseline Profile only | 1,210 | 620ms | 980ms |
| Baseline + DEX reorder | 780 | 445ms | 710ms |
| Baseline + DEX reorder + Cloud | 690 | 380ms | 590ms |
## Step 5: Validate With Macrobenchmark
The docs do not mention this, but `CompilationMode.Full` will make your numbers look great and tell you nothing useful. Use `Partial` to simulate real-world conditions:
kotlin
@get:Rule
val benchmarkRule = MacrobenchmarkRule()
@test
fun startupCold() = benchmarkRule.measureRepeated(
packageName = "com.example.app",
metrics = listOf(StartupTimingMetric()),
iterations = 10,
startupMode = StartupMode.COLD,
compilationMode = CompilationMode.Partial(
baselineProfileMode = BaselineProfileMode.Require
)
) {
startActivityAndWait()
}
Benchmark across API level brackets: 24–27 (no Cloud Profile support), 28–30 (Cloud Profiles, older ART), and 31–35 (latest ART with improved compilation). A profile that cuts cold start by 60% on API 33 may only yield 30% on API 26.
## Gotchas
- **Stale profiles kill gains silently.** Generate profiles in CI on every release. I've seen teams lose 200ms over a few months from profile drift alone.
- **R8 invalidates your profile.** Method inlining causes inlined methods to disappear from DEX. Always regenerate profiles post-R8.
- **Cloud Profiles lag 24–48 hours.** Early adopters — your most engaged users — get no benefit. Baseline Profiles give you deterministic, day-zero coverage. Relying solely on Cloud Profiles is a mistake.
- **Watch the right Perfetto trace points:** `bindApplication` (framework init), `activityStart` (your `onCreate`), and `reportFullyDrawn` (your declared "ready" signal).
## Wrapping Up
Let me show you a pattern I use in every project: Baseline Profile generation in CI, DEX reorder enabled, Macrobenchmark gating the release across API brackets. Each layer compounds on the last. Skip any one of them and you leave real performance on the table. Start with Step 2, measure the delta, then stack each layer and watch the numbers drop.
Top comments (0)