Stable Screenshot Tests and Flow Benchmarks in KMP

#kotlin #multiplatform #testing #ui

In the previous post on architecture, I argued that a good KMP suite is a consequence of architecture: keep the core framework-free, depend on ports, and most of your behavior becomes pure commonTest that runs without a device. That covered the fast tiers. This post is about the two slow ones that everybody actually fights with — screenshot tests and on-device benchmarks.

They're the flakiest tiers in any mobile suite. A snapshot that passes on your laptop and fails in CI; a startup benchmark that swings 40% run to run. The usual reaction is to distrust the tools and stop gating on them. But the flakiness almost never comes from Paparazzi or Macrobenchmark — it comes from non-determinism leaking into the thing under test. And the same component separation that powers the WeatherConditions core is exactly what seals those leaks.

Where the flakiness actually comes from

A screenshot diff and a benchmark number are both functions of their inputs. If any input wobbles, the output wobbles:

Source of wobble	Wrecks screenshots	Wrecks benchmarks
Live network / forecast data	different pixels each run	timing dominated by I/O jitter
Real clock (`now()`, "4pm", "in 2h")	timestamps shift the layout	—
Running animations	mid-frame capture	frame timing noise
Async loading order	captured during a spinner	warm-up variance
Font / locale / density differences	sub-pixel + reflow diffs	—

Every one of these is a dependency the UI shouldn't own. WeatherConditions already pushes all of them behind ports in CoreLib — ForecastService, LocationService, Clock. The trick is to let the test exploit that boundary.

The separation that makes pixels deterministic

The rule: the screen is a pure function of state. A composable that takes a UiState and renders it — no ViewModel lookup, no remember { fetch() }, no clock read inside the body. The presenter (which does call PlayabilityCalculator through the ports) lives one layer up.

// Stateless and total — same input always renders the same pixels.
@Composable
fun ScoreScreen(state: ScoreUiState, onRefresh: () -> Unit = {}) {
    when (state) {
        is ScoreUiState.Loading -> ScoreSkeleton()
        is ScoreUiState.Error   -> ErrorPanel(state.message, onRefresh)
        is ScoreUiState.Loaded  -> ScoreCard(state.score)   // ActivityScore + factors
    }
}

Because ScoreScreen never reaches for data, a screenshot test just hands it a frozen state — no network, no clock, no async:

private val frozenScore = ActivityScore(
    value = 62,
    factors = listOf(
        ScoreFactor("wind", delta = -18),
        ScoreFactor("rain after 4pm", delta = -12),
        ScoreFactor("temperature", delta = +8),
    ),
)

@Test fun scoreScreen_loaded_dark() = paparazzi.snapshot {
    ViewPointTheme(dark = true) {
        ScoreScreen(ScoreUiState.Loaded(frozenScore))
    }
}

Two more things to nail down so the render is bit-stable:

Pin the clock. Anything that prints a time goes through the Clock port; the test injects FixedClock("2026-06-27T16:00Z"). No port read = no drifting timestamps.
Kill animations. Snapshot with the animation clock paused (Paparazzi/Roborazzi render a single settled frame), so you never catch a transition mid-flight.

That's the whole stability story: deterministic inputs in, identical pixels out. A golden now changes only when the UI genuinely changes — which is the entire point of a golden.

Screenshot-testing a flow, not just a widget

The same separation lets you snapshot an entire key user flow by enumerating its states instead of clicking through it. The score→rank journey has four states worth locking down — and because the screen is stateless, you can produce each one directly:

@Test fun rankFlow_states() {
    snapshotState("loading", RankUiState.Loading)
    snapshotState("loaded",  RankUiState.Loaded(rankedFixture))   // 3 locations, deterministic order
    snapshotState("empty",   RankUiState.Empty)
    snapshotState("error",   RankUiState.Error("No forecast"))
}

No emulator drive, no waiting on a spinner, no "is it loaded yet" race. Each state of the flow is its own golden, so a regression tells you exactly which state broke. (When you do want real interaction — scroll, click, recomposition — Roborazzi runs the same goldens through Robolectric on the JVM; Compose UI tests cover the Android side and simulator tests the iOS side, as in the testing post.)

The deterministic order in rankedFixture matters: rank_locations sorts by score, so the fixture must produce a stable tie-break. That's a domain property — and it's already covered by a fast commonTest, which is why the screenshot can trust it.

Benchmarking the flow, not just cold start

The testing post measured cold StartupTimingMetric. Useful, but users don't feel "startup" — they feel the flow: open the app, pick a location, watch the ranked list paint. To benchmark that meaningfully you have to remove the one thing that makes the number meaningless — the network.

Separated components make this clean: point the benchmark build's ForecastService at a canned fixture adapter so every run scores the identical data. Now Macrobenchmark measures your compute and rendering, not Google Weather's latency that morning.

@Test fun rankFlow_frames() = benchmarkRule.measureRepeated(
    packageName = "com.abyxcz.viewpoint",
    metrics = listOf(FrameTimingMetric(), TraceSectionMetric("score_rank")),
    iterations = 10,
    startupMode = StartupMode.WARM,
) {
    startActivityAndWait()
    device.findObject(By.res("location-input")).text = "New York City"
    device.findObject(By.res("rank-button")).click()
    device.wait(Until.hasObject(By.res("rank-list")), 5_000)
}

// In production code: a custom trace section scopes the work the benchmark cares about.
trace("score_rank") {
    val ranked = ActivityScorer(registry).rank(contexts, profile)  // CoreLib, shared
    emit(RankUiState.Loaded(ranked))
}

FrameTimingMetric catches jank in the list paint; TraceSectionMetric("score_rank") isolates the scoring compute. With the network swapped out, run-to-run variance collapses from "is this a real regression or just wifi?" to a tight distribution where a true regression actually stands out. Bake the wins in with a Baseline Profile and you've got a flow benchmark you can gate on.

The stability payoff

Tier	Before separation	After separation
Screenshot	flakes on data/time/animation; diffs ignored	golden changes only on real UI change
Flow benchmark	swings with network; too noisy to gate	tight variance; regressions visible
Diagnosis	"something on that screen broke"	"the error state broke" / "scoring got slower"

Both tiers went from "nice in theory, muted in practice" to "trustworthy enough to fail a PR" — without a fancier tool. The leverage was architectural.

Things to Keep in mind

Goldens need an owner. Intentional UI changes mean re-recording images; review the diff, don't rubber-stamp it. Record on one canonical environment (CI) so font/density differences don't reintroduce the flakiness you just removed.
Fakes can drift. The benchmark's canned ForecastService and the snapshot fixtures must stay shaped like the real contracts. Generate them from the same commonTest fixtures so there's one source of truth.
One discipline to hold: the screen must stay a pure function of state. The day someone calls a ViewModel or reads the clock inside a composable, determinism leaks back in. An ArchUnit rule ("UI package must not depend on ports/use-cases directly") keeps that honest — same pattern the testing post used to protect the core.

Slow and Stable Wins

Screenshot tests and flow benchmarks aren't flaky because the tools are bad — they're flaky because non-deterministic dependencies are baked into what you're measuring. Separate the stateless UI from the shared core the way WeatherConditions already does, feed it frozen state and canned adapters through the existing ports, and both tiers become deterministic: pixels that only move when the design moves, timings that only move when the code gets slower. The architecture you built for portability turns out to be the same architecture that makes the slow tests trustworthy.