I shipped an Android app whose daily insight is written by Gemini Nano running entirely on the phone — no cloud, no prompt or output ever leaving the device. The integration looked trivial in the docs. The production reality had three hard edges that reshaped my architecture. This is the writeup I wish I'd had before I started.
Quick context: Tawen reads sleep, HRV, and activity from Health Connect, computes a readiness score (0–100) on-device, and then uses the ML Kit GenAI Prompt API (which sits on AICore, Android's system service for on-device foundation models) to explain that score in plain English. Health data, prompts, and outputs all stay on the device.
Here's what actually bit.
1. Inference is foreground-only, and it's enforced
My first design pre-generated the day's narrative in a WorkManager job so it'd be instant when the user opened the app. Clean, idiomatic, and completely wrong: the GenAI API returns ErrorCode.BACKGROUND_USE_BLOCKED the moment you call it without a visible UI — including from a foreground service. AICore deliberately refuses inference unless your app is the top foreground application.
This isn't a quota you can request around; it's a design constraint. So the architecture inverts: inference happens in front of the user, triggered by the screen that needs it, and everything that can be precomputed without the model (the score itself — more on that below) is precomputed, while the narration is generated live when a relevant screen is foreground. If you're planning to "warm up" an on-device LLM in the background, plan for it not to work.
2. AICore is effectively single-threaded — concurrency returns BUSY
The shared Gemini Nano model on the device is a single resource, and AICore serializes access to it. Fire two inference calls close together — say, two composables that each want a narrative — and the second comes back ErrorCode.BUSY. There's also PER_APP_BATTERY_USE_QUOTA_EXCEEDED for longer-horizon overuse.
The fix that made this stable was to stop letting UI call the model directly. Every Nano request goes through a single-owner inference queue: one coroutine owns the model, requests are serialized, each has a hard timeout, and callers await a result instead of racing for the resource. Treat the on-device model like a single serial device (because it is), not like a stateless cloud endpoint you can fan out to.
3. Different Nano versions give different output — so don't let the model own anything that must be stable
The docs note it plainly: different versions of Gemini Nano can return different output for the same prompt. For a narrative, that's fine — it's prose. But it means the model cannot be the source of truth for anything a user might compare day to day.
This drove the core architectural decision of the app: the score is deterministic; the model only narrates it. A plain rule-based engine computes the readiness score from five weighted signals. Gemini Nano writes the explanation about that score and never computes it. The benefits compound:
- The number is identical whether Nano is available or not.
- On devices without Nano (it needs recent hardware), a deterministic rule-based explanation takes its place and the score is unchanged.
- "AI explains a transparent calculation" is a more honest and more debuggable shape than "AI emits a number you can't inspect." I label output as AI-written only when Nano actually wrote it. The rule-based fallback is never called "AI." That honesty turned out to matter more to users than the AI itself.
What I'd tell my past self
-
Read the error codes first, design second.
BACKGROUND_USE_BLOCKED,BUSY, and the version-variance note are not edge cases — they're the shape of the platform. Designing around them up front would have saved me a rewrite. - Keep a deterministic core. Let the model do the soft, fuzzy, language part. Anything that needs to be stable, reproducible, or comparable should live in code you control. The fallback path you get for free is worth the discipline on its own.
- On-device AI's real product win is privacy, not magic. The reason to do this isn't that Nano is smarter than a cloud model (it isn't). It's that a sentence about someone's sleep and heart-rate data can be generated without that data ever leaving their phone. Build around that and the architecture mostly designs itself.
If you're integrating ML Kit GenAI / Gemini Nano, I'm happy to compare notes — the foreground-only and single-owner-queue parts especially. The official docs are here; everything above is what they don't quite prepare you for.
Top comments (0)