I write a safety-critical Android app that watches a phone 24/7 — motion, GPS, screen activity — and emails the family if an elderly person's behavior suddenly looks wrong. Install it on grandma's phone and forget about it. That's the ideas.
I've written here before about a SAM lambda that hung my geocoder for 21 hours, and about the OTAs that silently strip every battery-optimisation exemption I worked to get. This one is sillier than either. This one is about the time my own service-recovery layers were so visibly enthusiastic that the manufacturer's battery manager looked at them and went, "yeah, that's a zombie, kill it harder."
The fix shipped last week. The bug is not really a bug - it's a category of mistake - and I think it's worth writing about because every Android dev I know who builds anything that has to run in the background eventually steps in this exact pile.
The setup that should have worked
Fresh install on an Honor phone running MagicOS 9.0 on top of Android 15. Out of the box, brand new, the friendliest possible test.
The user went through onboarding. Battery optimisation: "Don't optimise." Autostart: on. Recipients added. Setup finished.
My in-app sanity worker fires one hour after setup completes. It's a simple thing — re-runs every permission probe and snapshots the device state into the diagnostic log. At T+1h it reported health=Healthy. Battery exemption present. Foreground service running. IMPORTANCE_MIN notification visible. Everything fine.
About six hours after that, the family received this email:
Battery optimisation is blocking the app. Tap "Open Settings" below, then set "How Are You?!" to "Don't optimise."
That email is automatic. The app sends it when a runtime health check detects that an exemption it was granted has been quietly revoked. So at some point between hour 1 and hour 7, the OS had said "yes, you have battery exemption" at lunchtime and "no, you don't" by evening, and absolutely nothing the user did caused the change.
The phone had been sitting on a counter the whole time. The app should have been bored.
What actually happened (with timestamps, because Android is a story told in timestamps)
The diagnostic export, when I opened it, was a small horror movie.
At 17:24:30 UTC, MagicOS killed the foreground service. No specific reason in the logs — Honor doesn't bother emitting one. The process just dies.
Then my 11-layer service-recovery chain woke up, exactly as designed. I built that chain over forty-odd app versions, each layer added because the previous ones weren't enough on some specific device. Eleven layers feels excessive until you ship to a Xiaomi user, after which it feels like a starting point.
Within the same wall-clock second, three of those layers fired at once:
17:24:31 [Application] Recovery: process started, service not running - starting service
17:24:31 [Application] Network recovery: service not running - starting service
17:24:31 [MonitoringService] onStartCommand startId=2
17:24:31 [MonitoringService] onStartCommand startId=3
What you're looking at:
- The process restarted, which fired
Application.onCreaterecovery. That layer noticed the foreground service wasn't running and calledMonitoringService.start(). - A network availability callback fired in the same instant. That layer also noticed the service wasn't running and called
MonitoringService.start(). - Behind the scenes, a SyncAdapter account-creation poke also fired. Third independent path. Third call.
All three calls reached the framework. The framework happily delivered all three to onStartCommand. My init code ran, ran again, ran a third time. Each invocation re-fired startForeground(). Each invocation re-scheduled the watchdog. Each invocation re-launched the monitoring coroutines.
I was, from the OS's point of view, aggressively visible.
By 19:24 UTC — exactly two hours later — the watchdog reported ServiceRecoveryJobService missing - rescheduling. My battery exemption was gone, and one of my JobScheduler registrations had been ripped out for good measure. The next health check confirmed the regression and shipped the email to the family.
It took MagicOS less than two hours from "this app got killed once" to "this app's exemption is hereby revoked."
The "process resurrection detector"
Here is the unkind realisation that took me an embarrassing length of time to land on.
Honor PowerGenie, Xiaomi MIUI Power Keeper, Samsung Smart Manager — they all run a heuristic that explicitly looks for what mine does: a process that gets killed and immediately tries to come back via multiple unrelated paths.
To the OEM, that pattern is one of two things. It's malware-shaped — rootkit-style autostart, ad-stuck zombies, miners that won't quit. Or it's a "stubborn" app that doesn't take the hint that the OS would prefer it to stay down. Either way, it earns demerits. Each concurrent recovery path that lights up in the same second is one tick on a process-resurrection counter. Hit the threshold and the OEM strips your trust.
Multi-path concurrent recovery is not, in PowerGenie's eyes, a clever defence-in-depth strategy. It's a symptom of a misbehaving app.
And of course every single one of my eleven layers is genuinely necessary on a different OEM. Pixel App Standby will let BOOT_COMPLETED fire but quietly throttle the AlarmManager chain into next Tuesday. Samsung lets the AlarmManager fire but JobScheduler registrations vanish through Doze. Xiaomi keeps the JobScheduler but stops delivering onTaskRemoved after a while. Whenever I removed any of these layers in the past, a different family stopped getting alerts. So the eleven of them are real. The problem is not that they exist — the problem is that, on a process restart, they all fire on the same trigger in the same second, and that's the part PowerGenie reads.
This is a problem you only have if your recovery is good enough that all of it works.
What broke, in three signals
Working backwards from the export, I could pin down three specific things MagicOS observed and graded me down on. None of the three was a clever, specific bug. They were the kind of thing you only see once you know to look.
Signal 1 — multi-path concurrent recovery. Three onStartCommand calls in the same second, two of them firing startForeground() redundantly, watchdog reschedule re-fired, monitoring coroutines re-launched. To PowerGenie, three increments instead of one.
Signal 2 — JobScheduler frequency. My periodic watchdog ran every fifteen minutes. That's about ninety-six runs per day. According to AOSP BatteryStatsService telemetry, that puts an app in the top one percent of UIDs by job count. Pixel App Standby Bucket demotion and Samsung's Sleeping Apps classifier read this number directly. The historical reason I picked fifteen minutes was that fifteen is the floor for PeriodicWorkRequest. I picked the floor because it was the floor, not because anyone measured it against alternatives.
Signal 3 — time-based location requests with no explicit distance hint. My GPS wake probe — a five-second requestLocationUpdates(HIGH_ACCURACY) that fires at most once every four hours — built its LocationRequest without setMinUpdateDistanceMeters(0f). The default is implementation-defined, undocumented, and on some Play Services versions non-zero. Samsung and Pixel battery analysers flag time-only location requests harder than the equivalent request that explicitly declares its intent. Two other call sites in my code already had the line. The wake probe didn't, because I'd written it earlier and never came back.
Each signal on its own is small. The three together added up to "this UID is suspicious," and PowerGenie should clear its trust, I hope.
The fix, three small surfaces
The pleasant surprise was that none of these three needed a clever solution. They needed the smallest possible change at the right place.
Anti-thrash gate on onStartCommand. Five seconds, monotonic clock, one @Volatile field on the Service:
private const val CONCURRENT_START_WINDOW_MS = 5_000L
@Volatile private var lastStartCommandHandledMs: Long = 0L
override fun onStartCommand(intent: Intent?, flags: Int, startId: Int): Int {
super.onStartCommand(intent, flags, startId)
val now = SystemClock.elapsedRealtime()
val sinceLast = now - lastStartCommandHandledMs
if (lastStartCommandHandledMs > 0L && sinceLast < CONCURRENT_START_WINDOW_MS) {
DiagnosticLogger.logInfo(
"MonitoringService",
"onStartCommand startId=$startId — duplicate within ${sinceLast}ms; skipping re-init"
)
return START_STICKY
}
lastStartCommandHandledMs = now
// ...existing init body unchanged from here
}
Two things worth being specific about.
SystemClock.elapsedRealtime(), not System.currentTimeMillis(), because wall clock can jump backwards (NTP sync, DST in a few zones, manual time change), and you do not want a "duplicate" gate that ever rejects more aggressively than the window you wrote down.
The gate sits at the convergence point, not at every caller. The eleven layers above need to keep firing independently — each one is the only thing that works on some specific OEM, and every gate I added on the divergent caller side immediately created the next OEM-specific gap. Deduplication belongs where the calls converge, not where they diverge. onStartCommand is the chokepoint where every layer's effort eventually arrives, so it's the right place to be visible-once instead of visible-thrice.
The framework is fine with this, by the way. AOSP ActivityManagerService.ServiceRecord tracks one deadline per service, and a single successful startForeground() call clears every pending deadline regardless of how many startForegroundService() calls are stacked behind it. The first onStartCommand of the burst already satisfied the OS. The duplicates were doing redundant work — work that PowerGenie watched, and that the OS would have been just as happy without.
setMinUpdateDistanceMeters(0f) on the GPS wake probe. One line:
LocationRequest.Builder(Priority.PRIORITY_HIGH_ACCURACY, 1000L)
.setDurationMillis(GPS_WAKE_PROBE_DURATION_MS)
.setMaxUpdates(5)
.setMinUpdateDistanceMeters(0f) // declarative intent for OEM analysers
.build()
Behavior identical. 0f says "deliver every fix regardless of distance," which is exactly what a five-second wake probe wanted anyway. The change is purely about handing the OEM's static analyser something explicit to read instead of letting it infer "time-based, no distance hint, suspicious."
Watchdog cadence: 15 → 30 minutes. One constant. The watchdog is not the primary FGS-survival layer — that's the AlarmManager 5-min chain plus the 8-hour setAlarmClock() safety net. The WorkManager job is explicitly the backup. Halving its cadence (96 runs/day → 48 runs/day) cuts the per-UID job count in half. On Android 12+ the periodic worker is Doze-deferred anyway, so the "15 is more responsive than 30" claim doesn't hold during the exact moments — deep sleep, OEM kill cycle — when revocation actually happens.
I also flipped ExistingPeriodicWorkPolicy.KEEP to UPDATE for this release only, so existing installs migrate to the new cadence on first app open instead of waiting weeks for WorkManager to organically re-evaluate.
Three fixes. Three small surfaces. Total diff: maybe forty lines, including the KDoc explaining why the constants are what they are.
What I actually learned
Some recovery patterns look exactly like malware to a battery manager. I had been thinking of my eleven layers as defence in depth. From PowerGenie's side they look like a process that won't die quietly. The OEM doesn't care which I am. It grades the visible behaviour. Every concurrent path that wakes the same service in the same second is one increment on the resurrection counter — and if your eleven layers all fire on the same trigger (process restart, network change, boot), you have handed the heuristic an unambiguous reading. The fix is not to remove layers. It is to keep them and converge their effect.
The floor is not a free choice. I picked 15-minute watchdog cadence because it was the lowest periodic interval WorkManager allows. I never asked whether 15 was better than 30 for what the watchdog actually does. It was the floor and I assumed that was the answer. It wasn't. The floor was available. The right value was somewhere above it where the protection still held but the OEM signal stopped firing. Every "minimum allowed" comment in your codebase is a candidate for this question: did anyone ever actually pick this number, or did someone just take what was on the bottom shelf?
Declarative intent is cheap and load-bearing. setMinUpdateDistanceMeters(0f) was zero behaviour change. It existed purely to hand a sentence to a machine analyser that was going to read your LocationRequest whether you liked it or not. The lesson generalises. When a LocationRequest / JobInfo / WorkRequest has explicit fields and you leave them unset, you are letting an OEM-specific default fill in for you. That default is undocumented, varies between platform versions, and biases against you. Set every field that matters even if the value matches the platform default — at least then the analyzer has something concrete to read instead of guessing.
The convergence point is the right place for dedup. I genuinely tried, briefly, to dedupe at every caller — gate Application.attemptServiceRecovery, gate NetworkConnectivityMonitor.onAvailable, gate the SyncAdapter trigger. Within an hour it was obvious that any one of those gates I added now created the next OEM-specific gap, because each layer is the only thing that works on some specific device. The dedup belongs at the place where the divergent paths converge — onStartCommand for service restarts, processLocation for GPS writes (a separate post one day). Wherever the system actually does the thing. That is the chokepoint where you can be both correct and visible-to-OEM-quiet.
Single-OEM forensics, multi-OEM patches. This whole story came from one Honor diagnostic export. Three of the fixes generalise — Pixel and Samsung both read the same job-count and concurrent-onStartCommand signals — but I would not have known to look without the Honor evidence. If you are shipping continuous-background work on Android, every diagnostic export from a device that misbehaved is worth more than a thousand emulator hours. The OEM behaviour that actually matters is undocumented, regional, version-specific, and only shows up under the exact conditions of a real install on a real phone in someone's drawer.
The really uncomfortable part of all of this, the one I keep coming back to, is that on Android good engineering can be the wrong choice. The recovery layers are correct. The eleven of them, individually, are each load-bearing on some device. Together they are the reason "install and forget" works for the user, in most cases (I can't fight with OTA resetting my configuration silently). And on Honor, the eleven of them firing in the same second of process restart is the trigger for "this is a zombie, lower its trust." There is no platform contract that says "thou shalt not recover thy service via more than two paths at once," and there cannot be, because the OEMs do not write contracts. They write laws! And the laws this week is "concurrent recovery is suspicious."
So my eleven layers stay. They just learned to whisper.
The app is called "How Are You?!" — it's on Google Play. If you've shipped an Android app that has to run continuously, I'd love to hear which OEM caught you out, what the giveaway was in your logs, and how small the eventual fix turned out to be.
Top comments (0)