DEV Community: Stoyan Minchev

Honor's isIgnoringBatteryOptimizations() returns true in foreground and false in background — for the same app, at the same time

Stoyan Minchev — Tue, 05 May 2026 06:32:43 +0000

I maintain an Android app that monitors elderly people and emails their family if something looks wrong. It runs a foreground service 24/7, and the single most important thing it does at install time is get added to the battery-optimisation whitelist - the one behind PowerManager.isIgnoringBatteryOptimizations().

On my Honor device running MagicOS 9.0, I did a fresh new install. I did everything right during setup. Tapped "Don't optimise." Enabled autostart. Finished onboarding. All looked good. I opened the app a few times, closed it. Normal behavior. People will open it, see that it works, close it. They will wake up, out of curiosity, they will open it, then close it.

A few hours later, I received an email:
Battery optimisation is blocking the app. Please open Settings and set it back to "Don't optimize."

I opened the OS settings and all looked good there. They were correct.

The logs showed something strange:

2026-05-01 19:27 UTC  [WatchdogWorker]  isIgnoringBatteryOptimizations = false → health = Degraded(battery_optimization_active)

2026-05-02 09:25 UTC  [MonitoringAutoRecover]  isIgnoringBatteryOptimizations = true → health = Healthy

Same device. Same package. Same API call: context.getSystemService(PowerManager::class.java).isIgnoringBatteryOptimizations(packageName).

The watchdog runs as a background process. No activity, no visible UI at that moment.
The auto-recover check runs when the user opens the app. It's a foreground process. Activity visible, window focused.

Between 19:27 and 09:25, the watchdog fired roughly every thirty minutes. Every single background check returned false. I opened the app the next morning, the foreground check returned true, the app said "everything's fine," and the recovery screen never appeared.

What Honor is actually doing

AOSP's isIgnoringBatteryOptimizations() reads the deviceidle whitelist in DeviceIdleController. That's a persistent list. You get added via ACTION_REQUEST_IGNORE_BATTERY_OPTIMIZATIONS or dumpsys deviceidle whitelist +<package>. Once you're on it, you stay on it until someone removes you.

Honor's PowerGenie sits on top of this. It has its own internal classification - a per-UID trust score that evolves based on the app's runtime behaviour (job frequency, wake patterns, GPS usage, the resurrection-detection heuristic I wrote about previously). When PowerGenie decides your trust is low enough, it doesn't remove you from the AOSP whitelist. That would be visible in dumpsys. Instead, it folds an enforcement bit into the return value of isIgnoringBatteryOptimizations().

And here's the part that cost me two days: the enforcement bit is only applied when the caller is a background process. When the calling process has a visible activity — when the user is looking at the app — PowerGenie shadows the read and returns the "real" AOSP whitelist value. Which is true, because the user never revoked the exemption.

In other words:

Background worker asks "am I whitelisted?" → PowerGenie says no
User opens the app, same code asks "am I whitelisted?" → PowerGenie says yes
The AOSP whitelist never changed

I spent two days convinced this was a race condition before I looked at the timestamps carefully enough to see that every single background read disagreed with every single foreground read, for sixteen straight hours. That's not a race. That's policy.

The app is How Are You?! — elderly monitoring via AI pattern learning. If you've hit a vendor API that returns different values depending on the calling context, I'd genuinely like to know about it. The AOSP contract says nothing about this, and I suspect Honor isn't the only one doing it.

Honor watched my Android app come back from the dead — and revoked the battery exemption that let it

Stoyan Minchev — Wed, 29 Apr 2026 19:18:29 +0000

I write a safety-critical Android app that watches a phone 24/7 — motion, GPS, screen activity — and emails the family if an elderly person's behavior suddenly looks wrong. Install it on grandma's phone and forget about it. That's the ideas.

I've written here before about a SAM lambda that hung my geocoder for 21 hours, and about the OTAs that silently strip every battery-optimisation exemption I worked to get. This one is sillier than either. This one is about the time my own service-recovery layers were so visibly enthusiastic that the manufacturer's battery manager looked at them and went, "yeah, that's a zombie, kill it harder."

The fix shipped last week. The bug is not really a bug - it's a category of mistake - and I think it's worth writing about because every Android dev I know who builds anything that has to run in the background eventually steps in this exact pile.

The setup that should have worked

Fresh install on an Honor phone running MagicOS 9.0 on top of Android 15. Out of the box, brand new, the friendliest possible test.

The user went through onboarding. Battery optimisation: "Don't optimise." Autostart: on. Recipients added. Setup finished.

My in-app sanity worker fires one hour after setup completes. It's a simple thing — re-runs every permission probe and snapshots the device state into the diagnostic log. At T+1h it reported health=Healthy. Battery exemption present. Foreground service running. IMPORTANCE_MIN notification visible. Everything fine.

About six hours after that, the family received this email:

Battery optimisation is blocking the app. Tap "Open Settings" below, then set "How Are You?!" to "Don't optimise."

That email is automatic. The app sends it when a runtime health check detects that an exemption it was granted has been quietly revoked. So at some point between hour 1 and hour 7, the OS had said "yes, you have battery exemption" at lunchtime and "no, you don't" by evening, and absolutely nothing the user did caused the change.

The phone had been sitting on a counter the whole time. The app should have been bored.

What actually happened (with timestamps, because Android is a story told in timestamps)

The diagnostic export, when I opened it, was a small horror movie.

At 17:24:30 UTC, MagicOS killed the foreground service. No specific reason in the logs — Honor doesn't bother emitting one. The process just dies.

Then my 11-layer service-recovery chain woke up, exactly as designed. I built that chain over forty-odd app versions, each layer added because the previous ones weren't enough on some specific device. Eleven layers feels excessive until you ship to a Xiaomi user, after which it feels like a starting point.

Within the same wall-clock second, three of those layers fired at once:

17:24:31  [Application] Recovery: process started, service not running - starting service
17:24:31  [Application] Network recovery: service not running - starting service
17:24:31  [MonitoringService] onStartCommand startId=2
17:24:31  [MonitoringService] onStartCommand startId=3

What you're looking at:

The process restarted, which fired Application.onCreate recovery. That layer noticed the foreground service wasn't running and called MonitoringService.start().
A network availability callback fired in the same instant. That layer also noticed the service wasn't running and called MonitoringService.start().
Behind the scenes, a SyncAdapter account-creation poke also fired. Third independent path. Third call.

All three calls reached the framework. The framework happily delivered all three to onStartCommand. My init code ran, ran again, ran a third time. Each invocation re-fired startForeground(). Each invocation re-scheduled the watchdog. Each invocation re-launched the monitoring coroutines.

I was, from the OS's point of view, aggressively visible.

By 19:24 UTC — exactly two hours later — the watchdog reported ServiceRecoveryJobService missing - rescheduling. My battery exemption was gone, and one of my JobScheduler registrations had been ripped out for good measure. The next health check confirmed the regression and shipped the email to the family.

It took MagicOS less than two hours from "this app got killed once" to "this app's exemption is hereby revoked."

The "process resurrection detector"

Here is the unkind realisation that took me an embarrassing length of time to land on.

Honor PowerGenie, Xiaomi MIUI Power Keeper, Samsung Smart Manager — they all run a heuristic that explicitly looks for what mine does: a process that gets killed and immediately tries to come back via multiple unrelated paths.

To the OEM, that pattern is one of two things. It's malware-shaped — rootkit-style autostart, ad-stuck zombies, miners that won't quit. Or it's a "stubborn" app that doesn't take the hint that the OS would prefer it to stay down. Either way, it earns demerits. Each concurrent recovery path that lights up in the same second is one tick on a process-resurrection counter. Hit the threshold and the OEM strips your trust.

Multi-path concurrent recovery is not, in PowerGenie's eyes, a clever defence-in-depth strategy. It's a symptom of a misbehaving app.

And of course every single one of my eleven layers is genuinely necessary on a different OEM. Pixel App Standby will let BOOT_COMPLETED fire but quietly throttle the AlarmManager chain into next Tuesday. Samsung lets the AlarmManager fire but JobScheduler registrations vanish through Doze. Xiaomi keeps the JobScheduler but stops delivering onTaskRemoved after a while. Whenever I removed any of these layers in the past, a different family stopped getting alerts. So the eleven of them are real. The problem is not that they exist — the problem is that, on a process restart, they all fire on the same trigger in the same second, and that's the part PowerGenie reads.

This is a problem you only have if your recovery is good enough that all of it works.

What broke, in three signals

Working backwards from the export, I could pin down three specific things MagicOS observed and graded me down on. None of the three was a clever, specific bug. They were the kind of thing you only see once you know to look.

Signal 1 — multi-path concurrent recovery. Three onStartCommand calls in the same second, two of them firing startForeground() redundantly, watchdog reschedule re-fired, monitoring coroutines re-launched. To PowerGenie, three increments instead of one.

Signal 2 — JobScheduler frequency. My periodic watchdog ran every fifteen minutes. That's about ninety-six runs per day. According to AOSP BatteryStatsService telemetry, that puts an app in the top one percent of UIDs by job count. Pixel App Standby Bucket demotion and Samsung's Sleeping Apps classifier read this number directly. The historical reason I picked fifteen minutes was that fifteen is the floor for PeriodicWorkRequest. I picked the floor because it was the floor, not because anyone measured it against alternatives.

Signal 3 — time-based location requests with no explicit distance hint. My GPS wake probe — a five-second requestLocationUpdates(HIGH_ACCURACY) that fires at most once every four hours — built its LocationRequest without setMinUpdateDistanceMeters(0f). The default is implementation-defined, undocumented, and on some Play Services versions non-zero. Samsung and Pixel battery analysers flag time-only location requests harder than the equivalent request that explicitly declares its intent. Two other call sites in my code already had the line. The wake probe didn't, because I'd written it earlier and never came back.

Each signal on its own is small. The three together added up to "this UID is suspicious," and PowerGenie should clear its trust, I hope.

The fix, three small surfaces

The pleasant surprise was that none of these three needed a clever solution. They needed the smallest possible change at the right place.

Anti-thrash gate on onStartCommand. Five seconds, monotonic clock, one @Volatile field on the Service:

private const val CONCURRENT_START_WINDOW_MS = 5_000L

@Volatile private var lastStartCommandHandledMs: Long = 0L

override fun onStartCommand(intent: Intent?, flags: Int, startId: Int): Int {
    super.onStartCommand(intent, flags, startId)

    val now = SystemClock.elapsedRealtime()
    val sinceLast = now - lastStartCommandHandledMs
    if (lastStartCommandHandledMs > 0L && sinceLast < CONCURRENT_START_WINDOW_MS) {
        DiagnosticLogger.logInfo(
            "MonitoringService",
            "onStartCommand startId=$startId — duplicate within ${sinceLast}ms; skipping re-init"
        )
        return START_STICKY
    }
    lastStartCommandHandledMs = now

    // ...existing init body unchanged from here
}

Two things worth being specific about.

SystemClock.elapsedRealtime(), not System.currentTimeMillis(), because wall clock can jump backwards (NTP sync, DST in a few zones, manual time change), and you do not want a "duplicate" gate that ever rejects more aggressively than the window you wrote down.

The gate sits at the convergence point, not at every caller. The eleven layers above need to keep firing independently — each one is the only thing that works on some specific OEM, and every gate I added on the divergent caller side immediately created the next OEM-specific gap. Deduplication belongs where the calls converge, not where they diverge. onStartCommand is the chokepoint where every layer's effort eventually arrives, so it's the right place to be visible-once instead of visible-thrice.

The framework is fine with this, by the way. AOSP ActivityManagerService.ServiceRecord tracks one deadline per service, and a single successful startForeground() call clears every pending deadline regardless of how many startForegroundService() calls are stacked behind it. The first onStartCommand of the burst already satisfied the OS. The duplicates were doing redundant work — work that PowerGenie watched, and that the OS would have been just as happy without.

setMinUpdateDistanceMeters(0f) on the GPS wake probe. One line:

LocationRequest.Builder(Priority.PRIORITY_HIGH_ACCURACY, 1000L)
    .setDurationMillis(GPS_WAKE_PROBE_DURATION_MS)
    .setMaxUpdates(5)
    .setMinUpdateDistanceMeters(0f)   // declarative intent for OEM analysers
    .build()

Behavior identical. 0f says "deliver every fix regardless of distance," which is exactly what a five-second wake probe wanted anyway. The change is purely about handing the OEM's static analyser something explicit to read instead of letting it infer "time-based, no distance hint, suspicious."

Watchdog cadence: 15 → 30 minutes. One constant. The watchdog is not the primary FGS-survival layer — that's the AlarmManager 5-min chain plus the 8-hour setAlarmClock() safety net. The WorkManager job is explicitly the backup. Halving its cadence (96 runs/day → 48 runs/day) cuts the per-UID job count in half. On Android 12+ the periodic worker is Doze-deferred anyway, so the "15 is more responsive than 30" claim doesn't hold during the exact moments — deep sleep, OEM kill cycle — when revocation actually happens.

I also flipped ExistingPeriodicWorkPolicy.KEEP to UPDATE for this release only, so existing installs migrate to the new cadence on first app open instead of waiting weeks for WorkManager to organically re-evaluate.

Three fixes. Three small surfaces. Total diff: maybe forty lines, including the KDoc explaining why the constants are what they are.

What I actually learned

Some recovery patterns look exactly like malware to a battery manager. I had been thinking of my eleven layers as defence in depth. From PowerGenie's side they look like a process that won't die quietly. The OEM doesn't care which I am. It grades the visible behaviour. Every concurrent path that wakes the same service in the same second is one increment on the resurrection counter — and if your eleven layers all fire on the same trigger (process restart, network change, boot), you have handed the heuristic an unambiguous reading. The fix is not to remove layers. It is to keep them and converge their effect.

The floor is not a free choice. I picked 15-minute watchdog cadence because it was the lowest periodic interval WorkManager allows. I never asked whether 15 was better than 30 for what the watchdog actually does. It was the floor and I assumed that was the answer. It wasn't. The floor was available. The right value was somewhere above it where the protection still held but the OEM signal stopped firing. Every "minimum allowed" comment in your codebase is a candidate for this question: did anyone ever actually pick this number, or did someone just take what was on the bottom shelf?

Declarative intent is cheap and load-bearing. setMinUpdateDistanceMeters(0f) was zero behaviour change. It existed purely to hand a sentence to a machine analyser that was going to read your LocationRequest whether you liked it or not. The lesson generalises. When a LocationRequest / JobInfo / WorkRequest has explicit fields and you leave them unset, you are letting an OEM-specific default fill in for you. That default is undocumented, varies between platform versions, and biases against you. Set every field that matters even if the value matches the platform default — at least then the analyzer has something concrete to read instead of guessing.

The convergence point is the right place for dedup. I genuinely tried, briefly, to dedupe at every caller — gate Application.attemptServiceRecovery, gate NetworkConnectivityMonitor.onAvailable, gate the SyncAdapter trigger. Within an hour it was obvious that any one of those gates I added now created the next OEM-specific gap, because each layer is the only thing that works on some specific device. The dedup belongs at the place where the divergent paths converge — onStartCommand for service restarts, processLocation for GPS writes (a separate post one day). Wherever the system actually does the thing. That is the chokepoint where you can be both correct and visible-to-OEM-quiet.

Single-OEM forensics, multi-OEM patches. This whole story came from one Honor diagnostic export. Three of the fixes generalise — Pixel and Samsung both read the same job-count and concurrent-onStartCommand signals — but I would not have known to look without the Honor evidence. If you are shipping continuous-background work on Android, every diagnostic export from a device that misbehaved is worth more than a thousand emulator hours. The OEM behaviour that actually matters is undocumented, regional, version-specific, and only shows up under the exact conditions of a real install on a real phone in someone's drawer.

The really uncomfortable part of all of this, the one I keep coming back to, is that on Android good engineering can be the wrong choice. The recovery layers are correct. The eleven of them, individually, are each load-bearing on some device. Together they are the reason "install and forget" works for the user, in most cases (I can't fight with OTA resetting my configuration silently). And on Honor, the eleven of them firing in the same second of process restart is the trigger for "this is a zombie, lower its trust." There is no platform contract that says "thou shalt not recover thy service via more than two paths at once," and there cannot be, because the OEMs do not write contracts. They write laws! And the laws this week is "concurrent recovery is suspicious."

So my eleven layers stay. They just learned to whisper.

The app is called "How Are You?!" — it's on Google Play. If you've shipped an Android app that has to run continuously, I'd love to hear which OEM caught you out, what the giveaway was in your logs, and how small the eventual fix turned out to be.

I pushed my Android app to production. Then Android and the OEMs spent two weeks tearing it apart.

Stoyan Minchev — Wed, 22 Apr 2026 20:09:16 +0000

I build a safety-critical Android app that monitors elderly people living alone. It watches the phone 24/7 — motion, GPS, screen activity — and emails their family when something looks wrong. No buttons to press. No wearable to charge. Install it on your mum's phone and forget about it.

That last sentence — install and forget — is the entire product. It is the only reason this works for the target user, who is 65+ years old, does not open apps she did not put there herself, and will not read an email from us telling her to go into Settings and tap anything. If the app needs her attention to keep working, the app has failed.

In the last two weeks I found out that Android does not want me to keep that promise.

What broke first: the app was too quiet

I shipped a build. A tester installed it, went through setup, and never open the app again, exactly as a real elderly user would. Two and a half weeks later I opened the diagnostic export and found a phone that had, from Android's point of view, essentially disappeared.

The battery-optimization exemption I had granted at setup was reset. The problem was subtler. Because the app had not been opened in 18 days, every OEM-level "smart battery" heuristic had decided the app was unused and put it into the most aggressive standby bucket the platform has. My auto-recovery layers — WorkManager watchdog, AlarmManager safety net, boot receiver — were all there. They were all throttled to the point of being ornamental.

My 11 layers of recovery are all triggered by the device doing something. Boot. Screen on. Time change. App opened. Network regained. If the user does not open the app, the device stays in deep doze for days, and my "recovery" is just a sequence of alarms that will eventually fire when the OS decides they can.

This is the bootstrap paradox of Android background work. The app cannot wake itself up reliably. It needs an external caller to poke it. And if you are writing a safety-critical app where the entire value proposition is "you do not have to think about this," the user cannot be the external caller.

The fix: somebody else has to ring the doorbell

I needed an external, non-user-driven wake source. Something that would touch the phone on a predictable cadence regardless of whether the user opened the app.

FCM push messages are that thing, if you use them correctly. Here is the shape of it — none of this is private:

A GitHub Actions workflow runs on a cron. Mine is every 6 hours. Its only job is to call the FCM API and send a data message to a topic called heartbeat-all.
The Android app subscribes to that topic at first launch.
The app registers a FirebaseMessagingService. When a heartbeat arrives, the service does one thing: it runs my existing recovery classifier. Is the foreground service alive? Are permissions still granted? Is the learning phase progressing? If anything looks wrong, fix it. If everything is fine, do nothing.

A few things are worth being specific about.

Topic messaging, not device tokens. I do not want to store FCM tokens on a server. I have no server. With topic messaging, the server-side cron sends one message and Google's fan-out reaches every subscribed device. Zero per-device state. The external caller (GitHub Actions) does not know who my users are, and I like it that way.

Data messages with priority: HIGH, not notification messages. Notification messages are shown by the system and give the app almost no work budget. Data messages trigger onMessageReceived even in Doze, provided the priority is high. You pay for this with a quota — FCM will downgrade high-priority data messages if you overuse them — so 6-hour cadence is the floor. More frequent and you get throttled; less frequent and dormant phones drift too far before the next wake.

The handler always checks priority before acting. If FCM downgraded the message (which it will, eventually, on some device, for reasons Google does not fully document), the heartbeat handler does not run the full recovery path inline. It enqueues an expedited WorkManager job instead, because WorkManager is exempt from the foreground-service-while-in-use restrictions that killed the naive direct-execution path on Android 14. This one took me a minor-version patch to get right.

There is a compile-time kill switch. ENABLE_FCM_HEARTBEAT is a BuildConfig boolean. If a specific OEM starts misbehaving, I can ship a hotfix that disables the entire wake path without touching manifest registrations or Firebase configuration. Keep the lever small and reversible.

That was roughly a week of work. FCM itself is easy to configure — the hard part is not "how do I send a push message," it is "what is the contract my handler signs when the push arrives, and what does it promise not to do on a misbehaving device." Get that contract right and you have bought yourself a non-user-driven wake channel that survives dormant installs.

Testing now before I ship it.

And then I started worrying about everything else.

I went looking for what else could silently break

Shipping the FCM heartbeat felt too quiet. A fix that works is always a little suspicious, especially on Android, and especially when the failure it is fixing took me two and a half weeks of dormant-phone diagnostics to even notice. So after the heartbeat stabilised I sat down with a blank page and forced myself to answer one question: what else can invalidate the state my recovery is built on top of, without the user doing anything?

I listed everything the app depends on to keep running. Battery-optimization exemption. Autostart permission on OEMs that have it. Foreground-service type declaration being honoured. Notification channel being visible. The device's standby bucket being something other than "restricted." All of these are things the app establishes at setup and then assumes, forever, because nothing in the normal operation of a phone should flip them back.

OTAs are not normal operation.

I do not have a dramatic in-production discovery story for this one. I do not have a log line. The common thread is that the OEM considers battery-optimization exemption to be a privilege the app was granted under the previous system image, and the new system image is entitled to re-evaluate.

And here is what makes it the worst possible failure mode for my app: the user will never know. The OTA completes overnight. The phone reboots. The boot receiver fires. The foreground service tries to start. On the exempted path, it runs. On the reset path, it either fails silently on Android 14+ because foregroundServiceType=location requires permissions that are no longer considered granted, or it starts but runs in a standby bucket that throttles every alarm I schedule into next week. The IMPORTANCE_MIN notification looks the same either way. The elderly user will not notice. The adult child who installed the app months ago has moved on. The app is sitting on the phone doing approximately nothing, and nobody knows.

A lot of work on 11 layers of recovery. All of it sitting on top of assumptions the OS is allowed to invalidate between Tuesday and Wednesday, without telling me.

Detecting an OTA without any help from the OS

If the platform will not tell you an OTA happened, you have to work it out. The trick is Build.FINGERPRINT.

Build.FINGERPRINT is a string that uniquely identifies the system image — OEM, device, build ID, build tags, build date. Every OTA changes it. It changes at a point in time the app cannot be guaranteed to observe directly, but it is stable across every wake once the OTA has completed.

So: persist the last-seen fingerprint. On every wake — service start, boot receiver, heartbeat handler, everywhere — compare the current fingerprint against the stored one. If they differ, the device has been updated since the last time this app ran. At that instant, the app knows an OTA happened, even though nobody told it.

That is the detector. The response is where it gets interesting, because you still cannot fix anything from code. The permissions have been revoked. The app cannot silently re-grant them. The user has to go into Settings. But the user will never open the app.

So the response is an email.

When the fingerprint changes, the app checks: are the critical exemptions still present? If yes, rotate the stored fingerprint, log it, move on — benign OTA. If no, the app flips a flag (ota_degradation_mode), emails the family contact with concrete two-step instructions ("Battery → Unrestricted, Autostart → On, single visit to Settings, five minutes"), and schedules a follow-up worker for seven days out. The email does not go to the elderly user. It goes to the adult child who originally set the app up, because they are the one who can either drive over or walk the parent through it on the phone.

Here is where I get angry

This is my first Android project in Kotlin, since 14 years. I have been a developer for a long time — I have shipped on other platforms, on servers, on the web — but I had not written a line of Kotlin before this one. I came to Android again, with the assumption that a mainstream mobile platform in 2026 would be a reasonably solved environment. It is not. This has been, by a clear margin, the most hostile platform I have ever written code for, and I mean that in a specific way: it is hostile to the developer.

Not because Android is hard. Android is fine as a platform. The problem is that there is no platform anymore. There are fifteen platforms pretending to be one, and the differences between them are not surface-level. They are the parts of the OS that determine whether your background work runs at all. Samsung, Xiaomi, Honor, OPPO, Vivo, Asus, OnePlus — each one has a proprietary battery manager, a proprietary autostart manager, a proprietary app-standby bucket scheme, and a proprietary philosophy about what apps are allowed to do while the user is not looking. None of these behaviours are documented in a way a developer could build against. All of them change without notice, including as a side effect of an OTA.

Google's published APIs promise one thing. The OEM fork delivers something weaker. The standards are not enforced. There is no compliance suite the OEM has to pass to ship "Android." The developer is left to reverse-engineer each OEM's behaviour on real hardware, discover the regression via user bug reports or, in my case, a phone not being opened for two and a half weeks, and then add another compatibility layer that will itself need updating the next time the OEM changes its mind.

I do not think this is by accident. The OEMs compete on battery life — the review sites measure it, the spec sheets advertise it — and the cheapest way to win a battery-life benchmark is to kill background apps harder than the next manufacturer. The developer's app is not a stakeholder in this fight. The developer is collateral. We are victims of the marketing. Good benchmarks, better marketing, more sales -> annoyed developers!

And here is the thing that really bothers me. My app is not trying to serve ads. It is not mining crypto. It is not stealing the user's contacts. It is trying, on behalf of the user who installed it, to notice if an elderly person stops moving for too long and tell their family. That is the entire feature. And every single one of these OEM-specific battery managers is designed, at its core, to stop exactly this kind of work from happening. Because it looks, from the kernel's perspective, identical to the bad actor.

I cannot fix that from code. No amount of layered recovery fixes it. The OEMs can invalidate my product's promise between Tuesday and Wednesday by pushing an OTA, and I will find out about it when a family emails me to ask why grandma's phone stopped sending heartbeats three weeks ago.

I have not seen a serious Android project at my day job in years. I used to wonder why. I do not wonder anymore. If you were deciding, today, where to spend your next year of engineering time, would you pick the platform where your core value proposition can be invalidated by a silent system update from a manufacturer you have no contact with? Or would you pick the one where the platform vendor publishes the API, enforces it, and ships the update themselves?

The answer is visible in the job listings. Serious product work has moved to iOS and to the web.

What I actually learned

"Install and forget" is achievable on Android, but it is not a one-time promise — it is an ongoing fight. You can reach a state where the app survives first install, first reboot, first weekend in a drawer. You cannot reach a state where it is guaranteed to survive the next OTA you did not know was coming. The best you can do is detect degradation on the next wake and recover gracefully, through the family, without the elderly user ever noticing.

If your background work matters, you need an external wake source. Every internal recovery mechanism — AlarmManager, WorkManager, boot receivers, sticky services — is eventually rate-limited by the device's opinion of whether your app matters. A dormant install, on a dormant phone, running aggressive OEM battery management, can reach a state where nothing internal will wake it for days. An external push with a server-side cron is the only thing I have found that reliably breaks that state. FCM topic messaging plus a scheduled cron job is roughly a week of engineering. If your app has safety-critical reliability requirements, it is the cheapest week of engineering you will ever spend.

The permission you got at setup is not the permission you will have next month. Treat every exemption — battery optimization, autostart, exact alarms, foreground service type — as a fact you re-verify on every wake, not a state you establish once and assume. Compare Build.FINGERPRINT to the last-seen value, re-check exemptions when it changes, and have an out-of-band path (email, SMS, something outside the device) to tell a human when an exemption has been revoked by the OS without the user's involvement.

Silent degradation is worse than a crash. This echoes the last piece I wrote about a different bug, and it keeps being true. If the app had crashed after the OTA, Play Console would have told me. It did not crash. It kept running, doing nothing, drawing its minimum-importance notification, while the thing it was supposed to protect against — an elderly person falling in a quiet house — was no longer being monitored. There is no Play Console alert for "app is running but has stopped doing anything useful." You have to build that alert yourself. I have now built three of them, and I expect to build a fourth before the year is out.

The app is called "How Are You?! Senior Safety" It is on Google Play. If you have shipped an Android app that has to run continuously without the user opening it, I would like to hear what broke for you and how you found out.

A single Kotlin lambda silently broke my app for 21 hours - and I only found the bug because I crossed a border

Stoyan Minchev — Sun, 12 Apr 2026 20:38:19 +0000

I build a safety-critical Android app that monitors elderly people living alone. It watches their phone 24/7 — motion, GPS, screen activity — and emails their family when something looks wrong. No buttons to press, no wearable to charge. Install it on grandma's phone and forget about it.

I took a trip from Bulgaria to Romania in early April to test the app in real conditions and have a small vacation with my family. I drove across the Danube at the Vidin - Calafat bridge. Everything was working fine. Then at 14:55, the app went completely silent.

Not crashed. Not killed by the OS. Silent.

For the next 21 hours and 42 minutes, the motion sensor recorded 682 events. The GPS hardware was acquiring satellite fixes with 11-meter accuracy. The app was running, awake, doing its job. But not a single location reached the database.

The next morning, the AI looked at the last known position — a border crossing — and the 12-hour data gap, and did what it was designed to do: it sent an URGENT alert. Except I was fine. I was in Craiova, 200km away, sleeping in a hotel. The alert was anchored to a stale coordinate from the previous afternoon.

I spent two days tracing this. The root cause was one line of Kotlin.

The interface that lies to you

Android's Geocoder class converts GPS coordinates into street addresses. On API 33+, there's an async callback version:

geocoder.getFromLocation(latitude, longitude, 1) { addresses ->
    // do something with the result
}

That trailing lambda is Kotlin's SAM (Single Abstract Method) conversion. It looks clean. It compiles. It works perfectly — until it doesn't.

The interface behind this lambda is Geocoder.GeocodeListener:

public interface GeocodeListener {
    void onGeocode(@NonNull List<Address> addresses);

    default void onError(@Nullable String errorMessage) { }
}

See that second method? onError has a default empty implementation. When you use a SAM lambda, Kotlin only implements the single abstract method — onGeocode. The default onError stays empty.

So what happens when geocoding fails? Network timeout. No roaming data after crossing a border. Play Services killed by the OEM battery manager. Any of a dozen things that go wrong on real Android devices in real countries.

The framework calls onError(). The empty default runs. Nothing happens. The continuation is never resumed. The coroutine hangs forever.

Why it killed everything, not just geocoding

If the geocoder had hung in isolation, it would have been a minor bug — one address lookup fails, you move on. But my code looked like this:

processLocationMutex.withLock {
    val address = reverseGeocode(latitude, longitude)  // hangs here
    insertLocationData(location, address)
}

The processLocationMutex exists for a good reason. Four independent systems can trigger a GPS write at the same time — the stillness detector, the periodic scheduler, the force probe, and the area stability detector. Without the mutex, they race on the stationarity filter and insert duplicate rows that defeat the drive-through filtering logic.

But when reverseGeocode() hung, the mutex was held forever. Every subsequent GPS fix from every trigger path called processLocation(), tried to acquire the mutex, and blocked. Behind a coroutine that would never wake up.

No exception. No crash. No log entry. Just a growing queue of frozen coroutines, each holding a perfectly good satellite fix that would never reach the database.

The motion sensor kept firing. The GPS kept acquiring. The diagnostic logs show two successful HIGH_ACCURACY fixes at 21:37 and 21:38 — 11-meter accuracy, acquired in 2.5 seconds — both of which entered processLocation() and silently queued behind the hung mutex holder from 7 hours earlier.

The only recovery was killing the process

At 12:19 the next day — almost 22 hours after the hang started — I force-stopped the app from Android settings. The process died. The singleton mutex died with it. On restart, everything worked again.

But by then, the damage was done. The AI had already sent a false URGENT alert based on 12-hour-old coordinates. And a weekly re-calibration job had run during the trip, learning the border crossing drive-through as a "frequent location," which caused a cascade of further false alerts over the following days.

One hung lambda. One stale coordinate. Days of downstream consequences.

The fix has three layers

I don't trust single fixes for problems that can kill 21 hours of data.

Layer 1: Explicit object, both methods implemented.

val listener = object : Geocoder.GeocodeListener {
    override fun onGeocode(addresses: MutableList<Address>) {
        if (!hasResumed && continuation.isActive) {
            hasResumed = true
            continuation.resume(formatAddress(addresses.firstOrNull()))
        }
    }

    override fun onError(errorMessage: String?) {
        if (!hasResumed && continuation.isActive) {
            hasResumed = true
            continuation.resume(null)
        }
    }
}

No SAM conversion. Both callbacks resume the continuation. The hasResumed flag guards against the race where both fire, or either fires after timeout.

Layer 2: Hard timeout ceiling.

withTimeoutOrNull(10_000L) {
    suspendCancellableCoroutine<String?> { continuation ->
        geocoder.getFromLocation(latitude, longitude, 1, listener)
    }
}

Even if some future Android version adds a third callback method with another empty default, the coroutine dies after 10 seconds.

Layer 3: Geocoding moved outside the mutex.

// Geocoding is slow and can hang — never inside the mutex
val address = reverseGeocodingService.reverseGeocode(lat, lng)

// Only the database insert is protected (50ms critical section, not 10s+)
val acquired = withTimeoutOrNull(60_000L) {
    processLocationMutex.withLock {
        insertLocationData(location, address)
    }
}

The mutex timeout is a tripwire. If something else wedges the lock in the future, we log a diagnostic error and drop the fix rather than queuing forever.

What I actually learned

SAM conversion is not a convenience. It's a contract you didn't read. When you write a trailing lambda, you're implementing one method and accepting the defaults for everything else. If those defaults are no-ops, you've written code that silently drops errors. The compiler won't warn you. The IDE won't flag it. It works perfectly until it doesn't.

The scary part is that GeocodeListener isn't unusual. Android has dozens of interfaces with default error methods. WebViewClient.onReceivedError() has a default. MediaPlayer.OnErrorListener has patterns where partial implementation looks complete. Every SAM-converted lambda on an interface with default methods is a potential silent failure.

Mutexes amplify hangs into outages. A 10-second geocoding timeout would have been invisible — one null address, one row without a street name, nobody notices. But a mutex turned a local hang into a system-wide 21-hour data loss. If you're using a mutex to serialize writes, the critical section should contain only writes. Anything that touches the network, the filesystem, or a third-party service belongs outside the lock.

Silent failures are worse than crashes. If the geocoder had thrown an exception, I would have found it in the first hour. Instead, it hung — producing no error, no log, no crash report. The only evidence was the absence of data in a database table. In a safety-critical app that monitors whether elderly people are still moving, silence is the most dangerous failure mode there is.

The app is called "How Are You?! Senior Safety" — soon it will be released, once I am confident, that there are no bad surprises popping up. Have you ever been bitten by a default interface method you didn't know existed?

I built a 126K-line Android app with AI — here is the workflow that actually worked for me

Stoyan Minchev — Sun, 29 Mar 2026 08:58:53 +0000

Most developers trying AI coding tools hit the same wall. They open a chat, type "build me a todo app," get something that looks right, and then spend 3 hours fixing the mess. They try again with a bigger project and it falls apart faster. They conclude AI coding is overhyped.

I had the same experience. Then I changed my approach — not the tool, the process around it.

Over 4 months I built How Are You?!, a safety-critical Android app that monitors elderly people living alone. 126,000 lines of Kotlin. 144 versions. 130 test files. 3 languages. Solo developer with zero Kotlin experience when I started. The entire codebase was AI-generated — I never wrote Kotlin manually.

This article is not about the app. It is about the workflow that made this possible.

Why most people fail with AI coding

Two reasons:

Expectations are wrong. People expect to describe a feature in plain English and get production code. That works for a function. It does not work for a system. AI is not a replacement for engineering — it is an amplifier. If your input is vague, the output is vague.
No structure around the AI. They open a chat, prompt, get code, paste it, prompt again. There is no architecture. No shared context. No accumulated knowledge. Every conversation starts from zero.

The fix is not better prompting. It is better engineering process — with the AI as a participant.

Step 1: Architecture before code (BMAD)

Before writing a single line of code, I used BMAD (a structured methodology for AI-assisted development) to create:

Product Requirements Document — what the app does, who it is for, what the constraints are
Architecture document — module boundaries, layer responsibilities, error handling patterns, data flow
Project context — coding standards, naming conventions, DO/DON'T lists

This took about a week. It felt slow. It was the most valuable week of the entire project.

Why? Because every conversation with the AI after that point had a shared foundation. The AI was not guessing what my app looked like — it knew. Module boundaries were defined. Error handling was standardized. The AI could generate code that fit into a real system because the system was documented.

Without architecture docs, AI generates code that looks correct in isolation but conflicts with everything else. You spend all your time merging inconsistent outputs instead of building features.

Step 2: CLAUDE.md — the constitution

Claude Code loads a CLAUDE.md file from your project root at the start of every conversation. This is the most important file in my repository.

Mine contains:

Module boundaries enforced by Gradle (which module can import what)
Core patterns (all use cases return Result<T>, ViewModels expose StateFlow, never GlobalScope)
Critical DON'Ts — a condensed list of rules that came from production bugs
Subsystem quick reference — a table pointing to detailed rules for each area (AlarmManager, sensors, AI, email, billing, GPS, permissions)

Every rule in that file exists because I violated it once and something broke. The file grows with the project.

This is the key insight: CLAUDE.md turns one-time lessons into permanent constraints. The AI never forgets a rule I put there. I forget constantly.

Step 3: Living documentation with start/stop commands

I built custom slash commands that bookend every development session:

/howareyou-start — loads the developer briefing, critical rules, release notes, and current version. The AI reads everything before I write a single prompt. It takes 30 seconds and prevents 80% of the mistakes I used to make.

/howareyou-stop — updates release notes, archives old entries, updates CRITICAL_DONTS.md with any new lessons, updates the developer briefing, bumps the version, commits, and pushes.

The documentation is never stale because updating it is part of the release process, not a separate task. I do not update docs manually. The AI does it as part of shipping.

This creates a flywheel: better docs -> better AI output -> fewer bugs -> lessons captured -> better docs.

Step 4: Concrete technical specs

When I need a new feature, I do not say "add travel detection." I use BMAD's tech spec workflow to produce a document that specifies:

Exact state machine (HOME -> DAY_1 -> TRAVELING -> TRIP_ENDED)
Database schema changes (table names, column types, indexes)
Which existing classes are affected and how
Edge cases and error handling
What tests to write

The spec is 2-5 pages. Writing it takes 30 minutes with BMAD's guided conversation. It saves hours of back-and-forth with the AI during implementation and eliminates the "it generated something but it does not fit" problem.

The rule: if I cannot describe the feature precisely enough for a spec, I am not ready to build it. I brainstorm first (also with the AI), then spec, then build.

Step 5: Brainstorming sessions

I use BMAD brainstorming for everything — not just code. Pricing strategy. UX decisions. Marketing approaches. Whether to support SMS notifications or stick with email.

The pattern: open a session, describe the problem, let the AI challenge my assumptions. I keep the transcripts. Some of my best architectural decisions came from brainstorming sessions where the AI pointed out an edge case I had not considered.

Step 6: Automated audits that run weekly

My app has to survive Android OEM battery killers (Samsung, Xiaomi, Honor, OPPO — they all kill background apps differently). These OEMs ship updates constantly that can break my compatibility layer.

I built two audit commands:

/howareyou-oem-audit — searches the web for recent OEM changelog entries and breaking changes, then scans my codebase for affected areas and proposes fixes.

/howareyou-gps-audit — does the same for GPS and location API changes (FusedLocationProvider updates, OEM GPS power management changes).

/howareyou-full-audit — runs both in parallel and produces a combined report with a prioritized action plan.

I run these weekly. They have caught breaking changes before they hit my users — Samsung silently resetting battery optimization exemptions after OTA updates, Honor changing wakelock tag whitelisting behavior, Google deprecating location API parameters.

This is the kind of thing that would take a human developer hours of manual searching. The AI does it in minutes and maps the findings directly to my source code.

Step 7: One-command publishing

/howareyou-build-test    → builds signed release AAB
/howareyou-publish-testingMode  → uploads to Google Play internal + closed testing

From "the code is ready" to "testers have the update" in under 5 minutes, without leaving the terminal. No browser, no Play Console clicking.

Step 8: Infrastructure monitoring

I use 6 Google Cloud projects for Gemini API key rotation (each project gets 10K free requests/day — 60K total). Things break. Billing gets disabled. Keys expire.

/howareyou-monitor — checks all 6 shards, reports which are healthy, which failed, and why.

/howareyou-fix-billing — automatically re-links disabled shards to the shared billing account.

These are not development tasks. They are operational tasks that I handle from the same terminal where I write code.

Step 9: Code reviews with a second model

After implementing a feature, I run a code review using BMAD's adversarial review workflow. It is configured to find 3-10 specific problems in every review — it never says "looks good." It checks:

Architecture compliance (are module boundaries respected?)
Test coverage (are edge cases tested?)
Security (any hardcoded keys? SQL injection? XSS?)
Performance (unnecessary allocations? missing indexes?)
Consistency with project patterns

This catches things I miss because I have been staring at the code for hours. The adversarial framing is important — a review that always approves is useless.

Step 10: Lessons learned as a living document

Every production bug becomes a rule in CRITICAL_DONTS.md. The file is organized by subsystem:

AlarmManager: never call setAlarmClock() more than 3x/day (Honor flags you)
Sensor: always flush FIFO and discard stale readings (Honor rebases timestamps)
Email: per-recipient sends, never batch (Resend delivery tracking breaks)
GPS: full priority fallback chain, never trust a single getCurrentLocation() call

There are 50+ rules in that file. Each one has a version number (when it was added) and a rationale (why it matters). The AI reads this file at the start of every session via the /howareyou-start command.

This is the most underrated part of the workflow. Most developers keep lessons in their head. Heads forget. Files do not.

The daily workflow

Here is what a typical development day looks like:

/howareyou-start — AI loads all context (30 seconds)
Describe the task — with a tech spec if it is a feature, or a bug description if it is a fix
AI implements — I review the diff, run tests
Iterate — usually 1-3 rounds
/howareyou-stop — docs updated, version bumped, committed, pushed
/howareyou-publish-testingMode — testers have the update

I ship multiple versions per day with this flow. Not because I rush — because the overhead between "code works" and "testers have it" is near zero.

What this is NOT

It is not "no-code.". If you know the language, it is worth checking and correcting if needed. With time, the needed small fixes will become less. It is always good to understand the architecture and to make the design decisions yourself.
It is not effortless. The workflow took months to build. The documentation is extensive.
It is not magic. The AI makes mistakes. The difference is that mistakes are caught by the process (tests, reviews, rules, audits) instead of by users.

The numbers

126,000 lines of Kotlin across 398 files
45,000 lines of tests across 130 files
144 versions shipped
3 languages (English, Bulgarian, German)
50+ production lessons captured in CRITICAL_DONTS.md
4 months from zero Kotlin experience to production app on Google Play
9 custom commands automating the full development lifecycle
0 lines of Kotlin written manually by me

The takeaway

AI coding tools are not magic code generators. They are force multipliers for engineering process. If your process is "open chat, type prompt, hope for the best," you will be disappointed.

If your process is "document the architecture, define the rules, automate the lifecycle, capture every lesson, review everything adversarially" — the AI becomes unreasonably effective.

The investment is not in better prompts. It is in better engineering.

The app is How Are You?! — AI safety monitoring for elderly parents. It will be released soon. The code workflow described here uses Claude Code with BMAD. Both are tools I use daily and genuinely recommend.

What Android OEMs do to background apps, and the 11 layers I built to survive it

Stoyan Minchev — Mon, 23 Mar 2026 12:02:33 +0000

I spent over a year building a safety monitoring app that runs 24/7 on elderly parents' phones. If it gets killed, nobody gets alerted when something goes wrong. That constraint forced me into the deepest, most frustrating corners of Android background execution.

This article covers what I learned about how Samsung, Xiaomi, Honor, OPPO, and Vivo actively kill background apps, why the standard Android approach is nowhere near sufficient, and the 11-layer recovery architecture I ended up building. I will also cover two related problems that surprised me: GPS hardware that silently stops working, and accelerometer data that lies about its age.

126,000 lines of Kotlin, 125+ versions, solo developer. The app is called How Are You?! — it learns an elderly person's daily routine over 7 days, then monitors around the clock and emails the family if something seems wrong. But this article is about the engineering, not the product.

The problem: Android wants your app dead

Stock Android already makes continuous background work difficult. Doze mode, App Standby, background execution limits — Google has been tightening the screws since Android 6. A foreground service with REQUEST_IGNORE_BATTERY_OPTIMIZATIONS is the standard answer.

That is necessary. It is nowhere near sufficient.

OEMs add their own proprietary battery management on top of stock Android, and they are far more aggressive. Here is what I encountered on the devices I tested:

Samsung maintains a "Sleeping Apps" list. If your app has no foreground activity for 3 days, Samsung kills it. OTA updates silently reset your battery optimization exemption. The user opted you out of optimization? Samsung un-opted you after the update.

Xiaomi (MIUI/HyperOS) kills background services aggressively and resets autostart permissions after OTA updates. Your app was whitelisted? Not anymore.

Honor and Huawei have PowerGenie, which monitors how often your app wakes the system. Call setAlarmClock() more than about 3 times per day and you get flagged as "frequently wakes your system." They also have HwPFWService, which kills apps holding wakelocks longer than 60 minutes with non-whitelisted tags.

OPPO (ColorOS) has "Sleep standby optimization" that freezes apps during the hours the phone detects the user is sleeping. A safety monitoring app for elderly people needs to run especially during sleep hours — that is when falls and medical events go unnoticed.

Vivo (Funtouch OS) has "AI sleep mode" that does the same thing.

Each manufacturer found a different way to kill you. No single workaround survives all of them.

The answer: 11 layers of recovery

The core insight is that no single mechanism is reliable across all OEMs and all device states. The answer is redundancy — each layer catches the failures of the layers above it.

Layer 1: Foreground service with START_STICKY

The foundation. startForeground() with a persistent notification. The notification channel must use IMPORTANCE_MIN — not IMPORTANCE_DEFAULT or higher. Why? OEMs auto-grant POST_NOTIFICATIONS on higher importance channels, bypassing the user's notification settings and making your persistent notification visible. IMPORTANCE_MIN keeps it silent while startForeground() still gives your process elevated priority.

START_STICKY tells the system to restart the service after a kill. But "restart" can take minutes or never happen on aggressive OEMs.

Layer 2: onDestroy recovery scheduling

When the system kills your service, onDestroy() fires (most of the time). Use this 50ms window to schedule everything that will bring you back:

override fun onDestroy() {
    super.onDestroy()
    ServiceWatchdogReceiver.scheduleWithBackup(this)
    MotionSnapshotReceiver.schedule(this)
}

This fires both the AlarmManager chain and the motion snapshot chain. If onDestroy() does not fire (force-stop, OEM kill without callback), the other layers cover it.

Layer 3: AlarmManager watchdog chain

A self-chaining setExactAndAllowWhileIdle() alarm at 15-minute intervals during active use. When it fires, it checks whether the service is alive and restarts it if not.

The interval adapts to power state: 15 minutes when active, 30 minutes when idle, 60 minutes during deep sleep. This matters for OEM battery scoring — more frequent alarms get flagged.

Important: never use Handler.postDelayed() as a replacement for AlarmManager. Handlers do not fire during CPU deep sleep. I learned this the hard way.

Layer 4: WorkManager periodic watchdog

A PeriodicWorkRequest at 15-minute intervals that does the same thing — checks the service and restarts if needed. WorkManager survives service kills and uses JobScheduler under the hood, which OEMs are more reluctant to interfere with.

But there is a subtle trap: ExistingPeriodicWorkPolicy.KEEP silently discards new requests if a worker is already enqueued, even if the existing one has a stale timer from hours ago. And REPLACE resets the countdown every time you call schedule(). The solution: query getWorkInfosForUniqueWork() first and only schedule when the worker is not already enqueued.

val workInfos = workManager
    .getWorkInfosForUniqueWork(WATCHDOG_WORK_NAME).await()
val isEnqueued = workInfos.any {
    it.state == WorkInfo.State.ENQUEUED || it.state == WorkInfo.State.RUNNING
}
if (!isEnqueued) {
    workManager.enqueueUniquePeriodicWork(
        WATCHDOG_WORK_NAME,
        ExistingPeriodicWorkPolicy.KEEP,
        watchdogRequest
    )
}

Layer 5: Boot recovery

BOOT_COMPLETED, LOCKED_BOOT_COMPLETED, QUICKBOOT_POWERON, and MY_PACKAGE_REPLACED receivers that re-establish the service and all alarm chains after reboot or app update.

Some OEMs reset permissions after OTA updates. OnePlus, Samsung, Xiaomi, Redmi, and POCO all do this. You need to detect the OTA and re-prompt the user for battery optimization exemption.

Layer 6: SyncAdapter for process priority

ContentResolver.addPeriodicSync() gives your process elevated priority through the sync framework. OEMs are reluctant to kill sync adapter processes because the sync framework is a system concept — killing it could break contacts, calendar, and email sync.

This is a ~1-hour periodic callback that checks service health. It will not bring you back fast, but it is extremely hard for OEMs to suppress.

Layer 7: AlarmClock safety net

setAlarmClock() at 8-hour intervals — approximately 3 calls per day. This is the nuclear option. AlarmClock alarms get the highest delivery priority on Android because they are designed to wake users up.

Why 8 hours and not shorter? Honor's PowerGenie specifically tracks AlarmClock frequency. At 15-minute intervals, it flags you as "frequently wakes your system" and kills you. At 8-hour intervals (~3/day), you fly under the radar.

fun scheduleSafetyNet(context: Context) {
    val intent = PendingIntent.getBroadcast(
        context, REQUEST_CODE_ALARMCLOCK, intent,
        PendingIntent.FLAG_UPDATE_CURRENT or PendingIntent.FLAG_IMMUTABLE
    )
    val triggerAt = System.currentTimeMillis() + SAFETY_NET_INTERVAL_MS // 8 hours
    alarmManager.setAlarmClock(
        AlarmManager.AlarmClockInfo(triggerAt, null),
        intent
    )
}

Layer 8: Exact alarm permission recovery

When the user revokes SCHEDULE_EXACT_ALARM, all pending AlarmManager chains die silently. No callback, no exception. Your watchdog, your snapshot receiver, your safety net — all gone.

Listen for ACTION_SCHEDULE_EXACT_ALARM_PERMISSION_STATE_CHANGED and re-establish everything on re-grant:

class ExactAlarmPermissionReceiver : BroadcastReceiver() {
    override fun onReceive(context: Context, intent: Intent) {
        if (canScheduleExactAlarms()) {
            ServiceWatchdogReceiver.scheduleWithBackup(context)
            MotionSnapshotReceiver.schedule(context)
        }
    }
}

Layer 9: Batched accelerometer sensing

This is the layer that surprised me most. Keep the accelerometer registered with maxReportLatencyUs during idle and deep sleep. The sensor HAL continuously samples into a hardware FIFO buffer and delivers readings via a sensor interrupt — this is completely invisible to OEM battery managers because it does not use AlarmManager, WorkManager, or any schedulable mechanism.

sensorManager.registerListener(
    batchedMotionListener,
    accelerometer,
    SensorManager.SENSOR_DELAY_NORMAL,
    maxReportLatencyUs  // 10 min in deep sleep
)

The HAL batches readings and delivers them all at once when the buffer fills or the latency expires. You get continuous motion awareness with zero wakes visible to the OEM.

One gotcha: a single SLIGHT_MOVEMENT reading (1.0-3.0 m/s^2) should not exit batched mode. Table vibrations and building micro-movements produce transient spikes. I require 3 consecutive SLIGHT_MOVEMENT readings (~15 seconds) before exiting. Anything above 3.0 m/s^2 (MODERATE_MOVEMENT) exits immediately.

Layer 10: Network restoration and app foreground triggers

CONNECTIVITY_ACTION receiver triggers a service health check when the network comes back. ProcessLifecycleOwner fires when the user opens the app. These are opportunistic — they catch edge cases where the service died during airplane mode or extended offline periods.

Layer 11: User-facing gap detection

When all 10 layers fail (and on some devices, they do), the app detects the gap and shows the user device-specific instructions: "Your [Manufacturer] phone is stopping background apps. Open Settings > Battery > [OEM-specific path] and disable optimization for How Are You?!"

This is the least satisfying layer because it requires user action. But on a few particularly aggressive OEM configurations, it is the only thing that works.

The wakelock tag problem on Honor

HwPFWService on Honor and Huawei devices maintains a whitelist of allowed wakelock tags. If your app holds a wakelock for more than 60 minutes with a tag that is not on the whitelist, HwPFWService kills your app.

The solution is embarrassingly simple: use a whitelisted tag on Honor/Huawei, your real tag everywhere else.

private val WAKELOCK_TAG: String = run {
    val manufacturer = Build.MANUFACTURER?.lowercase().orEmpty()
    if (manufacturer == "huawei" || manufacturer == "honor") {
        "LocationManagerService"  // Whitelisted by HwPFWService
    } else {
        "HowAreYou:PulseBurst"
    }
}

LocationManagerService is whitelisted because it is a system service tag. I am not proud of this, but it works.

getCurrentLocation() hangs forever

Once I had the service staying alive, I discovered a second problem: GPS does not work when you need it.

At approximately 12% battery on my Honor test device, the OEM battery saver silently killed GPS hardware access. No exception, no error callback, no log entry. The foreground service was alive, the accelerometer worked. But getCurrentLocation(PRIORITY_HIGH_ACCURACY) simply never completed. The Task from Play Services hung indefinitely — neither onSuccessListener nor onFailureListener ever fired.

The code fell back to getLastLocation(), which returned a 5-hour-old cached position from a completely different city.

Fix 1: Always timeout

Every getCurrentLocation() call must be wrapped in a coroutine timeout:

suspend fun getLocation(priority: Int): Location? {
    return withTimeoutOrNull(30_000L) {
        suspendCancellableCoroutine { cont ->
            fusedClient.getCurrentLocation(priority, token)
                .addOnSuccessListener { cont.resume(it) }
                .addOnFailureListener { cont.resume(null) }
        }
    }
}

Fix 2: Priority fallback chain

GPS hardware being dead does not mean all location sources are dead. Cell towers and Wi-Fi still work. I built a sequential fallback:

PRIORITY_HIGH_ACCURACY (GPS, ~10m)
    | timeout or null
PRIORITY_BALANCED_POWER_ACCURACY (Wi-Fi + cell, ~40-300m)
    | timeout or null
PRIORITY_LOW_POWER (cell only, ~300m-3km)
    | timeout or null
getLastLocation() (cached, any age)
    | null
TotalFailure

Each step gets its own 30-second timeout. In practice, when GPS is killed, BALANCED_POWER_ACCURACY returns in 2-3 seconds because Wi-Fi scanning still works.

Fix 3: GPS wake probe

Sometimes the GPS hardware is not permanently dead — it has been suspended by the battery manager. A brief requestLocationUpdates call can wake it:

if (hoursSinceLastFreshGps > 4) {
    val probeRequest = LocationRequest.Builder(
        Priority.PRIORITY_HIGH_ACCURACY, 1000L
    )
        .setDurationMillis(5_000L)
        .setMaxUpdates(5)
        .build()

    withTimeoutOrNull(6_000L) {
        fusedClient.requestLocationUpdates(probeRequest, callback, looper)
    }
    fusedClient.removeLocationUpdates(callback)
}

Five seconds, maximum once every 4 hours. On Honor, this recovers the GPS roughly 40% of the time.

Fix 4: Explicit outcome types

The original code returned Location?. The caller had no way to distinguish a fresh 10-meter GPS fix from a 5-hour-old cached position. I changed the return type to make the quality of data explicit:

sealed interface GpsLocationOutcome {
    data class FreshGps(val accuracy: Float) : GpsLocationOutcome
    data class WakeProbeSuccess(val accuracy: Float) : GpsLocationOutcome
    data class CellFallback(val accuracy: Float) : GpsLocationOutcome
    data class StaleLastLocation(val ageMs: Long) : GpsLocationOutcome
    data object TotalFailure : GpsLocationOutcome
}

Now the consumer can make informed decisions. A 3km cell tower reading is low precision, but it answers "is this person in the expected city or 200km away?" For a safety app, that distinction matters.

The sensor HAL lies about timestamps

At 3 AM, your app wakes up to check the accelerometer. You call registerListener(), and the sensor HAL returns data. You check event.timestamp against SystemClock.elapsedRealtimeNanos(). The delta is small. The data looks fresh.

It is not. It is 22-minute-old data sitting in the hardware FIFO buffer since the last time anyone read the sensor.

This is the normal behavior of hardware sensor FIFOs. When the CPU sleeps, the sensor continues sampling into its buffer. When you register a listener after wakeup, the HAL dumps the entire buffer contents at you. The timestamps are real (the readings were taken at those times), but the data is stale — it describes what happened 22 minutes ago, not what is happening now.

On most devices, you can catch this by comparing event.timestamp (CLOCK_BOOTTIME nanoseconds) against SystemClock.elapsedRealtimeNanos(). If the delta is large, the reading is stale.

Honor broke this assumption. On Honor devices, the HAL rebases event.timestamp on FIFO flush, so the delta check shows the data as fresh even when it is not.

The fix: flush, wait for callback, then collect

Do not trust the first readings after registerListener(). Instead:

Call sensorManager.flush(this) to drain the stale FIFO data
Wait for the onFlushCompleted() callback from SensorEventListener2
Only start collecting readings after the flush completes
Set a 1000ms fallback timer in case the HAL never fires the callback

class MotionSnapshotReceiver : BroadcastReceiver(), SensorEventListener2 {
    private var isFlushPhase = true

    override fun onSensorChanged(event: SensorEvent) {
        if (isFlushPhase) return  // Discard stale FIFO data
        collectReading(event)
    }

    override fun onFlushCompleted(sensor: Sensor?) {
        endFlushPhase(byHal = true)
    }

    private fun endFlushPhase(byHal: Boolean) {
        if (!isFlushPhase) return  // Guard against double-trigger
        isFlushPhase = false
        handler.removeCallbacks(flushFallbackRunnable)
        // Now start collecting real readings
    }
}

The fallback timer at 1000ms is important. I originally used 200ms, which was insufficient for Honor devices — their deep FIFO drains at approximately 16Hz, and a full buffer can take over 200ms to flush.

As a secondary safety net, I use dual-clock comparison: both CLOCK_BOOTTIME and CLOCK_MONOTONIC deltas must agree that the reading is fresh. If either delta exceeds 500ms of staleness, the reading is discarded.

A race condition in GPS processing

I had multiple independent trigger paths (stillness detector, smart GPS scheduler, area stability detector) that could request GPS concurrently. Two of them fired within 33 milliseconds of each other. Both read the same getLastLocation(), both passed the stationarity filter, and both inserted a GPS reading.

My code uses a minimum-readings-per-cluster filter to discard drive-through locations — a place needs at least 2 GPS readings to count as a real visit. The duplicate entry from the race condition defeated this filter. A single drive-by at 60km/h became a "cluster of 2."

The fix is a Mutex around the entire location processing path:

private val processLocationMutex = Mutex()

suspend fun processLocation(location: Location) {
    processLocationMutex.withLock {
        val lastLocation = getLastLocation()
        // The second concurrent caller now sees the just-inserted
        // location and correctly skips as duplicate
    }
}

Battery result

After all 11 layers and three tiers of power state, the battery impact is under 1% per day. The key numbers:

Before optimization: ~4,300 AlarmManager wakes per day. Every active-mode pulse (15s/30s) used AlarmManager. Every watchdog check (every 5 minutes) used AlarmManager. Honor flagged the app within hours.
After optimization: ~240 wakes per day. Active-mode pulses use Handler.postDelayed() (zero AlarmManager wakes). Watchdog intervals extended from 5 to 15 minutes. AlarmClock safety net reduced from every 15 minutes to every 8 hours.

That is a 94% reduction in system wakes while maintaining the same monitoring reliability.

The insight: aggressive scheduling wastes more battery than it saves in reliability. A three-tier power state that backs off when the device is still (active at 15-second pulses, idle at 5-minute pulses, deep sleep at 30-minute pulses with batched accelerometer as safety net) achieves both low battery impact and high reliability.

What I would do differently

Build the OEM compatibility layer first. I treated background reliability as something I would fix later. It took 40+ versions across several months to get right. It should have been the architectural foundation from Day 1.

Test on real OEM devices from the start. The Android emulator and Pixel devices tell you nothing about OEM battery management. I did not discover the Honor wakelock whitelist problem, the GPS hardware suspension, or the sensor FIFO timestamp rebasing until I tested on actual devices.

Never trust a single mechanism. Every Android background API has an OEM that breaks it. AlarmManager gets suppressed. WorkManager gets deferred. Foreground services get killed. The only reliable approach is layered redundancy where each mechanism independently tries to recover.

The app is called How Are You?! and is available on Google Play. It is still in closed testing phase — if you have an elderly parent on Android and want to try it, I would appreciate feedback, especially from OEM devices I have not tested yet. Email: developer@howareu.app

I am happy to answer questions about any of these techniques. The OEM compatibility rabbit hole goes much deeper than what I have covered here.

I spent several months building an AI safety app for my elderly parent — here is what I learned

Stoyan Minchev — Sat, 21 Mar 2026 21:10:16 +0000

My parent lives alone. After a fall that nobody noticed for hours, I decided to build something that would.

Four months, 121 versions, and approximately 79,000 lines of Kotlin later, the app is live on Google Play. Here is the story — the technical challenges, the things that broke, and what I would do differently.

What the app does

Install it on your parent's Android phone. It watches. That is it.

For 7 days, it learns their routine — when they wake up, how active they are, where they go. After that, it monitors 24/7 and emails your family if something seems wrong:

Unusual stillness (potential fall or medical event)
Did not wake up on time
At an unfamiliar location at an unusual hour
Phone silent for too long

No buttons to press. No wearable to charge. No daily check-in calls. Install and forget.

The technical stack

Kotlin + Jetpack Compose + Material Design 3
Room + SQLCipher for encrypted local storage
Google Gemini API for behavioral analysis (cloud, anonymized summaries)
Resend API for transactional email alerts
WorkManager + Foreground Service for 24/7 reliability
Clean architecture: :domain (pure Kotlin) -> :data -> :ui -> :app

## The hard part: staying alive on Android

This is where 80% of my development time went.

Android's job is to kill your app. OEMs make it worse. Here is what I learned:

### Problem 1: OEM battery killers

Samsung, Xiaomi, Honor, OPPO — they all have proprietary battery managers that kill background apps. The standard startForeground() is not enough.

My solution: 11-layer service recovery:

Foreground service with IMPORTANCE_MIN channel
WorkManager periodic watchdog
AlarmManager backup chains
BOOT_COMPLETED receiver
SyncAdapter for process priority boost
Batched accelerometer sensing (survives CPU sleep)
Exact alarm permission recovery
OEM-specific wakelock tag spoofing (Honor whitelists "LocationManagerService")
START_STICKY restart
Safety net AlarmClock at 8-hour intervals
User-facing gap detection with OEM-specific guidance

Each layer was added because the previous ones were not enough on some device.

Problem 2: Sensor data lies to you

At 3 AM, your app wakes up to check the accelerometer. The sensor HAL returns data. You think it is fresh. It is not — it is 22-minute-old data sitting in the hardware FIFO buffer since the last time anyone read the sensor.

On Honor devices, the HAL even rebases event.timestamp on flush, so a delta check against elapsedRealtimeNanos() thinks the data is fresh. The solution: explicit sensorManager.flush(), discard warm-up readings, use onFlushCompleted() callback instead of fixed timers, and dual-clock comparison as a safety net.

### Problem 3: GPS does not work when you need it

getCurrentLocation(PRIORITY_HIGH_ACCURACY) returns nothing. The OEM killed the GPS hardware to save power.

Solution: Priority fallback chain — HIGH_ACCURACY -> wake probe -> BALANCED_POWER -> LOW_POWER -> getLastLocation(). Returns a GpsLocationOutcome sealed class so the caller knows exactly what happened.

## The AI: from on-device to cloud

I started with Gemini Nano (fully on-device). It worked on Pixels. It did not work on anything else. The addressable market was tiny.

So I moved to Gemini Flash (cloud API). The privacy trade-off: detailed behavioral data stays on-device in an encrypted database, but anonymized summaries (including location context) are sent to Google's AI for weekly analysis. No names, no personal identifiers.

The key architectural decision: API key sharding. Each Google Cloud project gets 10,000 requests per day free. I created 6 projects with independent API keys. The app rotates through them on rate-limit errors (429/403). That is 60,000 requests per day — enough for thousands of users at zero cost.

## Travel intelligence

The biggest UX win. Without it, every vacation generated 5 to 7 URGENT emails (one per night when the hard-floor detector fired at an unfamiliar location). By day 5, families ignored all emails.

Now: Day 1 sends one "your parent appears to be traveling" notification. Days 2 through 6: silence (unless something actually changes). Return home: "they have returned to a familiar area."

The state machine: HOME -> DAY_1 -> TRAVELING(n) -> TRIP_ENDED. Single-writer rule through TravelStateManager to prevent state corruption from concurrent assessments.

## What I would do differently

Start with cloud AI from Day 1. I lost 2 months on Gemini Nano before accepting the device compatibility reality.
Build the OEM compatibility layer first. The 11-layer recovery took 40+ versions to get right. It should have been the foundation, not an afterthought.
Email before OAuth. I started with Gmail OAuth (user signs into their Google account). It was a UX nightmare. Resend API (transactional email, zero auth) took 1 day to implement and just works.

## Looking for early adopters

The app is free to download with a 21-day free trial (then $49 for the first year, $5 per year after that). I am looking for families to test it — install it on your parent's Android phone (Android 9 or newer), run it for a couple of weeks, and tell me what works and what does not.

Website: howareu.app