DEV Community

Ai2th
Ai2th

Posted on

Pockr | Part 5 — Debugging the VM Restart Loop

A Race Condition in Kotlin

Part 5 of 6 — building Pockr, a single APK that runs Docker on non-rooted Android.
← Part 4: Making Docker Run Without Kernel Modules


The Symptom

Alpine Linux takes ~5 minutes on first boot to set up Docker, install Python packages, and start the API server. During that time, the app shows a loading state.

On real device testing (Firebase Test Lab), we kept seeing a cycle:

VM starts → health check fails at ~35s → VM restarts → repeat
Enter fullscreen mode Exit fullscreen mode

Alpine was never given enough time to finish first-boot setup. Docker Hub pulls never happened.


Reading the Logcat

The logcat told the story:

VmManager: Starting VM...
VmManager: VM process launched
VmApiClient: Health check failed: timeout          ← 35s in
VmManager: Starting VM...                          ← restarted!
VmManager: Stopping existing VM before restart     ← killed Alpine mid-boot
VmManager: VM process launched
VmApiClient: Health check failed: timeout          ← again
VmManager: Starting VM...                          ← restarted again
Enter fullscreen mode Exit fullscreen mode

The VM was restarting every ~35 seconds — exactly the health check timeout. But why? Nobody was tapping Stop.


The Race Condition

Here's getStatus() before the fix:

fun getStatus(): String {
    if (!isRunning) return "stopped"   // ← THE BUG
    vmProcess?.let {
        return try {
            it.exitValue()
            isRunning = false
            "stopped"
        } catch (_: IllegalThreadStateException) {
            "running"
        }
    }
    return "stopped"
}
Enter fullscreen mode Exit fullscreen mode

And startVm() does this:

@Synchronized
fun startVm() {
    // Stop existing VM before restart
    if (isRunning || vmProcess != null) {
        stopVm()   // kills QEMU
    }
    // ... launches new QEMU
    isRunning = true
}
Enter fullscreen mode Exit fullscreen mode

Do you see the window? Inside startVm():

  1. stopVm() is called → sets isRunning = false
  2. New QEMU not yet launched → vmProcess = null
  3. During this window, getStatus() is called by the Flutter health poll
  4. !isRunning is true → returns "stopped" — even though we're mid-restart
  5. Flutter sets _status = "stopped" → Start Engine button re-enables
  6. User or Robo taps Start Engine → calls startVm() AGAIN → kills Alpine mid-boot

The Fix

Check the live process first, not the boolean flag:

fun getStatus(): String {
    // Check the process directly first — isRunning can be briefly false
    // during the startVm() window between stopVm() and isRunning=true
    vmProcess?.let {
        return try {
            it.exitValue()
            // Process actually exited — clean up
            isRunning = false
            vmProcess = null
            "stopped"
        } catch (_: IllegalThreadStateException) {
            "running"   // Process is alive
        }
    }
    return "stopped"
}
Enter fullscreen mode Exit fullscreen mode

If vmProcess is non-null, we check whether the process is actually alive (exitValue() throws if still running). Only if the process has genuinely exited do we return "stopped".


A Second Guard in Flutter

We also added a guard in the Dart layer to prevent double-starts:

Future<void> startVm() async {
    if (_status == 'running' || _status == 'starting') return;  // guard
    _isLoading = true;
    _status = 'starting';
    ...
}
Enter fullscreen mode Exit fullscreen mode

Even if something pushes a second startVm() call, it no-ops if the VM is already in motion.


Confirmed Fixed

Firebase Test Lab v33 logcat:

VmManager: Starting VM...           ← first boot
VmManager: VM process launched
VmApiClient: Health check failed: Connection reset   ← still booting, normal
... (2 minutes of booting) ...
VmApiClient: Container started: alpine_1772702979346  ← SUCCESS ✅
Enter fullscreen mode Exit fullscreen mode

No restart loop. Alpine booted fully. Container started from Docker Hub.


Lessons

  1. Boolean flags lie during state transitions. Check the real source of truth — the OS process handle.
  2. Log everything. Logcat timestamps and thread IDs revealed the exact 3ms window where the race occurred.
  3. Robo testing is brutal. Firebase's Robo crawler taps every visible button, exposing race conditions that manual testing misses.

Next: Part 6 — Test Results, Firebase Test Lab, and What's Next

GitHub: github.com/AI2TH/Pockr



Pockr Series — Docker in Your Pocket

Pockr = Pocket + Docker. A single Android APK that runs real Docker containers in your pocket — no root, no Termux, no PC required.

# Post Topic
📖 Intro What is Pockr? Start here
1 Part 1 The Idea and Architecture
2 Part 2 Executing Binaries — The SELinux Problem
3 Part 3 Bundling 50 Native Libraries
4 Part 4 Docker Without Kernel Modules
5 Part 5 Debugging the VM Restart Loop
6 Part 6 Test Results and What's Next

GitHub: github.com/AI2TH/Pockr

Software Engineer & Debugger: Kalvin Nathan
skalvinnathan@gmail.com · LinkedIn

Top comments (0)