DEV Community

David McHale
David McHale

Posted on

That Time SQLite File Existence Lied to Us

The bug

Our bootstrap script was killing GoPhish before database migrations could finish. Took us way too long to figure out why.

Here's what we had:

while [ ! -f "$DB_FILE" ]; do
    sleep 1
done
kill $GOPHISH_PID  # DB exists, we're good right?
Enter fullscreen mode Exit fullscreen mode

Nope.

SQLite creates the database file the moment you open a connection. Migrations run after that. We were killing the process as soon as gophish.db appeared, before the schema was even built.

Result: weird SQL errors about missing columns. Only in production. Fun times.

The fix

Wait for the schema, not just the file:

# Wait for file
while [ ! -f "$DB_FILE" ]; do
    sleep 1
done

# Wait for migrations (check for a column we know should exist)
while ! sqlite3 "$DB_FILE" ".schema users" 2>/dev/null | grep -q "password_change_required"; do
    sleep 1
done
Enter fullscreen mode Exit fullscreen mode

Pre-flight checks

While we were in there, we added checks before starting the service:

preflight_check() {
    local missing=()

    [ ! -d "$MIGRATIONS_DIR" ] && missing+=("migrations directory")
    [ ! -f "$CONFIG_FILE" ] && missing+=("config.json")
    [ ! -d "$STATIC_DIR" ] && missing+=("static directory")

    if [ ${#missing[@]} -gt 0 ]; then
        echo "ERROR: Missing required files: ${missing[*]}"
        exit 1
    fi
}
Enter fullscreen mode Exit fullscreen mode

Catches broken deployments immediately instead of failing mysteriously later.

Debugging output that actually helps

When stuff breaks, we now dump:

  • Directory contents
  • Database schema (if it exists)
  • Common causes checklist
  • All output unbuffered with stdbuf -oL so you can actually see it in real time

That last one matters a lot when you're debugging through a cloud serial console.

Cloud VM patterns we use now

Network waiting. Cloud VMs don't always have network at boot:

wait_for_network() {
    local max_attempts=30
    for ((i=1; i<=max_attempts; i++)); do
        if curl -s --max-time 5 https://example.com > /dev/null; then
            return 0
        fi
        sleep 2
    done
    return 1
}
Enter fullscreen mode Exit fullscreen mode

Docker readiness. Service "started" doesn't mean Docker is actually ready:

wait_for_docker() {
    while ! docker info > /dev/null 2>&1; do
        sleep 1
    done
}
Enter fullscreen mode Exit fullscreen mode

State files. Know if this is first boot or a restart:

STATE_FILE="/var/lib/myapp/state"
if [ -f "$STATE_FILE" ]; then
    do_routine_restart
else
    do_initial_setup
    touch "$STATE_FILE"
fi
Enter fullscreen mode Exit fullscreen mode

Systemd stuff

Use network-online.target, not network.target. They're different and it matters:

[Service]
Type=oneshot
RemainAfterExit=yes
ProtectSystem=full
PrivateTmp=true
NoNewPrivileges=true

After=network-online.target docker.service
Wants=network-online.target
Requires=docker.service
Enter fullscreen mode Exit fullscreen mode

SSH key cleanup for marketplace images

If you're publishing VM images, clean up your dev keys:

rm -f ~/.ssh/id_* ~/.ssh/*.pub ~/.ssh/known_hosts ~/.ssh/config
find ~/.ssh -type f -exec rm -f {} \;

# Verify it worked
if find ~/.ssh -type f 2>/dev/null | grep -q .; then
    echo "WARNING: SSH files still present!"
fi
Enter fullscreen mode Exit fullscreen mode

TL;DR

  1. Don't trust file existence as a completion signal
  2. Pre-flight checks save debugging time
  3. Cloud VMs need patience (network, Docker, everything)
  4. Unbuffer your output (stdbuf -oL)
  5. Verify your cleanup actually worked

These changes took us from "why does this randomly break" to "oh, it failed because X." That's the goal.

Top comments (0)