David McHale

Posted on Feb 11

That Time SQLite File Existence Lied to Us

#devops #sqlite #debugging #linux

The bug

Our bootstrap script was killing GoPhish before database migrations could finish. Took us way too long to figure out why.

Here's what we had:

while [ ! -f "$DB_FILE" ]; do
    sleep 1
done
kill $GOPHISH_PID  # DB exists, we're good right?

Nope.

SQLite creates the database file the moment you open a connection. Migrations run after that. We were killing the process as soon as gophish.db appeared, before the schema was even built.

Result: weird SQL errors about missing columns. Only in production. Fun times.

The fix

Wait for the schema, not just the file:

# Wait for file
while [ ! -f "$DB_FILE" ]; do
    sleep 1
done

# Wait for migrations (check for a column we know should exist)
while ! sqlite3 "$DB_FILE" ".schema users" 2>/dev/null | grep -q "password_change_required"; do
    sleep 1
done

Pre-flight checks

While we were in there, we added checks before starting the service:

preflight_check() {
    local missing=()

    [ ! -d "$MIGRATIONS_DIR" ] && missing+=("migrations directory")
    [ ! -f "$CONFIG_FILE" ] && missing+=("config.json")
    [ ! -d "$STATIC_DIR" ] && missing+=("static directory")

    if [ ${#missing[@]} -gt 0 ]; then
        echo "ERROR: Missing required files: ${missing[*]}"
        exit 1
    fi
}

Catches broken deployments immediately instead of failing mysteriously later.

Debugging output that actually helps

When stuff breaks, we now dump:

Directory contents
Database schema (if it exists)
Common causes checklist
All output unbuffered with stdbuf -oL so you can actually see it in real time

That last one matters a lot when you're debugging through a cloud serial console.

Cloud VM patterns we use now

Network waiting. Cloud VMs don't always have network at boot:

wait_for_network() {
    local max_attempts=30
    for ((i=1; i<=max_attempts; i++)); do
        if curl -s --max-time 5 https://example.com > /dev/null; then
            return 0
        fi
        sleep 2
    done
    return 1
}

Docker readiness. Service "started" doesn't mean Docker is actually ready:

wait_for_docker() {
    while ! docker info > /dev/null 2>&1; do
        sleep 1
    done
}

State files. Know if this is first boot or a restart:

STATE_FILE="/var/lib/myapp/state"
if [ -f "$STATE_FILE" ]; then
    do_routine_restart
else
    do_initial_setup
    touch "$STATE_FILE"
fi

Systemd stuff

Use network-online.target, not network.target. They're different and it matters:

[Service]
Type=oneshot
RemainAfterExit=yes
ProtectSystem=full
PrivateTmp=true
NoNewPrivileges=true

After=network-online.target docker.service
Wants=network-online.target
Requires=docker.service

SSH key cleanup for marketplace images

If you're publishing VM images, clean up your dev keys:

rm -f ~/.ssh/id_* ~/.ssh/*.pub ~/.ssh/known_hosts ~/.ssh/config
find ~/.ssh -type f -exec rm -f {} \;

# Verify it worked
if find ~/.ssh -type f 2>/dev/null | grep -q .; then
    echo "WARNING: SSH files still present!"
fi

TL;DR

Don't trust file existence as a completion signal
Pre-flight checks save debugging time
Cloud VMs need patience (network, Docker, everything)
Unbuffer your output (stdbuf -oL)
Verify your cleanup actually worked

These changes took us from "why does this randomly break" to "oh, it failed because X." That's the goal.

DEV Community