The bug
Our bootstrap script was killing GoPhish before database migrations could finish. Took us way too long to figure out why.
Here's what we had:
while [ ! -f "$DB_FILE" ]; do
sleep 1
done
kill $GOPHISH_PID # DB exists, we're good right?
Nope.
SQLite creates the database file the moment you open a connection. Migrations run after that. We were killing the process as soon as gophish.db appeared, before the schema was even built.
Result: weird SQL errors about missing columns. Only in production. Fun times.
The fix
Wait for the schema, not just the file:
# Wait for file
while [ ! -f "$DB_FILE" ]; do
sleep 1
done
# Wait for migrations (check for a column we know should exist)
while ! sqlite3 "$DB_FILE" ".schema users" 2>/dev/null | grep -q "password_change_required"; do
sleep 1
done
Pre-flight checks
While we were in there, we added checks before starting the service:
preflight_check() {
local missing=()
[ ! -d "$MIGRATIONS_DIR" ] && missing+=("migrations directory")
[ ! -f "$CONFIG_FILE" ] && missing+=("config.json")
[ ! -d "$STATIC_DIR" ] && missing+=("static directory")
if [ ${#missing[@]} -gt 0 ]; then
echo "ERROR: Missing required files: ${missing[*]}"
exit 1
fi
}
Catches broken deployments immediately instead of failing mysteriously later.
Debugging output that actually helps
When stuff breaks, we now dump:
- Directory contents
- Database schema (if it exists)
- Common causes checklist
- All output unbuffered with
stdbuf -oLso you can actually see it in real time
That last one matters a lot when you're debugging through a cloud serial console.
Cloud VM patterns we use now
Network waiting. Cloud VMs don't always have network at boot:
wait_for_network() {
local max_attempts=30
for ((i=1; i<=max_attempts; i++)); do
if curl -s --max-time 5 https://example.com > /dev/null; then
return 0
fi
sleep 2
done
return 1
}
Docker readiness. Service "started" doesn't mean Docker is actually ready:
wait_for_docker() {
while ! docker info > /dev/null 2>&1; do
sleep 1
done
}
State files. Know if this is first boot or a restart:
STATE_FILE="/var/lib/myapp/state"
if [ -f "$STATE_FILE" ]; then
do_routine_restart
else
do_initial_setup
touch "$STATE_FILE"
fi
Systemd stuff
Use network-online.target, not network.target. They're different and it matters:
[Service]
Type=oneshot
RemainAfterExit=yes
ProtectSystem=full
PrivateTmp=true
NoNewPrivileges=true
After=network-online.target docker.service
Wants=network-online.target
Requires=docker.service
SSH key cleanup for marketplace images
If you're publishing VM images, clean up your dev keys:
rm -f ~/.ssh/id_* ~/.ssh/*.pub ~/.ssh/known_hosts ~/.ssh/config
find ~/.ssh -type f -exec rm -f {} \;
# Verify it worked
if find ~/.ssh -type f 2>/dev/null | grep -q .; then
echo "WARNING: SSH files still present!"
fi
TL;DR
- Don't trust file existence as a completion signal
- Pre-flight checks save debugging time
- Cloud VMs need patience (network, Docker, everything)
- Unbuffer your output (
stdbuf -oL) - Verify your cleanup actually worked
These changes took us from "why does this randomly break" to "oh, it failed because X." That's the goal.
Top comments (0)