Commands, explanations, and real output — for engineers who want to understand what's actually happening, not just copy-paste their way through it.
☁️ Pre-Flight Checklist
Before we taxi down the runway, here’s your flight plan. Keep this handy to navigate your flight path. Welcome aboard the cloud! ☁️
🌥️ Takeoff
⛅️ Cruising Altitude
- Part 1 — Pre-Reboot Checklist
- Part 2 — Running the Reboot
- Part 3 — Post-Reboot Verification
- Part 4 — Measuring Time to Recovery (TTR)
🌤️ Landing & Taxi
Enjoy your flight! ☁️
There's a specific kind of anxiety that comes with running sudo reboot on a server with real users on it. You know the system should come back, but "should" feels a lot less reassuring at the moment your SSH session freezes. This guide removes the guesswork. It covers everything from reading your apt upgrade output intelligently, to verifying your stack is healthy after the reboot, to measuring your actual recovery time with real commands and real numbers so that the next time you need to do this, it's a procedure, not a gamble.
Prerequisites
This guide assumes:
- Ubuntu 22.04 on an OCI Compute instance (ARM or x86)
- Docker + Docker Compose managing your services
- All long-running services configured with
restart: alwaysin yourdocker-compose.yml - SSH access to the instance
If restart: always isn't set on your services, your containers will not come back after a reboot. Check this first.
services:
backend:
image: your-backend-image
restart: always # ✅ restarts automatically after reboot or crash
migrations:
image: your-migrations-image
# no restart policy # ✅ correct — this should run once and exit
restart: always tells Docker to relaunch the container whenever it stops — whether from a crash or a full system reboot. The one exception to be deliberate about is one-shot containers like database migrations: they're designed to run once and exit cleanly, so no restart policy is the right call for those.
Part 1 — Pre-Reboot Checklist
Never reboot without completing this checklist. It takes under two minutes and prevents the most common post-reboot problems.
1.1 Verify no critical process is mid-flight
docker ps
What to look for:
| STATUS | Meaning |
|---|---|
Up 2 days (healthy) |
Safe to reboot |
Up 3 minutes |
Something recently restarted — investigate |
Restarting (1) |
Container is crash-looping — fix before rebooting |
Up 2 hours (unhealthy) |
Health check is failing — fix before rebooting |
If everything shows Up [days/weeks] (healthy), you are clear.
Why this matters: If a database migration container is mid-run, or a background job is processing a large task, a reboot will kill it mid-execution. You want to reboot during a quiet moment.
1.2 Validate your Compose configuration
cd ~/your-project
docker compose config
Expected output: Your full resolved docker-compose.yml printed to the terminal, with no errors.
Why this matters: docker compose config resolves all environment variables and validates YAML syntax. If there's a broken variable reference or a typo in your file, this command catches it now — not after the reboot when containers silently fail to start. A common mistake is editing a .env file or docker-compose.yml and not realising you've introduced a syntax error. This is your safety net.
1.3 Read your apt upgrade output
When you run sudo apt update && sudo apt upgrade -y before a reboot, the output tells you exactly what changed on your system. Don't skip past it.
Here's a real upgrade output and what each part means:
The following packages will be upgraded:
containerd.io coreutils docker-ce docker-ce-cli
docker-ce-rootless-extras docker-compose-plugin docker-model-plugin
gitlab-runner gitlab-runner-helper-images libnftables1 nftables
python3-pyasn1
How to read this list:
| Package | What it is | Reboot needed? |
|---|---|---|
docker-ce, containerd.io, docker-ce-cli
|
The Docker engine and its runtime | Recommended |
docker-compose-plugin |
The docker compose CLI plugin |
No |
nftables, libnftables1
|
Linux kernel firewall/networking | Yes |
coreutils |
Fundamental Linux utilities (ls, cp, etc.) |
Recommended |
gitlab-runner, gitlab-runner-helper-images
|
CI/CD runner agent | Service restarts during upgrade |
python3-pyasn1 |
Python crypto library | No |
The rule of thumb: If the upgrade touches anything in the kernel, networking stack, or container runtime — reboot. If it's only application-level packages — a reboot is optional but never harmful.
1.4 Understand the service restart messages
After apt upgrade, Ubuntu's needrestart tool prints which services were restarted automatically and which were deferred:
Restarting services...
systemctl restart irqbalance.service ssh.service rsyslog.service ...
Service restarts being deferred:
systemctl restart networkd-dispatcher.service
systemctl restart systemd-logind.service
"Restarting services" — These were restarted immediately. Your SSH connection stayed alive because ssh.service restarts in-place without dropping existing sessions.
"Service restarts being deferred" — These require a full reboot to apply safely. systemd-logind manages user sessions; restarting it mid-session can cause issues, so Ubuntu defers it to the next clean boot.
No containers need to be restarted.
This line means Docker detected that running container images are still current — no container needed to be replaced. This is expected if you haven't rebuilt your application images.
1.5 Check available disk space
df -h /
Example output:
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 48G 12G 36G 23% /
You want at least 20% free on your root partition. Docker image pulls and accumulated log files are the two most common causes of a full disk, which can prevent containers from starting after a reboot.
Tip: The apt upgrade process often reclaims space automatically by pruning unused Docker build cache layers. In a real upgrade run, this printed:
Total reclaimed space: 4.165GB
Part 2 — Running the Reboot
Once the checklist is complete:
sudo reboot
What happens next, step by step:
- The OS sends
SIGTERMto all running processes, giving them time to shut down cleanly. - Docker receives the signal and stops all containers gracefully.
- The kernel shuts down and the VM restarts.
- Your SSH session prints
Connection to [ip] closed by remote host.and terminates. This is normal.
How long to wait: OCI ARM instances (Ampere A1) typically reboot in 45–90 seconds. Wait at least 60 seconds before trying to reconnect.
ssh -i ~/.ssh/id_rsa ubuntu@YOUR_IP
Part 3 — Post-Reboot Verification
Run these checks in order. Each one builds on the last.
3.1 Check the Docker daemon
sudo systemctl status docker
Expected output:
● docker.service - Docker Application Container Engine
Loaded: loaded (/lib/systemd/system/docker.service; enabled)
Active: active (running) since Mon 2026-03-30 15:55:51 UTC; 5min ago
Key things to check:
-
Active: active (running)— the daemon is running ✅ -
enabled— it is configured to auto-start on every future boot ✅
If the daemon isn't running:
sudo systemctl enable docker # ensure it starts on future reboots
sudo systemctl start docker # start it now
3.2 Check all containers are up
docker ps
Example output:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
fc46f84c7bd5 app-backend "uv run uvi…" 2 days ago Up 5 minutes (healthy) 8000/tcp app_backend
a3e9a2eeb160 redis:alpine "docker-ent…" 2 weeks ago Up 5 minutes (healthy) 6379/tcp app_redis
f4afe2edb00c caddy:alpine "caddy run …" 4 weeks ago Up 5 minutes (healthy) 80, 443 caddy_proxy
What to check:
- Every service you expect should be present. If one is missing, it crashed on startup.
-
STATUSshould beUporUp (healthy).(health: starting)is fine for the first 30 seconds after boot. - The
CREATEDtimestamp does not reset on reboot — it reflects when the container was first created withdocker compose up. This is normal.
If a container is missing or in a restart loop:
docker compose logs [service_name] --tail=50
This shows the last 50 log lines for that specific service, which will usually tell you exactly why it failed.
3.3 Watch the live logs
cd ~/your-project
docker compose logs -f --tail=20
The -f flag follows the log stream in real time. --tail=20 shows the last 20 lines per service as a starting point.
What healthy output looks like:
app_gate | 127.0.0.1 - - [30/Mar/2026:16:00:00 +0000] "GET / HTTP/1.1" 200 4140
app_backend | INFO: 127.0.0.1:58562 - "GET /health HTTP/1.1" 200 OK
caddy_proxy | {"level":"info","msg":"received request","uri":"/config/"}
app_redis | * Ready to accept connections tcp
What a transient (non-critical) error looks like:
app_worker | redis.exceptions.ConnectionError: Error while reading
from redis:6379 : (104, 'Connection reset by peer')
app_worker | 15:56:15: Starting worker for 1 functions: process_message
app_worker | 15:56:15: redis_version=8.6.1 mem_usage=1.38M clients_connected=1
This pattern — an error followed immediately by a successful connection message — is normal during cold starts. When all containers launch simultaneously, a dependent service (like a worker) may attempt its first connection before its dependency (like Redis) has finished initialising. The container retries and connects successfully on the next attempt. This is expected behavior.
What a critical error looks like:
app_backend | sqlalchemy.exc.OperationalError: connection refused
app_backend | [after 5 retries] giving up
A critical error is one that does not resolve on its own. If you see continuous errors without a recovery line following them, press Ctrl+C and investigate that service.
3.4 Check additional system services
If you run a CI/CD runner or similar agent alongside Docker:
sudo gitlab-runner status
Expected output:
gitlab-runner: Service is running
If it's not running:
sudo gitlab-runner start
Part 4 — Measuring Time to Recovery (TTR)
TTR is the total time from sudo reboot to the moment your application is serving healthy responses. Measuring it gives you accurate data for maintenance window planning and user communications.
4.1 Measure OS boot time
systemd-analyze
Example output:
Startup finished in 3.617s (kernel) + 19.608s (userspace) = 23.225s
graphical.target reached after 18.845s in userspace
Breaking this down:
| Phase | Time | What's happening |
|---|---|---|
| Kernel | 3.6s | The Linux kernel loads into memory and initialises hardware drivers |
| Userspace | 19.6s | All systemd services start in parallel (networking, Docker, SSH, etc.) |
| Total | 23.2s | OS is fully booted |
4.2 Find the bottleneck in the boot sequence
systemd-analyze blame | head -20
This lists every service sorted by how long it took to start, slowest first:
12.186s docker.service
4.821s cloud-init.service
1.204s snapd.service
38ms docker.socket
In this case, Docker itself accounted for 12 of the 23 total seconds. This is normal — Docker has to read its state from disk, re-attach networks, and prepare to launch containers.
Why this is useful: If your boot time is unexpectedly long, systemd-analyze blame tells you exactly which service is the bottleneck.
4.3 Find the exact moment containers started
docker inspect --format='{{.Name}}: {{.State.StartedAt}}' $(docker ps -q)
Example output:
/app_ftp_bridge: 2026-03-30T15:55:57.766Z
/app_worker: 2026-03-30T15:55:57.695Z
/app_backend: 2026-03-30T15:55:57.646Z
/app_gate: 2026-03-30T15:55:57.830Z
/app_admin: 2026-03-30T15:55:57.742Z
/app_redis: 2026-03-30T15:55:57.794Z
/caddy_proxy: 2026-03-30T15:55:57.615Z
Every container launched within the same second. This is because Docker starts all containers in parallel as soon as the daemon is ready. Note: this timestamp reflects when Docker launched the container process, not when the application inside it was ready to serve traffic. A container may take a further 5–30 seconds to pass its health check after this point.
4.4 Build your full TTR timeline
Combining the data from the above commands:
| Event | Time (relative to reboot) |
|---|---|
sudo reboot executed |
T+0s |
| SSH connection closed | T+~5s |
| Kernel boot complete | T+~8s |
| Userspace boot complete (OS ready) | T+~28s |
| Docker daemon ready | T+~28s (12s of the userspace phase) |
| All containers launched | T+~28s |
| Redis accepting connections | T+~30s |
Backend /health returning 200 |
T+~35s |
| All health checks passing | T+~55s |
| Total TTR | ~55–60 seconds |
4.5 Use TTR to plan user communications
With a measured TTR, you can set honest expectations.
Internal / engineering team:
"Maintenance reboot at [time]. Expected downtime: ~2 minutes."
The 2-minute internal window gives a buffer above the measured ~60 seconds for anything unexpected.
External users:
"Scheduled maintenance in progress. Services will be restored within 5 minutes."
The 5-minute external window is deliberately conservative. If a container fails its first health check and requires a full restart cycle (up to 5 retries × 5 seconds = 25 extra seconds), you're still within your stated window. Under-promise, over-deliver.
Quick Reference: All Commands
# --- PRE-REBOOT ---
docker ps # check container states
docker compose config # validate compose file syntax
df -h / # check available disk space
# --- REBOOT ---
sudo reboot # initiate the reboot
# --- POST-REBOOT ---
sudo systemctl status docker # confirm daemon is running
docker ps # confirm containers are up
docker compose logs -f --tail=20 # watch live logs
sudo gitlab-runner status # check runner (if applicable)
# --- TTR MEASUREMENT ---
systemd-analyze # total OS boot time
systemd-analyze blame | head -20 # per-service boot time breakdown
docker inspect --format='{{.Name}}: {{.State.StartedAt}}' $(docker ps -q)
# exact container start timestamps
Troubleshooting Reference
| Symptom | Likely cause | Fix |
|---|---|---|
Container missing from docker ps
|
Crashed on startup | docker compose logs [service] --tail=50 |
Container stuck in (health: starting) after 2+ minutes |
Health check command failing |
docker inspect [id] → check Health.Log
|
| Docker daemon not running | Not enabled in systemd | sudo systemctl enable docker && sudo systemctl start docker |
| SSH times out for more than 3 minutes | VM didn't boot cleanly | Check OCI console → instance serial console for kernel panic output |
| All containers up but app unreachable externally | Reverse proxy (Caddy/Nginx) issue | docker compose logs caddy --tail=50 |
| Persistent container errors after cold start | Dependency started before its dependency was ready | Wait 60 seconds, then re-check — most resolve automatically |
Cover photo by BoliviaInteligente on Unsplash
Top comments (0)