Smyekh David-West

Posted on Mar 30

Rebooting a Production VM on Oracle Cloud: A Reference Guide

#docker #devops #linux #cloud

Commands, explanations, and real output — for engineers who want to understand what's actually happening, not just copy-paste their way through it.

☁️ Pre-Flight Checklist

Before we taxi down the runway, here’s your flight plan. Keep this handy to navigate your flight path. Welcome aboard the cloud! ☁️

Enjoy your flight! ☁️

There's a specific kind of anxiety that comes with running sudo reboot on a server with real users on it. You know the system should come back, but "should" feels a lot less reassuring at the moment your SSH session freezes. This guide removes the guesswork. It covers everything from reading your apt upgrade output intelligently, to verifying your stack is healthy after the reboot, to measuring your actual recovery time with real commands and real numbers so that the next time you need to do this, it's a procedure, not a gamble.

Prerequisites

This guide assumes:

Ubuntu 22.04 on an OCI Compute instance (ARM or x86)
Docker + Docker Compose managing your services
All long-running services configured with restart: always in your docker-compose.yml
SSH access to the instance

If restart: always isn't set on your services, your containers will not come back after a reboot. Check this first.

services:
  backend:
    image: your-backend-image
    restart: always  # ✅ restarts automatically after reboot or crash

  migrations:
    image: your-migrations-image
    # no restart policy  # ✅ correct — this should run once and exit

restart: always tells Docker to relaunch the container whenever it stops — whether from a crash or a full system reboot. The one exception to be deliberate about is one-shot containers like database migrations: they're designed to run once and exit cleanly, so no restart policy is the right call for those.

Part 1 — Pre-Reboot Checklist

Never reboot without completing this checklist. It takes under two minutes and prevents the most common post-reboot problems.

1.1 Verify no critical process is mid-flight

docker ps

What to look for:

STATUS	Meaning
`Up 2 days (healthy)`	Safe to reboot
`Up 3 minutes`	Something recently restarted — investigate
`Restarting (1)`	Container is crash-looping — fix before rebooting
`Up 2 hours (unhealthy)`	Health check is failing — fix before rebooting

If everything shows Up [days/weeks] (healthy), you are clear.

Why this matters: If a database migration container is mid-run, or a background job is processing a large task, a reboot will kill it mid-execution. You want to reboot during a quiet moment.

1.2 Validate your Compose configuration

cd ~/your-project
docker compose config

Expected output: Your full resolved docker-compose.yml printed to the terminal, with no errors.

Why this matters: docker compose config resolves all environment variables and validates YAML syntax. If there's a broken variable reference or a typo in your file, this command catches it now — not after the reboot when containers silently fail to start. A common mistake is editing a .env file or docker-compose.yml and not realising you've introduced a syntax error. This is your safety net.

1.3 Read your `apt upgrade` output

When you run sudo apt update && sudo apt upgrade -y before a reboot, the output tells you exactly what changed on your system. Don't skip past it.

Here's a real upgrade output and what each part means:

The following packages will be upgraded:
  containerd.io coreutils docker-ce docker-ce-cli
  docker-ce-rootless-extras docker-compose-plugin docker-model-plugin
  gitlab-runner gitlab-runner-helper-images libnftables1 nftables
  python3-pyasn1

How to read this list:

Package	What it is	Reboot needed?
`docker-ce`, `containerd.io`, `docker-ce-cli`	The Docker engine and its runtime	Recommended
`docker-compose-plugin`	The `docker compose` CLI plugin	No
`nftables`, `libnftables1`	Linux kernel firewall/networking	Yes
`coreutils`	Fundamental Linux utilities (`ls`, `cp`, etc.)	Recommended
`gitlab-runner`, `gitlab-runner-helper-images`	CI/CD runner agent	Service restarts during upgrade
`python3-pyasn1`	Python crypto library	No

The rule of thumb: If the upgrade touches anything in the kernel, networking stack, or container runtime — reboot. If it's only application-level packages — a reboot is optional but never harmful.

1.4 Understand the service restart messages

After apt upgrade, Ubuntu's needrestart tool prints which services were restarted automatically and which were deferred:

Restarting services...
 systemctl restart irqbalance.service ssh.service rsyslog.service ...

Service restarts being deferred:
 systemctl restart networkd-dispatcher.service
 systemctl restart systemd-logind.service

"Restarting services" — These were restarted immediately. Your SSH connection stayed alive because ssh.service restarts in-place without dropping existing sessions.

"Service restarts being deferred" — These require a full reboot to apply safely. systemd-logind manages user sessions; restarting it mid-session can cause issues, so Ubuntu defers it to the next clean boot.

No containers need to be restarted.

This line means Docker detected that running container images are still current — no container needed to be replaced. This is expected if you haven't rebuilt your application images.

1.5 Check available disk space

df -h /

Example output:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        48G   12G   36G  23% /

You want at least 20% free on your root partition. Docker image pulls and accumulated log files are the two most common causes of a full disk, which can prevent containers from starting after a reboot.

Tip: The apt upgrade process often reclaims space automatically by pruning unused Docker build cache layers. In a real upgrade run, this printed:

Total reclaimed space: 4.165GB

Part 2 — Running the Reboot

Once the checklist is complete:

sudo reboot

What happens next, step by step:

The OS sends SIGTERM to all running processes, giving them time to shut down cleanly.
Docker receives the signal and stops all containers gracefully.
The kernel shuts down and the VM restarts.
Your SSH session prints Connection to [ip] closed by remote host. and terminates. This is normal.

How long to wait: OCI ARM instances (Ampere A1) typically reboot in 45–90 seconds. Wait at least 60 seconds before trying to reconnect.

ssh -i ~/.ssh/id_rsa ubuntu@YOUR_IP

Part 3 — Post-Reboot Verification

Run these checks in order. Each one builds on the last.

3.1 Check the Docker daemon

sudo systemctl status docker

Expected output:

● docker.service - Docker Application Container Engine
     Loaded: loaded (/lib/systemd/system/docker.service; enabled)
     Active: active (running) since Mon 2026-03-30 15:55:51 UTC; 5min ago

Key things to check:

Active: active (running) — the daemon is running ✅
enabled — it is configured to auto-start on every future boot ✅

If the daemon isn't running:

sudo systemctl enable docker   # ensure it starts on future reboots
sudo systemctl start docker    # start it now

3.2 Check all containers are up

docker ps

Example output:

CONTAINER ID   IMAGE              COMMAND        CREATED      STATUS                  PORTS    NAMES
fc46f84c7bd5   app-backend        "uv run uvi…"  2 days ago   Up 5 minutes (healthy)  8000/tcp app_backend
a3e9a2eeb160   redis:alpine       "docker-ent…"  2 weeks ago  Up 5 minutes (healthy)  6379/tcp app_redis
f4afe2edb00c   caddy:alpine       "caddy run …"  4 weeks ago  Up 5 minutes (healthy)  80, 443  caddy_proxy

What to check:

Every service you expect should be present. If one is missing, it crashed on startup.
STATUS should be Up or Up (healthy). (health: starting) is fine for the first 30 seconds after boot.
The CREATED timestamp does not reset on reboot — it reflects when the container was first created with docker compose up. This is normal.

If a container is missing or in a restart loop:

docker compose logs [service_name] --tail=50

This shows the last 50 log lines for that specific service, which will usually tell you exactly why it failed.

3.3 Watch the live logs

cd ~/your-project
docker compose logs -f --tail=20

The -f flag follows the log stream in real time. --tail=20 shows the last 20 lines per service as a starting point.

What healthy output looks like:

app_gate    | 127.0.0.1 - - [30/Mar/2026:16:00:00 +0000] "GET / HTTP/1.1" 200 4140
app_backend | INFO: 127.0.0.1:58562 - "GET /health HTTP/1.1" 200 OK
caddy_proxy | {"level":"info","msg":"received request","uri":"/config/"}
app_redis   | * Ready to accept connections tcp

What a transient (non-critical) error looks like:

app_worker | redis.exceptions.ConnectionError: Error while reading
            from redis:6379 : (104, 'Connection reset by peer')
app_worker | 15:56:15: Starting worker for 1 functions: process_message
app_worker | 15:56:15: redis_version=8.6.1 mem_usage=1.38M clients_connected=1

This pattern — an error followed immediately by a successful connection message — is normal during cold starts. When all containers launch simultaneously, a dependent service (like a worker) may attempt its first connection before its dependency (like Redis) has finished initialising. The container retries and connects successfully on the next attempt. This is expected behavior.

What a critical error looks like:

app_backend | sqlalchemy.exc.OperationalError: connection refused
app_backend | [after 5 retries] giving up

A critical error is one that does not resolve on its own. If you see continuous errors without a recovery line following them, press Ctrl+C and investigate that service.

3.4 Check additional system services

If you run a CI/CD runner or similar agent alongside Docker:

sudo gitlab-runner status

Expected output:

gitlab-runner: Service is running

If it's not running:

sudo gitlab-runner start

Part 4 — Measuring Time to Recovery (TTR)

TTR is the total time from sudo reboot to the moment your application is serving healthy responses. Measuring it gives you accurate data for maintenance window planning and user communications.

4.1 Measure OS boot time

systemd-analyze

Example output:

Startup finished in 3.617s (kernel) + 19.608s (userspace) = 23.225s
graphical.target reached after 18.845s in userspace

Breaking this down:

Phase	Time	What's happening
Kernel	3.6s	The Linux kernel loads into memory and initialises hardware drivers
Userspace	19.6s	All systemd services start in parallel (networking, Docker, SSH, etc.)
Total	23.2s	OS is fully booted

4.2 Find the bottleneck in the boot sequence

systemd-analyze blame | head -20

This lists every service sorted by how long it took to start, slowest first:

12.186s docker.service
 4.821s cloud-init.service
 1.204s snapd.service
   38ms docker.socket

In this case, Docker itself accounted for 12 of the 23 total seconds. This is normal — Docker has to read its state from disk, re-attach networks, and prepare to launch containers.

Why this is useful: If your boot time is unexpectedly long, systemd-analyze blame tells you exactly which service is the bottleneck.

4.3 Find the exact moment containers started

docker inspect --format='{{.Name}}: {{.State.StartedAt}}' $(docker ps -q)

Example output:

/app_ftp_bridge:  2026-03-30T15:55:57.766Z
/app_worker:      2026-03-30T15:55:57.695Z
/app_backend:     2026-03-30T15:55:57.646Z
/app_gate:        2026-03-30T15:55:57.830Z
/app_admin:       2026-03-30T15:55:57.742Z
/app_redis:       2026-03-30T15:55:57.794Z
/caddy_proxy:     2026-03-30T15:55:57.615Z

Every container launched within the same second. This is because Docker starts all containers in parallel as soon as the daemon is ready. Note: this timestamp reflects when Docker launched the container process, not when the application inside it was ready to serve traffic. A container may take a further 5–30 seconds to pass its health check after this point.

4.4 Build your full TTR timeline

Combining the data from the above commands:

Event	Time (relative to reboot)
`sudo reboot` executed	T+0s
SSH connection closed	T+~5s
Kernel boot complete	T+~8s
Userspace boot complete (OS ready)	T+~28s
Docker daemon ready	T+~28s (12s of the userspace phase)
All containers launched	T+~28s
Redis accepting connections	T+~30s
Backend `/health` returning 200	T+~35s
All health checks passing	T+~55s
Total TTR	~55–60 seconds

4.5 Use TTR to plan user communications

With a measured TTR, you can set honest expectations.

Internal / engineering team:

"Maintenance reboot at [time]. Expected downtime: ~2 minutes."

The 2-minute internal window gives a buffer above the measured ~60 seconds for anything unexpected.

External users:

"Scheduled maintenance in progress. Services will be restored within 5 minutes."

The 5-minute external window is deliberately conservative. If a container fails its first health check and requires a full restart cycle (up to 5 retries × 5 seconds = 25 extra seconds), you're still within your stated window. Under-promise, over-deliver.

Quick Reference: All Commands

# --- PRE-REBOOT ---
docker ps                              # check container states
docker compose config                  # validate compose file syntax
df -h /                                # check available disk space

# --- REBOOT ---
sudo reboot                            # initiate the reboot

# --- POST-REBOOT ---
sudo systemctl status docker           # confirm daemon is running
docker ps                              # confirm containers are up
docker compose logs -f --tail=20       # watch live logs
sudo gitlab-runner status              # check runner (if applicable)

# --- TTR MEASUREMENT ---
systemd-analyze                        # total OS boot time
systemd-analyze blame | head -20       # per-service boot time breakdown
docker inspect --format='{{.Name}}: {{.State.StartedAt}}' $(docker ps -q)
                                       # exact container start timestamps

Troubleshooting Reference

Symptom	Likely cause	Fix
Container missing from `docker ps`	Crashed on startup	`docker compose logs [service] --tail=50`
Container stuck in `(health: starting)` after 2+ minutes	Health check command failing	`docker inspect [id]` → check `Health.Log`
Docker daemon not running	Not enabled in systemd	`sudo systemctl enable docker && sudo systemctl start docker`
SSH times out for more than 3 minutes	VM didn't boot cleanly	Check OCI console → instance serial console for kernel panic output
All containers up but app unreachable externally	Reverse proxy (Caddy/Nginx) issue	`docker compose logs caddy --tail=50`
Persistent container errors after cold start	Dependency started before its dependency was ready	Wait 60 seconds, then re-check — most resolve automatically

Cover photo by BoliviaInteligente on Unsplash

DEV Community

Rebooting a Production VM on Oracle Cloud: A Reference Guide

☁️ Pre-Flight Checklist

🌥️ Takeoff

⛅️ Cruising Altitude

🌤️ Landing & Taxi

Prerequisites

Part 1 — Pre-Reboot Checklist

1.1 Verify no critical process is mid-flight

1.2 Validate your Compose configuration

1.3 Read your `apt upgrade` output

1.4 Understand the service restart messages

1.5 Check available disk space

Part 2 — Running the Reboot

Part 3 — Post-Reboot Verification

3.1 Check the Docker daemon

3.2 Check all containers are up

3.3 Watch the live logs

3.4 Check additional system services

Part 4 — Measuring Time to Recovery (TTR)

4.1 Measure OS boot time

4.2 Find the bottleneck in the boot sequence

4.3 Find the exact moment containers started

4.4 Build your full TTR timeline

4.5 Use TTR to plan user communications

Quick Reference: All Commands

Troubleshooting Reference

Top comments (0)

☁️ Pre-Flight Checklist

🌥️ Takeoff

⛅️ Cruising Altitude

🌤️ Landing & Taxi

Prerequisites

Part 1 — Pre-Reboot Checklist

1.1 Verify no critical process is mid-flight

1.2 Validate your Compose configuration

1.3 Read your apt upgrade output

1.4 Understand the service restart messages

1.5 Check available disk space

Part 2 — Running the Reboot

Part 3 — Post-Reboot Verification

3.1 Check the Docker daemon

3.2 Check all containers are up

3.3 Watch the live logs

3.4 Check additional system services

Part 4 — Measuring Time to Recovery (TTR)

4.1 Measure OS boot time

4.2 Find the bottleneck in the boot sequence

4.3 Find the exact moment containers started

4.4 Build your full TTR timeline

4.5 Use TTR to plan user communications

Quick Reference: All Commands

Troubleshooting Reference

1.3 Read your `apt upgrade` output