DEV Community

Cover image for Rebooting a Production VM on Oracle Cloud: A Reference Guide
Smyekh David-West
Smyekh David-West

Posted on

Rebooting a Production VM on Oracle Cloud: A Reference Guide

Commands, explanations, and real output — for engineers who want to understand what's actually happening, not just copy-paste their way through it.


☁️ Pre-Flight Checklist

Before we taxi down the runway, here’s your flight plan. Keep this handy to navigate your flight path. Welcome aboard the cloud! ☁️

🌥️ Takeoff

⛅️ Cruising Altitude

🌤️ Landing & Taxi

Enjoy your flight! ☁️


There's a specific kind of anxiety that comes with running sudo reboot on a server with real users on it. You know the system should come back, but "should" feels a lot less reassuring at the moment your SSH session freezes. This guide removes the guesswork. It covers everything from reading your apt upgrade output intelligently, to verifying your stack is healthy after the reboot, to measuring your actual recovery time with real commands and real numbers so that the next time you need to do this, it's a procedure, not a gamble.

Prerequisites

This guide assumes:

  • Ubuntu 22.04 on an OCI Compute instance (ARM or x86)
  • Docker + Docker Compose managing your services
  • All long-running services configured with restart: always in your docker-compose.yml
  • SSH access to the instance

If restart: always isn't set on your services, your containers will not come back after a reboot. Check this first.

services:
  backend:
    image: your-backend-image
    restart: always  # ✅ restarts automatically after reboot or crash

  migrations:
    image: your-migrations-image
    # no restart policy  # ✅ correct — this should run once and exit
Enter fullscreen mode Exit fullscreen mode

restart: always tells Docker to relaunch the container whenever it stops — whether from a crash or a full system reboot. The one exception to be deliberate about is one-shot containers like database migrations: they're designed to run once and exit cleanly, so no restart policy is the right call for those.

Part 1 — Pre-Reboot Checklist

Never reboot without completing this checklist. It takes under two minutes and prevents the most common post-reboot problems.

1.1 Verify no critical process is mid-flight

docker ps
Enter fullscreen mode Exit fullscreen mode

What to look for:

STATUS Meaning
Up 2 days (healthy) Safe to reboot
Up 3 minutes Something recently restarted — investigate
Restarting (1) Container is crash-looping — fix before rebooting
Up 2 hours (unhealthy) Health check is failing — fix before rebooting

If everything shows Up [days/weeks] (healthy), you are clear.

Why this matters: If a database migration container is mid-run, or a background job is processing a large task, a reboot will kill it mid-execution. You want to reboot during a quiet moment.

1.2 Validate your Compose configuration

cd ~/your-project
docker compose config
Enter fullscreen mode Exit fullscreen mode

Expected output: Your full resolved docker-compose.yml printed to the terminal, with no errors.

Why this matters: docker compose config resolves all environment variables and validates YAML syntax. If there's a broken variable reference or a typo in your file, this command catches it now — not after the reboot when containers silently fail to start. A common mistake is editing a .env file or docker-compose.yml and not realising you've introduced a syntax error. This is your safety net.

1.3 Read your apt upgrade output

When you run sudo apt update && sudo apt upgrade -y before a reboot, the output tells you exactly what changed on your system. Don't skip past it.

Here's a real upgrade output and what each part means:

The following packages will be upgraded:
  containerd.io coreutils docker-ce docker-ce-cli
  docker-ce-rootless-extras docker-compose-plugin docker-model-plugin
  gitlab-runner gitlab-runner-helper-images libnftables1 nftables
  python3-pyasn1
Enter fullscreen mode Exit fullscreen mode

How to read this list:

Package What it is Reboot needed?
docker-ce, containerd.io, docker-ce-cli The Docker engine and its runtime Recommended
docker-compose-plugin The docker compose CLI plugin No
nftables, libnftables1 Linux kernel firewall/networking Yes
coreutils Fundamental Linux utilities (ls, cp, etc.) Recommended
gitlab-runner, gitlab-runner-helper-images CI/CD runner agent Service restarts during upgrade
python3-pyasn1 Python crypto library No

The rule of thumb: If the upgrade touches anything in the kernel, networking stack, or container runtime — reboot. If it's only application-level packages — a reboot is optional but never harmful.

1.4 Understand the service restart messages

After apt upgrade, Ubuntu's needrestart tool prints which services were restarted automatically and which were deferred:

Restarting services...
 systemctl restart irqbalance.service ssh.service rsyslog.service ...

Service restarts being deferred:
 systemctl restart networkd-dispatcher.service
 systemctl restart systemd-logind.service
Enter fullscreen mode Exit fullscreen mode

"Restarting services" — These were restarted immediately. Your SSH connection stayed alive because ssh.service restarts in-place without dropping existing sessions.

"Service restarts being deferred" — These require a full reboot to apply safely. systemd-logind manages user sessions; restarting it mid-session can cause issues, so Ubuntu defers it to the next clean boot.

No containers need to be restarted.
Enter fullscreen mode Exit fullscreen mode

This line means Docker detected that running container images are still current — no container needed to be replaced. This is expected if you haven't rebuilt your application images.

1.5 Check available disk space

df -h /
Enter fullscreen mode Exit fullscreen mode

Example output:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        48G   12G   36G  23% /
Enter fullscreen mode Exit fullscreen mode

You want at least 20% free on your root partition. Docker image pulls and accumulated log files are the two most common causes of a full disk, which can prevent containers from starting after a reboot.

Tip: The apt upgrade process often reclaims space automatically by pruning unused Docker build cache layers. In a real upgrade run, this printed:

Total reclaimed space: 4.165GB
Enter fullscreen mode Exit fullscreen mode

Part 2 — Running the Reboot

Once the checklist is complete:

sudo reboot
Enter fullscreen mode Exit fullscreen mode

What happens next, step by step:

  1. The OS sends SIGTERM to all running processes, giving them time to shut down cleanly.
  2. Docker receives the signal and stops all containers gracefully.
  3. The kernel shuts down and the VM restarts.
  4. Your SSH session prints Connection to [ip] closed by remote host. and terminates. This is normal.

How long to wait: OCI ARM instances (Ampere A1) typically reboot in 45–90 seconds. Wait at least 60 seconds before trying to reconnect.

ssh -i ~/.ssh/id_rsa ubuntu@YOUR_IP
Enter fullscreen mode Exit fullscreen mode

Part 3 — Post-Reboot Verification

Run these checks in order. Each one builds on the last.

3.1 Check the Docker daemon

sudo systemctl status docker
Enter fullscreen mode Exit fullscreen mode

Expected output:

 docker.service - Docker Application Container Engine
     Loaded: loaded (/lib/systemd/system/docker.service; enabled)
     Active: active (running) since Mon 2026-03-30 15:55:51 UTC; 5min ago
Enter fullscreen mode Exit fullscreen mode

Key things to check:

  • Active: active (running) — the daemon is running ✅
  • enabled — it is configured to auto-start on every future boot ✅

If the daemon isn't running:

sudo systemctl enable docker   # ensure it starts on future reboots
sudo systemctl start docker    # start it now
Enter fullscreen mode Exit fullscreen mode

3.2 Check all containers are up

docker ps
Enter fullscreen mode Exit fullscreen mode

Example output:

CONTAINER ID   IMAGE              COMMAND        CREATED      STATUS                  PORTS    NAMES
fc46f84c7bd5   app-backend        "uv run uvi…"  2 days ago   Up 5 minutes (healthy)  8000/tcp app_backend
a3e9a2eeb160   redis:alpine       "docker-ent…"  2 weeks ago  Up 5 minutes (healthy)  6379/tcp app_redis
f4afe2edb00c   caddy:alpine       "caddy run …"  4 weeks ago  Up 5 minutes (healthy)  80, 443  caddy_proxy
Enter fullscreen mode Exit fullscreen mode

What to check:

  • Every service you expect should be present. If one is missing, it crashed on startup.
  • STATUS should be Up or Up (healthy). (health: starting) is fine for the first 30 seconds after boot.
  • The CREATED timestamp does not reset on reboot — it reflects when the container was first created with docker compose up. This is normal.

If a container is missing or in a restart loop:

docker compose logs [service_name] --tail=50
Enter fullscreen mode Exit fullscreen mode

This shows the last 50 log lines for that specific service, which will usually tell you exactly why it failed.

3.3 Watch the live logs

cd ~/your-project
docker compose logs -f --tail=20
Enter fullscreen mode Exit fullscreen mode

The -f flag follows the log stream in real time. --tail=20 shows the last 20 lines per service as a starting point.

What healthy output looks like:

app_gate    | 127.0.0.1 - - [30/Mar/2026:16:00:00 +0000] "GET / HTTP/1.1" 200 4140
app_backend | INFO: 127.0.0.1:58562 - "GET /health HTTP/1.1" 200 OK
caddy_proxy | {"level":"info","msg":"received request","uri":"/config/"}
app_redis   | * Ready to accept connections tcp
Enter fullscreen mode Exit fullscreen mode

What a transient (non-critical) error looks like:

app_worker | redis.exceptions.ConnectionError: Error while reading
            from redis:6379 : (104, 'Connection reset by peer')
app_worker | 15:56:15: Starting worker for 1 functions: process_message
app_worker | 15:56:15: redis_version=8.6.1 mem_usage=1.38M clients_connected=1
Enter fullscreen mode Exit fullscreen mode

This pattern — an error followed immediately by a successful connection message — is normal during cold starts. When all containers launch simultaneously, a dependent service (like a worker) may attempt its first connection before its dependency (like Redis) has finished initialising. The container retries and connects successfully on the next attempt. This is expected behavior.

What a critical error looks like:

app_backend | sqlalchemy.exc.OperationalError: connection refused
app_backend | [after 5 retries] giving up
Enter fullscreen mode Exit fullscreen mode

A critical error is one that does not resolve on its own. If you see continuous errors without a recovery line following them, press Ctrl+C and investigate that service.

3.4 Check additional system services

If you run a CI/CD runner or similar agent alongside Docker:

sudo gitlab-runner status
Enter fullscreen mode Exit fullscreen mode

Expected output:

gitlab-runner: Service is running
Enter fullscreen mode Exit fullscreen mode

If it's not running:

sudo gitlab-runner start
Enter fullscreen mode Exit fullscreen mode

Part 4 — Measuring Time to Recovery (TTR)

TTR is the total time from sudo reboot to the moment your application is serving healthy responses. Measuring it gives you accurate data for maintenance window planning and user communications.

4.1 Measure OS boot time

systemd-analyze
Enter fullscreen mode Exit fullscreen mode

Example output:

Startup finished in 3.617s (kernel) + 19.608s (userspace) = 23.225s
graphical.target reached after 18.845s in userspace
Enter fullscreen mode Exit fullscreen mode

Breaking this down:

Phase Time What's happening
Kernel 3.6s The Linux kernel loads into memory and initialises hardware drivers
Userspace 19.6s All systemd services start in parallel (networking, Docker, SSH, etc.)
Total 23.2s OS is fully booted

4.2 Find the bottleneck in the boot sequence

systemd-analyze blame | head -20
Enter fullscreen mode Exit fullscreen mode

This lists every service sorted by how long it took to start, slowest first:

12.186s docker.service
 4.821s cloud-init.service
 1.204s snapd.service
   38ms docker.socket
Enter fullscreen mode Exit fullscreen mode

In this case, Docker itself accounted for 12 of the 23 total seconds. This is normal — Docker has to read its state from disk, re-attach networks, and prepare to launch containers.

Why this is useful: If your boot time is unexpectedly long, systemd-analyze blame tells you exactly which service is the bottleneck.

4.3 Find the exact moment containers started

docker inspect --format='{{.Name}}: {{.State.StartedAt}}' $(docker ps -q)
Enter fullscreen mode Exit fullscreen mode

Example output:

/app_ftp_bridge:  2026-03-30T15:55:57.766Z
/app_worker:      2026-03-30T15:55:57.695Z
/app_backend:     2026-03-30T15:55:57.646Z
/app_gate:        2026-03-30T15:55:57.830Z
/app_admin:       2026-03-30T15:55:57.742Z
/app_redis:       2026-03-30T15:55:57.794Z
/caddy_proxy:     2026-03-30T15:55:57.615Z
Enter fullscreen mode Exit fullscreen mode

Every container launched within the same second. This is because Docker starts all containers in parallel as soon as the daemon is ready. Note: this timestamp reflects when Docker launched the container process, not when the application inside it was ready to serve traffic. A container may take a further 5–30 seconds to pass its health check after this point.

4.4 Build your full TTR timeline

Combining the data from the above commands:

Event Time (relative to reboot)
sudo reboot executed T+0s
SSH connection closed T+~5s
Kernel boot complete T+~8s
Userspace boot complete (OS ready) T+~28s
Docker daemon ready T+~28s (12s of the userspace phase)
All containers launched T+~28s
Redis accepting connections T+~30s
Backend /health returning 200 T+~35s
All health checks passing T+~55s
Total TTR ~55–60 seconds

4.5 Use TTR to plan user communications

With a measured TTR, you can set honest expectations.

Internal / engineering team:

"Maintenance reboot at [time]. Expected downtime: ~2 minutes."

The 2-minute internal window gives a buffer above the measured ~60 seconds for anything unexpected.

External users:

"Scheduled maintenance in progress. Services will be restored within 5 minutes."

The 5-minute external window is deliberately conservative. If a container fails its first health check and requires a full restart cycle (up to 5 retries × 5 seconds = 25 extra seconds), you're still within your stated window. Under-promise, over-deliver.

Quick Reference: All Commands

# --- PRE-REBOOT ---
docker ps                              # check container states
docker compose config                  # validate compose file syntax
df -h /                                # check available disk space

# --- REBOOT ---
sudo reboot                            # initiate the reboot

# --- POST-REBOOT ---
sudo systemctl status docker           # confirm daemon is running
docker ps                              # confirm containers are up
docker compose logs -f --tail=20       # watch live logs
sudo gitlab-runner status              # check runner (if applicable)

# --- TTR MEASUREMENT ---
systemd-analyze                        # total OS boot time
systemd-analyze blame | head -20       # per-service boot time breakdown
docker inspect --format='{{.Name}}: {{.State.StartedAt}}' $(docker ps -q)
                                       # exact container start timestamps
Enter fullscreen mode Exit fullscreen mode

Troubleshooting Reference

Symptom Likely cause Fix
Container missing from docker ps Crashed on startup docker compose logs [service] --tail=50
Container stuck in (health: starting) after 2+ minutes Health check command failing docker inspect [id] → check Health.Log
Docker daemon not running Not enabled in systemd sudo systemctl enable docker && sudo systemctl start docker
SSH times out for more than 3 minutes VM didn't boot cleanly Check OCI console → instance serial console for kernel panic output
All containers up but app unreachable externally Reverse proxy (Caddy/Nginx) issue docker compose logs caddy --tail=50
Persistent container errors after cold start Dependency started before its dependency was ready Wait 60 seconds, then re-check — most resolve automatically

Cover photo by BoliviaInteligente on Unsplash

Top comments (0)