Solved: [HELP] Oracle Cloud ARM Instance Locked Out After Editing sshd_config — Serial Console Login Immediately Resets

#devops #programming #tutorial #cloud

🚀 Executive Summary

TL;DR: Oracle Cloud ARM instances can become inaccessible after sshd\_config errors, with the serial console failing due to its reliance on a local SSH session. Recovery involves using OCI’s Run Command, detaching and mounting the boot volume on a rescue instance, or, ideally, terminating and rebuilding the instance with automation.

🎯 Key Takeaways

The OCI Serial Console attempts a local SSH session, not a direct TTY, which causes it to immediately reset if the sshd daemon is misconfigured or fails to start.
OCI’s ‘Run Command’ feature provides a quick recovery method by executing scripts as root on the instance via the Oracle Cloud Agent, bypassing the need for SSH login.
The ‘Boot Volume Shuffle’ is a universal recovery technique involving detaching the broken instance’s boot volume, attaching it to a temporary ‘rescue’ instance, and directly editing the sshd\_config file before re-attaching it.

Locked out of your Oracle Cloud (OCI) ARM instance after a bad sshd_config edit? This guide explains why the serial console fails and provides three real-world methods to regain access and fix your server.

That Sinking Feeling: Locked Out of OCI by sshd_config

It was 4:55 PM on a Friday. “Just one last thing,” I thought. A quick change to /etc/ssh/sshd_config on a new staging server, stage-arm-worker-04, to disable password authentication as per our new security policy. I ran systemctl restart sshd, logged out, and tried to log back in to confirm. Connection refused. My heart sank. “No big deal,” I told myself, “I’ll just use the OCI Serial Console.” I connected, the login prompt appeared for a fraction of a second, and then… poof. Connection closed. I was well and truly locked out. If this sounds familiar, you’re not alone, and you’re not out of luck.

First, Why The Serial Console Betrays You

This is the part that throws everyone for a loop. We veterans expect a serial console to be a direct, low-level connection to a TTY, the ultimate back door. But in Oracle Cloud, the “Serial Console Connection” is a bit more sophisticated—and that’s its weakness here.

Instead of a simple login prompt, OCI’s console actually attempts to create a local SSH session for your user on the instance. When you broke your /etc/ssh/sshd_config file with a syntax error, the SSH daemon (sshd) on your instance failed to start or reload properly. So, when the serial console service tries to connect via SSH locally, the daemon immediately rejects the connection. The service sees this instant failure and correctly assumes the session is over, terminating your connection.

It’s not crashing; it’s doing exactly what it was designed to do. Infuriating, but logical.

The Recovery Playbook: Three Ways Out

Alright, enough theory. Let’s get you back into your server. Depending on your situation, you can pick the method that works best for you.

Solution 1: The Lifeline (Using OCI’s Run Command)

This is the fastest and easiest way back in, provided the Oracle Cloud Agent is running on your instance (it is by default). The “Run Command” feature lets you execute a script on the instance as the root user without needing to log in at all.

Navigate to your instance’s details page in the OCI console.
On the left side, under Resources, click Run Command.
Click Create Run Command.
Paste the following script into the Script box. This script safely moves your broken config, copies the default cloud-init config back in its place, and restarts SSH.

#!/bin/bash
# Move the broken config to a backup file for later review
mv /etc/ssh/sshd_config /etc/ssh/sshd_config.broken-`date +%F`

# OCI Linux images usually have a good default config here.
# For Ubuntu, this might be a different file, but the concept is the same.
# The goal is to get *any* valid config in place.
if [ -f /etc/ssh/sshd_config.d/50-cloud-init.conf ]; then
    cp /etc/ssh/sshd_config.d/50-cloud-init.conf /etc/ssh/sshd_config
else
    # Fallback for other images: create a super basic config
    echo "Include /etc/ssh/sshd_config.d/*.conf" > /etc/ssh/sshd_config
    echo "PasswordAuthentication no" >> /etc/ssh/sshd_config
    echo "ChallengeResponseAuthentication no" >> /etc/ssh/sshd_config
fi

# Give it correct permissions
chmod 600 /etc/ssh/sshd_config

# Restart the SSH daemon
systemctl restart sshd

Leave the other options as default (“Run as user” should be root) and click Create. Wait a minute for the command to run, and you should be able to SSH back into your instance.

Heads Up: This method is a lifesaver, but it depends entirely on the OCI Agent running and being able to communicate with the OCI control plane. If the agent is stopped or network rules are blocking it, this will fail.

Solution 2: The Surgical Procedure (The Boot Volume Shuffle)

This is the classic, old-school cloud recovery method. It’s more involved, but it will always work, regardless of the instance’s software state. The plan is to treat your instance’s boot disk like a USB drive that you plug into another computer to fix.

Stop the broken instance. This is critical. You can’t detach a boot volume from a running instance.
Go to the instance details, click the Boot Volume tab, click the three-dots menu, and select Detach. Confirm the detachment.
Launch a temporary “rescue” instance. It can be the smallest, cheapest instance available, but it must be in the same Availability Domain as your original instance.
Once the rescue instance is running, navigate to its instance details page, go to Attached Block Volumes, and click Attach Block Volume.
Select Boot Volume as the type and choose your detached volume from the list. Attach it.
SSH into your new rescue instance. Now, we mount the broken disk and fix the file.

# SSH into the rescue instance
# ssh opc@<rescue_instance_ip>

# List the attached disks to find your volume. It's often /dev/sdb
sudo lsblk

# Create a directory to mount it
sudo mkdir /mnt/rescue

# The boot partition is usually the largest one. On Oracle Linux, it's often partition 3.
# Use the name you found with lsblk, e.g., /dev/sdb3
sudo mount /dev/sdb3 /mnt/rescue

# Now, your entire broken filesystem is available under /mnt/rescue. Let's fix the file!
sudo nano /mnt/rescue/etc/ssh/sshd_config

# After fixing the file, unmount the volume
sudo umount /mnt/rescue

After unmounting, go back to the OCI console, detach the volume from the rescue instance, re-attach it as the boot volume for your original instance, and start it up. You’re back in business. You can now terminate the rescue instance.

Solution 3: The ‘Nuke and Pave’ (Embracing Immutable Infrastructure)

Sometimes, the best fix is a restart. This is my preferred “fix,” because it forces good habits. If your instance, like stage-arm-worker-04, doesn’t contain critical, unique state (e.g., its data is on a separate block volume or in a database), why spend 30 minutes performing surgery on it?

Terminate it. And build a new one.

This sounds drastic, but in a modern DevOps workflow, it’s standard practice. Your infrastructure should be treated as “cattle, not pets.” If one gets sick, you replace it; you don’t nurse it back to health.

Pro Tip: This incident is the perfect justification for improving your process. The real fix isn’t editing a file; it’s ensuring that manual edits are no longer necessary. Your server configuration, including SSH hardening, should be defined in code using tools like Terraform, Ansible, or simple cloud-init scripts. That way, you can spin up a perfectly configured replacement in minutes, and this kind of lockout becomes a non-issue.

Locking yourself out is a rite of passage for any systems engineer. Don’t panic. Use one of these methods to get back in, then take a step back and think about how you can use automation to make sure “next time” never happens.