loading...

Saving Your Linux Machine from Certain Death

martinheinz profile image Martin Heinz Originally published at martinheinz.dev ・8 min read

Note: This was originally posted at martinheinz.dev

Troubleshooting damaged systems is essential skill of every SysAdmin, SRE or DevOps engineer. Everyone of us runs into OS related issues from time to time and it's better to be prepared when things go terribly wrong. It's especially beneficial to be able to identify and act on the issue quickly to prevent any significant damage. To help with that in this article - we will go over a few common problems that you might encounter as well as ways to gather information, troubleshoot and solve these issues.

Note: This article uses RHEL 8/CentOS. But, examples/concepts below can be applied to any Linux distribution.

Recovering root Password

What if you lose root password and you don't have access to any privileged user? If you still have access to the machine, then there is way to solve this inconvenient situation:

First, start by rebooting the machine. When machine starts, hit any key to access boot menu:

Boot Menu Screenshot

In boot menu hit e to edit boot options. Using arrows move to the line starting with linux and append rd.break - this breaks boot process early on. Optionally you can also append enforcing=0, to pause SELinux enforcing. Next, hit CTRL+X to let machine boot.

Edited Boot Menu Screenshot

After few seconds of booting, you should get shell. At this point you have access to system in read-only mode. So, in order to change anything in the system - like root password - we need to make filesystem read-write. We can do that by running mount -o remount,rw /sysroot. Next thing we need to do is enter root jail using chroot /sysroot - this changes root of filesystem to be /sysroot instead of /. This is required, so that any further commands we run will be in regards to /sysroot directory. Now we can change root password using passwd.

If you added enforcing=0 to boot options, you can now hit CTRL+D (or type exit) and let system fully boot. If not, run touch /.autorelabel to trigger SELinux system relabel. This is needed because changing password results in /etc/password having incorrect SELinux security context. Therefore we need to relabel whole filesystem during next boot (this can take some time, depending on size of filesystem).

Password Change Commands

As an alternative solution, you could also access Linux debug-shell. This can be done - again - by accessing GRUB during boot and appending systemd.debug-shell instead of rd.break. When you let system boot with this option, you will end up in normal shell session, which isn't really helpful. If you however try to access terminal 9 using CTRL+ALT+F9, you will open debug-shell with full root permissions. Here, you can change password normally. At this point you can switch back to normal shell (CTRL+ALT+F1) and login. You shouldn't forget to stop the debug-shell though, as it is huge vulnerability to the system. You can do that by running systemctl stop debug-shell.service (you can still switch back to debug-shell but it will be unresponsive - killed-off).

debug-shell

Fixing Unmountable Filesystems

Creating new partitions, creating filesystems, mounting filesystem, etc. are common tasks for most SysAdmins. But, even though these are basic tasks it's easy to make mistake which may render your system unbootable. Let's see how you can solve problems related unmountable filesystems:

As with previous solutions, we start by rebooting the machine, accessing the boot menu and editing it, this time appending systemd.unit=emergency.target. This tells your system to boot into emergency target instead of default one (multi-user or graphical). When system boots and we get shell, we login as root and we again remount the filesystem using mount -o remount,rw /. Now we can try mounting all filesystems by running mount -a. If there is a problem with mounting some specific filesystem, you might see error message like mount: /wrong-mount: mount point does not exist. or mount /wrong-mount: special device /dev/sdb1 does not exist.. These kind of an issues need to be fixed inside /etc/fstab:

/dev/sda1   /mnt           xfs   defaults   0 0
/dev/sdb1   /wrong-mount   xfs   defaults   0 0  <- fix, comment out or remove this line

After fixing the issue in /etc/fstab run systemctl daemon-reload, so that systemd picks up the changes. Now, run mount -a again. If the issue was indeed fixed, you should see no error (no news, is good news). You can now exit using CTRL+D and let system boot normally.

Aside from mistyped device or mount point name, you might also encounter issues with VDO (Virtual Data Optimizer) or Stratis, which require extra mount arguments e.g. x-systemd.requires=vdo.service or x-systemd.requires=stratisd.service, without which system won't boot properly. Another common and easily fixable mistake might be missing quote when using UUID="... to specify device (use /etc/fstab syntax highlighting, it can save you lot problems).

Troubleshooting SELinux Problems

This one is not a life and death kind of a situation, but it can cause a lot problems, so it's beneficial to be able to identify it quickly when it happens.

It's important to realise that most of the time, SELinux is doing it's job correctly. But it might just happen that you are trying to achieve something SELinux doesn't expect. Some of the problem you might encounter may include issues with incorrect file context, for example after moving file from one place to another. Sometimes the issue might be with overly restrictive policies (SELinux booleans) or blocked service ports.

One can troubleshoot all of these problems by first temporarily changing SELinux to non-enforcing mode using setenforce 0 and retrying the action that wasn't working previously. If the problem got fixed by switching SELinux to non-enforcing mode, then we know that the problem was caused by some SELinux violation.

Now, if we turn SELinux back on using setenforce 1, we can try to analyze and fix the violation:

First install setroubleshoot-server using yum -y install setroubleshoot-server. This troubleshooting server will listen to /var/log/audit/audit.log and send summary messages to /var/log/messages. Next, to analyze these messages run grep sealert /var/log/messages which should give you messages like this:

...
Apr  6 08:03:46 server1 setroubleshoot[24548]: SELinux is preventing /usr/sbin/httpd from name_bind access on the tcp_socket port 8012. For complete SELinux messages run: sealert -l d2c359a9-e9c5-48cb-af8b-ba73d982f21b

As an example here, I configured httpd to run on port 8012 which is blocked because of SELinux service's allowed ports. If we were not aware of this, then it would be quite hard to find root cause of this issue. The output above can help with that. We can see description of the SELinux violation as well as command that can help us troubleshoot further, so let's try it out:

[root@server1 ~]# sealert -l d2c359a9-e9c5-48cb-af8b-ba73d982f21b
SELinux is preventing /usr/sbin/httpd from name_bind access on the tcp_socket port 8012.

*****  Plugin bind_ports (92.2 confidence) suggests   ************************

If you want to allow /usr/sbin/httpd to bind to network port 8012
Then you need to modify the port type.
Do
# semanage port -a -t PORT_TYPE -p tcp 8012
    where PORT_TYPE is one of the following: http_cache_port_t, http_port_t, jboss_management_port_t, jboss_messaging_port_t, ntop_port_t, puppet_port_t.

... omitted

Additional Information:
Source Context                system_u:system_r:httpd_t:s0
Target Context                system_u:object_r:unreserved_port_t:s0
Target Objects                port 8012 [ tcp_socket ]
Source                        httpd
Source Path                   /usr/sbin/httpd
Port                          8012
Host                          ...
Source RPM Packages           httpd-2.4.37-10.module+el8+2764+7127e69e.x86_64
Target RPM Packages           
Policy RPM                    selinux-policy-3.14.1-61.el8.noarch
Selinux Enabled               True
Policy Type                   targeted
Enforcing Mode                Enforcing

... omitted

This produces full report of what caused the violation. Including suggested (not necessarily the most appropriate) fix. If you have some experience with SELinux you might realise that most appropriate way to fix this issue is to add the relevant port to SELinux service (http_port_t). This can be done by running semanage port -a -t http_port_t -p tcp 8012.

This pattern of replicating the violation, looking for sealert messages in var/log/messages and viewing the report and analyzing the report can be applied to any SELinux violation/problem, not just the one example above.

Alternatively you can also search directly in /var/log/audit/audit.log using ausearch. Specific command you would want to run: ausearch -m AVC -ts recent. This shows all recent denials. Output should look something like this (same information, little less user friendly):

time->Mon Apr  6 08:10:56 2020
type=PROCTITLE msg=audit(1586160656.860:235): proctitle=2F7573722F7362696E2F6874747064002D44464F524547524F554E44
type=SYSCALL msg=audit(1586160656.860:235): arch=c000003e syscall=49 success=no exit=-13 a0=3 a1=562ebb162280 a2=10 a3=7ffcd89eb21c items=0 ppid=1 pid=24612 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="httpd" exe="/usr/sbin/httpd" subj=system_u:system_r:httpd_t:s0 key=(null)
type=AVC msg=audit(1586160656.860:235): avc:  denied  { name_bind } for  pid=24612 comm="httpd" src=8012 scontext=system_u:system_r:httpd_t:s0 tcontext=system_u:object_r:unreserved_port_t:s0 tclass=tcp_socket permissive=0

Getting Logs From Crashing System

By default, logs stored in /run/log/journal are not persisted across system reboots. That might become a problem if you need to debug logs on crashing system. To preserve journal logs we need to modify /etc/systemd/journald.conf. More specifically the Storage parameter:

# /etc/systemd/journald.conf

[Journal]
#Storage=auto  # <- This one
#Compress=yes
#Seal=yes
#SplitMode=uid
...

By uncommenting and changing Storage to persistent, we tell systemd to store all logs in var/log/journal. Aside from this change we also need to run systemctl reload systemd-journald to make sure that the change takes effect.

Even though this change will persist logs on your system, it won't keep all of them forever. By default, journald is configured to not exceed 10% of filesystem or leave system with less then 15% of free space.

Now, to actually inspect the previously stored logs. First switch to root user. Run journalctl --list-boots. This will give you list like this:

-3 620d709db36440dc9a0b165f85f9f346 Sun 2020-03-08 18:58:41 CET—Fri 2020-04-03 08:29:25 CEST
-2 89a7ba7725ab4ac5bc91517d29fbc31e Fri 2020-04-03 08:33:19 CEST—Fri 2020-04-03 08:35:02 CEST
-1 e93ec544729f4008ac126b80b4b0c207 Fri 2020-04-03 08:35:53 CEST—Sun 2020-04-05 15:21:16 CEST
 0 58a205834d2d4d7b9b228d8a126b9e81 Sun 2020-04-05 15:21:40 CEST—Fri 2020-04-10 14:57:49 CEST

Based on the dates and times, choose from which boot you want to see logs. For example, to view logs from boot with id -2 with log level err or higher:

~ $ journalctl -b -2 -p err
-- Logs begin at Sun 2020-03-08 18:58:41 CET, end at Fri 2020-04-10 14:59:59 CEST. --
apr 03 08:33:21 my-pc bluetoothd[1187]: ...
apr 03 08:33:25 my-pc bluetoothd[1187]: ...
apr 03 08:33:25 my-pc kernel: ...
apr 03 08:33:36 my-pc wpa_supplicant[1198]: ...
apr 03 08:33:36 my-pc wpa_supplicant[1198]: ...
...

If the logs above are not enough to troubleshoot your issues, then there are other log files to check:

  • /var/log/messages - Most of the syslog messages should be here.
  • /var/log/boot.log - System startup messages (Not related to syslog)

Alternatively, if you can't boot your machine normally, then you can access emergency.target as shown above a view logs there the same way.

Conclusion

There is a lot more that can go wrong with Linux machine than what I shown in sections above. These examples/approaches though, can be applied to variety of other problems that you might encounter. Also, not all of them are life-and-death kind of a situation, but it's always preferable to be able solve them rather quickly, especially if this problematic machine is a production system. Solving most of the issues depends of getting the right information and being able to restore previous configurations, therefore it's crucial to always store logs and to backup critical files on your system before modifying them.

Posted on May 15 by:

martinheinz profile

Martin Heinz

@martinheinz

My name is Martin Heinz and I'm a software developer/DevOps engineer. I'm from Slovakia, living in Bratislava.

Discussion

markdown guide