This week we're going to look at some tools that report various CPU, memory, and disk metrics on Linux servers. You will want to ensure that these are all installed on production systems for convenience when there's a problem.
System load is a metric defined by the average amount of CPU used per minute. In addition to per-minute system loads, Linux also tracks 5- and 15- minute system loads. These can all be viewed by the
top utilities. In the case of
top, the system load is shown on the first line, on the right.
On the left of both of these tools, you can see the system uptime, which is how long the operating system has been running, on the left, immediately to the left of the system load.
Now typically, when you see a high 1-minute system load but low 5- and 15-minute loads, then that means the system was idle until just now when something started to consume a lot of CPU. Generally speaking, these intervals may not provide enough granularity for everyone, so what people usually do next is run the
top commands in quick succession, but a better tool that precisely records the system load in fine-grained intervals is
The sysstat configuration file lives in
/etc/default/sysstat on Debian systems and
/etc/sysconfig/sysstat on Red Hat-based systems. To enable
sysstat, you need to make sure that the
ENABLED= line has a value of “true”. Add it if there is no such line. Then run
systemctl enable sysstat systemctl start sysstat
Sometimes, you have to install sysstat from your distribution's package manager before you can enable it.
Once it's enabled,
sysstat will take care of the job of periodically recording statistics to a file. This file is rotated to a new filename each day. The records can be viewed using the
sar tool. By default, it will show CPU statistics, but the -r option makes or show memory stats, and the -b option makes it show disk stats, and finally, the -A option makes it display all available statistics.
Let's take a look at the kind of metrics you can expect to get from
- User time and system time refer to the amount of time spent by user-mode processes (the vast majority of processes) and fielding system calls to the kernel, respectively.
- I/O wait time refers to the amount of time processes spend waiting for disk I/O. This also includes network I/O because, in Linux, the network read/write functionality is implemented as a block device backed by a respective kernel driver.
- Niced time measures the user time of processes that have been re-niced, which refers to the changing of processes’ execution priorities.
- Steal time measures the amount of CPU time “stolen” by a hypervisor to yield for other host processes. It’s only meaningful for virtual machines and containers because this metric is usually zero on physical hardware.
- There’s also hardware interrupt and software interrupt tome that’s spent in the kernel, servicing these two events.
- Finally, idle time has the most straightforward definition, and it's the amount of time the system hasn’t been doing anything.
All of these metrics can be found not only in
sar’s output but also in
top’s as well.
sar’s memory metrics are pretty straightforward, with columns for kilobytes of RAM and swap used and free, the percentage of used RAM, and how much of the RAM is being used for cached purposes.
Finally, its disk metrics tell you how many reads and writes were done per second and how many bytes were read or written to disk on average. These can also be viewed by the
iostat tool, whose output is similar to
sar -b's except it displays the values per device.